Binary neural networks for speech recognition

被引：14

作者：

Qian, Yan-min ^{[1
,2
]}

Xiang, Xu ^{[1
,2
]}

机构：

[1] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, SpeechLab, Shanghai 200240, Peoples R China

来源：

FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING | 2019年 / 20卷 / 05期

基金：

中国国家自然科学基金;

关键词：

\ Speech recognition; Binary neural networks; Binary matrix multiplication; Knowledge distillation; Population count;

D O I：

10.1631/FITEE.1800469

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, deep neural networks (DNNs) significantly outperform Gaussian mixture models in acoustic modeling for speech recognition. However, the substantial increase in computational load during the inference stage makes deep models difficult to directly deploy on low-power embedded devices. To alleviate this issue, structure sparseness and low precision fixed-point quantization have been applied widely. In this work, binary neural networks for speech recognition are developed to reduce the computational cost during the inference stage. A fast implementation of binary matrix multiplication is introduced. On modern central processing unit (CPU) and graphics processing unit (GPU) architectures, a 5-7 times speedup compared with full precision floatingpoint matrix multiplication can be achieved in real applications. Several kinds of binary neural networks and related model optimization algorithms are developed for large vocabulary continuous speech recognition acoustic modeling. In addition, to improve the accuracy of binary models, knowledge distillation from the normal full precision floating-point model to the compressed binary model is explored. Experiments on the standard Switchboard speech recognition task show that the proposed binary neural networks can deliver 3-4 times speedup over the normal full precision deep models. With the knowledge distillation from the normal floating-point models, the binary DNNs or binary convolutional neural networks (CNNs) can restrict the word error rate (WER) degradation to within 15.0%, compared to the normal full precision floating-point DNNs or CNNs, respectively. Particularly for the binary CNN with binarization only on the convolutional layers, the WER degradation is very small and is almost negligible with the proposed approach.

引用

页码：701 / 715

页数：15

共 50 条

[1] Binary neural networks for speech recognition
Yan-min Qian
Xu Xiang
[J]. Frontiers of Information Technology & Electronic Engineering, 2019, 20 : 701 - 715
[2] Binary Deep Neural Networks for Speech Recognition
Xiang, Xu
Qian, Yanmin
Yu, Kai
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 533 - 537
[3] Review of Neural Networks for Speech Recognition
Lippmann, Richard P.
[J]. NEURAL COMPUTATION, 1989, 1 (01) : 1 - 38
[4] Neural networks used for speech recognition
El-Ramly, SH
Abdel-Kader, NS
El-Adawi, R
[J]. 2002 IEEE PROCEEDINGS OF THE NINETEENTH NATIONAL RADIO SCIENCE CONFERENCE, VOLS 1 AND 2, 2002, : 200 - 207
[5] Speech recognition with artificial neural networks
Dede, Guelin
Sazli, Murat Huesnue
[J]. DIGITAL SIGNAL PROCESSING, 2010, 20 (03) : 763 - 768
[6] Residual Neural Networks for Speech Recognition
Vydana, Hari Krishna
Vuppala, Anil Kumar
[J]. 2017 25TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2017, : 543 - 547
[7] Convolutional Neural Networks for Speech Recognition
Abdel-Hamid, Ossama
Mohamed, Abdel-Rahman
Jiang, Hui
Deng, Li
Penn, Gerald
Yu, Dong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
[8] Speech Recognition with Temporal Neural Networks
Lin, Payton
Lyu, Dau-Cheng
Chang, Yun-Fan
Tsao, Yu
[J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 21 - 25
[9] Speech recognition using neural networks
Khan, SU
Sharma, G
Rao, PRK
[J]. PROCEEDINGS OF IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL TECHNOLOGY 2000, VOLS 1 AND 2, 2000, : 432 - 437
[10] NEURAL NETWORKS APPLIED TO SPEECH RECOGNITION
SAKOE, H
[J]. IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 1992, E75A (05) : 546 - 551

← 1 2 3 4 5 →