FluentNet: End-to-End Detection of Stuttered Speech Disfluencies With Deep Learning

被引:25
|
作者
Kourkounakis, Tedd [1 ]
Hajavi, Amirhossein [1 ]
Etemad, Ali [1 ]
机构
[1] Queens Univ, Dept Elect & Comp Engn, Kingston, ON K7L 3N6, Canada
关键词
Speech processing; Deep learning; Training; Benchmark testing; Tools; Speaker recognition; Residual neural networks; Attention; disfluency; deep learning; BLSTM; speech; stutter; squeeze-and-excitation; RECOGNITION; PROLONGATIONS; DYSFLUENCIES; TRANSFORMER; REPETITIONS; MFCC;
D O I
10.1109/TASLP.2021.3110146
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Millions of people are affected by stuttering and other speech disfluencies, with the majority of the world having experienced mild stutters while communicating under stressful conditions. While there has been much research in the field of automatic speech recognition and language models, stutter detection and recognition has not received as much attention. To this end, we propose an end-to-end deep neural network, FluentNet, capable of detecting a number of different stutter types. FluentNet consists of a Squeeze-and-Excitation Residual convolutional neural network which facilitate the learning of strong spectral frame-level representations, followed by a set of bidirectional long short-term memory layers that aid in learning effective temporal relationships. Lastly, FluentNet uses an attention mechanism to focus on the important parts of speech to obtain a better performance. We perform a number of different experiments, comparisons, and ablation studies to evaluate our model. Our model achieves state-of-the-art results by outperforming other solutions in the field on the publicly available UCLASS dataset. Additionally, we present LibriStutter: a stuttered speech dataset based on the public LibriSpeech dataset with synthesized stutters. We also evaluate FluentNet on this dataset, showing the strong performance of our model versus a number of baseline and state-of-the-art techniques.
引用
收藏
页码:2986 / 2999
页数:14
相关论文
共 50 条
  • [1] MINTZAI: End-to-end Deep Learning for Speech Translation
    Etchegoyhen, Thierry
    Arzelus, Haritz
    Gete, Harritxu
    Alvarez, Aitor
    Hernaez, Inma
    Navas, Eva
    Gonzalez-Docasal, Ander
    Osacar, Jaime
    Benites, Edson
    Ellakuria, Igor
    Calonge, Eusebi
    Martin, Maite
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65): : 97 - 100
  • [2] Arabic speech recognition using end-to-end deep learning
    Alsayadi, Hamzah A.
    Abdelhamid, Abdelaziz A.
    Hegazy, Islam
    Fayed, Zaki T.
    [J]. IET SIGNAL PROCESSING, 2021, 15 (08) : 521 - 534
  • [3] End-to-End Automatic Speech Recognition with Deep Mutual Learning
    Masumura, Ryo
    Ihori, Mana
    Takashima, Akihiko
    Tanaka, Tomohiro
    Ashihara, Takanori
    [J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 632 - 637
  • [4] End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge
    Kimura, Naoki
    Su, Zixiong
    Saeki, Takaaki
    [J]. INTERSPEECH 2020, 2020, : 1025 - 1026
  • [5] An End-to-End Detection Method for WebShell with Deep Learning
    Qi, Longchen
    Kong, Rui
    Lu, Yang
    Zhuang, Honglin
    [J]. 2018 EIGHTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC 2018), 2018, : 660 - 665
  • [6] End-to-end deep learning approach for Parkinson?s disease detection from speech signals
    Quan, Changqin
    Ren, Kang
    Luo, Zhiwei
    Chen, Zhonglue
    Ling, Yun
    [J]. BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 2022, 42 (02) : 556 - 574
  • [7] End-to-End Deep Learning Framework for Speech Paralinguistics Detection Based on Perception Aware Spectrum
    Cai, Danwei
    Ni, Zhidong
    Liu, Wenbo
    Cai, Weicheng
    Li, Gang
    Li, Ming
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3452 - 3456
  • [8] An End-to-end Deep Learning Scheme for Atrial Fibrillation Detection
    Jia, Yingjie
    Jiang, Haoyu
    Yang, Ping
    He, Xianliang
    [J]. 2020 COMPUTING IN CARDIOLOGY, 2020,
  • [9] Deep End-to-End Representation Learning for Food Type Recognition from Speech
    Sertolli, Benjamin
    Cummins, Nicholas
    Sengur, Abdulkadir
    Schuller, Bjorn W.
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 574 - 578
  • [10] AN END-TO-END DEEP LEARNING SPEECH CODING AND DENOISING STRATEGY FOR COCHLEAR IMPLANTS
    Gajecki, Tom
    Nogueira, Waldo
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3109 - 3113