Speech Activity Detection on YouTube Using Deep Neural Networks

被引:0
|
作者
Ryant, Neville [1 ]
Liberman, Mark [1 ]
Yuan, Jiahong [1 ]
机构
[1] Linguist Data Consortium, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
speech activity detection; voice activity detection; segmentation; deep neural networks;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).
引用
下载
收藏
页码:728 / 731
页数:4
相关论文
共 50 条
  • [31] Automatic Recognition of Kazakh Speech Using Deep Neural Networks
    Mamyrbayev, Orken
    Turdalyuly, Mussa
    Mekebayev, Nurbapa
    Alimhan, Keylan
    Kydyrbekova, Aizat
    Turdalykyzy, Tolganay
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT II, 2019, 11432 : 465 - 474
  • [32] Binaural Speech Intelligibility Estimation Using Deep Neural Networks
    Kondo, Kazuhiro
    Taira, Kazuya
    Kobayashi, Yosuke
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1858 - 1862
  • [33] Speech Recognition Using Deep Neural Networks: A Systematic Review
    Nassif, Ali Bou
    Shahin, Ismail
    Attili, Imtinan
    Azzeh, Mohammad
    Shaalan, Khaled
    IEEE ACCESS, 2019, 7 : 19143 - 19165
  • [34] Enhancing analysis of diadochokinetic speech using deep neural networks
    Segal-Feldman, Yael
    Hitczenko, Kasia
    Goldrick, Matthew
    Buchwald, Adam
    Roberts, Angela
    Keshet, Joseph
    Computer Speech and Language, 2025, 90
  • [35] PERCEPTUALLY GUIDED SPEECH ENHANCEMENT USING DEEP NEURAL NETWORKS
    Zhao, Yan
    Xu, Buye
    Giri, Ritwik
    Zhang, Tao
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5074 - 5078
  • [36] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
    Zen, Heiga
    Senior, Andrew
    Schuster, Mike
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966
  • [37] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
    Wani, Taiba Majid
    Gunawan, Teddy Surya
    Qadri, Syed Asif Ahmad
    Mansor, Hasmah
    Kartiwi, Mira
    Ismail, Nanang
    PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
  • [38] The Representation of Speech in Deep Neural Networks
    Scharenborg, Odette
    van der Gouw, Nikki
    Larson, Martha
    Marchiori, Elena
    MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 194 - 205
  • [39] Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation
    Vecchiotti, Paolo
    Pepe, Giovanni
    Principi, Emanuele
    Squartini, Stefano
    EXPERT SYSTEMS WITH APPLICATIONS, 2019, 134 : 53 - 65
  • [40] Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks
    Li, Kun
    Qian, Xiaojun
    Meng, Helen
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (01) : 193 - 207