Speech Activity Detection on YouTube Using Deep Neural Networks

被引：0

作者：

Ryant, Neville ^{[1
]}

Liberman, Mark ^{[1
]}

Yuan, Jiahong ^{[1
]}

机构：

[1] Linguist Data Consortium, Philadelphia, PA 19104 USA

来源：

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5 | 2013年

基金：

美国国家科学基金会;

关键词：

speech activity detection; voice activity detection; segmentation; deep neural networks;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Speech activity detection (SAD) is an important first step in speech processing. Commonly used methods (e.g., frame-level classification using gaussian mixture models (GMMs)) work well under stationary noise conditions, but do not generalize well to domains such as YouTube, where videos may exhibit a diverse range of environmental conditions. One solution is to augment the conventional cepstral features with additional, hand-engineered features (e.g., spectral flux, spectral centroid, multiband spectral entropies) which are robust to changes in environment and recording condition. An alternative approach, explored here, is to learn robust features during the course of training using an appropriate architecture such as deep neural networks (DNNs). In this paper we demonstrate that a DNN with input consisting of multiple frames of mel frequency cepstral coefficients (MFCCs) yields drastically lower frame-wise error rates (19.6%) on YouTube videos compared to a conventional GMM based system (40%).

引用

下载

页码：728 / 731

页数：4

共 50 条

[31] Automatic Recognition of Kazakh Speech Using Deep Neural Networks
Mamyrbayev, Orken
Turdalyuly, Mussa
Mekebayev, Nurbapa
Alimhan, Keylan
Kydyrbekova, Aizat
Turdalykyzy, Tolganay
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2019, PT II, 2019, 11432 : 465 - 474
[32] Binaural Speech Intelligibility Estimation Using Deep Neural Networks
Kondo, Kazuhiro
Taira, Kazuya
Kobayashi, Yosuke
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1858 - 1862
[33] Speech Recognition Using Deep Neural Networks: A Systematic Review
Nassif, Ali Bou
Shahin, Ismail
Attili, Imtinan
Azzeh, Mohammad
Shaalan, Khaled
IEEE ACCESS, 2019, 7 : 19143 - 19165
[34] Enhancing analysis of diadochokinetic speech using deep neural networks
Segal-Feldman, Yael
Hitczenko, Kasia
Goldrick, Matthew
Buchwald, Adam
Roberts, Angela
Keshet, Joseph
Computer Speech and Language, 2025, 90
[35] PERCEPTUALLY GUIDED SPEECH ENHANCEMENT USING DEEP NEURAL NETWORKS
Zhao, Yan
Xu, Buye
Giri, Ritwik
Zhang, Tao
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5074 - 5078
[36] STATISTICAL PARAMETRIC SPEECH SYNTHESIS USING DEEP NEURAL NETWORKS
Zen, Heiga
Senior, Andrew
Schuster, Mike
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7962 - 7966
[37] Speech Emotion Recognition using Convolution Neural Networks and Deep Stride Convolutional Neural Networks
Wani, Taiba Majid
Gunawan, Teddy Surya
Qadri, Syed Asif Ahmad
Mansor, Hasmah
Kartiwi, Mira
Ismail, Nanang
PROCEEDING OF 2020 6TH INTERNATIONAL CONFERENCE ON WIRELESS AND TELEMATICS (ICWT), 2020,
[38] The Representation of Speech in Deep Neural Networks
Scharenborg, Odette
van der Gouw, Nikki
Larson, Martha
Marchiori, Elena
MULTIMEDIA MODELING, MMM 2019, PT II, 2019, 11296 : 194 - 205
[39] Detection of activity and position of speakers by using deep neural networks and acoustic data augmentation
Vecchiotti, Paolo
Pepe, Giovanni
Principi, Emanuele
Squartini, Stefano
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 134 : 53 - 65
[40] Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks
Li, Kun
Qian, Xiaojun
Meng, Helen
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (01) : 193 - 207

← 1 2 3 4 5 →