Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training

被引:22
|
作者
Dinkel, Heinrich [1 ,2 ]
Wang, Shuai [1 ,2 ]
Xu, Xuenan [1 ,2 ]
Wu, Mengyue [1 ,2 ]
Yu, Kai [1 ,2 ]
机构
[1] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, X LANCE Lab,Dept Comp Sci & Engn, Shanghai 200240, Peoples R China
[2] State Key Lab Media Convergence Prod Technol & Sy, Beijing 100803, Peoples R China
基金
中国国家自然科学基金;
关键词
Hidden Markov models; Training; Data models; Speech recognition; Mathematical model; Training data; Speech enhancement; Voice activity detection; Speech activity detection; Weakly supervised learning; Convolutional neural networks; Teacher-student learning; SPEECH ACTIVITY DETECTION; SOUND EVENT DETECTION; NEURAL-NETWORKS; ALGORITHM; FEATURES;
D O I
10.1109/TASLP.2021.3073596
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised datasets. Therefore, a major challenge for supervised VAD systems is their generalization towards noisy, real-world data. This work proposes a data-driven teacher-student approach for VAD, which utilizes vast and unconstrained audio data for training. Unlike previous approaches, only weak labels during teacher training are required, enabling the utilization of any real-world, potentially noisy dataset. Our approach firstly trains a teacher model on a source dataset (Audioset) using clip-level supervision. After training, the teacher provides frame-level guidance to a student model on an unlabeled, target dataset. A multitude of student models trained on mid- to large-sized datasets are investigated (Audioset, Voxceleb, NIST SRE). Our approach is then respectively evaluated on clean, artificially noised, and real-world data. We observe significant performance gains in artificially noised and real-world scenarios. Lastly, we compare our approach against other unsupervised and supervised VAD methods, demonstrating our method's superiority.
引用
收藏
页码:1542 / 1555
页数:14
相关论文
共 50 条
  • [1] Voice Activity Detection with Teacher-Student Domain Emulation
    Luckenbaugh, Jarrod
    Abplanalp, Samuel
    Gonzalez, Rachel
    Fulford, Daniel
    Gard, David
    Busso, Carlos
    [J]. INTERSPEECH 2021, 2021, : 4374 - 4378
  • [2] TEACHER-STUDENT TRAINING FOR ACOUSTIC EVENT DETECTION USING AUDIOSET
    Shi, Ruibo
    Ng, Raymond W. M.
    Swietojanski, Pawel
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 875 - 879
  • [3] Activity type detection of mobile phone data based on self-training: Application of the teacher-student cycling model
    Gao, Lei
    Huang, Haozhe
    Ye, Jianhong
    Wang, Daoge
    [J]. TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2024, 161
  • [4] Audio-Visual Information Fusion Using Cross-modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments
    Zhou, Hengshun
    Du, Jun
    Chen, Hang
    Jing, Zijun
    Xiong, Shifu
    Lee, Chin-Hui
    [J]. INTERSPEECH 2021, 2021, : 341 - 345
  • [5] Semisupervised Cross Domain Teacher-Student Mutual Training for Damaged Building Detection
    Pan, Jie
    Yin, Pengyu
    Sun, Xian
    Tan, Junxiang
    Li, Wei
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2023, 16 : 8191 - 8203
  • [6] A semi supervised approach to Arabic aspect category detection using Bert and teacher-student model
    Almasri, Miada
    Al-Malki, Norah
    Alotaibi, Reem
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [7] Data-driven detection and analysis of the patterns of creaky voice
    Drugman, Thomas
    Kane, John
    Gobl, Christer
    [J]. COMPUTER SPEECH AND LANGUAGE, 2014, 28 (05): : 1233 - 1253
  • [8] A Data-Driven Approach for Inferring Student Proficiency from Game Activity Logs
    Falakmasir, Mohammad H.
    Gonzalez-Brenes, Jose P.
    Gordon, Geoffrey J.
    DiCerbo, Kristen E.
    [J]. PROCEEDINGS OF THE THIRD (2016) ACM CONFERENCE ON LEARNING @ SCALE (L@S 2016), 2016, : 341 - 349
  • [9] Teacher-Student Mutual Training for Semi-Supervised Object Detection Based on PPYOLOE
    Zhang, Guoshan
    Wei, Jinman
    [J]. Tianjin Daxue Xuebao (Ziran Kexue yu Gongcheng Jishu Ban)/Journal of Tianjin University Science and Technology, 2024, 57 (04): : 415 - 423
  • [10] A Teacher-Student Knowledge Distillation Framework for Enhanced Detection of Anomalous User Activity
    Hsu, Chan
    Ku, Chan-Tung
    Wang, Yuwen
    Hsieh, Minchen
    Wu, Jun-Ting
    Hsieh, Yunhsiang
    Chang, PoFeng
    Lu, Yimin
    Kang, Yihuang
    [J]. 2023 IEEE 24TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE, IRI, 2023, : 20 - 21