Deep Speaker Embeddings for Speaker Verification of Children

被引:0
|
作者
Abed, Mohammed Hamzah [1 ]
Sztaho, David [1 ]
机构
[1] Budapest Univ Technol & Econ, Dept Telecommun & Artificial Intelligence, Magyar Tudosok Korutja 2, H-1117 Budapest, Hungary
来源
关键词
Forensic voice comparison; children speaker verification; X-vector; RESNET-TDNN; ECAPA-TDNN; likelihood-ratio framework; IDENTIFICATION;
D O I
10.1007/978-3-031-70566-3_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, deep speaker embedding models are the most advanced feature extraction methods for speaker verification. However, their effectiveness in identifying children's voices has not been thoroughly researched. While various methods have been proposed in recent years, most of them concentrate on adult speakers, with fewer researchers focusing on children. This study examines three deep learning-based speaker embedding methods and their ability to differentiate between child speakers in speaker verification. The study evaluated the X-vector, ECAPA-TDNN, and RESNET-TDNN methods for forensic voice comparison using pre-trained models and fine-tuning them on children's speech samples. The likelihood-ratio framework was used for evaluations using the likelihood-ratio score calculation method based on children's voices. The Samromur Children dataset was used to evaluate the work-flow. It comprises 131 h of speech from 3175 speakers aged between 4 and 17 of both sexes. The results indicate that RESNET-TDNN has the lowest EER and Cllr(min) values (10.8% and 0.368, respectively) without fine-tuning the embedding models. With fine-tuning, ECAPA-TDNN performs the best (EER and Cllrmin are 2.9% and 0.111, respectively). No difference was found between the sexes of the speakers. When the results were analysed based on the age range of the speakers (4-10, 11-15, and 16-17), varying levels of performance were observed. The younger speakers were less accurately identified using the original pre-trained models. However, after fine-tuning, this tendency changed slightly. The results indicate that the models could be used in real-life investigation cases and fine-tuning helps mitigating the performance degradation in young speakers.
引用
下载
收藏
页码:58 / 69
页数:12
相关论文
共 50 条
  • [21] SPEAKER DIARIZATION THROUGH SPEAKER EMBEDDINGS
    Rouvier, Mickael
    Bousquet, Pierre-Michel
    Favre, Benoit
    2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 2082 - 2086
  • [22] Adapting Speaker Embeddings for Speaker Diarisation
    Kwon, Youngki
    Jung, Jee-weon
    Heo, Hee-Soo
    Kim, You Jin
    Lee, Bong-Jin
    Chung, Joon Son
    INTERSPEECH 2021, 2021, : 3101 - 3105
  • [23] Speaker Diarization Using Deep Recurrent Convolutional Neural Networks for Speaker Embeddings
    Cyrta, Pawel
    Trzcinski, Tomasz
    Stokowiec, Wojciech
    INFORMATION SYSTEMS ARCHITECTURE AND TECHNOLOGY, PT I, 2018, 655 : 107 - 117
  • [24] PARTIAL AUC OPTIMIZATION BASED DEEP SPEAKER EMBEDDINGS WITH CLASS-CENTER LEARNING FOR TEXT-INDEPENDENT SPEAKER VERIFICATION
    Bai, Zhongxin
    Zhang, Xiao-Lei
    Chen, Jingdong
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6819 - 6823
  • [25] Unsupervised deep feature embeddings for speaker diarization
    Ahmad, Rehan
    Zubair, Syed
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2019, 27 (04) : 3138 - 3149
  • [26] Deep Speaker Embeddings Based Online Diarization
    Avdeeva, Anastasia
    Novoselov, Sergey
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 24 - 32
  • [27] Deep Neural Network Embeddings with Gating Mechanisms for Text-Independent Speaker Verification
    You, Lanhua
    Guo, Wu
    Dai, Li-Rong
    Du, Jun
    INTERSPEECH 2019, 2019, : 1168 - 1172
  • [28] Phonetic-Attention Scoring for Deep Speaker Features in Speaker Verification
    Li, Lantian
    Tang, Zhiyuan
    Shi, Ying
    Wang, Dong
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 284 - 288
  • [29] DEEP SPEAKER REPRESENTATION USING ORTHOGONAL DECOMPOSITION AND RECOMBINATION FOR SPEAKER VERIFICATION
    Kim, Insoo
    Kim, Kyuhong
    Kim, Jiwhan
    Choi, Changkyu
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6126 - 6130
  • [30] Deep Speaker Feature Learning for Text-independent Speaker Verification
    Li, Lantian
    Chen, Yixiang
    Shi, Zing
    Tang, Zhiyuan
    Wang, Dong
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1542 - 1546