LARGE-SCALE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEAKER VERIFICATION

被引:27
|
作者
Chen, Zhengyang [1 ,2 ]
Chen, Sanyuan [2 ]
Wu, Yu [2 ]
Qian, Yao [2 ]
Wang, Chengyi [2 ]
Liu, Shujie [2 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, X LANCE Lab, MoE Key Lab Artificial Intelligence,AI Inst, Shanghai, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
representation learning; self-supervised pretrain; speaker verification;
D O I
10.1109/ICASSP43922.2022.9747814
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCelebl, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.
引用
收藏
页码:6147 / 6151
页数:5
相关论文
共 50 条
  • [1] Self-supervised contrastive representation learning for large-scale trajectories
    Li, Shuzhe
    Chen, Wei
    Yan, Bingqi
    Li, Zhen
    Zhu, Shunzhi
    Yu, Yanwei
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2023, 148 : 357 - 366
  • [2] Bootstrap Equilibrium and Probabilistic Speaker Representation Learning for Self-Supervised Speaker Verification
    Mun, Sung Hwan
    Han, Min Hyun
    Lee, Dongjune
    Kim, Jihwan
    Kim, Nam Soo
    [J]. IEEE ACCESS, 2021, 9 : 167615 - 167627
  • [3] ADVERSARIAL DEFENSE FOR AUTOMATIC SPEAKER VERIFICATION BY CASCADED SELF-SUPERVISED LEARNING MODELS
    Wu, Haibin
    Li, Xu
    Liu, Andy T.
    Wu, Zhiyong
    Meng, Helen
    Lee, Hung-yi
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6718 - 6722
  • [4] ROBUST SPEAKER VERIFICATION WITH JOINT SELF-SUPERVISED AND SUPERVISED LEARNING
    Wang, Kai
    Zhang, Xiaolei
    Zhang, Miao
    Li, Yuguang
    Lee, Jaeyun
    Cho, Kiho
    Park, Sung-UN
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7637 - 7641
  • [5] Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learning
    Kim, Eesung
    Jeon, Jae-Jin
    Seo, Hyeji
    Kim, Hoon
    [J]. INTERSPEECH 2022, 2022, : 1411 - 1415
  • [6] The effect of speech pathology on automatic speaker verification: a large-scale study
    Tayebi Arasteh, Soroosh
    Weise, Tobias
    Schuster, Maria
    Noeth, Elmar
    Maier, Andreas
    Yang, Seung Hee
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [7] The effect of speech pathology on automatic speaker verification: a large-scale study
    Soroosh Tayebi Arasteh
    Tobias Weise
    Maria Schuster
    Elmar Noeth
    Andreas Maier
    Seung Hee Yang
    [J]. Scientific Reports, 13
  • [8] Self-supervised Learning for Large-scale Item Recommendations
    Yao, Tiansheng
    Yi, Xinyang
    Cheng, Derek Zhiyuan
    Yu, Felix
    Chen, Ting
    Menon, Aditya
    Hong, Lichan
    Chi, Ed H.
    Tjoa, Steve
    Kang, Jieqi
    Ettinger, Evan
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4321 - 4330
  • [9] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
  • [10] Improving the Adversarial Robustness for Speaker Verification by Self-Supervised Learning
    Wu, Haibin
    Li, Xu
    Liu, Andy T.
    Wu, Zhiyong
    Meng, Helen
    Lee, Hung-Yi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 202 - 217