LARGE-SCALE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEAKER VERIFICATION

被引:27
|
作者
Chen, Zhengyang [1 ,2 ]
Chen, Sanyuan [2 ]
Wu, Yu [2 ]
Qian, Yao [2 ]
Wang, Chengyi [2 ]
Liu, Shujie [2 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, X LANCE Lab, MoE Key Lab Artificial Intelligence,AI Inst, Shanghai, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
representation learning; self-supervised pretrain; speaker verification;
D O I
10.1109/ICASSP43922.2022.9747814
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCelebl, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.
引用
收藏
页码:6147 / 6151
页数:5
相关论文
共 50 条
  • [41] ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds
    Jia, Xiangze
    Zhou, Hui
    Zhu, Xinge
    Guo, Yandong
    Zhang, Ji
    Ma, Yuexin
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 929 - 937
  • [42] Whitening for Self-Supervised Representation Learning
    Ermolov, Aleksandr
    Siarohin, Aliaksandr
    Sangineto, Enver
    Sebe, Nicu
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [43] Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning
    Jiang, Dongwei
    Li, Wubo
    Cao, Miao
    Zou, Wei
    Li, Xiangang
    [J]. INTERSPEECH 2021, 2021, : 1544 - 1548
  • [44] Self-Supervised Graph Transformer on Large-Scale Molecular Data
    Rong, Yu
    Bian, Yatao
    Xu, Tingyang
    Xie, Weiyang
    Wei, Ying
    Huang, Wenbing
    Huang, Junzhou
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [45] Self-Supervised Representation Learning for CAD
    Jones, Benjamin T.
    Hu, Michael
    Kodnongbua, Milin
    Kim, Vladimir G.
    Schulz, Adriana
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21327 - 21336
  • [46] Self-supervised learning based domain regularization for mask-wearing speaker verification
    Zhang, Ruiteng
    Wei, Jianguo
    Lu, Xugang
    Lu, Wenhuan
    Jin, Di
    Zhang, Lin
    Ji, Yantao
    Xu, Junhai
    [J]. SPEECH COMMUNICATION, 2023, 152
  • [47] Label-Efficient Self-Supervised Speaker Verification With Information Maximization and Contrastive Learning
    Lepage, Theo
    Dehak, Reda
    [J]. INTERSPEECH 2022, 2022, : 4018 - 4022
  • [48] VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation
    Wang, Changhan
    Riviere, Morgane
    Lee, Ann
    Wu, Anne
    Talnikar, Chaitanya
    Haziza, Daniel
    Williamson, Mary
    Pino, Juan
    Dupoux, Emmanuel
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 993 - 1003
  • [49] An analytic study on clustering driven self-supervised speaker verification
    Fathan, Abderrahim
    Alam, Jahangir
    [J]. PATTERN RECOGNITION LETTERS, 2024, 179 : 80 - 86
  • [50] DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
    Liu, Alexander H.
    Chang, Heng-Jui
    Auli, Michael
    Hsu, Wei-Ning
    Glass, James
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,