LARGE-SCALE SELF-SUPERVISED SPEECH REPRESENTATION LEARNING FOR AUTOMATIC SPEAKER VERIFICATION

被引:27
|
作者
Chen, Zhengyang [1 ,2 ]
Chen, Sanyuan [2 ]
Wu, Yu [2 ]
Qian, Yao [2 ]
Wang, Chengyi [2 ]
Liu, Shujie [2 ]
Qian, Yanmin [1 ]
Zeng, Michael [2 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, X LANCE Lab, MoE Key Lab Artificial Intelligence,AI Inst, Shanghai, Peoples R China
[2] Microsoft Corp, Redmond, WA 98052 USA
关键词
representation learning; self-supervised pretrain; speaker verification;
D O I
10.1109/ICASSP43922.2022.9747814
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The speech representations learned from large-scale unlabeled data have shown better generalizability than those from supervised learning and thus attract a lot of interest to be applied for various downstream tasks. In this paper, we explore the limits of speech representations learned by different self-supervised objectives and datasets for automatic speaker verification (ASV), especially with a well-recognized SOTA ASV model, ECAPA-TDNN [1], as a downstream model. The representations from all hidden layers of the pre-trained model are firstly averaged with learnable weights and then fed into the ECAPA-TDNN as input features. The experimental results on Voxceleb dataset show that the weighted average representation is significantly superior to FBank, a conventional handcrafted feature for ASV. Our best single system achieves 0.537%, 0.569%, and 1.180% equal error rate (EER) on the three official trials of VoxCelebl, separately. Accordingly, the ensemble system with three pre-trained models can further improve the EER to 0.479%, 0.536% and 1.023%. Among the three evaluation trials, our best system outperforms the winner system [2] of the VoxCeleb Speaker Recognition Challenge 2021 (VoxSRC2021) on the VoxCeleb1-E trial.
引用
收藏
页码:6147 / 6151
页数:5
相关论文
共 50 条
  • [21] SELF-SUPERVISED LEARNING BASED DOMAIN ADAPTATION FOR ROBUST SPEAKER VERIFICATION
    Chen, Zhengyang
    Wang, Shuai
    Qian, Yanmin
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5834 - 5838
  • [22] Self-Supervised Representation Learning With Path Integral Clustering for Speaker Diarization
    Singh, Prachi
    Ganapathy, Sriram
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1639 - 1649
  • [23] TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech
    Liu, Andy T.
    Li, Shang-Wen
    Lee, Hung-yi
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 2351 - 2366
  • [24] Automatic self-supervised learning of associations between speech and text
    Knuuttila, Juho
    Rasanen, Okko
    Laine, Unto K.
    [J]. 14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 465 - 469
  • [25] Self-supervised learning with automatic data augmentation for enhancing representation
    Park, Chanjong
    Kim, Eunwoo
    [J]. PATTERN RECOGNITION LETTERS, 2024, 184 : 133 - 139
  • [26] Large-Scale Self-Supervised Human Activity Recognition
    Zadeh, Mohammad Zaki
    Jaiswal, Ashish
    Pavel, Hamza Reza
    Hebri, Aref
    Kapoor, Rithik
    Makedon, Fillia
    [J]. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON PERVASIVE TECHNOLOGIES RELATED TO ASSISTIVE ENVIRONMENTS, PETRA 2022, 2022, : 298 - 299
  • [27] XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
    Babu, Arun
    Wang, Changhan
    Tjandra, Andros
    Lakhotia, Kushal
    Xu, Qiantong
    Goyal, Naman
    Singh, Kritika
    von Platen, Patrick
    Saraf, Yatharth
    Pino, Juan
    Baevski, Alexei
    Conneau, Alexis
    Auli, Michael
    [J]. INTERSPEECH 2022, 2022, : 2278 - 2282
  • [28] Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?
    Chen, Sanyuan
    Wu, Yu
    Wang, Chengyi
    Liu, Shujie
    Chen, Zhuo
    Wang, Peidong
    Liu, Gang
    Li, Jinyu
    Wu, Jian
    Yu, Xiangzhan
    Wei, Furu
    [J]. INTERSPEECH 2022, 2022, : 3699 - 3703
  • [29] SPEAKER NORMALIZATION FOR SELF-SUPERVISED SPEECH EMOTION RECOGNITION
    Gat, Itai
    Aronowitz, Hagai
    Zhu, Weizhong
    Morais, Edmilson
    Hoory, Ron
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7342 - 7346
  • [30] ROBUST SELF-SUPERVISED SPEAKER REPRESENTATION LEARNING VIA INSTANCE MIX REGULARIZATION
    Kang, Woo Hyun
    Alam, Jahangir
    Fathan, Abderrahim
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6617 - 6621