Efficient Self-Supervised Learning Representations for Spoken Language Identification

被引:6
|
作者
Liu, Hexin [1 ]
Perera, Leibny Paola Garcia [3 ]
Khong, Andy W. H. [1 ]
Chng, Eng Siong [1 ]
Styles, Suzy J. [2 ]
Khudanpur, Sanjeev [3 ]
机构
[1] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore
[2] Nanyang Technol Univ, Sch Social Sci, Psychol, Singapore 639818, Singapore
[3] Johns Hopkins Univ, CLSP & HLT COE, Baltimore, MD 21218 USA
基金
美国国家科学基金会; 新加坡国家研究基金会;
关键词
Task analysis; Context modeling; Feature extraction; Data models; Computational modeling; Speech processing; Acoustics; Downstream; language identification; represen-; tation; self-supervised learning;
D O I
10.1109/JSTSP.2022.3201445
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Self-supervised learning has been widely exploited to learn powerful speech representations. The premise of this paper is that these learned self-supervised representations contain irrelevant information for a particular downstream task. Hence, we investigate efficient methods to compute reliable representations and discard redundant information for language identification (LID) using a pre-trained multilingual wav2vec 2.0 model. To determine an optimal basic system, we compare the performance of wav2vec features extracted from different inner layers of the context network. For this approach, the x-vector self-attention LID (XSA-LID) model forms the backbone used to discriminate between distinct languages. We then propose to employ two mechanisms to reduce irrelevant information of the representations in LID. The first is the attentive squeeze-and-excitation (SE) block for dimension-wise scaling and the second is the linear bottleneck (LBN) block that reduces irrelevant information by nonlinear dimension reduction. We incorporate these two methods in the XSA-LID model and conduct experiments on AP19-OLR data and the MLS14 data in NIST LRE 2017. By replacing the previous input features with wav2vec 2.0 features, the XSA-LID model achieves 63.79% relative improvement in terms of the average cost on AP19-OLR data, and 40.42%, 41.54% and 18.97% relative improvement on 3 s, 10 s and 30 s test speech in the MLS14 data in NIST LRE 2017, respectively. In addition, the proposed LBN-XSA model achieves 9.85% relative improvement on AP19-OLR data and over 10% overall improvement on the MLS14 data with a modest number of additional parameters compared to the XSA-LID model. Finally, in terms of average cost and accuracy, the proposed LBN-XSA model outperforms the XSA-LID model which adopts the fine-tuned features on the AP19-OLR data.
引用
收藏
页码:1296 / 1307
页数:12
相关论文
共 50 条
  • [1] Self-supervised Phonotactic Representations for Language Identification
    Ramesh, G.
    Kumar, C. Shiva
    Murty, K. Sri Rama
    [J]. INTERSPEECH 2021, 2021, : 1514 - 1518
  • [2] Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
    Baevski, Alexei
    Babu, Arun
    Hsu, Wei-Ning
    Auli, Michael
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [3] Towards Efficient and Effective Self-supervised Learning of Visual Representations
    Addepalli, Sravanti
    Bhogale, Kaushal
    Dey, Priyam
    Babu, R. Venkatesh
    [J]. COMPUTER VISION, ECCV 2022, PT XXXI, 2022, 13691 : 523 - 538
  • [4] Implicit Self-Supervised Language Representation for Spoken Language Diarization
    Mishra, Jagabandhu
    Prasanna, S. R. Mahadeva
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3393 - 3407
  • [5] Self-Supervised Learning of Smart Contract Representations
    Yang, Shouliang
    Gu, Xiaodong
    Shen, Beijun
    [J]. 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 82 - 93
  • [6] Self-supervised Dialogue Learning for Spoken Conversational Question Answering
    Chen, Nuo
    You, Chenyu
    Zou, Yuexian
    [J]. INTERSPEECH 2021, 2021, : 231 - 235
  • [7] Contrast and Order Representations for Video Self-supervised Learning
    Hu, Kai
    Shao, Jie
    Liu, Yuan
    Raj, Bhiksha
    Savvides, Marios
    Shen, Zhiqiang
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 7919 - 7929
  • [8] Learning Action Representations for Self-supervised Visual Exploration
    Oh, Changjae
    Cavallaro, Andrea
    [J]. 2019 INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2019, : 5873 - 5879
  • [9] Self-supervised graph representations with generative adversarial learning
    Sun, Xuecheng
    Wang, Zonghui
    Lu, Zheming
    Lu, Ziqian
    [J]. NEUROCOMPUTING, 2024, 592
  • [10] Self-supervised learning of Dynamic Representations for Static Images
    Song, Siyang
    Sanchez, Enrique
    Shen, Linlin
    Valstar, Michel
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 1619 - 1626