Phoneme Recognition in Korean Singing Voices Using Self-Supervised English Speech Representations

被引:0
|
作者
Wu, Wenqin [1 ]
Lee, Joonwhoan [1 ]
机构
[1] Jeonbuk Natl Univ, Dept Comp Sci & Engn, Artificial Intelligence Lab, Jeonju 54896, South Korea
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 18期
基金
新加坡国家研究基金会;
关键词
phoneme recognition; Korean singing voices; self-supervised learning;
D O I
10.3390/app14188532
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In general, it is difficult to obtain a huge, labeled dataset for deep learning-based phoneme recognition in singing voices. Studying singing voices also offers inherent challenges, compared to speech, because of the distinct variations in pitch, duration, and intensity. This paper proposes a detouring method to overcome this insufficient dataset, and applies it to the recognition of Korean phonemes in singing voices. The method started with pre-training the HuBERT, a self-supervised speech representation model, on a large-scale English corpus. The model was then adapted to the Korean speech domain with a relatively small-scale Korean corpus, in which the Korean phonemes were interpreted as similar English ones. Finally, the speech-adapted model was again trained with a tiny-scale Korean singing voice corpus for speech-singing adaptation. In the final adaptation, melodic supervision was chosen, which utilizes pitch information to improve the performance. For evaluation, the performance on multi-level error rates based on Word Error Rate (WER) was taken. Using the HuBERT-based transfer learning for adaptation improved the phoneme-level error rate of Korean speech by as much as 31.19%. Again, on singing voices by melodic supervision, it improved the rate by 0.55%. The significant improvement in speech recognition underscores the considerable potential of a model equipped with general human voice representations captured from the English corpus that can improve phoneme recognition on less target speech data. Moreover, the musical variation in singing voices is beneficial for phoneme recognition in singing voices. The proposed method could be applied to the phoneme recognition of other languages that have less speech and singing voice corpora.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] PHONEME SEGMENTATION USING SELF-SUPERVISED SPEECH MODELS
    Strgar, Luke
    Harwath, David
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 1067 - 1073
  • [2] Phonetic Analysis of Self-supervised Representations of English Speech
    Wells, Dan
    Tang, Hao
    Richmond, Korin
    INTERSPEECH 2022, 2022, : 3583 - 3587
  • [3] Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition
    Atmaja, Bagus Tris
    Sasou, Akira
    IEEE ACCESS, 2022, 10 : 124396 - 124407
  • [4] Self-Supervised Contrastive Learning for Singing Voices
    Yakura, Hiromu
    Watanabe, Kento
    Goto, Masataka
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1614 - 1623
  • [5] Self-Supervised Models for Phoneme Recognition: Applications in Children's Speech for Reading Learning
    Medin, Lucas Block
    Pellegrini, Thomas
    Gelin, Lucile
    INTERSPEECH 2024, 2024, : 5168 - 5172
  • [6] SPEECH EMOTION RECOGNITION USING SELF-SUPERVISED FEATURES
    Morais, Edmilson
    Hoory, Ron
    Zhu, Weizhong
    Gat, Itai
    Damasceno, Matheus
    Aronowitz, Hagai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6922 - 6926
  • [7] Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition
    Hernandez, Abner
    Perez-Toro, Paula Andrea
    Noeth, Elmar
    Orozco-Arroyave, Juan Rafael
    Maier, Andreas
    Yang, Seung Hee
    INTERSPEECH 2022, 2022, : 51 - 55
  • [8] UNIVERSAL PARALINGUISTIC SPEECH REPRESENTATIONS USING SELF-SUPERVISED CONFORMERS
    Shor, Joel
    Jansen, Aren
    Han, Wei
    Park, Daniel
    Zhang, Yu
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3169 - 3173
  • [9] Self-supervised Speech Representations Still Struggle with African American Vernacular English
    Chang, Kalvin
    Chou, Yi-Hui
    Shi, Jiatong
    Chen, Hsuan-Ming
    Holliday, Nicole
    Scharenborg, Odette
    Mortensen, David R.
    INTERSPEECH 2024, 2024, : 4643 - 4647
  • [10] Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations
    Pastor, Miguel A.
    Ribas, Dayana
    Ortega, Alfonso
    Miguel, Antonio
    Lleida, Eduardo
    APPLIED SCIENCES-BASEL, 2023, 13 (16):