Crossing language identification: Multilingual ASR framework based on semantic dataset creation & Wav2Vec 2.0

被引:1
|
作者
Anidjar, Or Haim [1 ,2 ,3 ,4 ]
Yozevitch, Roi [1 ]
Bigon, Nerya [1 ]
Abdalla, Najeeb [1 ]
Myara, Benjamin [1 ]
Marbel, Revital [1 ,2 ,5 ]
机构
[1] Ariel Univ, Sch Comp Sci, Golan Hts 1, IL-4077625 Ariel, Israel
[2] Ariel Univ, Ariel Cyber Innovat Ctr, Golan Hts 1, IL-4077625 Ariel, Israel
[3] Ariel Univ, Kinemat & Computat Geometry Lab K&CG, Golan Hts 1, IL-4077625 Ariel, Israel
[4] Ariel Univ, Data Sci & Artificial Intelligence Res Ctr, Golan Hts 1, IL-4077625 Ariel, Israel
[5] Coll Law & Business, Fac Informat Syst & Comp Sci, David Bengur 26, IL-5257346 Ramat Gan, Israel
来源
关键词
Wav2Vec; 2.0; Automatic Speech Recognition; Transformers; Word Error Rate; Character Error Rate; Language identification; SPEECH;
D O I
10.1016/j.mlwa.2023.100489
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study proposes an innovative methodology to enhance the performance of multilingual Automatic Speech Recognition (ASR) systems by capitalizing on the high semantic similarity between sentences across different languages and eliminating the requirement for Language Identification (LID). To achieve this, special bilingual datasets were created from the Mozzila Common Voices datasets in Spanish, Russian, and Portuguese. The process involves computing sentence embeddings using Language-agnostic BERT and selecting sentence pairs based on high and low cosine similarity. Subsequently, we train the Wav2vec 2.0 XLSR53 model on these datasets and assess its performance utilizing Character Error Rate (CER) and Word Error Rate (WER) metrics. The experimental results indicate that models trained on high-similarity samples consistently surpass their low-similarity counterparts, emphasizing the significance of high semantic similarity data selection for precise and dependable ASR performance. Furthermore, the elimination of LID contributes to a simplified system with reduced computational costs and the capacity for real-time text output. The findings of this research offer valuable insights for the development of more efficient and accurate multilingual ASR systems, particularly in real-time and on-device applications.
引用
收藏
页数:12
相关论文
共 19 条
  • [1] Exploring wav2vec 2.0 on speaker verification and language identification
    Fan, Zhiyun
    Li, Meng
    Zhou, Shiyu
    Xu, Bo
    INTERSPEECH 2021, 2021, : 1509 - 1513
  • [2] Low Resource Comparison of Attention-based and Hybrid ASR Exploiting wav2vec 2.0
    Rouhe, Aku
    Virkkunen, Anja
    Leinonen, Juho
    Kurimo, Mikko
    INTERSPEECH 2022, 2022, : 3543 - 3547
  • [3] wav2vec 2.0 ASR for Cantonese-Speaking Older Adults in a Clinical Setting
    Huang, Ranzo C. F.
    Mak, Brian
    INTERSPEECH 2023, 2023, : 4958 - 4962
  • [4] SYNTHETIC SPEECH DETECTION WITH WAV2VEC 2.0 IN VARIOUS LANGUAGE SETTINGS
    Dropulic, Branimir
    Suflaj, Miljenko
    Jertec, Andrej
    Obad, Leo
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 585 - 589
  • [5] On the robustness of wav2vec 2.0 based speaker recognition systems
    Novoselov, Sergey
    Lavrentyeva, Galina
    Avdeeva, Anastasia
    Volokhov, Vladimir
    Khmelev, Nikita
    Akulov, Artem
    Leonteva, Polina
    INTERSPEECH 2023, 2023, : 3177 - 3181
  • [6] Speech recognition model design for Sundanese language using WAV2VEC 2.0
    Cryssiover A.
    Zahra A.
    International Journal of Speech Technology, 2024, 27 (01) : 171 - 177
  • [7] wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
    Baevski, Alexei
    Zhou, Henry
    Mohamed, Abdelrahman
    Auli, Michael
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [8] Multi-level Fusion of Fisher Vector Encoded BERT and Wav2vec 2.0 Embeddings for Native Language Identification
    Krebbers, Dani
    Kaya, Heysem
    Karpov, Alexey
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 391 - 403
  • [9] Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism
    Zhang, Yumei
    Jia, Maoshen
    Cao, Xuan
    Zhao, Zichen
    2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 398 - 402
  • [10] Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
    Angra, Ananya
    Muralikrishna, H.
    Dinesh, Dileep Aroor
    Thenkanidiyoor, Veena
    IEEE ACCESS, 2025, 13 : 3115 - 3129