Case Study on Data Collection of Kreol Morisien, a Low-Resourced Creole Language

被引:0
|
作者
Bastien, David Joshen [1 ]
Chumroo, Vijay Prakash [1 ]
Bastien, Johan Patrice [1 ]
机构
[1] Hydrus Labs Ltd, Roche Brunes, Rose Hill, Mauritius
来源
关键词
Natural Language Processing; Machine Learning; Speech-to-text; Information Extractor; Mauritian Creole; Data Collection;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This case study focuses on laying down the foundations for the development of Kreol Morisien NLP (KreMoN) which is a series of Natural Language Processing tools to be used to process Mauritian Creole. While most of the works done so far focuses on detailing the Machine Learning algorithms, this work focuses on the first steps needed for any low resourced language which is the collection of data. We present a process currently being used to collect audio and textual data for a low resourced language like Mauritian Creole. This data will be used to develop a speech-to-text system as well as an Information Extractor for Mauritian Creole. As part of the case study, we detail some of the works made using existing textual data in Non standardized Mauritian Creole where an NLP pre-processing pipeline adapted for low resourced languages have been developed.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] An Automatic Summarizer for a Low-Resourced Language
    Pattnaik, Sagarika
    Nayak, Ajit Kumar
    [J]. ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, 2020, 1082 : 285 - 295
  • [2] A Need Finding Study with Low-Resourced Language Content Creators
    Nigatu, Hellina Hailu
    Canny, John
    Chasins, Sarah
    [J]. PROCEEDINGS OF THE 4TH AFRICAN CONFERENCE FOR HUMAN COMPUTER INTERACTION, AFRICHI 2023, 2023, : 1 - 4
  • [3] AN INVESTIGATION INTO LANGUAGE MODEL DATA AUGMENTATION FOR LOW-RESOURCED STT AND KWS
    Huang, Guangpu
    da Silva, Thiago Fraga
    Lamel, Lori
    Gauvain, Jean-Luc
    Gorin, Arseniy
    Laurent, Antoine
    Lileikyte, Rasa
    Messouadi, Abdel
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5790 - 5794
  • [4] Automatic Speech Recognition for Kreol Morisien: A Case Study for the Health Domain
    Sahib-Kaudeer, Nuzhah Gooda
    Gobin-Rahimbux, Baby
    Bahsu, Bibi Saamiyah
    Maghoo, Maryam Farheen Aasiyah
    [J]. SPEECH AND COMPUTER, SPECOM 2019, 2019, 11658 : 414 - 422
  • [5] BERT-Based Sentiment Analysis for Low-Resourced Languages: A Case Study of Urdu Language
    Ashraf, Muhammad Rehan
    Jana, Yasmeen
    Umer, Qasim
    Jaffar, M. Arfan
    Chung, Sungwook
    Ramay, Waheed Yousuf
    [J]. IEEE ACCESS, 2023, 11 : 110245 - 110259
  • [6] Performance of Recent Large Language Models for a Low-Resourced Language
    Jayakody, Ravindu
    Dias, Gihan
    [J]. 2024 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, IALP 2024, 2024, : 162 - 167
  • [7] A Spell Checker for a Low-resourced and Morphologically Rich Language
    Octaviano, Manolito, Jr.
    Borra, Allan
    [J]. TENCON 2017 - 2017 IEEE REGION 10 CONFERENCE, 2017, : 1853 - 1856
  • [8] Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions
    Gorin, Arseniy
    Lileikyte, Rasa
    Huang, Guangpu
    Lamel, Lori
    Gauvain, Jean-Luc
    Laurent, Antoine
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 775 - 779
  • [9] Data sharing in low-resourced research environments
    Rappert, Brian
    Bezuidenhout, Louise
    [J]. PROMETHEUS, 2016, 34 (3-4) : 207 - 224
  • [10] Gramatika: A Grammar Checker for the Low-Resourced Filipino Language
    Go, Matthew Phillip
    Nocon, Nicco
    Borra, Allan
    [J]. TENCON 2017 - 2017 IEEE REGION 10 CONFERENCE, 2017, : 471 - 475