Automatic language identification: a case study of Pahari languages

被引:0
|
作者
Rachana Gusain
Satya Ranjan Dash
Shantipriya Parida
Girish Nath Jha
机构
[1] Doon University,
[2] KIIT University,undefined
[3] Silo AI,undefined
[4] Jawaharlal Nehru University,undefined
来源
Language Resources and Evaluation | 2023年 / 57卷
关键词
Low-resource languages; Corpus development; Statistical analysis; Language identification; Northern Indo-Aryan; Pahari; Nepali; Garhwali; Kumaoni; Dogri;
D O I
暂无
中图分类号
学科分类号
摘要
In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages—Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.
引用
收藏
页码:1361 / 1387
页数:26
相关论文
共 50 条
  • [1] Automatic language identification: a case study of Pahari languages
    Gusain, Rachana
    Dash, Satya Ranjan
    Parida, Shantipriya
    Jha, Girish Nath
    LANGUAGE RESOURCES AND EVALUATION, 2023, 57 (03) : 1361 - 1387
  • [2] AUTOMATIC LANGUAGE IDENTIFICATION OF THREE INDIAN LANGUAGES USING VECTOR QUANTIZATION
    Roy, Pinki
    Das, Pradip K.
    FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND ELECTRICAL ENGINEERING (ICCEE 2011), 2011, : 293 - +
  • [3] automatic language identification for berber and arabic languages using prosodic features
    Lounnas, Khlaed
    Demri, Lyes
    Teffahi, Hocine
    Falek, Leila
    PROCEEDINGS 2018 3RD INTERNATIONAL CONFERENCE ON ELECTRICAL SCIENCES AND TECHNOLOGIES IN MAGHREB (CISTEM), 2018, : 239 - 242
  • [4] Automatic Language Identification for Romance Languages using Stop Words and Diacritics
    Truica, Ciprian-Octavian
    Velcin, Julien
    Boicea, Alexandru
    2015 17TH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING (SYNASC), 2016, : 243 - 246
  • [5] Automatic identification of European languages
    Zhdanova, AV
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2002, 2553 : 76 - 84
  • [6] A GMM-BASED HIERARCHICAL AUTOMATIC LANGUAGE IDENTIFICATION SYSTEM FOR INDIAN LANGUAGES
    Jothilakshmi, S.
    Ramalingam, V.
    Palanivel, S.
    APPLIED ARTIFICIAL INTELLIGENCE, 2012, 26 (06) : 554 - 570
  • [7] Automatic Language Identification for Seven Indian Languages using Higher Level Features
    Madhu, Chithra
    George, Anu
    Mary, Leena
    2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
  • [8] Language Identification for Austronesian Languages
    Dunn, Jonathan
    Nijhof, Wikke
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6530 - 6539
  • [9] Experiments on Automatic Language Identification for Philippine Languages using Acoustic Gaussian Mixture Models
    Laguna, Ann Franchesca
    Guevara, Rowena Cristina
    2014 IEEE REGION 10 SYMPOSIUM, 2014, : 657 - 662
  • [10] Automatic language identification
    Zissman, MA
    Berkling, KM
    SPEECH COMMUNICATION, 2001, 35 (1-2) : 115 - 124