Automatic language identification: a case study of Pahari languages

被引:0
|
作者
Rachana Gusain
Satya Ranjan Dash
Shantipriya Parida
Girish Nath Jha
机构
[1] Doon University,
[2] KIIT University,undefined
[3] Silo AI,undefined
[4] Jawaharlal Nehru University,undefined
来源
Language Resources and Evaluation | 2023年 / 57卷
关键词
Low-resource languages; Corpus development; Statistical analysis; Language identification; Northern Indo-Aryan; Pahari; Nepali; Garhwali; Kumaoni; Dogri;
D O I
暂无
中图分类号
学科分类号
摘要
In an attempt to expand the inclusiveness of Natural Language Processing, this paper focuses on developing resources and building machine learning models to identify four languages of the Northern Indo-Aryan family, also known as Pahari languages—Nepali, Garhwali, Kumaoni, and Dogri. This is the first attempt towards building identification models for Pahari languages and developing a plain text corpus for Garhwali and Kumaoni, both of which are lesser-known and under-resourced languages/mother tongues of India. The collected corpus, including data in Nepali and Dogri, is statistically analyzed at the word level. We also trained traditional machine learning models for Pahari language identification on this corpus and found that character n-grams based Linear Support Vector Machines performed best with 99.28% accuracy.
引用
收藏
页码:1361 / 1387
页数:26
相关论文
共 50 条
  • [21] AUTOMATIC VISUAL-ONLY LANGUAGE IDENTIFICATION: A PRELIMINARY STUDY
    Newman, Jacob L.
    Cox, Stephen J.
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4345 - 4348
  • [22] Language identification: How to distinguish similar languages?
    Ljubesic, Nikola
    Mikelic, Nives
    Boras, Damir
    PROCEEDINGS OF THE ITI 2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES, 2007, : 541 - +
  • [23] Language identification system for South African languages
    Mashao, DJ
    PROCEEDINGS OF THE 1998 SOUTH AFRICAN SYMPOSIUM ON COMMUNICATIONS AND SIGNAL PROCESSING: COMSIG '98, 1998, : 193 - 196
  • [24] A hierarchical language identification system for Indian languages
    Jothilakshmi, S.
    Ramalingam, V.
    Palanivel, S.
    DIGITAL SIGNAL PROCESSING, 2012, 22 (03) : 544 - 553
  • [25] Automatic identification of spontaneously spoken languages with neural networks
    Schultz, T
    Soltau, H
    NATURAL LANGUAGE PROCESSING AND SPEECH TECHNOLOGY: RESULTS OF THE 3RD KONVENS CONFERENCE, 1996, : 102 - 110
  • [26] Hierarchical Language Identification based on Automatic Language Clustering
    Yin, Bo
    Ambikairajah, Eliathamby
    Chen, Fang
    INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1217 - 1220
  • [27] Deep Learning Case Study for Automatic Bird Identification
    Niemi, Juha
    Tanttu, Juha T.
    APPLIED SCIENCES-BASEL, 2018, 8 (11):
  • [28] Automatic Language Identification using Wavelets
    Lilia Reyes-Herrera, Ana
    Villasenor-Pineda, Luis
    Montes-y-Gomez, Manuel
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 401 - 404
  • [29] Automatic Language Identification in Texts: A Survey
    Jauhiainen, Tommi
    Lui, Marco
    Zampieri, Marcos
    Baldwin, Timothy
    Linden, Krister
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2019, 65 : 675 - 782
  • [30] Automatic identification of received language in MEG
    Parisotto, Emilio
    Ghassabeh, Youness A.
    MacDonald, Matt J.
    Cozma, Adelina
    Pang, Elizabeth W.
    Rudzicz, Frank
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 1106 - 1110