IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

被引:0
|
作者
Arnob, Noor Mairukh Khan [1 ]
Faiyaz, A. [1 ]
Fuad, Md Mubtasim [1 ]
Masud, Shah Murtaza Rashid Al [1 ]
Das, Baivab [1 ]
Mridha, M. F. [2 ]
机构
[1] Univ Asia Pacific, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Amer Int Univ Bangladesh, Dept Comp Sci, Dhaka, Bangladesh
来源
DATA IN BRIEF | 2024年 / 55卷
关键词
Natural Language Processing (NLP); Low-resource languages; Linguistics; Inclusive AI;
D O I
10.1016/j.dib.2024.110690
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Multilingual Neural Machine Translation for Indic to Indic Languages
    Das, Sudhansu Bala
    Panda, Divyajyoti
    Mishra, Tapas Kumar
    Patra, Bidyut Kr.
    Ekbal, Asif
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [2] Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages
    Doddapaneni, Sumanth
    Aralikatte, Rahul
    Ramesh, Gowtham
    Goyal, Shreya
    Khapra, Mitesh M.
    Kunchukuttan, Anoop
    Kumar, Pratyush
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12402 - 12426
  • [3] Intelligent Approaches for Natural Language Processing for Indic Languages
    Kumar, Rashi
    Sahula, Vineet
    2021 IEEE INTERNATIONAL SYMPOSIUM ON SMART ELECTRONIC SYSTEMS (ISES 2021), 2021, : 331 - 334
  • [4] Indic Language Computing
    Bhattacharyya, Pushpak
    Murthy, Hema
    Ranathunga, Surangika
    MunasInghe, Ranjiva
    COMMUNICATIONS OF THE ACM, 2019, 62 (11) : 70 - 75
  • [5] Unsupervised SMT: an analysis of Indic languages and a low resource language
    Saxena, Shefali
    Chauhan, Shweta
    Arora, Paras
    Daniel, Philemon
    JOURNAL OF EXPERIMENTAL & THEORETICAL ARTIFICIAL INTELLIGENCE, 2024, 36 (06) : 865 - 884
  • [6] IIIT-INDIC-HW-WORDS: A Dataset for Indic Handwritten Text Recognition
    Gongidi, Santhoshini
    Jawahar, C., V
    DOCUMENT ANALYSIS AND RECOGNITION, ICDAR 2021, PT IV, 2021, 12824 : 444 - 459
  • [7] Varta: A Large-Scale Headline-Generation Dataset for Indic Languages
    Aralikatte, Rahul
    Cheng, Ziling
    Doddapaneni, Sumanth
    Cheung, Jackie Chi Kit
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 3468 - 3492
  • [8] Dementia Speech Dataset Creation and Analysis in Indic Languages-A Pilot Study
    Vekkot, Susmitha
    Prakash, Nagulapati Naga Venkata Sai
    Reddy, Thirupati Sai Eswar
    Sripathi, Satwik Reddy
    Lalitha, S.
    Gupta, Deepa
    Zakariah, Mohammed
    Alotaibi, Yousef Ajami
    IEEE ACCESS, 2023, 11 : 130697 - 130718
  • [9] Statistical machine translation for Indic languages
    Das, Sudhansu Bala
    Panda, Divyajyoti
    Mishra, Tapas Kumar
    Patra, Bidyut Kr.
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 328 - 345
  • [10] Text Independent Language Recognition System for Indic Languages With new Features
    Sadanandam, M.
    Nagesh, A.
    Prasad, V. Kamakshi
    Janaki, V.
    2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 139 - 143