IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

被引:0
|
作者
Arnob, Noor Mairukh Khan [1 ]
Faiyaz, A. [1 ]
Fuad, Md Mubtasim [1 ]
Masud, Shah Murtaza Rashid Al [1 ]
Das, Baivab [1 ]
Mridha, M. F. [2 ]
机构
[1] Univ Asia Pacific, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Amer Int Univ Bangladesh, Dept Comp Sci, Dhaka, Bangladesh
来源
DATA IN BRIEF | 2024年 / 55卷
关键词
Natural Language Processing (NLP); Low-resource languages; Linguistics; Inclusive AI;
D O I
10.1016/j.dib.2024.110690
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:11
相关论文
共 50 条
  • [31] HMM Based Language Identification from Speech Utterances of Popular Indic Languages Using Spectral and Prosodic Features
    Sadanandam, Manchala
    TRAITEMENT DU SIGNAL, 2021, 38 (02) : 521 - 528
  • [32] Exploring the Role of Language Families for Building Indic Speech Synthesisers
    Prakash, Anusha
    Murthy, Hema A.
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 734 - 747
  • [33] Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages
    Mhaske, Arnav
    Kedia, Harshit
    Doddapaneni, Sumanth
    Khapra, Mitesh M.
    Kumar, Pratyush
    Murthy, V. Rudra
    Kunchukuttan, Anoop
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 10441 - 10456
  • [34] Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages
    Ramesh, Gowtham
    Doddapaneni, Sumanth
    Bheemaraj, Aravinth
    Jobanputra, Mayank
    Raghavan, A. K.
    Sharma, Ajitesh
    Sahoo, Sujit
    Diddee, Harshita
    Mahalakshmi, J.
    Kakwani, Divyanshu
    Kumar, Navneet
    Pradeep, Aswin
    Nagaraj, Srihari
    Deepak, Kumar
    Raghavan, Vivek
    Kunchukuttan, Anoop
    Kumar, Pratyush
    Khapra, Mitesh Shantadevi
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 145 - 162
  • [35] Neural machine translation system of indic languages-an attention based approach
    Shah, Parth
    Bakrola, Vishvajit
    2019 2nd International Conference on Advanced Computational and Communication Paradigms, ICACCP 2019, 2019,
  • [36] NeuMorph: Neural Morphological Tagging for Low-Resource Languages-An Experimental Study for Indic Languages
    Chakrabarty, Abhisek
    Chaturvedi, Akshay
    Garain, Utpal
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [37] IndicBART: A Pre-trained Model for Indic Natural Language Generation
    Dabre, Raj
    Shrotriya, Himani
    Kunchukuttan, Anoop
    Puduppully, Ratish
    Khapra, Mitesh M.
    Kumar, Pratyush
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1849 - 1863
  • [38] Word Sense Disambiguation from English to Indic Language: Approaches and Opportunities
    Mishra, Binod Kumar
    Jain, Suresh
    SOFT COMPUTING AND ITS ENGINEERING APPLICATIONS, ICSOFTCOMP 2022, 2023, 1788 : 135 - 146
  • [39] Cross-language framework for word recognition and spotting of Indic scripts
    Bhunia, Ayan Kumar
    Roy, Partha Pratim
    Mohta, Akash
    Pal, Umapada
    PATTERN RECOGNITION, 2018, 79 : 12 - 31
  • [40] Command and control of industrial manipulator through speech-based interfaces in Indic Languages
    Saravanan, N.
    Sivaramakrishnan, R.
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (08): : 5106 - 5117