IndicDialogue: A dataset of subtitles in 10 Indic languages for Indic language modeling

被引:0
|
作者
Arnob, Noor Mairukh Khan [1 ]
Faiyaz, A. [1 ]
Fuad, Md Mubtasim [1 ]
Masud, Shah Murtaza Rashid Al [1 ]
Das, Baivab [1 ]
Mridha, M. F. [2 ]
机构
[1] Univ Asia Pacific, Dept Comp Sci & Engn, Dhaka, Bangladesh
[2] Amer Int Univ Bangladesh, Dept Comp Sci, Dhaka, Bangladesh
来源
DATA IN BRIEF | 2024年 / 55卷
关键词
Natural Language Processing (NLP); Low-resource languages; Linguistics; Inclusive AI;
D O I
10.1016/j.dib.2024.110690
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The Languages of the Indian subcontinent are less represented in current NLP literature. To mitigate this gap, we present the IndicDialogue dataset, which contains subtitles and dialogues in 10 major Indic languages: Hindi, Bengali, Marathi, Telugu, Tamil, Urdu, Odia, Sindhi, Nepali, and Assamese. This dataset is sourced from OpenSubtitles.org, with subtitles pre-processed to remove irrelevant tags, timestamps, square brackets, and links, ensuring the retention of relevant dialogues in JSONL files. The IndicDialogue dataset comprises 7750 raw subtitle files (SRT), 11 JSONL files, 6,853,518 dialogues, and 42,188,569 words. It is designed to serve as a foundation for language model pre-training for low-resource languages, enabling a wide range of downstream tasks including word embeddings, topic modeling, conversation synthesis, neural machine translation, and text summarization. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY-NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:11
相关论文
共 50 条