Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

被引:0
|
作者
Kaur, Dilshad [1 ]
Singh, Satwinder [1 ]
机构
[1] Cent Univ Punjab Bathinda, Dept Comp Sci & Technol, Bathinda, Punjab, India
关键词
Aligned corpora; comparable corpora; English-Punjabi; parallel corpora;
D O I
10.2478/acss-2023-0024
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of "Mann Ki Baat", which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.
引用
收藏
页码:245 / 251
页数:7
相关论文
共 50 条
  • [1] Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia
    Goyal, Vishal
    Kumar, Ajit
    Lehal, Manpreet Singh
    [J]. INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 42 - 51
  • [2] Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs
    Wolk, Krzysztof
    Marasek, Krzysztof
    [J]. INTERNATIONAL WORKSHOP ON INNOVATIONS IN INFORMATION AND COMMUNICATION SCIENCE AND TECHNOLOGY, IICST 2014, 2014, 18 : 126 - 132
  • [3] Extracting an English-Persian Parallel Corpus from Comparable Corpora
    Karimi, Akbar
    Ansari, Ebrahim
    Bigham, Bahram Sadeghi
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3477 - 3482
  • [4] Extracting Parallel Phrases from Comparable Corpora
    Zhang, Jiexin
    Cao, Hailong
    Zhao, Tiejun
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 166 - 169
  • [5] Creation of a parallel corpora from comparable corpora for the simplification of medical texts in French
    Cardon, Remi
    Grabar, Natalia
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2020, 61 (02): : 15 - 39
  • [6] Building comparable corpora from social networks
    Trabelsi, Maroua
    Hajjem, Malek
    Latiri, Chiraz
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014,
  • [7] Parallel Sentence Alignment from Biomedical Comparable Corpora
    Cardon, Remi
    Grabar, Natalia
    [J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 362 - 366
  • [8] Issues in building English-Chinese parallel corpora with WordNets
    Bond, Francis
    Wang, Shan
    [J]. PROCEEDINGS OF THE SEVENTH GLOBAL WORDNET CONFERENCE, GWC 2014, 2014, : 391 - 399
  • [9] French-English terminology extraction from comparable corpora
    Daille, B
    Morin, E
    [J]. NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 707 - 718
  • [10] Parallel sentence generation from comparable corpora for improved SMT
    Rauf, Sadaf Abdul
    Schwenk, Holger
    [J]. MACHINE TRANSLATION, 2011, 25 (04) : 341 - 375