Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

被引:0
|
作者
Kaur, Dilshad [1 ]
Singh, Satwinder [1 ]
机构
[1] Cent Univ Punjab Bathinda, Dept Comp Sci & Technol, Bathinda, Punjab, India
关键词
Aligned corpora; comparable corpora; English-Punjabi; parallel corpora;
D O I
10.2478/acss-2023-0024
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of "Mann Ki Baat", which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.
引用
收藏
页码:245 / 251
页数:7
相关论文
共 50 条
  • [31] Creation of Comparable Corpora for English-{Urdu, Arabic, Persian}
    Abouammoh, Murad
    Shah, Kashif
    Aker, Ahmet
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 4193 - 4196
  • [32] Looking for french-english translations in comparable medical corpora
    Chiao, YC
    Zweigenbaum, P
    [J]. AMIA 2002 SYMPOSIUM, PROCEEDINGS: BIOMEDICAL INFORMATICS: ONE DISCIPLINE, 2002, : 150 - 154
  • [33] Building parallel corpora by automatic title alignment
    Yang, CC
    Li, KW
    [J]. DIGITAL LIBRARIES: PEOPLE, KNOWLEDGE, AND TECHNOLOGY, PROCEEDINGS, 2002, 2555 : 328 - 339
  • [34] Genre and Register in Comparable Corpora: An English/Spanish Contrastive Analysis
    Lopez Arroyo, Belen
    Roberts, Roda P.
    [J]. META, 2017, 62 (01) : 114 - 136
  • [35] Structure of medical research articles in Polish and English comparable corpora
    Taczalska, A
    [J]. PALC'99: PRACTICAL APPLICATIONS IN LANGUAGE CORPORA, 2000, 1 : 567 - 580
  • [36] Identification of Comparable Argument-Head Relations in Parallel Corpora
    Spreyer, Kathrin
    Kuhn, Jonas
    Schrader, Bettina
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1860 - 1866
  • [37] Terminology Extraction from Comparable Corpora for Latvian
    Gornostay, Tatiana
    Ramm, Anita
    Heid, Ulrich
    Morin, Emmanuel
    Harastani, Rima
    Planas, Emmanuel
    [J]. HUMAN LANGUAGE TECHNOLOGIES: THE BALTIC PERSPECTIVE, 2012, 247 : 66 - +
  • [38] The use of English, Czech and French punctuation marks in reference, parallel and comparable web corpora: a question of methodology
    Nadvornikova, Olga
    [J]. LINGUISTICA PRAGENSIA, 2020, 30 (01) : 30 - 50
  • [39] On New Manually Aligned and Tagged Bilingual Parallel Corpora and Their Applications
    Roszko, Roman
    [J]. ACTA BALTICO-SLAVICA, 2021, 45
  • [40] Improved machine translation performance via parallel sentence extraction from comparable corpora
    Munteanu, DS
    Fraser, A
    Marcu, D
    [J]. HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 265 - 272