Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora

被引:0
|
作者
Kaur, Dilshad [1 ]
Singh, Satwinder [1 ]
机构
[1] Cent Univ Punjab Bathinda, Dept Comp Sci & Technol, Bathinda, Punjab, India
关键词
Aligned corpora; comparable corpora; English-Punjabi; parallel corpora;
D O I
10.2478/acss-2023-0024
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Comparable corpora are the right resources for extracting parallel data due to their abundant availability. It is of great importance where parallel data are scarce. In this study, the focus is placed on building of parallel data for Punjabi and English language pair. The raw data were collected from web contents of "Mann Ki Baat", which is a collection of textual speeches of Prime Minister of India Mr. Narendra Modi broadcasted every last Sunday of the month. Data were cleaned and pre-processed using a natural language toolkit. An alignment model using BERT was built that aligned two textual files on a sentence level. Furthermore, extraction of noun forms with the help of NLTK library in Python programming was performed. The noun aligned dataset was built for English-Punjabi language pair and made available at Mendeley data repository.
引用
收藏
页码:245 / 251
页数:7
相关论文
共 50 条
  • [41] Building wordnets with multi-word expressions from parallel corpora
    Simoes, Alberto
    Gomez Guinovart, Xavier
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (64): : 45 - 52
  • [42] The Application of Parallel Corpora in the Translation Teaching of College English
    Wu, Jiaping
    Peng, Dejing
    [J]. 2016 5TH EEM INTERNATIONAL CONFERENCE ON PUBLIC ADMINISTRATION & MANAGEMENT (EEM-PAM 2016), 2016, 91 : 106 - 111
  • [43] A Multidimensional Analysis of Language Use in English Argumentative Essays: An Evidence From Comparable Corpora
    Zhang, Yujiao
    [J]. SAGE OPEN, 2023, 13 (03):
  • [44] An application of local relevance feedback for building comparable corpora from news article matching
    Collier, Nigel
    Kumano, Akira
    Hirakawa, Hideki
    [J]. NII Journal, 2003, (05): : 9 - 23
  • [45] Building a Parallel Corpora: Translation Issues and Remedial Case
    Archana, G. P.
    Jithesh, V. S.
    Remya, L. B.
    Sherly, Elizabeth
    [J]. 2015 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2015, : 2414 - 2417
  • [46] A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora
    Fung, P
    [J]. MACHINE TRANSLATION AND THE INFORMATION SOUP, 1998, 1529 : 1 - 17
  • [47] From questionnaires to parallel corpora in typology
    Dahl, Osten
    [J]. STUF-LANGUAGE TYPOLOGY AND UNIVERSALS, 2007, 60 (02) : 172 - 181
  • [48] Building Comparable Corpora for Assessing Multi-Word Term Alignment
    Adjali, Omar
    Morin, Emmanuel
    Zweigenbaum, Pierre
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3103 - 3112
  • [49] A Review on Building Bilingual Comparable Corpora for Resource-limited Languages
    Nasharuddin, Nurul Amelina
    Abdullah, Muhamad Taufik
    Azman, Azreen
    Kadir, Rabiah Abdul
    [J]. 2018 FOURTH INTERNATIONAL CONFERENCE ON INFORMATION RETRIEVAL AND KNOWLEDGE MANAGEMENT (CAMP), 2018, : 113 - 118
  • [50] Extracting translation equivalents from bilingual comparable corpora
    Kaji, H
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (02): : 313 - 323