HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation

被引:0
|
作者
Bojar, Ondrej [1 ]
Diatka, Vojtech [2 ]
Rychly, Pavel [3 ]
Stranak, Pavel [1 ]
Suchomel, Vit [3 ]
Tamchyna, Ales [1 ]
Zeman, Daniel [1 ]
机构
[1] Charles Univ Prague, Fac Math & Phys, Inst Formal & Appl Linguist, Prague, Czech Republic
[2] Charles Univ Prague, Fac Arts, Dept Linguist, Prague, Czech Republic
[3] Masaryk Univ, Fac Informat, Nat Language Proc Ctr, CS-60177 Brno, Czech Republic
关键词
corpora; parallel corpora; machine translation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.
引用
收藏
页码:3550 / 3555
页数:6
相关论文
共 50 条
  • [1] A Hybrid Approach For Hindi-English Machine Translation
    Dhariya, Omkar
    Malviya, Shrikant
    Tiwary, Uma Shanker
    [J]. 2017 31ST INTERNATIONAL CONFERENCE ON INFORMATION NETWORKING (ICOIN), 2017, : 389 - 394
  • [2] Evaluating Gender Bias in Hindi-English Machine Translation
    Gupta, Gauri
    Ramesh, Krithika
    Singh, Sanjay
    [J]. GEBNLP 2021: THE 3RD WORKSHOP ON GENDER BIAS IN NATURAL LANGUAGE PROCESSING, 2021, : 16 - 23
  • [3] Linguistically Informed Hindi-English Neural Machine Translation
    Goyal, Vikrant
    Mishra, Pruthwik
    Sharma, Dipti Misra
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3698 - 3703
  • [4] Automatic Parallel Corpus Creation for Hindi-English News Translation Task
    Pathak, Aditya Kumar
    Acharya, Priyankit
    Kaur, Dilpreet
    Balabantaray, Rakesh Chandra
    [J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2018, : 1069 - 1075
  • [5] A Hindi-English Code-Switching Corpus
    Dey, Anik
    Fung, Pascale
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2410 - 2413
  • [6] HINDI-ENGLISH, ENGLISH-HINDI DICTIONARY OF PSYCHOLOGY TERMS - HINDI - SHERJUNG,N
    GUPTA, GC
    [J]. INDIAN JOURNAL OF PSYCHOLOGY, 1971, 46 : 307 - 307
  • [7] Innovative Algorithms for Parts of Speech Tagging in Hindi-English Machine Translation Language
    Mall, Shachi
    Jaiswal, Umesh Chandra
    [J]. 2015 INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND INTERNET OF THINGS (ICGCIOT), 2015, : 709 - 714
  • [8] Hindi-English Bilingual Dyslexia
    Pauranik, Apoorva
    [J]. AOA2010, 48TH ACADEMY OF APHASIA PROCEEDINGS, 2010, 6 : 214 - 214
  • [9] The Oxford Hindi-English dictionary
    不详
    [J]. JOURNAL OF INDO-EUROPEAN STUDIES, 1998, 26 (3-4): : 500 - 500
  • [10] A Twitter Corpus for Hindi-English Code Mixed POS Tagging
    Singh, Kushagra
    Sen, Indira
    Kumaraguru, Ponnurangam
    [J]. NATURAL LANGUAGE PROCESSING FOR SOCIAL MEDIA (AFNLP SIG SOCIALNLP), 2018, : 12 - 17