HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation

被引:0
|
作者
Bojar, Ondrej [1 ]
Diatka, Vojtech [2 ]
Rychly, Pavel [3 ]
Stranak, Pavel [1 ]
Suchomel, Vit [3 ]
Tamchyna, Ales [1 ]
Zeman, Daniel [1 ]
机构
[1] Charles Univ Prague, Fac Math & Phys, Inst Formal & Appl Linguist, Prague, Czech Republic
[2] Charles Univ Prague, Fac Arts, Dept Linguist, Prague, Czech Republic
[3] Masaryk Univ, Fac Informat, Nat Language Proc Ctr, CS-60177 Brno, Czech Republic
关键词
corpora; parallel corpora; machine translation;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
We present HindEnCorp, a parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences. Both the corpora are freely available for non-commercial research and their preliminary release has been used by numerous participants of the WMT 2014 shared translation task.
引用
收藏
页码:3550 / 3555
页数:6
相关论文
共 50 条
  • [41] Corpus based Machine Translation System with Deep Neural Network for Sanskrit to Hindi Translation
    Singh, Muskaan
    Kumar, Ravinder
    Chana, Inderveer
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND DATA SCIENCE, 2020, 167 : 2534 - 2544
  • [42] Stance Detection in Hindi-English Code-Mixed Data
    Utsav, Jethva
    Kabaria, Dhaiwat
    Vajpeyi, Ribhu
    Mina, Mohit
    Srivastava, Vivek
    [J]. PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 359 - 360
  • [43] Role of languages in national integration - Hindi-English controversy in India
    Reddy, SK
    Reddy, TS
    Prabhakar, C
    Raju, MPN
    [J]. ACHIEVING COMMUNAL HARMONY AND NATIONAL INTEGRATION: A DREAM FOR EVERY INDIAN, 1997, : 87 - 90
  • [44] An Efficient English to Hindi Machine Translation System Using Hybrid Mechanism
    Nair, Jayashree
    Krishnan, Amrutha K.
    Deetha, R.
    [J]. 2016 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2016, : 2109 - 2113
  • [45] Hindi to Punjabi Machine Translation System
    Goyal, Vishal
    Singh Lehal, Gurpreet
    [J]. ACL HLT 2011 - 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of Student Session, 2011, : 1 - 6
  • [46] An Improvement in BLEU Metric for English-Hindi Machine Translation Evaluation
    Malik, Pooja
    Baghel, Anurag Singh
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND AUTOMATION (ICCCA), 2016, : 331 - 336
  • [47] Developing a System for Machine Translation from Hindi language to English language
    Mall, Shachi
    Jaiswal, Umesh Chandra
    [J]. 2013 4TH IEEE INTERNATIONAL CONFERENCE ON COMPUTER & COMMUNICATION TECHNOLOGY (ICCCT), 2013, : 79 - 87
  • [48] Sifar: An Attempt to Develop Interactive Machine Translation System for English to Hindi
    Jain, Meenal
    Syed, Mehvish
    Sharma, Nidhi
    Seth, Shambhavi
    Joshi, Nisheeth
    [J]. FIRST INTERNATIONAL CONFERENCE ON SUSTAINABLE TECHNOLOGIES FOR COMPUTATIONAL INTELLIGENCE, 2020, 1045 : 693 - 703
  • [49] Hindi to Punjabi Machine Translation System
    Goyal, Vishal
    Lehal, Gurpreet Singh
    [J]. INFORMATION SYSTEMS FOR INDIAN LANGUAGES, 2011, 139 : 236 - 241
  • [50] A Framework for Online Hate Speech Detection on Code-mixed Hindi-English Text and Hindi Text in Devanagari
    Chopra, Abhishek
    Sharma, Deepak Kumar
    Jha, Aashna
    Ghosh, Uttam
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (05)