Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation

被引:0
|
作者
Hangya, Viktor [1 ]
Fraser, Alexander [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mining parallel sentences from comparable corpora is important. Most previous work relies on supervised systems, which are trained on parallel data, thus their applicability is problematic in low-resource scenarios. Recent developments in building unsupervised bilingual word embeddings made it possible to mine parallel sentences based on cosine similarities of source and target language words. We show that relying only on this information is not enough, since sentences often have similar words but different meanings. We detect continuous parallel segments in sentence pair candidates and rely on them when mining parallel sentences. We show better mining accuracy on three language pairs in a standard shared task on artificial data. We also provide the first experiments showing that parallel sentences mined from real life sources improve unsupervised MT. Our code is available, we hope it will be used to support low-resource MT research.
引用
收藏
页码:1224 / 1234
页数:11
相关论文
共 50 条
  • [31] Extended Parallel Corpus for Amharic-English Machine Translation
    Gezmu, Andargachew Mekonnen
    Nuernberger, Andreas
    Bati, Tesfaye Bayu
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6644 - 6653
  • [32] Mining Parallel Resources for Machine Translation from Comparable Corpora
    Pal, Santanu
    Pakray, Partha
    Gelbukh, Alexander
    van Genabith, Josef
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT I, 2015, 9041 : 534 - 544
  • [33] Parallel Corpora Preparation for English-Amharic Machine Translation
    Biadgligne, Yohanens
    Smaili, Kamel
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, IWANN 2021, PT I, 2021, 12861 : 443 - 455
  • [34] Evaluating the English-Turkish parallel treebank for machine translation
    Gorgun, Onur
    Yildiz, Olcay Taner
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2022, 30 (01) : 184 - 199
  • [35] Improving Neural Machine Translation by Filtering Synthetic Parallel Data
    Xu, Guanghao
    Ko, Youngjoong
    Seo, Jungyun
    [J]. ENTROPY, 2019, 21 (12)
  • [36] Extracting parallel phrases from comparable data for machine translation
    Hewavitharana, Sanjika
    Vogel, Stephan
    [J]. NATURAL LANGUAGE ENGINEERING, 2016, 22 (04) : 549 - 573
  • [37] A Richly Annotated, Multilingual Parallel Corpus for Hybrid Machine Translation
    Avramidis, Eleftherios
    Costa-Jussa, Marta R.
    Federmann, Christian
    Melero, Maite
    Pecina, Pavel
    van Genabith, Josef
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2189 - 2193
  • [38] Parallel texts dataset for Uzbek-Kazakh machine translation
    Allaberdiev, Bobur
    Matlatipov, Gayrat
    Kuriyozov, Elmurod
    Rakhmonov, Zafar
    [J]. DATA IN BRIEF, 2024, 53
  • [39] Translation Symmetry Detection in a Fronto-Parallel View
    Zhao, Peng
    Quan, Long
    [J]. 2011 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2011, : 1009 - 1016
  • [40] Enhancing the Performance of Unsupervised Machine Learning using Parallel Computing: A Comparative Analysis
    Baligodugula, Vishnu Vardhan
    Amsaad, Fathi
    [J]. 2024 IEEE 3RD INTERNATIONAL CONFERENCE ON COMPUTING AND MACHINE INTELLIGENCE, ICMI 2024, 2024,