Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation

被引:0
|
作者
Hangya, Viktor [1 ]
Fraser, Alexander [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
基金
欧洲研究理事会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mining parallel sentences from comparable corpora is important. Most previous work relies on supervised systems, which are trained on parallel data, thus their applicability is problematic in low-resource scenarios. Recent developments in building unsupervised bilingual word embeddings made it possible to mine parallel sentences based on cosine similarities of source and target language words. We show that relying only on this information is not enough, since sentences often have similar words but different meanings. We detect continuous parallel segments in sentence pair candidates and rely on them when mining parallel sentences. We show better mining accuracy on three language pairs in a standard shared task on artificial data. We also provide the first experiments showing that parallel sentences mined from real life sources improve unsupervised MT. Our code is available, we hope it will be used to support low-resource MT research.
引用
收藏
页码:1224 / 1234
页数:11
相关论文
共 50 条
  • [1] Improved machine translation performance via parallel sentence extraction from comparable corpora
    Munteanu, DS
    Fraser, A
    Marcu, D
    [J]. HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 265 - 272
  • [2] Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation
    Sarikaya, R.
    Maskey, S.
    Zhang, R.
    Jan, E.
    Wang, D.
    Ramabhadran, B.
    Roukos, S.
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 432 - 435
  • [3] Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs
    Zhu, Shaolin
    Mi, Chenggang
    Li, Tianqi
    Yang, Yong
    Xu, Chun
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (03)
  • [4] Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
    Kvapilikova, Ivana
    Artetxe, Mikel
    Labaka, Gorka
    Agirre, Eneko
    Bojar, Ondrej
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 255 - 262
  • [5] Parallel Machine Translation: Principles and practice
    Ren, F
    Shi, H
    [J]. SEVENTH IEEE INTERNATIONAL CONFERENCE ON ENGINEERING OF COMPLEX COMPUTER SYSTEMS, PROCEEDINGS, 2001, : 249 - 259
  • [6] Refining Parallel Quality for Machine Translation
    Gong, Huimin
    Dnan, Xiangyn
    Zhang, Min
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 9 - 12
  • [7] The parallel corpus for information extraction based on natural language processing and machine translation
    He, Honghua
    [J]. EXPERT SYSTEMS, 2019, 36 (05)
  • [8] Empirical Regularization for Synthetic Sentence Pairs in Unsupervised Neural Machine Translation
    Ai, Xi
    Fang, Bin
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 12471 - 12479
  • [9] Parallel Corpora based Translation Resources Extraction
    Simoes, Alberto
    Almeida, Jose Joao
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2007, (39): : 265 - 272
  • [10] Analyzing the Parallel Computing Performance of Unsupervised Machine Learning
    [J]. Amsaad, Fathi (fathi.amsaad@wright.edu), 1600, Institute of Electrical and Electronics Engineers Inc.