Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

被引:0
|
作者
Chaudhary, Vishrav [1 ]
Tang, Yuqing [1 ]
Guzman, Francisco [1 ]
Schwenk, Holger [1 ]
Koehn, Philipp [2 ]
机构
[1] Facebook AI, Menlo Pk, CA 94025 USA
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.
引用
收藏
页码:261 / 266
页数:6
相关论文
共 50 条
  • [1] Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
    Kvapilikova, Ivana
    Artetxe, Mikel
    Labaka, Gorka
    Agirre, Eneko
    Bojar, Ondrej
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 255 - 262
  • [2] Learning Multilingual Sentence Embeddings from Monolingual Corpus
    Wang, Shuai
    Hou, Lei
    Li, Juanzi
    Tong, Meihan
    Jiang, Jiabo
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 346 - 357
  • [3] SE-Former: Incorporating sentence embeddings into Transformer for low-resource NMT
    Wang, Dongsheng
    Wang, Shaoyong
    [J]. ELECTRONICS LETTERS, 2023, 59 (11)
  • [4] Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
    Artetxe, Mikel
    Schwenk, Holger
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3197 - 3203
  • [5] Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
    Adjeisah, Michael
    Liu, Guohua
    Nyabuga, Douglas Omwenga
    Nortey, Richard Nuetey
    Song, Jinling
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
  • [6] Extending Multilingual BERT to Low-Resource Languages
    Wang, Zihan
    Karthikeyan, K.
    Mayhew, Stephen
    Roth, Dan
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2649 - 2656
  • [7] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Joyanta Basu
    Soma Khan
    Rajib Roy
    Tapan Kumar Basu
    Swanirbhar Majumder
    [J]. Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
  • [8] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
    Basu, Joyanta
    Khan, Soma
    Roy, Rajib
    Basu, Tapan Kumar
    Majumder, Swanirbhar
    [J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
  • [9] Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
    Koehn, Philipp
    Guzman, Francisco
    Chaudhary, Vishrav
    Pino, Juan
    [J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 54 - 72
  • [10] Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection
    Poerner, Nina
    Schuetze, Hinrich
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1630 - 1641