Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

被引：0

作者：

Chaudhary, Vishrav ^{[1
]}

Tang, Yuqing ^{[1
]}

Guzman, Francisco ^{[1
]}

Schwenk, Holger ^{[1
]}

Koehn, Philipp ^{[2
]}

机构：

[1] Facebook AI, Menlo Pk, CA 94025 USA

[2] Johns Hopkins Univ, Baltimore, MD 21218 USA

来源：

FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2 | 2019年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we describe our submission to the WMT19 low-resource parallel corpus filtering shared task. Our main approach is based on the LASER toolkit (Language-Agnostic SEntence Representations), which uses an encoder-decoder architecture trained on a parallel corpus to obtain multilingual sentence representations. We then use the representations directly to score and filter the noisy parallel sentences without additionally training a scoring function. We contrast our approach to other promising methods and show that LASER yields strong results. Finally, we produce an ensemble of different scoring methods and obtain additional gains. Our submission achieved the best overall performance for both the Nepali-English and Sinhala-English 1M tasks by a margin of 1.3 and 1.4 BLEU respectively, as compared to the second best systems. Moreover, our experiments show that this technique is promising for low and even no-resource scenarios.

引用

页码：261 / 266

页数：6

共 50 条

[1] Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining
Kvapilikova, Ivana
Artetxe, Mikel
Labaka, Gorka
Agirre, Eneko
Bojar, Ondrej
[J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 255 - 262
[2] Learning Multilingual Sentence Embeddings from Monolingual Corpus
Wang, Shuai
Hou, Lei
Li, Juanzi
Tong, Meihan
Jiang, Jiabo
[J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 346 - 357
[3] SE-Former: Incorporating sentence embeddings into Transformer for low-resource NMT
Wang, Dongsheng
Wang, Shaoyong
[J]. ELECTRONICS LETTERS, 2023, 59 (11)
[4] Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
Artetxe, Mikel
Schwenk, Holger
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3197 - 3203
[5] Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation
Adjeisah, Michael
Liu, Guohua
Nyabuga, Douglas Omwenga
Nortey, Richard Nuetey
Song, Jinling
[J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2021, 2021
[6] Extending Multilingual BERT to Low-Resource Languages
Wang, Zihan
Karthikeyan, K.
Mayhew, Stephen
Roth, Dan
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 2649 - 2656
[7] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
Joyanta Basu
Soma Khan
Rajib Roy
Tapan Kumar Basu
Swanirbhar Majumder
[J]. Circuits, Systems, and Signal Processing, 2021, 40 : 4986 - 5013
[8] Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
Basu, Joyanta
Khan, Soma
Roy, Rajib
Basu, Tapan Kumar
Majumder, Swanirbhar
[J]. CIRCUITS SYSTEMS AND SIGNAL PROCESSING, 2021, 40 (10) : 4986 - 5013
[9] Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Koehn, Philipp
Guzman, Francisco
Chaudhary, Vishrav
Pino, Juan
[J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 54 - 72
[10] Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection
Poerner, Nina
Schuetze, Hinrich
[J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 1630 - 1641

← 1 2 3 4 5 →