Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

被引:0
|
作者
Kvapilikova, Ivana [1 ]
Artetxe, Mikel [2 ]
Labaka, Gorka [2 ]
Agirre, Eneko [2 ]
Bojar, Ondrej [1 ]
机构
[1] Charles Univ MFF UK, Inst Formal & Appl Linguist, Prague, Czech Republic
[2] Univ Basque Country, UPV EHU, Ixa NLP Grp, Leioa, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.
引用
收藏
页码:255 / 262
页数:8
相关论文
共 50 条
  • [1] Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings
    Artetxe, Mikel
    Schwenk, Holger
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3197 - 3203
  • [2] Learning Multilingual Sentence Embeddings from Monolingual Corpus
    Wang, Shuai
    Hou, Lei
    Li, Juanzi
    Tong, Meihan
    Jiang, Jiabo
    [J]. CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 346 - 357
  • [3] MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases
    Martin, Louis
    Fan, Angela
    de la Clergerie, Eric
    Bordes, Antoine
    Sagot, Benoit
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1651 - 1664
  • [4] Low-Resource Corpus Filtering using Multilingual Sentence Embeddings
    Chaudhary, Vishrav
    Tang, Yuqing
    Guzman, Francisco
    Schwenk, Holger
    Koehn, Philipp
    [J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 261 - 266
  • [5] Unsupervised Multilingual Word Embeddings
    Chen, Xilun
    Cardie, Claire
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 261 - 270
  • [6] Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?
    Sannigrahi, Sonal
    van Genabith, Josef
    Espana-Bonet, Cristina
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2306 - 2316
  • [7] Unsupervised multilingual sentence boundary detection
    Kiss, Tibor
    Strunk, Jan
    [J]. COMPUTATIONAL LINGUISTICS, 2006, 32 (04) : 485 - 525
  • [8] Connecting Supervised and Unsupervised Sentence Embeddings
    Levi, Gil
    [J]. REPRESENTATION LEARNING FOR NLP, 2018, : 79 - 83
  • [9] Learning Unsupervised Multilingual Word Embeddings with Incremental Multilingual Hubs
    Heyman, Geert
    Verreet, Bregt
    Vulic, Ivan
    Moens, Marie-Francine
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1890 - 1902
  • [10] Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining
    Tien, Chih-chan
    Steinert-Threlkeld, Shane
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 8696 - 8706