Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

被引:0
|
作者
Kvapilikova, Ivana [1 ]
Artetxe, Mikel [2 ]
Labaka, Gorka [2 ]
Agirre, Eneko [2 ]
Bojar, Ondrej [1 ]
机构
[1] Charles Univ MFF UK, Inst Formal & Appl Linguist, Prague, Czech Republic
[2] Univ Basque Country, UPV EHU, Ixa NLP Grp, Leioa, Spain
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.
引用
收藏
页码:255 / 262
页数:8
相关论文
共 50 条
  • [41] Sentence Alignment for Ancient and Modern Chinese Parallel Corpus
    Liu, Ying
    Wang, Nan
    [J]. EMERGING RESEARCH IN ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, 2012, 315 : 408 - 415
  • [42] CONDITIONAL SENTENCE REPHRASING WITHOUT PARALLEL TRAINING CORPUS
    Lee, Yen-Ting
    Li, Cheng-Te
    Lin, Shou-De
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO WORKSHOPS (IEEE ICMEW 2022), 2022,
  • [43] Contrastive learning for unsupervised sentence embeddings using negative samples with diminished semantics
    Zhiyi Yu
    Hong Li
    Jialin Feng
    [J]. The Journal of Supercomputing, 2024, 80 : 5428 - 5445
  • [44] Contrastive learning for unsupervised sentence embeddings using negative samples with diminished semantics
    Yu, Zhiyi
    Li, Hong
    Feng, Jialin
    [J]. JOURNAL OF SUPERCOMPUTING, 2024, 80 (04): : 5428 - 5445
  • [45] Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation
    Hangya, Viktor
    Fraser, Alexander
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1224 - 1234
  • [46] Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus
    Ahmadi, Sina
    Hassani, Hossein
    Jaff, Daban Q.
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (05)
  • [47] Unsupervised learning of arabic stemming using a parallel corpus
    Rogati, M
    McCarley, S
    Yang, YM
    [J]. 41ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2003, : 391 - 398
  • [48] Noisy Parallel Corpus Filtering through Projected Word Embeddings
    Kurfali, Murathan
    Ostling, Robert
    [J]. FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT 2019), VOL 3: SHARED TASK PAPERS, DAY 2, 2019, : 277 - 281
  • [49] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [50] ANNOTATION OF COMPLEX NOUN PHRASES FROM MULTILINGUAL PARALLEL CORPUS
    Cao, Jingxiang
    Huang, Degen
    [J]. 2012 IEEE 2ND INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTELLIGENT SYSTEMS (CCIS) VOLS 1-3, 2012, : 1440 - 1444