Data Augmentation with Unsupervised Machine Translation Improves the Structural Similarity of Cross-lingual Word Embeddings

被引:0
|
作者
Nishikawa, Sosuke [1 ]
Ri, Ryokan [1 ]
Tsuruoka, Yoshimasa [1 ]
机构
[1] Univ Tokyo, Bunkyo Ku, 7-3-1 Hongo, Tokyo, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unsupervised cross-lingual word embedding (CLWE) methods learn a linear transformation matrix that maps two monolingual embedding spaces that are separately trained with mon-olingual corpora. This method relies on the assumption that the two embedding spaces are structurally similar, which does not necessarily hold true in general. In this paper, we argue that using a pseudo-parallel corpus generated by an unsupervised machine translation model facilitates the structural similarity of the two embedding spaces and improves the quality of CLWEs in the unsupervised mapping method. We show that our approach outperforms other alternative approaches given the same amount of data, and, through detailed analysis, we show that data augmentation with the pseudo data from unsupervised machine translation is especially effective for mapping-based CLWEs because (1) the pseudo data makes the source and target corpora (partially) parallel; (2) the pseudo data contains information on the original language that helps to learn similar embedding spaces between the source and target languages.
引用
收藏
页码:163 / 173
页数:11
相关论文
共 50 条
  • [1] Refinement of Unsupervised Cross-Lingual Word Embeddings
    Biesialska, Magdalena
    Costa-jussa, Marta R.
    [J]. ECAI 2020: 24TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, 325 : 1978 - 1981
  • [2] Cross-lingual Supervision Improves Unsupervised Neural Machine Translation
    Wang, Mingxuan
    Bai, Hongxiao
    Zhao, Hai
    Li, Lei
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 89 - 96
  • [3] Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals
    Chauhan, Shweta
    Saxena, Shefali
    Daniel, Philemon
    [J]. INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (SUPPL 1) : 28 - 37
  • [4] Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals
    Shweta Chauhan
    Shefali Saxena
    Philemon Daniel
    [J]. International Journal of System Assurance Engineering and Management, 2022, 13 : 28 - 37
  • [5] A Closer Look on Unsupervised Cross-lingual Word Embeddings Mapping
    Plucinski, Kamil
    Lango, Mateusz
    Zimniewicz, Michal
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5555 - 5562
  • [6] Cross-Lingual Word Embeddings
    Søgaard, Anders
    Vulić, Ivan
    Ruder, Sebastian
    Faruqui, Manaal
    [J]. Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
  • [7] Unsupervised cross-lingual word embeddings learning with adversarial training
    Li, Yuling
    Zhang, Yuhong
    Li, Peipei
    Hu, Xuegang
    [J]. 2019 10TH IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (ICBK 2019), 2019, : 150 - 156
  • [8] Cross-Lingual Word Embeddings
    Corro, Caio Filippo
    [J]. TRAITEMENT AUTOMATIQUE DES LANGUES, 2019, 60 (01): : 46 - 48
  • [9] Cross-Lingual Word Embeddings
    Agirre, Eneko
    [J]. COMPUTATIONAL LINGUISTICS, 2020, 46 (01) : 245 - 248
  • [10] Data Filtering using Cross-Lingual Word Embeddings
    Herold, Christian
    Rosendahl, Jan
    Vanvinckenroye, Joris
    Ney, Hermann
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 162 - 172