English-Welsh Cross-Lingual Embeddings

被引:4
|
作者
Espinosa-Anke, Luis [1 ]
Palmer, Geraint [2 ]
Corcoran, Padraig [1 ]
Filimonov, Maxim [1 ]
Spasic, Irena [1 ]
Knight, Dawn [3 ]
机构
[1] Cardiff Univ, Sch Comp Sci & Informat, Cardiff CF24 3AA, Wales
[2] Cardiff Univ, Sch Math, Cardiff CF24 4AG, Wales
[3] Cardiff Univ, Sch English Commun & Philosophy, Cardiff CF10 3EU, Wales
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 14期
关键词
natural language processing; distributional semantics; machine learning; language model; word embeddings; machine translation; sentiment analysis;
D O I
10.3390/app11146541
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English-Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Manipuri–English comparable corpus for cross-lingual studies
    Lenin Laitonjam
    Sanasam Ranbir Singh
    Language Resources and Evaluation, 2023, 57 : 377 - 413
  • [42] Do We Really Need Fully Unsupervised Cross-Lingual Embeddings?
    Vulic, Ivan
    Glavas, Goran
    Reichart, Roi
    Korhonen, Anna
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4407 - 4418
  • [43] Non-Linearity in mapping based Cross-Lingual Word Embeddings
    Zhao, Jiawei
    Gilman, Andrew
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3583 - 3589
  • [44] Adversarial training with Wasserstein distance for learning cross-lingual word embeddings
    Li, Yuling
    Zhang, Yuhong
    Yu, Kui
    Hu, Xuegang
    APPLIED INTELLIGENCE, 2021, 51 (11) : 7666 - 7678
  • [45] Learning Cross-Lingual IR from an English Retriever
    Li, Yulong
    Franz, Martin
    Sultan, Md Arafat
    Iyer, Bhavani
    Lee, Young-Suk
    Sil, Avirup
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4428 - 4436
  • [46] Cross-Lingual Classification of Political Texts Using Multilingual Sentence Embeddings
    Licht, Hauke
    POLITICAL ANALYSIS, 2023, 31 (03) : 366 - 379
  • [47] Neural topic-enhanced cross-lingual word embeddings for CLIR
    Zhou, Dong
    Qu, Wei
    Li, Lin
    Tang, Mingdong
    Yang, Aimin
    INFORMATION SCIENCES, 2022, 608 : 809 - 824
  • [48] Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings
    Nosek, Tijana, V
    Suzic, Sinisa B.
    Pekar, Darko J.
    Obradovic, Radovan J.
    Secujski, Milan S.
    Delic, Vlado D.
    INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 7 (02): : 110 - 120
  • [49] Adversarial training with Wasserstein distance for learning cross-lingual word embeddings
    Yuling Li
    Yuhong Zhang
    Kui Yu
    Xuegang Hu
    Applied Intelligence, 2021, 51 : 7666 - 7678
  • [50] Cross-Lingual Word Embeddings for Low-Resource Language Modeling
    Adams, Oliver
    Makarucha, Adam
    Neubig, Graham
    Bird, Steven
    Cohn, Trevor
    15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 937 - 947