Searching the Web for Cross-lingual Parallel Data

被引:4
|
作者
El-Kishky, Ahmed [1 ]
Koehn, Philipp [2 ]
Schwenk, Holger [1 ]
机构
[1] Facebook AI, Seattle, WA 98109 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
关键词
cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;
D O I
10.1145/3397271.3401417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.
引用
收藏
页码:2417 / 2420
页数:4
相关论文
共 50 条
  • [31] Cross-lingual talker discrimination
    Wester, Mirjam
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1253 - 1256
  • [32] Cross-Lingual Word Embeddings
    Søgaard A.
    Vulić I.
    Ruder S.
    Faruqui M.
    Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
  • [33] Cross-lingual Continual Learning
    M'hamdi, Meryem
    Ren, Xiang
    May, Jonathan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 3908 - 3943
  • [34] Cross-Lingual Phrase Retrieval
    Zheng, Heqi
    Zhang, Xiao
    Chi, Zewen
    Huang, Heyan
    Yan, Tan
    Lan, Tian
    Wei, Wei
    Mao, Xian-Ling
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4193 - 4204
  • [35] Cross-lingual timeline summarization
    Cagliero, Luca
    La Quatra, Moreno
    Garza, Paolo
    Baralis, Elena
    2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 45 - 53
  • [36] Cross-Lingual Sentiment Quantification
    Esuli, Andrea
    Moreo, Alejandro
    Sebastiani, Fabrizio
    IEEE INTELLIGENT SYSTEMS, 2020, 35 (03) : 106 - 113
  • [37] Cross-Lingual Document Similarity
    Muhic, Andrej
    Rupnik, Jan
    Skraba, Primoz
    PROCEEDINGS OF THE ITI 2012 34TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES (ITI), 2012, : 387 - 392
  • [38] Cross-Lingual Word Embeddings
    Corro, Caio Filippo
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2019, 60 (01): : 46 - 48
  • [39] Cross-lingual document clustering
    Wu, Ke
    Lu, Bao-Liang
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 956 - +
  • [40] Cross-Lingual Word Embeddings
    Agirre, Eneko
    COMPUTATIONAL LINGUISTICS, 2020, 46 (01) : 245 - 248