Searching the Web for Cross-lingual Parallel Data

被引:4
|
作者
El-Kishky, Ahmed [1 ]
Koehn, Philipp [2 ]
Schwenk, Holger [1 ]
机构
[1] Facebook AI, Seattle, WA 98109 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
关键词
cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;
D O I
10.1145/3397271.3401417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.
引用
收藏
页码:2417 / 2420
页数:4
相关论文
共 50 条
  • [41] Cross-lingual CCG Induction
    Evang, Kilian
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1577 - 1587
  • [42] Cross-lingual and Multilingual CLIP
    Carlsson, Fredrik
    Eisen, Philipp
    Rekathati, Faton
    Sahlgren, Magnus
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6848 - 6854
  • [43] Cross-Lingual Text Categorization
    Bel, N
    Koster, CHA
    Villegas, M
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2003, 2769 : 126 - 139
  • [44] A Learning to rank framework based on cross-lingual loss function for cross-lingual information retrieval
    Ghanbari, Elham
    Shakery, Azadeh
    APPLIED INTELLIGENCE, 2022, 52 (03) : 3156 - 3174
  • [45] Cross-Lingual Visual Grounding
    Dong, Wenjian
    Otani, Mayu
    Garcia, Noa
    Nakashima, Yuta
    Chu, Chenhui
    IEEE ACCESS, 2021, 9 : 349 - 358
  • [46] Cross-lingual Emotion Detection
    Hassan, Sabit
    Shaar, Shaden
    Darwish, Kareem
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6948 - 6958
  • [47] A Survey on Cross-Lingual Summarization
    Wang, Jiaan
    Meng, Fandong
    Zheng, Duo
    Liang, Yunlong
    Li, Zhixu
    Qu, Jianfeng
    Zhou, Jie
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 1304 - 1323
  • [48] xLiD-Lexica: Cross-lingual Linked Data Lexica
    Zhang, Lei
    Faerber, Michael
    Rettinger, Achim
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 2101 - 2105
  • [49] FonBund: A Library for Combining Cross-lingual Phonological Segment Data
    Gutkin, Alexander
    Jansche, Martin
    Merkulova, Tatiana
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2236 - 2240
  • [50] Cross-Lingual Querying and Comparison of Linked Financial and Business Data
    O'Riain, Sean
    Coughlan, Barry
    Buitelaar, Paul
    Declerk, Thierry
    Krieger, Uli
    Marie-Thomas, Susan
    SEMANTIC WEB: ESWC 2013 SATELLITE EVENTS, 2013, 7955 : 242 - 247