Searching the Web for Cross-lingual Parallel Data

被引:4
|
作者
El-Kishky, Ahmed [1 ]
Koehn, Philipp [2 ]
Schwenk, Holger [1 ]
机构
[1] Facebook AI, Seattle, WA 98109 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
关键词
cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;
D O I
10.1145/3397271.3401417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.
引用
收藏
页码:2417 / 2420
页数:4
相关论文
共 50 条
  • [21] Cross-Lingual Blog Analysis by Cross-Lingual Comparison of Characteristic Terms and Blog Posts
    Nakasaki, Hiroyuki
    Kawaba, Mariko
    Utsuro, Takehito
    Fukuhara, Tomohiro
    Nakagawa, Hiroshi
    Kando, Noriko
    PROCEEDINGS OF THE SECOND INTERNATIONAL SYMPOSIUM ON UNIVERSAL COMMUNICATION, 2008, : 105 - +
  • [22] CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs
    El-Kishky, Ahmed
    Chaudhary, Vishrav
    Guzman, Francisco
    Koehn, Philipp
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5960 - 5969
  • [23] SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism
    Fatima, Mehwish
    Kolber, Tim
    Markert, Katja
    Strube, Michael
    NewSumm 2023 - Proceedings of the 4th New Frontiers in Summarization Workshop, Proceedings of EMNLP Workshop, 2023, : 24 - 40
  • [24] Cross-lingual Transfer of Named Entity Recognizers without Parallel Corpora
    Zirikly, Ayah
    Hagiwara, Masato
    PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 390 - 396
  • [25] Automated creation of parallel Bible corpora with cross-lingual semantic concordance
    Dorpinghaus, Jens
    Dueing, Carsten
    PROCEEDINGS OF THE 2021 16TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2021, : 111 - 114
  • [26] Using the Web corpus to translate the queries in cross-lingual information retrieval
    Zhang, JL
    Sun, L
    Min, JM
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 493 - 498
  • [27] A multilingual text mining approach to web cross-lingual text retrieval
    Chau, RW
    Yeh, CH
    KNOWLEDGE-BASED SYSTEMS, 2004, 17 (5-6) : 219 - 227
  • [28] A platform for cross-lingual, domain and user adaptive Web information extraction
    Karkaletsis, V
    Spyropoulos, CD
    Grover, C
    Pazienza, MT
    Coch, J
    Souflis, D
    ECAI 2004: 16TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2004, 110 : 725 - 729
  • [29] Automated annotation of parallel bible corpora with cross-lingual semantic concordance
    Doerpinghaus, Jens
    NATURAL LANGUAGE ENGINEERING, 2024, 30 (06) : 1277 - 1300
  • [30] Cross-lingual Emotion Detection
    Hassan, Sabit
    Shaar, Shaden
    Darwish, Kareem
    2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 6948 - 6958