Searching the Web for Cross-lingual Parallel Data

被引:4
|
作者
El-Kishky, Ahmed [1 ]
Koehn, Philipp [2 ]
Schwenk, Holger [1 ]
机构
[1] Facebook AI, Seattle, WA 98109 USA
[2] Johns Hopkins Univ, Baltimore, MD USA
关键词
cross-lingual document retrieval; cross-lingual sentence retrieval; machine translation; multilingual embedding; web mining;
D O I
10.1145/3397271.3401417
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While the World Wide Web provides a large amount of text in many languages, cross-lingual parallel data is more difficult to obtain. Despite its scarcity, this parallel cross-lingual data plays a crucial role in a variety of tasks in natural language processing with applications in machine translation, cross-lingual information retrieval, and document classification, as well as learning cross-lingual representations. Here, we describe the end-to-end process of searching the web for parallel cross-lingual texts. We motivate obtaining parallel text as a retrieval problem whereby the goal is to retrieve cross-lingual parallel text from a large, multilingual web-crawled corpus. We introduce techniques for searching for cross-lingual parallel data based on language, content, and other metadata. We motivate and introduce multilingual sentence embeddings as a core tool and demonstrate techniques and models that leverage them for identifying parallel documents and sentences as well as techniques for retrieving and filtering this data. We describe several large-scale datasets curated using these techniques and show how training on sentences extracted from parallel or comparable documents mined from the Web can improve machine translation models and facilitate cross-lingual NLP.
引用
收藏
页码:2417 / 2420
页数:4
相关论文
共 50 条
  • [1] On the Role of Parallel Data in Cross-lingual Transfer Learning
    Reid, Machel
    Artetxe, Mikel
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 5999 - 6006
  • [2] Cross-Lingual Web Spam Classification
    Garzo, Andras
    Daroczy, Balint
    Kiss, Tamas
    Siklosi, David
    Benczur, Andras A.
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 1149 - 1156
  • [3] Cross-Lingual Entity Linking for Web Tables
    Luo, Xusheng
    Luo, Kangqi
    Chen, Xianyang
    Zhu, Kenny Q.
    THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 362 - 369
  • [4] Cross-Lingual Classification of Crisis Data
    Khare, Prashant
    Burel, Gregoire
    Maynard, Diana
    Alani, Harith
    SEMANTIC WEB - ISWC 2018, PT I, 2018, 11136 : 617 - 633
  • [5] A Subspace Learning Framework For Cross-Lingual Sentiment Classification With Partial Parallel Data
    Zhou, Guangyou
    He, Tingting
    Zhao, Jun
    Wu, Wensheng
    PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 1426 - 1432
  • [6] Cross-lingual analysis of English and Chinese web search
    Lin, Peiguang
    Zhang, Tong
    Xia, Menglong
    Zhou, Jin
    Nie, Peiyao
    INTERNATIONAL JOURNAL OF WEB AND GRID SERVICES, 2018, 14 (04) : 376 - 399
  • [7] Document Similarity for Arabic and Cross-Lingual Web Content
    Salhi, Ali
    Yahya, Adnan H.
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, 2018, 782 : 134 - 146
  • [8] A cross-lingual framework for web news taxonomy integration
    Yang, Cheng-Zen
    Chen, Che-Min
    Chen, Ing-Xiang
    INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2006, 4182 : 270 - +
  • [9] Baselines and Test Data for Cross-Lingual Inference
    Agic, Zeljko
    Schluter, Natalie
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3890 - 3894
  • [10] Optimization of Cross-Lingual LSI Training Data
    Pozniak, John
    Bradford, Roger
    COMPUTER AND INFORMATION SCIENCE 2015, 2016, 614 : 57 - 73