Cross-lingual document similarity estimation and dictionary generation with comparable corpora

被引:7
|
作者
Stajner, Tadej [1 ]
Mladenic, Dunja [1 ]
机构
[1] Jozef Stefan Inst, Jozef Stefan Int Postgrad Sch, Jamova Ulica 39, Ljubljana 1000, Slovenia
关键词
Cross-lingual text analysis; Vector space machine translation; Representation learning; Comparable corpora; Similarity learning; Dictionary generation;
D O I
10.1007/s10115-018-1179-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper proposes an approach for performing bilingual dictionary generation even when trained on widely available comparable bilingual corpora. We also show its capability to provide cross-lingual similarity estimates that correlate well with human judgments. We implement an approach using a nonlinear bilingual translation model that we train using comparable corpora. We propose a method using word embeddings and kernel approximation to train scalable nonlinear transformations. We demonstrate that this novel method works better on a majority of evaluated language pairs.
引用
收藏
页码:729 / 743
页数:15
相关论文
共 50 条
  • [1] Cross-lingual document similarity estimation and dictionary generation with comparable corpora
    Tadej Štajner
    Dunja Mladenić
    [J]. Knowledge and Information Systems, 2019, 58 : 729 - 743
  • [2] Cross-Lingual Document Similarity
    Muhic, Andrej
    Rupnik, Jan
    Skraba, Primoz
    [J]. PROCEEDINGS OF THE ITI 2012 34TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES (ITI), 2012, : 387 - 392
  • [3] Improved Cross-Lingual Document Similarity Measurement
    Isuranga, Udhan
    Sandaruwan, Janaka
    Athukorala, Udesh
    Dias, Gihan
    [J]. 2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 45 - 49
  • [4] Cross-Lingual Semantic Similarity Measure for Comparable Articles
    Saad, Motaz
    Langlois, David
    Smaili, Kamel
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, 2014, 8686 : 105 - +
  • [5] Document Similarity for Arabic and Cross-Lingual Web Content
    Salhi, Ali
    Yahya, Adnan H.
    [J]. ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, 2018, 782 : 134 - 146
  • [6] Evaluating cross-lingual textual similarity on dictionary alignment problem
    Sever, Yigit
    Ercan, Gonenc
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2020, 54 (04) : 1059 - 1078
  • [7] Evaluating cross-lingual textual similarity on dictionary alignment problem
    Yiğit Sever
    Gönenç Ercan
    [J]. Language Resources and Evaluation, 2020, 54 : 1059 - 1078
  • [8] The application of the comparable corpora in Chinese-English Cross-Lingual Information Retrieval
    Du, L
    Zhang, YB
    Sun, L
    Sun, YF
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2001, 16 (04) : 351 - 358
  • [9] The Application of the Comparable Corpora in Chinese-English Cross-Lingual Information Retrieval
    杜林
    张毅波
    孙乐
    孙玉芳
    [J]. Journal of Computer Science & Technology, 2001, (04) : 351 - 358
  • [10] The application of the comparable corpora in Chinese-English Cross-Lingual Information Retrieval
    Lin Du
    Yibo Zhang
    Le Sun
    Yufang Sun
    [J]. Journal of Computer Science and Technology, 2001, 16 : 351 - 358