Extracting Multilingual Topics from Unaligned Comparable Corpora

被引:0
|
作者
Jagarlamudi, Jagadeesh [1 ]
Daume, Hal, III [1 ]
机构
[1] Univ Utah, Sch Comp, Salt Lake City, UT 84112 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for Cross-Lingual IR applications. We also found that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models.
引用
收藏
页码:444 / 456
页数:13
相关论文
共 50 条
  • [1] Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies
    Cartoni, Bruno
    Meyer, Thomas
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2132 - 2137
  • [2] Extracting multilingual lexicons from parallel corpora
    Tufis, D
    Barbu, AM
    Ion, R
    [J]. COMPUTERS AND THE HUMANITIES, 2004, 38 (02): : 163 - 189
  • [3] Extracting Multilingual Lexicons from Parallel Corpora
    Dan Tufiş
    Ana Maria Barbu
    Radu Ion
    [J]. Computers and the Humanities, 2004, 38 : 163 - 189
  • [4] Extracting Parallel Phrases from Comparable Corpora
    Zhang, Jiexin
    Cao, Hailong
    Zhao, Tiejun
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 166 - 169
  • [5] Extracting translation equivalents from bilingual comparable corpora
    Kaji, H
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (02): : 313 - 323
  • [6] Wikipedia as Multilingual Source of Comparable Corpora
    Gamallo Otero, Pablo
    Gonzalez Lopez, Isaac
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 21 - 25
  • [7] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [8] Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora
    Bakhshaei, Somayeh
    Safabakhsh, Reza
    Khadivi, Shahram
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [9] Extracting an English-Persian Parallel Corpus from Comparable Corpora
    Karimi, Akbar
    Ansari, Ebrahim
    Bigham, Bahram Sadeghi
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3477 - 3482
  • [10] Weakly Supervised Named Entity Transliteration and Discovery from Multilingual Comparable Corpora
    Klementiev, Alexandre
    Roth, Dan
    [J]. COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 817 - 824