Crosslingual Topic Modeling with WikiPDA

被引:4
|
作者
Piccardi, Tiziano [1 ]
West, Robert [1 ]
机构
[1] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
基金
瑞士国家科学基金会;
关键词
D O I
10.1145/3442381.3449805
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based latent Dirichlet allocation (LDA), thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia language editions, and crosslingual supervised document classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at https://github.com/epfl-dlab/WikiPDA.
引用
收藏
页码:3032 / 3041
页数:10
相关论文
共 50 条
  • [1] Analyzing Bayesian Crosslingual Transfer in Topic Models
    Hao, Shudong
    Paul, Michael J.
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 1551 - 1565
  • [2] An Empirical Study on Crosslingual Transfer in Probabilistic Topic Models
    Hao, Shudong
    Paul, Michael J.
    [J]. COMPUTATIONAL LINGUISTICS, 2020, 46 (01) : 95 - 134
  • [3] Tackling topic general words in topic modeling
    Xu, Yueshen
    Yin, Yuyu
    Yin, Jianwei
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 62 : 124 - 133
  • [4] Four Keys to Topic Interpretability in Topic Modeling
    Mavrin, Andrey
    Filchenkov, Andrey
    Koltcov, Sergei
    [J]. ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE (AINL 2018), 2018, 930 : 117 - 129
  • [5] Efficient Correlated Topic Modeling with Topic Embedding
    He, Junxian
    Hu, Zhiting
    Berg-Kirkpatrick, Taylor
    Huang, Ying
    Xing, Eric P.
    [J]. KDD'17: PROCEEDINGS OF THE 23RD ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2017, : 225 - 233
  • [6] Topic Modeling for Analyzing Topic Manipulation Skills
    Hwang, Seok-Ju
    Lee, Yoon-Kyoung
    Kim, Jong-Dae
    Park, Chan-Young
    Kim, Yu-Seop
    [J]. INFORMATION, 2021, 12 (09)
  • [7] The Evolution of Topic Modeling
    Churchill, Rob
    Singh, Lisa
    [J]. ACM COMPUTING SURVEYS, 2022, 54 (10S)
  • [8] Crime topic modeling
    Kuang D.
    Brantingham P.J.
    Bertozzi A.L.
    [J]. Crime Science, 6 (1)
  • [9] Topic modeling ensembles
    Shen, Zhiyong
    Luo, Ping
    Yang, Shengwen
    Shen, Xukun
    [J]. HP Laboratories Technical Report, 2010, (158):
  • [10] Conceptualization topic modeling
    Tang, Yi-Kun
    Mao, Xian-Ling
    Huang, Heyan
    Shi, Xuewen
    Wen, Guihua
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (03) : 3455 - 3471