Group Matrix Factorization for Scalable Topic Modeling

被引:0
|
作者
Wang, Quan [1 ]
Cao, Zheng [2 ]
Xu, Jun [3 ]
Li, Hang [3 ]
机构
[1] Peking Univ, MOE Microsoft Key Lab Stat & Informat Technol, Beijing, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
关键词
Matrix Factorization; Topic Modeling; Large Scale;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Topic modeling can reveal the latent structure of text data and is useful for knowledge discovery, search relevance ranking, document classification, and so on. One of the major challenges in topic modeling is to deal with large datasets and large numbers of topics in real-world applications. In this paper, we investigate techniques for scaling up the non-probabilistic topic modeling approaches such as RLSI and NMF. We propose a general topic modeling method, referred to as Group Matrix Factorization (GMF), to enhance the scalability and efficiency of the non-probabilistic approaches. GMF assumes that the text documents have already been categorized into multiple semantic classes, and there exist class-specific topics for each of the classes as well as shared topics across all classes. Topic modeling is then formalized as a problem of minimizing a general objective function with regularizations and/or constraints on the class-specific topics and shared topics. In this way, the learning of class-specific topics can be conducted in parallel, and thus the scalability and efficiency can be greatly improved. We apply GMF to RLSI and NMF, obtaining Group RLSI (GRLSI) and Group NMF (GNMF) respectively. Experiments on a Wikipedia dataset and a real-world web dataset, each containing about 3 million documents, show that GRLSI and GNMF can greatly improve RLSI and NMF in terms of scalability and efficiency. The topics discovered by GRLSI and GNMF are coherent and have good readability. Further experiments on a search relevance dataset, containing 30,000 labeled queries, show that the use of topics learned by GRLSI and GNMF can significantly improve search relevance.
引用
收藏
页码:375 / 384
页数:10
相关论文
共 50 条
  • [1] Stability of topic modeling via matrix factorization
    Belford, Mark
    Mac Namee, Brian
    Greene, Derek
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 159 - 169
  • [2] Coupled matrix factorization and topic modeling for aspect mining
    Xiao, Ding
    Ji, Yugang
    Li, Yitong
    Zhuang, Fuzhen
    Shi, Chuan
    INFORMATION PROCESSING & MANAGEMENT, 2018, 54 (06) : 861 - 873
  • [3] Neural nonnegative matrix factorization for hierarchical multilayer topic modeling
    Haddock, Jamie
    Will, Tyler
    Vendrow, Joshua
    Zhang, Runyu
    Molitor, Denali
    Needell, Deanna
    Gao, Mengdi
    Sadovnik, Eli
    SAMPLING THEORY SIGNAL PROCESSING AND DATA ANALYSIS, 2024, 22 (01):
  • [4] NEURAL NONNEGATIVE MATRIX FACTORIZATION FOR HIERARCHICAL MULTILAYER TOPIC MODELING
    Gao, M.
    Haddock, J.
    Molitor, D.
    Needell, D.
    Sadovnik, E.
    Will, T.
    Zhang, R.
    2019 IEEE 8TH INTERNATIONAL WORKSHOP ON COMPUTATIONAL ADVANCES IN MULTI-SENSOR ADAPTIVE PROCESSING (CAMSAP 2019), 2019, : 6 - 10
  • [5] Topic Modeling on Triage Notes With Semiorthogonal Nonnegative Matrix Factorization
    Li, Yutong
    Zhu, Ruoqing
    Qu, Annie
    Ye, Han
    Sun, Zhankun
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2021, 116 (536) : 1609 - 1624
  • [6] Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
    Vorontsov, Konstantin
    Potapenko, Anna
    ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS, 2014, 436 : 29 - 46
  • [7] Lifelong Hierarchical Topic Modeling via Non-negative Matrix Factorization
    Lin, Zhicheng
    Yan, Jiaxing
    Lei, Zhiqi
    Rao, Yanghui
    WEB AND BIG DATA, PT IV, APWEB-WAIM 2023, 2024, 14334 : 155 - 170
  • [8] Affinity Regularized Non-Negative Matrix Factorization for Lifelong Topic Modeling
    Chen, Yong
    Wu, Junjie
    Lin, Jianying
    Liu, Rui
    Zhang, Hui
    Ye, Zhiwen
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (07) : 1249 - 1262
  • [9] Snapshot ensembles of non-negative matrix factorization for stability of topic modeling
    Qiang, Jipeng
    Li, Yun
    Yuan, Yunhao
    Liu, Wei
    APPLIED INTELLIGENCE, 2018, 48 (11) : 3963 - 3975
  • [10] Snapshot ensembles of non-negative matrix factorization for stability of topic modeling
    Jipeng Qiang
    Yun Li
    Yunhao Yuan
    Wei Liu
    Applied Intelligence, 2018, 48 : 3963 - 3975