New metrics and tests for subject prevalence in documents based on topic modeling

被引:0
|
作者
Kontoghiorghes, Louisa [1 ]
Colubi, Ana [1 ]
机构
[1] Kings Coll London, London, England
关键词
Text mining; Topic prevalence; High; -dimensionality; Bayesian and frequentist statistics; Hypothesis testing; Bootstrapping; TEXT; LDA; CLASSIFICATION;
D O I
10.1016/j.ijar.2023.02.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim is to introduce a metric to quantify the relevance of specific subjects within a text and develop a methodology to test whether this relevance is the same or not in various written documents. The proposed metric can be used to track the evolution of a subject in a series of documents or to measure the impact of a given text in related literature. To this aim, text mining tools are combined with Bayesian and frequentist statistical methods innovatively. First, topic modeling based on state-of-the-art techniques is suggested to be employed to identify relevant topics. The derived models are used to quantify the relative importance of a subject defined through a given set of terms, or keywords, by employing Bayesian techniques. Then, a two-sample test statistic is proposed to compare subjects' prevalence in two groups of documents. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The rationale of the approach will be established. The correctness and consistency of the proposed test are analyzed through simulations. The methodology is used to assess the impact of the EU investment through a project on the related scientific production and for sentiment analysis.1 (c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页码:49 / 69
页数:21
相关论文
共 50 条
  • [31] RankTopic: Ranking Based Topic Modeling
    Duan, Dongsheng
    Li, Yuhua
    Li, Ruixuan
    Zhang, Rui
    Wen, Aiming
    12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, : 211 - 220
  • [32] A Topic Modeling Based on Prompt Learning
    Qiu, Mingjie
    Yang, Wenzhong
    Wei, Fuyuan
    Chen, Mingliang
    ELECTRONICS, 2024, 13 (16)
  • [33] TOPIC MODELING BASED ON ATTRIBUTED GRAPH
    Zhang Lidan
    2022 19TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING (ICCWAMTIP), 2022,
  • [34] A New Topic Modeling Method for Tweets Comparison
    Bezerra, Jose Fabio Ribeiro
    Pietranik, Marcin
    Thanh Thuy Nguyen
    Kozierkiewicz, Adrianna
    COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2023, 2023, 14162 : 326 - 336
  • [35] Empirical study of constructing a knowledge organization system of patent documents using topic modeling
    Zhengyin Hu
    Shu Fang
    Tian Liang
    Scientometrics, 2014, 100 : 787 - 799
  • [36] Empirical study of constructing a knowledge organization system of patent documents using topic modeling
    Hu, Zhengyin
    Fang, Shu
    Liang, Tian
    SCIENTOMETRICS, 2014, 100 (03) : 787 - 799
  • [37] Documents as data: A content analysis and topic modeling approach for analyzing responses to ecological disturbances
    Altaweel, Mark
    Bone, Christopher
    Abrams, Jesse
    ECOLOGICAL INFORMATICS, 2019, 51 : 82 - 95
  • [38] Entry-Wise Eigenvector Analysis and Improved Rates for Topic Modeling on Short Documents
    Ke, Zheng Tracy
    Wang, Jingming
    MATHEMATICS, 2024, 12 (11)
  • [39] Revisiting K-Means and Topic Modeling, a Comparison Study to Cluster Arabic Documents
    Alhawarat, M.
    Hegazi, M.
    IEEE ACCESS, 2018, 6 : 42740 - 42749
  • [40] A segment-based approach to clustering multi-topic documents
    Andrea Tagarelli
    George Karypis
    Knowledge and Information Systems, 2013, 34 : 563 - 595