New metrics and tests for subject prevalence in documents based on topic modeling

被引:0
|
作者
Kontoghiorghes, Louisa [1 ]
Colubi, Ana [1 ]
机构
[1] Kings Coll London, London, England
关键词
Text mining; Topic prevalence; High; -dimensionality; Bayesian and frequentist statistics; Hypothesis testing; Bootstrapping; TEXT; LDA; CLASSIFICATION;
D O I
10.1016/j.ijar.2023.02.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim is to introduce a metric to quantify the relevance of specific subjects within a text and develop a methodology to test whether this relevance is the same or not in various written documents. The proposed metric can be used to track the evolution of a subject in a series of documents or to measure the impact of a given text in related literature. To this aim, text mining tools are combined with Bayesian and frequentist statistical methods innovatively. First, topic modeling based on state-of-the-art techniques is suggested to be employed to identify relevant topics. The derived models are used to quantify the relative importance of a subject defined through a given set of terms, or keywords, by employing Bayesian techniques. Then, a two-sample test statistic is proposed to compare subjects' prevalence in two groups of documents. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The rationale of the approach will be established. The correctness and consistency of the proposed test are analyzed through simulations. The methodology is used to assess the impact of the EU investment through a project on the related scientific production and for sentiment analysis.1 (c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页码:49 / 69
页数:21
相关论文
共 50 条
  • [1] Topic modeling for sequential documents based on hybrid inter-document topic dependency
    Li, Wenbo
    Saigo, Hiroto
    Tong, Bin
    Suzuki, Einoshin
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2021, 56 (03) : 435 - 458
  • [2] Topic modeling for sequential documents based on hybrid inter-document topic dependency
    Wenbo Li
    Hiroto Saigo
    Bin Tong
    Einoshin Suzuki
    Journal of Intelligent Information Systems, 2021, 56 : 435 - 458
  • [3] Clustering scientific documents with topic modeling
    Yau, Chyi-Kwei
    Porter, Alan
    Newman, Nils
    Suominen, Arho
    SCIENTOMETRICS, 2014, 100 (03) : 767 - 786
  • [4] A Topic Modeling for Clustering Arabic Documents
    Alkhafaji, Doaa Wahhab
    Al-Rashid, Sura
    PROCEEDING OF 2021 2ND INFORMATION TECHNOLOGY TO ENHANCE E-LEARNING AND OTHER APPLICATION (IT-ELA 2021), 2021, : 76 - 81
  • [5] Clustering scientific documents with topic modeling
    Chyi-Kwei Yau
    Alan Porter
    Nils Newman
    Arho Suominen
    Scientometrics, 2014, 100 : 767 - 786
  • [6] Topic modeling revisited: New evidence on algorithm performance and quality metrics
    Ruediger, Matthias
    Antons, David
    Joshi, Amol M.
    Salge, Torsten-Oliver
    PLOS ONE, 2022, 17 (04):
  • [7] Clustering-based topic modeling for biomedical documents extractive text summarization
    Nabil M. AbdelAziz
    Aliaa A. Ali
    Soaad M. Naguib
    Lamiaa S. Fayed
    The Journal of Supercomputing, 2025, 81 (1)
  • [8] Improving topic modeling through homophily for legal documents
    Kazuki Ashihara
    Cheikh Brahim El Vaigh
    Chenhui Chu
    Benjamin Renoust
    Noriko Okubo
    Noriko Takemura
    Yuta Nakashima
    Hajime Nagahara
    Applied Network Science, 5
  • [9] Online Subset Topic Modeling for Interactive Documents Exploration
    Li, Linwei
    Wu, Yaobo
    Ke, Yixiong
    Liu, Chaoying
    Jing, Yinan
    He, Zhenying
    Wang, Xiaoyang Sean
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 916 - 923
  • [10] Recurrent Coupled Topic Modeling over Sequential Documents
    Guo, Jinjin
    Cao, Longbing
    Gong, Zhiguo
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (01)