New metrics and tests for subject prevalence in documents based on topic modeling

被引:1
|
作者
Kontoghiorghes, Louisa [1 ]
Colubi, Ana [1 ]
机构
[1] Kings Coll London, London, England
关键词
Text mining; Topic prevalence; High; -dimensionality; Bayesian and frequentist statistics; Hypothesis testing; Bootstrapping; TEXT; LDA; CLASSIFICATION;
D O I
10.1016/j.ijar.2023.02.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim is to introduce a metric to quantify the relevance of specific subjects within a text and develop a methodology to test whether this relevance is the same or not in various written documents. The proposed metric can be used to track the evolution of a subject in a series of documents or to measure the impact of a given text in related literature. To this aim, text mining tools are combined with Bayesian and frequentist statistical methods innovatively. First, topic modeling based on state-of-the-art techniques is suggested to be employed to identify relevant topics. The derived models are used to quantify the relative importance of a subject defined through a given set of terms, or keywords, by employing Bayesian techniques. Then, a two-sample test statistic is proposed to compare subjects' prevalence in two groups of documents. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The rationale of the approach will be established. The correctness and consistency of the proposed test are analyzed through simulations. The methodology is used to assess the impact of the EU investment through a project on the related scientific production and for sentiment analysis.1 (c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页码:49 / 69
页数:21
相关论文
共 50 条
  • [21] A Review of Stability in Topic Modeling: Metrics for Assessing and Techniques for Improving Stability
    Marani, Amin Hosseiny
    Baumer, Eric P. S.
    ACM COMPUTING SURVEYS, 2024, 56 (05)
  • [22] Topic Quality Metrics Based on Distributed Word Representations
    Nikolenko, Sergey I.
    SIGIR'16: PROCEEDINGS OF THE 39TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2016, : 1029 - 1032
  • [23] REQUIRE-DOCUMENTS AND PROVIDE-DOCUMENTS MATCHING ALGORITHM BASED ON TOPIC MODEL
    Zou, Xiangwen
    Wu, Yue
    Liu, Zhongtian
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), 2016, : 620 - 625
  • [24] A new evaluation framework for topic modeling algorithms based on synthetic corpora
    Shi, Hanyu
    Gerlach, Martin
    Diersen, Isabel
    Downey, Doug
    Amaral, Luis A. N.
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89 : 816 - 826
  • [25] Hierarchical Summarization of Text Documents Using Topic Modeling and Formal Concept Analysis
    Akhtar, Nadeem
    Javed, Hira
    Ahmad, Tameem
    DATA MANAGEMENT, ANALYTICS AND INNOVATION, ICDMAI 2018, VOL 2, 2019, 839 : 21 - 33
  • [26] Topic Modeling of Small Sequential Documents: Proposed Experiments for Detecting Terror Attacks
    Jones, Brandon W.
    Chung, Wingyan
    IEEE INTERNATIONAL CONFERENCE ON INTELLIGENCE AND SECURITY INFORMATICS: CYBERSECURITY AND BIG DATA, 2016, : 310 - 312
  • [27] Urdu Documents Clustering with Unsupervised and Semi-Supervised Probabilistic Topic Modeling
    Mustafa, Mubashar
    Zeng, Feng
    Ghulam, Hussain
    Muhammad Arslan, Hafiz
    INFORMATION, 2020, 11 (11) : 1 - 16
  • [28] Keywords Similarity Based Topic Identification for Indonesian News Documents
    Fuddoly, Aini
    Jaafar, Jafreezal
    Zamin, Norshuhani
    UKSIM-AMSS SEVENTH EUROPEAN MODELLING SYMPOSIUM ON COMPUTER MODELLING AND SIMULATION (EMS 2013), 2013, : 14 - 20
  • [29] A method for the automatic summarization of topic-based clusters of documents
    Pons-Porrata, A
    Ruiz-Shulcloper, J
    Berlanga-Llavori, R
    PROGRESS IN PATTERN RECOGNITION, SPEECH AND IMAGE ANALYSIS, 2003, 2905 : 596 - 603
  • [30] TASK-BASED SUBJECT VALIDATION: RELIABILITY METRICS
    Janowski, Lucjan
    2012 FOURTH INTERNATIONAL WORKSHOP ON QUALITY OF MULTIMEDIA EXPERIENCE (QOMEX), 2012, : 182 - 187