New metrics and tests for subject prevalence in documents based on topic modeling

被引:0
|
作者
Kontoghiorghes, Louisa [1 ]
Colubi, Ana [1 ]
机构
[1] Kings Coll London, London, England
关键词
Text mining; Topic prevalence; High; -dimensionality; Bayesian and frequentist statistics; Hypothesis testing; Bootstrapping; TEXT; LDA; CLASSIFICATION;
D O I
10.1016/j.ijar.2023.02.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The aim is to introduce a metric to quantify the relevance of specific subjects within a text and develop a methodology to test whether this relevance is the same or not in various written documents. The proposed metric can be used to track the evolution of a subject in a series of documents or to measure the impact of a given text in related literature. To this aim, text mining tools are combined with Bayesian and frequentist statistical methods innovatively. First, topic modeling based on state-of-the-art techniques is suggested to be employed to identify relevant topics. The derived models are used to quantify the relative importance of a subject defined through a given set of terms, or keywords, by employing Bayesian techniques. Then, a two-sample test statistic is proposed to compare subjects' prevalence in two groups of documents. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The rationale of the approach will be established. The correctness and consistency of the proposed test are analyzed through simulations. The methodology is used to assess the impact of the EU investment through a project on the related scientific production and for sentiment analysis.1 (c) 2023 Elsevier Inc. All rights reserved.
引用
收藏
页码:49 / 69
页数:21
相关论文
共 50 条
  • [41] Topic-Based Hard Clustering of Documents Using Generative Models
    Ponti, Giovanni
    Tagarelli, Andrea
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2009, 5722 : 231 - 240
  • [42] A deep learning-based classification for topic detection of audiovisual documents
    Fourati, Manel
    Jedidi, Anis
    Gargouri, Faiez
    APPLIED INTELLIGENCE, 2023, 53 (08) : 8776 - 8798
  • [43] A segment-based approach to clustering multi-topic documents
    Tagarelli, Andrea
    Karypis, George
    KNOWLEDGE AND INFORMATION SYSTEMS, 2013, 34 (03) : 563 - 595
  • [44] A deep learning-based classification for topic detection of audiovisual documents
    Manel Fourati
    Anis Jedidi
    Faiez Gargouri
    Applied Intelligence, 2023, 53 : 8776 - 8798
  • [45] Topic Tracking Based on Identifying Proper Number of the Latent Topics in Documents
    Serizawa, Midori
    Kobayashi, Ichiro
    JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2012, 16 (05) : 611 - 618
  • [46] Mining Contentious Documents Using an Unsupervised Topic Model Based Approach
    Trabelsi, Amine
    Zaiane, Osmar R.
    2014 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2014, : 550 - 559
  • [47] Competitive Perspective Identification via Topic based Refinement for Online Documents
    Lin, Junjie
    Mao, Wenji
    Zeng, Daniel
    IEEE INTERNATIONAL CONFERENCE ON INTELLIGENCE AND SECURITY INFORMATICS: CYBERSECURITY AND BIG DATA, 2016, : 214 - 216
  • [48] A NEW APPROACH TO MODELING PERSONAL OFFICE DOCUMENTS
    MHLANGA, FS
    ZHU, ZJ
    WANG, JTL
    NG, PA
    DATA & KNOWLEDGE ENGINEERING, 1995, 17 (02) : 127 - 158
  • [49] A new graph-based extractive text summarization using keywords or topic modeling
    Ramesh Chandra Belwal
    Sawan Rai
    Atul Gupta
    Journal of Ambient Intelligence and Humanized Computing, 2021, 12 : 8975 - 8990
  • [50] A new graph-based extractive text summarization using keywords or topic modeling
    Belwal, Ramesh Chandra
    Rai, Sawan
    Gupta, Atul
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2021, 12 (10) : 8975 - 8990