Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

被引:192
|
作者
Syed, Shaheen [1 ]
Spruit, Marco [1 ]
机构
[1] Univ Utrecht, Dept Informat & Comp Sci, Utrecht, Netherlands
基金
欧盟地平线“2020”;
关键词
SCIENCE; TRENDS; MODEL;
D O I
10.1109/DSAA.2017.61
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper assesses topic coherence and human topic ranking of uncovered latent topics from scientific publications when utilizing the topic model latent Dirichlet allocation (LDA) on abstract and full-text data. The coherence of a topic, used as a proxy for topic quality, is based on the distributional hypothesis that states that words with similar meaning tend to co-occur within a similar context. Although LDA has gained much attention from machine-learning researchers, most notably with its adaptations and extensions, little is known about the effects of different types of textual data on generated topics. Our research is the first to explore these practical effects and shows that document frequency, document word length, and vocabulary size have mixed practical effects on topic coherence and human topic ranking of LDA topics. We furthermore show that large document collections are less affected by incorrect or noise terms being part of the topic-word distributions, causing topics to be more coherent and ranked higher. Differences between abstract and full-text data are more apparent within small document collections, with differences as large as 90% high-quality topics for full-text data, compared to 50% high-quality topics for abstract data.
引用
收藏
页码:165 / 174
页数:10
相关论文
共 50 条
  • [1] Topic and Trend Detection in Text Collections Using Latent Dirichlet Allocation
    Bolelli, Levent
    Ertekin, Seyda
    Giles, C. Lee
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 776 - +
  • [2] Topic Modeling Using Latent Dirichlet allocation: A Survey
    Chauhan, Uttam
    Shah, Apurva
    [J]. ACM COMPUTING SURVEYS, 2021, 54 (07)
  • [3] Using Latent Dirichlet Allocation for Topic Modelling in Twitter
    Ostrowski, David Alfred
    [J]. 2015 IEEE 9TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2015, : 493 - 497
  • [4] Topic Selection in Latent Dirichlet Allocation
    Wang, Biao
    Liu, Zelong
    Li, Maozhen
    Liu, Yang
    Qi, Man
    [J]. 2014 11TH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (FSKD), 2014, : 756 - 760
  • [5] Topic modeling for expert finding using latent Dirichlet allocation
    Momtazi, Saeedeh
    Naumann, Felix
    [J]. WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2013, 3 (05) : 346 - 353
  • [6] Feature extraction for document text using Latent Dirichlet Allocation
    Prihatini, P. M.
    Suryawan, I. K.
    Mandia, I. N.
    [J]. 2ND INTERNATIONAL JOINT CONFERENCE ON SCIENCE AND TECHNOLOGY (IJCST) 2017, 2018, 953
  • [7] Topic Modeling Twitter Data Using Latent Dirichlet Allocation and Latent Semantic Analysis
    Qomariyah, Siti
    Iriawan, Nur
    Fithriasari, Kartika
    [J]. 2ND INTERNATIONAL CONFERENCE ON SCIENCE, MATHEMATICS, ENVIRONMENT, AND EDUCATION, 2019, 2019, 2194
  • [8] Road Traffic Topic Modeling on Twitter using Latent Dirichlet Allocation
    Hidayatullah, Ahmad Fathan
    Ma'arif, Muhammad Rifqi
    [J]. 2017 INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY (SIET), 2017, : 47 - 52
  • [9] ldagibbs: A command for topic modeling in Stata using latent Dirichlet allocation
    Schwarz, Carlo
    [J]. STATA JOURNAL, 2018, 18 (01): : 101 - 117
  • [10] A FRAMEWORK OF URDU TOPIC MODELING USING LATENT DIRICHLET ALLOCATION (LDA)
    Shakeel, Khadija
    Tahir, Ghulam Rasool
    Tehseen, Irsha
    Ali, Mubashir
    [J]. 2018 IEEE 8TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2018, : 117 - 123