Evaluation of the Dirichlet Process Multinomial Mixture Model for Short-Text Topic Modeling

被引:1
|
作者
Karlsson, Alexander [1 ]
Duarte, Denio [2 ]
Mathiason, Gunnar [1 ]
Bae, Juhee [1 ]
机构
[1] Univ Skovde, Sch Informat, Skovde, Sweden
[2] Fed Univ Fronteira Sul, Campus Chapeco, Chapeco, Brazil
关键词
text analysis; topic modeling; Bayesian non-parametrics; Dirichlet process; short text;
D O I
10.1109/ISCBI.2018.00025
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Fast-moving trends, both in society and in highly competitive business areas, call for effective methods for automatic analysis. The availability of fast-moving sources in the form of short texts, such as social media and blogs, allows aggregation from a vast number of text sources, for an up to date view of trends and business insights. Topic modeling is established as an approach for analysis of large amounts of texts, but the scarcity of statistical information in short texts is considered to be a major problem for obtaining reliable topics from traditional models such as LDA. A range of different specialized topic models have been proposed, but a majority of these approaches rely on rather strong parametric assumptions, such as setting a fixed number of topics. In contrast, recent advances in the field of Bayesian non-parametrics suggest the Dirichlet process as a method that, given certain hyper-parameters, can self-adapt to the number of topics of the data at hand. We perform an empirical evaluation of the Dirichlet process multinomial (unigram) mixture model against several parametric topic models, initialized with different number of topics. The resulting models are evaluated, using both direct and indirect measures that have been found to correlate well with human topic rankings. We show that the Dirichlet Process Multinomial Mixture model is a viable option for short text topic modeling since it on average performs better, or nearly as good, compared to the parametric alternatives, while reducing parameter setting requirements and thereby eliminates the need of expensive preprocessing.
引用
收藏
页码:79 / 83
页数:5
相关论文
共 50 条
  • [21] Exploiting Global Semantic Similarity Biterms for Short-text Topic Discovery
    Lu, Heng-yang
    Ge, Gao-jian
    Li, Yun
    Wang, Chong-jun
    Xie, Jun-yuan
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 975 - 982
  • [22] Network Public Opinion Detection During the Coronavirus Pandemic: A Short-Text Relational Topic Model
    Jiang, Yuanchun
    Liang, Ruicheng
    Zhang, Ji
    Sun, Jianshan
    Liu, Yezheng
    Qian, Yang
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2022, 16 (03)
  • [23] Unsupervised Anomaly Detection in Multi-Topic Short-Text Corpora
    Ait-Saada, Mira
    Nadif, Mohamed
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1392 - 1403
  • [24] Application of multinomial mixture model to text classification
    Novovicová, J
    Malík, A
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS, 2003, 2652 : 646 - 653
  • [25] SenU-PTM: a novel phrase-based topic model for short-text topic discovery by exploiting word embeddings
    Lu, Heng-Yang
    Zhang, Yi
    Du, Yuntao
    DATA TECHNOLOGIES AND APPLICATIONS, 2021, 55 (05) : 643 - 660
  • [26] A Biterm-based Dirichlet Process Topic Model for Short Texts
    Pan, Yali
    Yin, Jian
    Liu, Shaopeng
    Li, Jing
    PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND SERVICE SYSTEM (CSSS), 2014, 109 : 301 - 304
  • [27] Topic modeling in short-text using non-negative matrix factorization based on deep reinforcement learning
    Shahbazi, Zeinab
    Byun, Yung-Cheol
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (01) : 743 - 760
  • [28] On Clustering and Evaluation of Narrow Domain Short-Text Corpora
    Pinto Avendano, David Eduardo
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2009, (42): : 129 - 130
  • [29] Evaluation of internal validity measures in short-text corpora
    Ingaramo, Diego
    Pinto, David
    Rosso, Paolo
    Errecalde, Marcelo
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2008, 4919 : 555 - 567
  • [30] Dirichlet Process Mixture of Mixtures Model for Unsupervised Subword Modeling
    Heck, Michael
    Sakti, Sakriani
    Nakamura, Satoshi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2018, 26 (11) : 2027 - 2042