Using topic-noise models to generate domain-specific topics across data sources

被引：3

作者：

Churchill, Rob ^{[1
]}

Singh, Lisa ^{[1
]}

机构：

[1] Georgetown Univ, Dept Comp Sci, 3700 O St, Washington, DC 20007 USA

来源：

KNOWLEDGE AND INFORMATION SYSTEMS | 2023年 / 65卷 / 05期

基金：

美国国家科学基金会;

关键词：

Generative topic modeling; Topic noise model; Topic blending;

D O I：

10.1007/s10115-022-01805-2

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.

引用

页码：2159 / 2186

页数：28

共 50 条

[1] Using topic-noise models to generate domain-specific topics across data sources
Rob Churchill
Lisa Singh
Knowledge and Information Systems, 2023, 65 : 2159 - 2186
[2] Domain-Specific Term Rankings Using Topic Models
Liu, Zhiyuan
Sun, Maosong
INFORMATION RETRIEVAL TECHNOLOGY, 2010, 6458 : 454 - 465
[3] Detecting Environmental, Social and Governance (ESG) Topics Using Domain-Specific Language Models and Data Augmentation
Nugent, Tim
Stelea, Nicole
Leidner, Jochen L.
FLEXIBLE QUERY ANSWERING SYSTEMS (FQAS 2021), 2021, 12871 : 157 - 169
[4] Domain-Specific Analysis of Mobile App Reviews Using Keyword-Assisted Topic Models
Tushev, Miroslav
Ebrahimi, Fahimeh
Mahmoud, Anas
2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 762 - 773
[5] A platform for connecting social media data to domain-specific topics using large language models: an application to student mental health
Ruocco, Leonard
Zhuang, Yuqian
Ng, Raymond
Munthali, Richard J.
Hudec, Kristen L.
Wang, Angel Y.
Vereschagin, Melissa
Vigo, Daniel V.
JAMIA OPEN, 2024, 7 (01)
[6] Domain-specific Semantics and Data Refinement of Object Models
Davies, Jim
Faitelson, David
Welch, James
ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2008, 195 (0C) : 151 - 170
[7] Customization of Domain-Specific Reference Models for Data Warehouses
Schuetz, Christoph
Schrefl, Michael
PROCEEDINGS OF THE 2014 IEEE 18TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2014), 2014, : 61 - 70
[8] Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation
Yao, Liang
Liu, Mengyi
Hong, Yu
Liu, Hao
Yao, Jianmin
SOCIAL MEDIA PROCESSING, SMP 2016, 2016, 669 : 162 - 171
[9] Facilitation of Domain-Specific Data Models Design using Semantic Web Technologies for Manufacturing
Jirkovsky, Vaclav
Sebek, Ondrej
Kadera, Petr
Burget, Pavel
Knoch, Soenke
Becker, Tilman
IIWAS2019: THE 21ST INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES, 2019, : 649 - 653
[10] An automatic method to generate domain-specific investigator networks using PubMed abstracts
Wei Yu
Ajay Yesupriya
Anja Wulf
Junfeng Qu
Marta Gwinn
Muin J Khoury
BMC Medical Informatics and Decision Making, 7

← 1 2 3 4 5 →