Using topic-noise models to generate domain-specific topics across data sources

被引:3
|
作者
Churchill, Rob [1 ]
Singh, Lisa [1 ]
机构
[1] Georgetown Univ, Dept Comp Sci, 3700 O St, Washington, DC 20007 USA
基金
美国国家科学基金会;
关键词
Generative topic modeling; Topic noise model; Topic blending;
D O I
10.1007/s10115-022-01805-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.
引用
收藏
页码:2159 / 2186
页数:28
相关论文
共 50 条
  • [1] Using topic-noise models to generate domain-specific topics across data sources
    Rob Churchill
    Lisa Singh
    Knowledge and Information Systems, 2023, 65 : 2159 - 2186
  • [2] Domain-Specific Term Rankings Using Topic Models
    Liu, Zhiyuan
    Sun, Maosong
    INFORMATION RETRIEVAL TECHNOLOGY, 2010, 6458 : 454 - 465
  • [3] Detecting Environmental, Social and Governance (ESG) Topics Using Domain-Specific Language Models and Data Augmentation
    Nugent, Tim
    Stelea, Nicole
    Leidner, Jochen L.
    FLEXIBLE QUERY ANSWERING SYSTEMS (FQAS 2021), 2021, 12871 : 157 - 169
  • [4] Domain-Specific Analysis of Mobile App Reviews Using Keyword-Assisted Topic Models
    Tushev, Miroslav
    Ebrahimi, Fahimeh
    Mahmoud, Anas
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 762 - 773
  • [5] A platform for connecting social media data to domain-specific topics using large language models: an application to student mental health
    Ruocco, Leonard
    Zhuang, Yuqian
    Ng, Raymond
    Munthali, Richard J.
    Hudec, Kristen L.
    Wang, Angel Y.
    Vereschagin, Melissa
    Vigo, Daniel V.
    JAMIA OPEN, 2024, 7 (01)
  • [6] Domain-specific Semantics and Data Refinement of Object Models
    Davies, Jim
    Faitelson, David
    Welch, James
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2008, 195 (0C) : 151 - 170
  • [7] Customization of Domain-Specific Reference Models for Data Warehouses
    Schuetz, Christoph
    Schrefl, Michael
    PROCEEDINGS OF THE 2014 IEEE 18TH INTERNATIONAL ENTERPRISE DISTRIBUTED OBJECT COMPUTING CONFERENCE (EDOC 2014), 2014, : 61 - 70
  • [8] Topic Model Based Adaptation Data Selection for Domain-Specific Machine Translation
    Yao, Liang
    Liu, Mengyi
    Hong, Yu
    Liu, Hao
    Yao, Jianmin
    SOCIAL MEDIA PROCESSING, SMP 2016, 2016, 669 : 162 - 171
  • [9] Facilitation of Domain-Specific Data Models Design using Semantic Web Technologies for Manufacturing
    Jirkovsky, Vaclav
    Sebek, Ondrej
    Kadera, Petr
    Burget, Pavel
    Knoch, Soenke
    Becker, Tilman
    IIWAS2019: THE 21ST INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES, 2019, : 649 - 653
  • [10] An automatic method to generate domain-specific investigator networks using PubMed abstracts
    Wei Yu
    Ajay Yesupriya
    Anja Wulf
    Junfeng Qu
    Marta Gwinn
    Muin J Khoury
    BMC Medical Informatics and Decision Making, 7