Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling

被引:13
|
作者
Chen, Xin [1 ]
Hu, Xiaohua [1 ]
Lim, Tze Y. [2 ]
Shen, Xiajiong [3 ]
Park, E. K. [4 ]
Rosen, Gail L. [5 ]
机构
[1] Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USA
[2] Drexel Univ, Dept Phys, Philadelphia, PA 19104 USA
[3] Henan Univ, Coll Comp & Informat Engn, Kaifeng, Henan, Peoples R China
[4] Calif State Univ Chico, Chico, CA 95929 USA
[5] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
Data mining; bioinformatics (genome or protein) databases; language models; metagenomics; CLASSIFICATION; MICROBIOTA;
D O I
10.1109/TCBB.2011.113
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a "document," which has a mixture of functional groups, while each functional group (also known as a "latent topic") is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of "N-mer" features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the "N-mer" features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.
引用
收藏
页码:980 / 991
页数:12
相关论文
共 50 条
  • [31] Efficient algorithms for graph regularized PLSA for probabilistic topic modeling
    Wang, Xin
    Chang, Ming-Ching
    Wang, Lan
    Lyu, Siwei
    [J]. PATTERN RECOGNITION, 2019, 86 : 236 - 247
  • [32] Sentiment Detection of Short Text via Probabilistic Topic Modeling
    Wu, Zewei
    Rao, Yanghui
    Li, Xin
    Li, Jun
    Xie, Haoran
    Wang, Fu Lee
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2015, 2015, 9052 : 76 - 85
  • [33] Discovering Functional Modules by Topic Modeling RNA-Seq Based Toxicogenomic Data
    Yu, Ke
    Gong, Binsheng
    Lee, Mikyung
    Liu, Zhichao
    Xu, Joshua
    Perkins, Roger
    Tong, Weida
    [J]. CHEMICAL RESEARCH IN TOXICOLOGY, 2014, 27 (09) : 1528 - 1536
  • [34] Document Clustering and Topic Modeling: A Unified Bayesian Probabilistic Perspective
    Costa, Gianni
    Ortale, Riccardo
    [J]. 2019 IEEE 31ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2019), 2019, : 278 - 285
  • [35] Probabilistic Topic Modeling, Reinforcement Learning, and Crowdsourcing for Personalized Recommendations
    Tripolitakis, Evangelos
    Chalkiadakis, Georgios
    [J]. MULTI-AGENT SYSTEMS AND AGREEMENT TECHNOLOGIES, EUMAS 2016, 2017, 10207 : 157 - 171
  • [36] Applying topic modeling to forensic data
    de Waal, Alta
    Venter, Jacobus
    Barnard, Etienne
    [J]. ADVANCES IN DIGITAL FORENSICS IV, 2008, 285 : 115 - +
  • [37] Applying topic modeling to forensic data
    de Waal, Alta
    Venter, Jacobus
    Barnard, Etienne
    [J]. IFIP Advances in Information and Communication Technology, 2008, 285 : 115 - 126
  • [38] CluWords: Exploiting Semantic Word Clustering Representation for Enhanced Topic Modeling
    Viegas, Felipe
    Canuto, Sergio
    Gomes, Christian
    Luiz, Washington
    Rosa, Thierson
    Ribas, Sabir
    Rocha, Leonardo
    Goncalves, Marcos Andre
    [J]. PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 753 - 761
  • [39] Topic modeling in density functional theory on citations of condensed matter electronic structure packages
    Dumaz, Marie
    Romero-Bohorquez, Camila
    Adjeroh, Donald
    Romero, Aldo H.
    [J]. SCIENTIFIC REPORTS, 2023, 13 (01)
  • [40] Topic modeling in density functional theory on citations of condensed matter electronic structure packages
    Marie Dumaz
    Camila Romero-Bohórquez
    Donald Adjeroh
    Aldo H. Romero
    [J]. Scientific Reports, 13