Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling

被引:13
|
作者
Chen, Xin [1 ]
Hu, Xiaohua [1 ]
Lim, Tze Y. [2 ]
Shen, Xiajiong [3 ]
Park, E. K. [4 ]
Rosen, Gail L. [5 ]
机构
[1] Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USA
[2] Drexel Univ, Dept Phys, Philadelphia, PA 19104 USA
[3] Henan Univ, Coll Comp & Informat Engn, Kaifeng, Henan, Peoples R China
[4] Calif State Univ Chico, Chico, CA 95929 USA
[5] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
基金
美国国家科学基金会;
关键词
Data mining; bioinformatics (genome or protein) databases; language models; metagenomics; CLASSIFICATION; MICROBIOTA;
D O I
10.1109/TCBB.2011.113
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a "document," which has a mixture of functional groups, while each functional group (also known as a "latent topic") is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of "N-mer" features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the "N-mer" features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method.
引用
收藏
页码:980 / 991
页数:12
相关论文
共 50 条
  • [1] Probabilistic Topic Modeling for Genomic Data Interpretation
    Chen, Xin
    Hu, Xiaohua
    Shen, Xiajiong
    Rosen, Gail
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2010, : 149 - 152
  • [2] Probabilistic topic modeling for the analysis and classification of genomic sequences
    Massimo La Rosa
    Antonino Fiannaca
    Riccardo Rizzo
    Alfonso Urso
    [J]. BMC Bioinformatics, 16
  • [3] Probabilistic topic modeling for the analysis and classification of genomic sequences
    La Rosa, Massimo
    Fiannaca, Antonino
    Rizzo, Riccardo
    Urso, Alfonso
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [4] Genomic Sequence Classification Using Probabilistic Topic Modeling
    La Rosa, Massimo
    Fiannaca, Antonino
    Rizzo, Riccardo
    Urso, Alfonso
    [J]. COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS: 10TH INTERNATIONAL MEETING, 2014, 8452 : 49 - 61
  • [5] Protecting Genomic Data Privacy with Probabilistic Modeling
    Simmons, Sean
    Berger, Bonnie
    Sahinalp, Cenk
    [J]. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019, 2019, : 403 - 414
  • [6] Assessing the functional structure of genomic data
    Huttenhower, C.
    Troyanskaya, O. G.
    [J]. BIOINFORMATICS, 2008, 24 (13) : I330 - I338
  • [7] Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach
    Fang, Ethan X.
    Li, Min-Dian
    Jordan, Michael I.
    Liu, Han
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (519) : 921 - 932
  • [8] Probabilistic topic models for sequence data
    Nicola Barbieri
    Giuseppe Manco
    Ettore Ritacco
    Marco Carnuccio
    Antonio Bevacqua
    [J]. Machine Learning, 2013, 93 : 5 - 29
  • [9] Probabilistic topic models for sequence data
    Barbieri, Nicola
    Manco, Giuseppe
    Ritacco, Ettore
    Carnuccio, Marco
    Bevacqua, Antonio
    [J]. MACHINE LEARNING, 2013, 93 (01) : 5 - 29
  • [10] Adaptive Topic Modeling with Probabilistic Pseudo Feedback in Online Topic Detection
    Tang, Guoyu
    Xia, Yunqing
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 6177 : 100 - 108