Probabilistic Topic Modeling for Genomic Data Interpretation

被引:0
|
作者
Chen, Xin [1 ]
Hu, Xiaohua [1 ]
Shen, Xiajiong [3 ]
Rosen, Gail [2 ]
机构
[1] Drexel Univ, Coll Informat Sci & Technol, Philadelphia, PA 19104 USA
[2] Drexel Univ, Dept Elect & Comp Engn, Philadelphia, PA 19104 USA
[3] Henan Univ, Coll Comp & Informat Engn, Kaifeng, Henan, Peoples R China
基金
美国国家科学基金会;
关键词
genomic dataformatting; N-mer feature; Latent Dirichlet Allocation; core and distributed genes; functional annotation;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recently, the concept of a species containing both core and distributed genes, known as the supra-or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the 'N-mer' and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a.latent topics) of the 'N-mer' features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.
引用
收藏
页码:149 / 152
页数:4
相关论文
共 50 条
  • [1] Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling
    Chen, Xin
    Hu, Xiaohua
    Lim, Tze Y.
    Shen, Xiajiong
    Park, E. K.
    Rosen, Gail L.
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2012, 9 (04) : 980 - 991
  • [2] Probabilistic topic modeling for the analysis and classification of genomic sequences
    Massimo La Rosa
    Antonino Fiannaca
    Riccardo Rizzo
    Alfonso Urso
    [J]. BMC Bioinformatics, 16
  • [3] Probabilistic topic modeling for the analysis and classification of genomic sequences
    La Rosa, Massimo
    Fiannaca, Antonino
    Rizzo, Riccardo
    Urso, Alfonso
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [4] Genomic Sequence Classification Using Probabilistic Topic Modeling
    La Rosa, Massimo
    Fiannaca, Antonino
    Rizzo, Riccardo
    Urso, Alfonso
    [J]. COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS: 10TH INTERNATIONAL MEETING, 2014, 8452 : 49 - 61
  • [5] Protecting Genomic Data Privacy with Probabilistic Modeling
    Simmons, Sean
    Berger, Bonnie
    Sahinalp, Cenk
    [J]. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2019, 2019, : 403 - 414
  • [6] Mining Massive Amounts of Genomic Data: A Semiparametric Topic Modeling Approach
    Fang, Ethan X.
    Li, Min-Dian
    Jordan, Michael I.
    Liu, Han
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (519) : 921 - 932
  • [7] Probabilistic topic models for sequence data
    Nicola Barbieri
    Giuseppe Manco
    Ettore Ritacco
    Marco Carnuccio
    Antonio Bevacqua
    [J]. Machine Learning, 2013, 93 : 5 - 29
  • [8] Probabilistic topic models for sequence data
    Barbieri, Nicola
    Manco, Giuseppe
    Ritacco, Ettore
    Carnuccio, Marco
    Bevacqua, Antonio
    [J]. MACHINE LEARNING, 2013, 93 (01) : 5 - 29
  • [9] Adaptive Topic Modeling with Probabilistic Pseudo Feedback in Online Topic Detection
    Tang, Guoyu
    Xia, Yunqing
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 6177 : 100 - 108
  • [10] Probabilistic Word Selection via Topic Modeling
    Zhuang, Yueting
    Gao, Haidong
    Wu, Fei
    Tang, Siliang
    Zhang, Yin
    Zhang, Zhongfei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (06) : 1643 - 1655