Identification of transcription factor contexts in literature using machine learning approaches

被引:6
|
作者
Yang, Hui [1 ]
Nenadic, Goran [1 ]
Keane, John A. [1 ]
机构
[1] Univ Manchester, Sch Comp Sci, Manchester, Lancs, England
基金
英国生物技术与生命科学研究理事会;
关键词
D O I
10.1186/1471-2105-9-S3-S11
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Availability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature. Results: In this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%. Conclusions: The experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Identification of potential biomarkers for lung cancer using integrated bioinformatics and machine learning approaches
    Rabby, Md Symun
    Islam, Md Merajul
    Kumar, Sujit
    Maniruzzaman, Md
    Hasan, Md Al Mehedi
    Tomioka, Yoichi
    Shin, Jungpil
    PLOS ONE, 2025, 20 (02):
  • [32] SCNTA: Monitoring of Network Availability and Activity for Identification of Anomalies Using Machine Learning Approaches
    Rawat, Romil
    Garg, Bhagwati
    Pachlasiya, Kiran
    Mahor, Vinod
    Telang, Shrikant
    Chouhan, Mukesh
    Shukla, Surendra Kumar
    Mishra, Rina
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING, 2022, 17 (01)
  • [33] Identification of a non-canonical transcription factor binding site using deep learning
    Proft, Sebastian
    Leiz, Janna
    Opitz, Robert
    Jung, Minie
    Heinemann, Udo
    Seelow, Dominik
    Schmidt-Ott, Kai
    Rutkiewicz, Maria
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2023, 31 : 620 - 621
  • [34] Tandem machine learning for the identification of genes regulated by transcription factors
    Deendayal Dinakarpandian
    Venetia Raheja
    Saumil Mehta
    Erin G Schuetz
    Peter K Rogan
    BMC Bioinformatics, 6
  • [35] Tandem machine learning for the identification of genes regulated by transcription factors
    Dinakarpandian, D
    Raheja, V
    Mehta, S
    Schuetz, EG
    Rogan, PK
    BMC BIOINFORMATICS, 2005, 6 (1)
  • [36] Machine learning approaches to IoT security: A systematic literature review
    Ahmad, Rasheed
    Alsmadi, Izzat
    INTERNET OF THINGS, 2021, 14
  • [37] Machine learning approaches in reliability and maintenance: classifications of recent literature
    Wu, Shaomin
    Wu, Di
    Peng, Rui
    2020 ASIA-PACIFIC INTERNATIONAL SYMPOSIUM ON ADVANCED RELIABILITY AND MAINTENANCE MODELING (APARM), 2020,
  • [38] Identification of Promising Research Directions using Machine Learning Aided Medical Literature Analysis
    Andrei, Victor
    Arandjelovic, Ognjen
    2016 38TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2016, : 2471 - 2474
  • [39] Detection of trachoma using machine learning approaches
    Socia, Damien
    Brady, Christopher J.
    West, Sheila K.
    Cockrell, R. Chase
    PLOS NEGLECTED TROPICAL DISEASES, 2022, 16 (12):
  • [40] Using Stacking Approaches for Machine Learning Models
    Pavlyshenko, Bohdan
    2018 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA STREAM MINING & PROCESSING (DSMP), 2018, : 255 - 258