Scalable transcriptomics analysis with Dask: applications in data science and machine learning

被引:1
|
作者
Moreno, Marta [1 ,2 ]
Vilaca, Ricardo [4 ,5 ]
Ferreira, Pedro G. [1 ,2 ,3 ]
机构
[1] Univ Porto, Fac Sci, Dept Comp Sci, Rua Campo Alegre, P-4169007 Porto, Portugal
[2] INESC TEC, Lab Artificial Intelligence & Decis Support, Rua Dr Roberto Frias, P-4200465 Porto, Portugal
[3] Univ Porto, Inst Mol Pathol & Immunol, Inst Res & Innovat Hlth i3s, R Alfredo Allen 208, P-4200135 Porto, Portugal
[4] INESCTEC, High Assurance Software Lab, Rua Dr Roberto Frias, P-4200465 Porto, Portugal
[5] Univ Minho, Minho Adv Comp Ctr, Dept Informat, P-4710070 Braga, Portugal
关键词
Machine learning; Scalable data science; Gene expression; Transcriptomics; Data analysis; EXPRESSION; CLASSIFICATION; TUBERCULOSIS; PREDICTION; TRENDS;
D O I
10.1186/s12859-022-05065-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https:// github. com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
引用
收藏
页数:20
相关论文
共 50 条
  • [11] Fundamentals and Applications Related to Data Science, Machine Learning, and Statistical Processing
    Nakagawa, Masao
    Nomura, Yasutoshi
    Zairyo/Journal of the Society of Materials Science, Japan, 2024, 73 (07) : 618 - 624
  • [12] Fundamentals and Applications Related to Data Science, Machine Learning, and Statistical Processing
    Nomura, Yasutoshi
    Nakagawa, Masao
    Zairyo/Journal of the Society of Materials Science, Japan, 2024, 73 (08) : 682 - 688
  • [13] Scalable Machine Learning on Compact Data Representations
    Tabei, Yasuo
    PROCEEDINGS OF 2018 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA2018), 2018, : 26 - 30
  • [14] Applications of Entropy in Data Analysis and Machine Learning: A Review
    Sepulveda-Fontaine, Salome A.
    Amigo, Jose M.
    ENTROPY, 2024, 26 (12)
  • [15] Deep learning applications in single-cell genomics and transcriptomics data analysis
    Erfanian, Nafiseh
    Heydari, A. Ali
    Feriz, Adib Miraki
    Ianez, Pablo
    Derakhshani, Afshin
    Ghasemigol, Mohammad
    Farahpour, Mohsen
    Razavi, Seyyed Mohammad
    Nasseri, Saeed
    Safarpour, Hossein
    Sahebkar, Amirhossein
    BIOMEDICINE & PHARMACOTHERAPY, 2023, 165
  • [17] Sentiment analysis using machine learning: Progress in the machine intelligence for data science
    Revathy, G.
    Alghamdi, Saleh A.
    Alahmari, Sultan M.
    Yonbawi, Saud R.
    Kumar, Anil
    Haq, Mohd Anul
    SUSTAINABLE ENERGY TECHNOLOGIES AND ASSESSMENTS, 2022, 53
  • [18] PREFACE TO THE SPECIAL ISSUE ON ANALYSIS IN MACHINE LEARNING AND DATA SCIENCE
    Chirstmann, Andreas
    Wu, Qiang
    Zhou, Ding-Xuan
    COMMUNICATIONS ON PURE AND APPLIED ANALYSIS, 2020, 19 (08) : I - III
  • [19] Machine Learning Classifiers for Endometriosis Using Transcriptomics and Methylomics Data
    Akter, Sadia
    Xu, Dong
    Nagel, Susan C.
    Bromfield, John J.
    Pelch, Katherine
    Wilshire, Gilbert B.
    Joshi, Trupti
    FRONTIERS IN GENETICS, 2019, 10
  • [20] Industrial data science - a review of machine learning applications for chemical and process industries
    Mowbray, Max
    Vallerio, Mattia
    Perez-Galvan, Carlos
    Zhang, Dongda
    Del Rio Chanona, Antonio
    Navarro-Brull, Francisco J.
    Reaction Chemistry and Engineering, 2022, 7 (07): : 1471 - 1509