Scalable transcriptomics analysis with Dask: applications in data science and machine learning

被引:1
|
作者
Moreno, Marta [1 ,2 ]
Vilaca, Ricardo [4 ,5 ]
Ferreira, Pedro G. [1 ,2 ,3 ]
机构
[1] Univ Porto, Fac Sci, Dept Comp Sci, Rua Campo Alegre, P-4169007 Porto, Portugal
[2] INESC TEC, Lab Artificial Intelligence & Decis Support, Rua Dr Roberto Frias, P-4200465 Porto, Portugal
[3] Univ Porto, Inst Mol Pathol & Immunol, Inst Res & Innovat Hlth i3s, R Alfredo Allen 208, P-4200135 Porto, Portugal
[4] INESCTEC, High Assurance Software Lab, Rua Dr Roberto Frias, P-4200465 Porto, Portugal
[5] Univ Minho, Minho Adv Comp Ctr, Dept Informat, P-4710070 Braga, Portugal
关键词
Machine learning; Scalable data science; Gene expression; Transcriptomics; Data analysis; EXPRESSION; CLASSIFICATION; TUBERCULOSIS; PREDICTION; TRENDS;
D O I
10.1186/s12859-022-05065-3
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Gene expression studies are an important tool in biological and biomedical research. The signal carried in expression profiles helps derive signatures for the prediction, diagnosis and prognosis of different diseases. Data science and specifically machine learning have many applications in gene expression analysis. However, as the dimensionality of genomics datasets grows, scalable solutions become necessary. Methods: In this paper we review the main steps and bottlenecks in machine learning pipelines, as well as the main concepts behind scalable data science including those of concurrent and parallel programming. We discuss the benefits of the Dask framework and how it can be integrated with the Python scientific environment to perform data analysis in computational biology and bioinformatics. Results: This review illustrates the role of Dask for boosting data science applications in different case studies. Detailed documentation and code on these procedures is made available at https:// github. com/martaccmoreno/gexp-ml-dask. Conclusion: By showing when and how Dask can be used in transcriptomics analysis, this review will serve as an entry point to help genomic data scientists develop more scalable data analysis procedures.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Data and Machine Learning in Polymer Science
    Yun-Qi Li
    Ying Jiang
    Li-Quan Wang
    Jian-Feng Li
    [J]. Chinese Journal of Polymer Science, 2023, (09) : 1371 - 1376
  • [22] Data Science and Machine Learning at Scale
    Sundaresan, Neel
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT I, 2011, 6911 : 10 - 10
  • [23] Data science and machine learning in anesthesiology
    Chae, Dongwoo
    [J]. KOREAN JOURNAL OF ANESTHESIOLOGY, 2020, 73 (04) : 285 - 295
  • [24] Data and Machine Learning in Polymer Science
    Li, Yun-Qi
    Jiang, Ying
    Wang, Li-Quan
    Li, Jian-Feng
    [J]. CHINESE JOURNAL OF POLYMER SCIENCE, 2023, 41 (09) : 1371 - 1376
  • [25] Data and Machine Learning in Polymer Science
    Yun-Qi Li
    Ying Jiang
    Li-Quan Wang
    Jian-Feng Li
    [J]. Chinese Journal of Polymer Science, 2023, 41 : 1371 - 1376
  • [26] Tension in big data using machine learning: Analysis and applications
    Wang, Huamao
    Yao, Yumei
    Salhi, Said
    [J]. TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, 2020, 158
  • [27] Applications and Techniques for Fast Machine Learning in Science
    Deiana, Allison McCarn
    Tran, Nhan
    Agar, Joshua
    Blott, Michaela
    Di Guglielmo, Giuseppe
    Duarte, Javier
    Harris, Philip
    Hauck, Scott
    Liu, Mia
    Neubauer, Mark S.
    Ngadiuba, Jennifer
    Ogrenci-Memik, Seda
    Pierini, Maurizio
    Aarrestad, Thea
    Baehr, Steffen
    Becker, Juergen
    Berthold, Anne-Sophie
    Bonventre, Richard J.
    Bravo, Tomas E. Muller
    Diefenthaler, Markus
    Dong, Zhen
    Fritzsche, Nick
    Gholami, Amir
    Govorkova, Ekaterina
    Guo, Dongning
    Hazelwood, Kyle J.
    Herwig, Christian
    Khan, Babar
    Kim, Sehoon
    Klijnsma, Thomas
    Liu, Yaling
    Lo, Kin Ho
    Nguyen, Tri
    Pezzullo, Gianantonio
    Rasoulinezhad, Seyedramin
    Rivera, Ryan A.
    Scholberg, Kate
    Selig, Justin
    Sen, Sougata
    Strukov, Dmitri
    Tang, William
    Thais, Savannah
    Unger, Kai Lukas
    Vilalta, Ricardo
    von Krosigk, Belina
    Wang, Shen
    Warburton, Thomas K.
    [J]. FRONTIERS IN BIG DATA, 2022, 5
  • [28] Machine learning in suicide science: Applications and ethics
    Linthicum, Kathryn P.
    Schafer, Katherine Musacchio
    Ribeiro, Jessica D.
    [J]. BEHAVIORAL SCIENCES & THE LAW, 2019, 37 (03) : 214 - 222
  • [29] A Scalable Machine Learning Online Service for Big Data Real-Time Analysis
    Baldominos, Alejandro
    Albacete, Esperanza
    Saez, Yago
    Isasi, Pedro
    [J]. 2014 IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIG DATA (CIBD), 2014, : 112 - 119
  • [30] Machine learning for data mining, data science and data analytics
    Radhakrishna, Vangipuram
    Reddy, Gali Suresh
    Kumar, Gunupudi Rajesh
    Rao, Dammavalam Srinivasa
    [J]. Recent Advances in Computer Science and Communications, 2021, 14 (05): : 1356 - 1357