GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

被引:6
|
作者
Zvyagin, Maxim [1 ]
Brace, Alexander [1 ,2 ]
Hippe, Kyle [1 ]
Deng, Yuntian [3 ,4 ]
Zhang, Bin [5 ]
Bohorquez, Cindy Orozco [5 ]
Clyde, Austin [1 ,2 ]
Kale, Bharat [6 ]
Perez-Rivera, Danilo [1 ,7 ]
Ma, Heng [1 ]
Mann, Carla M. [1 ,2 ]
Irvin, Michael [1 ]
Ozgulbas, Defne G. [8 ]
Vassilieva, Natalia [5 ]
Pauloski, James Gregory [2 ]
Ward, Logan [1 ]
Hayot-Sasson, Valerie [1 ,2 ,9 ]
Emani, Murali [1 ,9 ]
Foreman, Sam [1 ,9 ]
Xie, Zhen [1 ]
Lin, Diangen [1 ,2 ]
Shukla, Maulik [1 ,2 ]
Nie, Weili [3 ]
Romero, Josh [3 ]
Dallago, Christian [3 ,10 ]
Vahdat, Arash [3 ]
Xiao, Chaowei [3 ,8 ]
Gibbs, Thomas [3 ]
Foster, Ian [1 ,2 ]
Davis, James J. [1 ,2 ]
Papka, Michael E. [1 ,9 ,11 ]
Brettin, Thomas [1 ,12 ]
Stevens, Rick [1 ,2 ,12 ]
Anandkumar, Anima [3 ,13 ]
Vishwanath, Venkatram [1 ,9 ,14 ]
Ramanathan, Arvind [1 ]
机构
[1] Argonne Natl Lab, Data Sci & Learning Div, Bldg 240, Lemont, IL 60439 USA
[2] Univ Chicago, Dept Comp Sci, Hyde Pk, IL USA
[3] NVIDIA Inc, Santa Clara, CA USA
[4] Harvard Univ, Cambridge, MA USA
[5] Cerebras Inc, San Jose, CA USA
[6] Northern Illinois Univ, Comp Sci Dept, De Kalb, IL USA
[7] NYU, New York, NY USA
[8] Univ Illinois, Dept Biochem, Champaign, IL USA
[9] Argonne Natl Lab, Argonne Leadership Comp Facil, Bldg 240, Lemont, IL 60439 USA
[10] Tech Univ Munich, Comp Sci Dept, Munich, Germany
[11] Univ Illinois, Comp Sci Dept, Chicago, IL USA
[12] Argonne Natl Lab, Comp Environm & Life Sci Directorate, Lemont, IL 60439 USA
[13] CALTECH, Comp Sci Dept, Pasadena, CA 91125 USA
[14] Argonne Natl Lab, Lemont, IL 60439 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
SARS-CoV-2; COVID-19; HPC; AI; large language models; whole-genome analyses; SEQUENCE;
D O I
10.1177/10943420231201154
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
引用
收藏
页码:683 / 705
页数:23
相关论文
共 50 条
  • [1] Analysis of SARS-CoV-2 genome evolutionary patterns
    Gupta, Shubhangi
    Gupta, Deepanshu
    Bhatnagar, Sonika
    MICROBIOLOGY SPECTRUM, 2024, 12 (02):
  • [2] Genome-Scale Identification of SARS-CoV-2 and Pan-coronavirus Host Factor Networks
    Schneider, William M.
    Luna, Joseph M.
    Hoffmann, H-Heinrich
    Sanchez-Rivera, Francisco J.
    Leal, Andrew A.
    Ashbrook, Alison W.
    Le Pen, Jeremie
    Ricardo-Lax, Inna
    Michailidis, Eleftherios
    Peace, Avery
    Stenzel, Ansgar F.
    Lowe, Scott W.
    MacDonald, Margaret R.
    Rice, Charles M.
    Poirier, John T.
    CELL, 2021, 184 (01) : 120 - +
  • [3] Variation in synonymous evolutionary rates in the SARS-CoV-2 genome
    Sun, Qianru
    Zeng, Jinfeng
    Tang, Kang
    Long, Haoyu
    Zhang, Chi
    Zhang, Jie
    Tang, Jing
    Xin, Yuting
    Zheng, Jialu
    Sun, Litao
    Liu, Siyang
    Du, Xiangjun
    FRONTIERS IN MICROBIOLOGY, 2023, 14
  • [4] Language models for the prediction of SARS-CoV-2 inhibitors
    Blanchard, Andrew E.
    Gounley, John
    Bhowmik, Debsindhu
    Shekar, Mayanka Chandra
    Lyngaas, Isaac
    Gao, Shang
    Yin, Junqi
    Tsaris, Aristeidis
    Wang, Feiyi
    Glaser, Jens
    INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2022, 36 (5-6): : 587 - 602
  • [5] Evolutionary Dynamics of Indels in SARS-CoV-2 Spike Glycoprotein
    Rao, R. Shyama Prasad
    Ahsan, Nagib
    Xu, Chunhui
    Su, Lingtao
    Verburgt, Jacob
    Fornelli, Luca
    Kihara, Daisuke
    Xu, Dong
    EVOLUTIONARY BIOINFORMATICS, 2021, 17
  • [6] In vivo structure and dynamics of the SARS-CoV-2 RNA genome
    Zhang, Yan
    Huang, Kun
    Xie, Dejian
    Lau, Jian You
    Shen, Wenlong
    Li, Ping
    Wang, Dong
    Zou, Zhong
    Shi, Shu
    Ren, Hongguang
    Wang, Youliang
    Mao, Youzhi
    Jin, Meilin
    Kudla, Grzegorz
    Zhao, Zhihu
    NATURE COMMUNICATIONS, 2021, 12 (01)
  • [7] In vivo structure and dynamics of the SARS-CoV-2 RNA genome
    Yan Zhang
    Kun Huang
    Dejian Xie
    Jian You Lau
    Wenlong Shen
    Ping Li
    Dong Wang
    Zhong Zou
    Shu Shi
    Hongguang Ren
    Youliang Wang
    Youzhi Mao
    Meilin Jin
    Grzegorz Kudla
    Zhihu Zhao
    Nature Communications, 12
  • [8] On the evolutionary epidemiology of SARS-CoV-2
    Day, Troy
    Gandon, Sylvain
    Lion, Sebastien
    Otto, Sarah P.
    CURRENT BIOLOGY, 2020, 30 (15) : R849 - R857
  • [9] Human/SARS-CoV-2 genome-scale metabolic modeling to discover potential antiviral targets for COVID-19
    Wang, Feng-Sheng
    Chen, Ke-Lin
    Chu, Sz-Wei
    JOURNAL OF THE TAIWAN INSTITUTE OF CHEMICAL ENGINEERS, 2022, 133
  • [10] Uncovering the Effect of SARS-CoV-2 on Liver Metabolism via Genome-Scale Metabolic Modeling for Reprogramming and Therapeutic Strategies
    Sertbas, Mustafa
    Ulgen, Kutlu O.
    ACS OMEGA, 2024, 9 (13): : 15535 - 15546