GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

被引:6
|
作者
Zvyagin, Maxim [1 ]
Brace, Alexander [1 ,2 ]
Hippe, Kyle [1 ]
Deng, Yuntian [3 ,4 ]
Zhang, Bin [5 ]
Bohorquez, Cindy Orozco [5 ]
Clyde, Austin [1 ,2 ]
Kale, Bharat [6 ]
Perez-Rivera, Danilo [1 ,7 ]
Ma, Heng [1 ]
Mann, Carla M. [1 ,2 ]
Irvin, Michael [1 ]
Ozgulbas, Defne G. [8 ]
Vassilieva, Natalia [5 ]
Pauloski, James Gregory [2 ]
Ward, Logan [1 ]
Hayot-Sasson, Valerie [1 ,2 ,9 ]
Emani, Murali [1 ,9 ]
Foreman, Sam [1 ,9 ]
Xie, Zhen [1 ]
Lin, Diangen [1 ,2 ]
Shukla, Maulik [1 ,2 ]
Nie, Weili [3 ]
Romero, Josh [3 ]
Dallago, Christian [3 ,10 ]
Vahdat, Arash [3 ]
Xiao, Chaowei [3 ,8 ]
Gibbs, Thomas [3 ]
Foster, Ian [1 ,2 ]
Davis, James J. [1 ,2 ]
Papka, Michael E. [1 ,9 ,11 ]
Brettin, Thomas [1 ,12 ]
Stevens, Rick [1 ,2 ,12 ]
Anandkumar, Anima [3 ,13 ]
Vishwanath, Venkatram [1 ,9 ,14 ]
Ramanathan, Arvind [1 ]
机构
[1] Argonne Natl Lab, Data Sci & Learning Div, Bldg 240, Lemont, IL 60439 USA
[2] Univ Chicago, Dept Comp Sci, Hyde Pk, IL USA
[3] NVIDIA Inc, Santa Clara, CA USA
[4] Harvard Univ, Cambridge, MA USA
[5] Cerebras Inc, San Jose, CA USA
[6] Northern Illinois Univ, Comp Sci Dept, De Kalb, IL USA
[7] NYU, New York, NY USA
[8] Univ Illinois, Dept Biochem, Champaign, IL USA
[9] Argonne Natl Lab, Argonne Leadership Comp Facil, Bldg 240, Lemont, IL 60439 USA
[10] Tech Univ Munich, Comp Sci Dept, Munich, Germany
[11] Univ Illinois, Comp Sci Dept, Chicago, IL USA
[12] Argonne Natl Lab, Comp Environm & Life Sci Directorate, Lemont, IL 60439 USA
[13] CALTECH, Comp Sci Dept, Pasadena, CA 91125 USA
[14] Argonne Natl Lab, Lemont, IL 60439 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
SARS-CoV-2; COVID-19; HPC; AI; large language models; whole-genome analyses; SEQUENCE;
D O I
10.1177/10943420231201154
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
引用
收藏
页码:683 / 705
页数:23
相关论文
共 50 条
  • [31] SARS-COV-2 UNDER THE EVOLUTIONARY MAGNIFYING GLASS
    Pievani, Telmo
    S&F-SCIENZAEFILOSOFIA IT, 2021, (25) : 123 - 131
  • [32] Evolutionary medical insights into the SARS-CoV-2 pandemic
    Crespi, Bernard
    EVOLUTION MEDICINE AND PUBLIC HEALTH, 2020, (01) : 314 - 322
  • [33] Two Years of Evolutionary Dynamics of SARS-CoV-2 in Mexico, With Emphasis on the Variants of Concern
    Flores-Alanis, Alejandro
    Delgado, Gabriela
    Espinosa-Camacho, Luis F.
    Rodriguez-Gomez, Flor
    Cruz-Rangel, Armando
    Sandner-Miranda, Luisa
    Cravioto, Alejandro
    Morales-Espinosa, Rosario
    FRONTIERS IN MICROBIOLOGY, 2022, 13
  • [34] A Simulation Framework for Modeling the Within-Patient Evolutionary Dynamics of SARS-CoV-2
    Terbot II, John W.
    Cooper, Brandon S.
    Good, Jeffrey M.
    Jensen, Jeffrey D.
    GENOME BIOLOGY AND EVOLUTION, 2023, 15 (11):
  • [35] Evolutionary trajectory of SARS-CoV-2 and emerging variants
    Singh, Jalen
    Pandit, Pranav
    McArthur, Andrew G.
    Banerjee, Arinjay
    Mossman, Karen
    VIROLOGY JOURNAL, 2021, 18 (01)
  • [36] Genome based evolutionary lineage of SARS-CoV-2 towards the development of novel chimeric vaccine
    Akhand, Mst Rubaiat Nazneen
    Azim, Kazi Faizul
    Hoque, Syeda Farjana
    Moli, Mahmuda Akther
    Joy, Bijit Das
    Akter, Hafsa
    Afif, Ibrahim Khalil
    Ahmed, Nadim
    Hasan, Mahmudul
    INFECTION GENETICS AND EVOLUTION, 2020, 85
  • [37] Evolutionary deletions within the SARS-CoV-2 genome as signature trends for virus fitness and adaptation
    Jeronimo, Pedro Miguel Carneiro
    Aksenen, Cleber Furtado
    Duarte, Igor Oliveira
    Lins, Roberto D.
    Miyajima, Fabio
    JOURNAL OF VIROLOGY, 2024, 98 (01)
  • [38] Experimental Models for SARS-CoV-2 Infection
    Kim, Taewoo
    Lee, Jeong Seok
    Ju, Young Seok
    MOLECULES AND CELLS, 2021, 44 (07) : 377 - 383
  • [39] Animal models in SARS-CoV-2 research
    Chu, Hin
    Chan, Jasper Fuk-Woo
    Yuen, Kwok-Yung
    NATURE METHODS, 2022, 19 (04) : 392 - 394
  • [40] Characterization of SARS-CoV-2 dynamics in the host
    Abuin, Pablo
    Anderson, Alejandro
    Ferramosca, Antonio
    Hernandez-Vargas, Esteban A.
    Gonzalez, Alejandro H.
    ANNUAL REVIEWS IN CONTROL, 2020, 50 : 457 - 468