GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics

被引:6
|
作者
Zvyagin, Maxim [1 ]
Brace, Alexander [1 ,2 ]
Hippe, Kyle [1 ]
Deng, Yuntian [3 ,4 ]
Zhang, Bin [5 ]
Bohorquez, Cindy Orozco [5 ]
Clyde, Austin [1 ,2 ]
Kale, Bharat [6 ]
Perez-Rivera, Danilo [1 ,7 ]
Ma, Heng [1 ]
Mann, Carla M. [1 ,2 ]
Irvin, Michael [1 ]
Ozgulbas, Defne G. [8 ]
Vassilieva, Natalia [5 ]
Pauloski, James Gregory [2 ]
Ward, Logan [1 ]
Hayot-Sasson, Valerie [1 ,2 ,9 ]
Emani, Murali [1 ,9 ]
Foreman, Sam [1 ,9 ]
Xie, Zhen [1 ]
Lin, Diangen [1 ,2 ]
Shukla, Maulik [1 ,2 ]
Nie, Weili [3 ]
Romero, Josh [3 ]
Dallago, Christian [3 ,10 ]
Vahdat, Arash [3 ]
Xiao, Chaowei [3 ,8 ]
Gibbs, Thomas [3 ]
Foster, Ian [1 ,2 ]
Davis, James J. [1 ,2 ]
Papka, Michael E. [1 ,9 ,11 ]
Brettin, Thomas [1 ,12 ]
Stevens, Rick [1 ,2 ,12 ]
Anandkumar, Anima [3 ,13 ]
Vishwanath, Venkatram [1 ,9 ,14 ]
Ramanathan, Arvind [1 ]
机构
[1] Argonne Natl Lab, Data Sci & Learning Div, Bldg 240, Lemont, IL 60439 USA
[2] Univ Chicago, Dept Comp Sci, Hyde Pk, IL USA
[3] NVIDIA Inc, Santa Clara, CA USA
[4] Harvard Univ, Cambridge, MA USA
[5] Cerebras Inc, San Jose, CA USA
[6] Northern Illinois Univ, Comp Sci Dept, De Kalb, IL USA
[7] NYU, New York, NY USA
[8] Univ Illinois, Dept Biochem, Champaign, IL USA
[9] Argonne Natl Lab, Argonne Leadership Comp Facil, Bldg 240, Lemont, IL 60439 USA
[10] Tech Univ Munich, Comp Sci Dept, Munich, Germany
[11] Univ Illinois, Comp Sci Dept, Chicago, IL USA
[12] Argonne Natl Lab, Comp Environm & Life Sci Directorate, Lemont, IL 60439 USA
[13] CALTECH, Comp Sci Dept, Pasadena, CA 91125 USA
[14] Argonne Natl Lab, Lemont, IL 60439 USA
基金
美国国家卫生研究院; 美国国家科学基金会;
关键词
SARS-CoV-2; COVID-19; HPC; AI; large language models; whole-genome analyses; SEQUENCE;
D O I
10.1177/10943420231201154
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
We seek to transform how new and emergent variants of pandemic-causing viruses, specifically SARS-CoV-2, are identified and classified. By adapting large language models (LLMs) for genomic data, we build genome-scale language models (GenSLMs) which can learn the evolutionary landscape of SARS-CoV-2 genomes. By pre-training on over 110 million prokaryotic gene sequences and fine-tuning a SARS-CoV-2-specific model on 1.5 million genomes, we show that GenSLMs can accurately and rapidly identify variants of concern. Thus, to our knowledge, GenSLMs represents one of the first whole-genome scale foundation models which can generalize to other prediction tasks. We demonstrate scaling of GenSLMs on GPU-based supercomputers and AI-hardware accelerators utilizing 1.63 Zettaflops in training runs with a sustained performance of 121 PFLOPS in mixed precision and peak of 850 PFLOPS. We present initial scientific insights from examining GenSLMs in tracking evolutionary dynamics of SARS-CoV-2, paving the path to realizing this on large biological data.
引用
收藏
页码:683 / 705
页数:23
相关论文
共 50 条
  • [21] Animal models for SARS-CoV-2
    Lee, Chung-Young
    Lowen, Anice C.
    CURRENT OPINION IN VIROLOGY, 2021, 48 : 73 - 81
  • [22] Challenges of integrating aerosol dynamics into SARS-CoV-2 transmission models
    Drossinos, Yannis
    Reid, Jonathan P.
    Hugentobler, Walter
    Stilianakis, Nikolaos, I
    AEROSOL SCIENCE AND TECHNOLOGY, 2022, 56 (09) : 777 - 784
  • [23] The Evolving Faces of the SARS-CoV-2 Genome
    Schmidt, Maria
    Arshad, Mamoona
    Bernhart, Stephan H.
    Hakobyan, Siras
    Arakelyan, Arsen
    Loeffler-Wirth, Henry
    Binder, Hans
    VIRUSES-BASEL, 2021, 13 (09):
  • [24] The UCSC SARS-CoV-2 Genome Browser
    Fernandes, Jason D.
    Hinrichs, Angie S.
    Clawson, Hiram
    Gonzalez, Jairo Navarro
    Lee, Brian T.
    Nassar, Luis R.
    Raney, Brian J.
    Rosenbloom, Kate R.
    Nerli, Santrupti
    Rao, Arjun A.
    Schmelter, Daniel
    Fyfe, Alastair
    Maulding, Nathan
    Zweig, Ann S.
    Lowe, Todd M.
    Ares, Manuel Jr Jr
    Corbet-Detig, Russ
    Kent, W. James
    Haussler, David
    Haeussler, Maximilian
    NATURE GENETICS, 2020, 52 (10) : 991 - 998
  • [25] The UCSC SARS-CoV-2 Genome Browser
    Jason D. Fernandes
    Angie S. Hinrichs
    Hiram Clawson
    Jairo Navarro Gonzalez
    Brian T. Lee
    Luis R. Nassar
    Brian J. Raney
    Kate R. Rosenbloom
    Santrupti Nerli
    Arjun A. Rao
    Daniel Schmelter
    Alastair Fyfe
    Nathan Maulding
    Ann S. Zweig
    Todd M. Lowe
    Manuel Ares
    Russ Corbet-Detig
    W. James Kent
    David Haussler
    Maximilian Haeussler
    Nature Genetics, 2020, 52 : 991 - 998
  • [26] SARS-CoV-2 has stable genome
    Howes, Laura
    CHEMICAL & ENGINEERING NEWS, 2020, 98 (34) : 5 - 5
  • [27] Temporal dynamics of SARS-CoV-2 genome mutations that occurred in vivo on an aircraft
    He, Yaqing
    Dang, Shengyuan
    Ma, Wentai
    Chen, Long
    Zhang, Renli
    Mei, Shujiang
    Wei, Xinyi
    Lv, Qiuying
    Peng, Bo
    Sun, Ying
    Kong, Dongfeng
    Chen, Jiancheng
    Li, Shimin
    Tang, Xiujuan
    Lu, Qingju
    Zhu, Can
    Chen, Zhigao
    Wan, Jia
    Zou, Xuan
    Li, Mingkun
    Feng, Tiejiang
    Ren, Lili
    Wang, Jianwei
    BIOSAFETY AND HEALTH, 2023, 5 (01) : 62 - 67
  • [28] Temporal dynamics of SARS-CoV-2 genome mutations that occurred in vivo on an aircraft
    He Yaqing
    Dang Shengyuan
    Ma Wentai
    Chen Long
    Zhang Renli
    Mei Shujiang
    Wei Xinyi
    Lv Qiuying
    Peng Bo
    Sun Ying
    Kong Dongfeng
    Chen Jiancheng
    Li Shimin
    Tang Xiujuan
    Lu Qingju
    Zhu Can
    Chen Zhigao
    Wan Jia
    Zou Xuan
    Li Mingkun
    Feng Tiejiang
    Ren Lili
    Wang Jianwei
    生物安全与健康(英文), 2023, 05 (01)
  • [29] Genome-scale CRISPR-Cas9 screen identifies host factors as potential therapeutic targets for SARS-CoV-2 infection
    Sakai, Madoka
    Masuda, Yoshie
    Tarumoto, Yusuke
    Aihara, Naoyuki
    Tsunoda, Yugo
    Iwata, Michiko
    Kamiya, Yumiko
    Komorizono, Ryo
    Noda, Takeshi
    Yusa, Kosuke
    Tomonaga, Keizo
    Makino, Akiko
    ISCIENCE, 2024, 27 (08)
  • [30] Evolutionary trajectory of SARS-CoV-2 and emerging variants
    Jalen Singh
    Pranav Pandit
    Andrew G. McArthur
    Arinjay Banerjee
    Karen Mossman
    Virology Journal, 18