Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

被引：4

作者：

Hasheminezhad, Bita ^{[1
]}

Shirzad, Shahrzad ^{[1
]}

Wu, Nanmiao ^{[1
]}

Diehl, Patrick ^{[1
]}

Schulz, Hannes ^{[2
]}

Kaiser, Hartmut ^{[1
]}

机构：

[1] Louisiana State Univ, Ctr Computat & Technol, Baton Rouge, LA 70803 USA

[2] Microsoft Res Montreal, Montreal, PQ, Canada

来源：

PROCEEDINGS OF 2020 IEEE/ACM 5TH WORKSHOP ON DEEP LEARNING ON SUPERCOMPUTERS (DLS 2020) | 2020年

关键词：

Distributed Deep Learning; High Performance Computing; HPX; Asynchronous Many-task System;

D O I：

10.1109/DLS51937.2020.00008

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Although recent scaling up approaches to train deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the primary designs of most available distributed deep learning frameworks and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx presents a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.

引用

页码：20 / 30

页数：11

共 50 条

[11] Scalable Malware Detection System Using Distributed Deep Learning
Kumar, Manish
CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647
[12] Towards Scalable Koopman Operator Learning: Convergence Rates and A Distributed Learning Algorithm
Liu, Zhiyuan
Ding, Guohui
Chen, Lijun
Yeung, Enoch
2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 3983 - 3990
[13] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
Nicolae, Bogdan
Li, Jiali
Wozniak, Justin M.
Bosilca, George
Dorier, Matthieu
Cappello, Franck
2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 172 - 181
[14] A scalable yet transparent infrastructure for distributed applications:: Core design of Jasmine ii framework
Leung, K
Shim, J
Tcherevik, D
Vinberg, A
PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, 2001, : 31 - 36
[15] Scalable and Energy-Efficient Deep Learning for Distributed AIoT Applications Using Modular Cognitive IoT Hardware
Abbasi, Maryam
Cardoso, Filipe
Silva, Jose
Martins, Pedro
NEW TRENDS IN DISRUPTIVE TECHNOLOGIES, TECH ETHICS AND ARTIFICIAL INTELLIGENCE, DITTET 2023, 2023, 1452 : 85 - 96
[16] Building a Distributed Infrastructure for Scalable Triple Stores
Zhou, Jing
Hall, Wendy
De Roure, David
JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2009, 24 (03) : 447 - 462
[17] Building a Distributed Infrastructure for Scalable Triple Stores
周菁
Wendy Hall
David De Roure
Journal of Computer Science & Technology, 2009, 24 (03) : 447 - 462
[18] Building a Distributed Infrastructure for Scalable Triple Stores
Jing Zhou
Wendy Hall
David De Roure
Journal of Computer Science and Technology, 2009, 24 : 447 - 462
[19] Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds
Jorge, Javier
Molto, German
Segrelles, Damian
Fontes, Joao Pedro
Guevara, Miguel Angel
CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 135 - 142
[20] BK.Synapse: A scalable distributed training framework for deep learning
Dinh Viet Sang
Phan Ngoc Lan
SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 43 - 48

← 1 2 3 4 5 →