Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

被引:4
|
作者
Hasheminezhad, Bita [1 ]
Shirzad, Shahrzad [1 ]
Wu, Nanmiao [1 ]
Diehl, Patrick [1 ]
Schulz, Hannes [2 ]
Kaiser, Hartmut [1 ]
机构
[1] Louisiana State Univ, Ctr Computat & Technol, Baton Rouge, LA 70803 USA
[2] Microsoft Res Montreal, Montreal, PQ, Canada
关键词
Distributed Deep Learning; High Performance Computing; HPX; Asynchronous Many-task System;
D O I
10.1109/DLS51937.2020.00008
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although recent scaling up approaches to train deep neural networks have proven to be effective, the computational intensity of large and complex models, as well as the availability of large-scale datasets require deep learning frameworks to utilize scaling out techniques. Parallelization approaches and distribution requirements are not considered in the primary designs of most available distributed deep learning frameworks and most of them still are not able to perform effective and efficient fine-grained inter-node communication. We present Phylanx that has the potential to alleviate these shortcomings. Phylanx presents a productivity-oriented frontend where user Python code is translated to a futurized execution tree that can be executed efficiently on multiple nodes using the C++ standard library for parallelism and concurrency (HPX), leveraging fine-grained threading and an active messaging task-based runtime system.
引用
收藏
页码:20 / 30
页数:11
相关论文
共 50 条
  • [11] Scalable Malware Detection System Using Distributed Deep Learning
    Kumar, Manish
    CYBERNETICS AND SYSTEMS, 2023, 54 (05) : 619 - 647
  • [12] Towards Scalable Koopman Operator Learning: Convergence Rates and A Distributed Learning Algorithm
    Liu, Zhiyuan
    Ding, Guohui
    Chen, Lijun
    Yeung, Enoch
    2020 AMERICAN CONTROL CONFERENCE (ACC), 2020, : 3983 - 3990
  • [13] DeepFreeze: Towards Scalable Asynchronous Checkpointing of Deep Learning Models
    Nicolae, Bogdan
    Li, Jiali
    Wozniak, Justin M.
    Bosilca, George
    Dorier, Matthieu
    Cappello, Franck
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 172 - 181
  • [14] A scalable yet transparent infrastructure for distributed applications:: Core design of Jasmine ii framework
    Leung, K
    Shim, J
    Tcherevik, D
    Vinberg, A
    PROCEEDINGS OF THE EIGHTH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, 2001, : 31 - 36
  • [15] Scalable and Energy-Efficient Deep Learning for Distributed AIoT Applications Using Modular Cognitive IoT Hardware
    Abbasi, Maryam
    Cardoso, Filipe
    Silva, Jose
    Martins, Pedro
    NEW TRENDS IN DISRUPTIVE TECHNOLOGIES, TECH ETHICS AND ARTIFICIAL INTELLIGENCE, DITTET 2023, 2023, 1452 : 85 - 96
  • [16] Building a Distributed Infrastructure for Scalable Triple Stores
    Zhou, Jing
    Hall, Wendy
    De Roure, David
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2009, 24 (03) : 447 - 462
  • [17] Building a Distributed Infrastructure for Scalable Triple Stores
    周菁
    Wendy Hall
    David De Roure
    Journal of Computer Science & Technology, 2009, 24 (03) : 447 - 462
  • [18] Building a Distributed Infrastructure for Scalable Triple Stores
    Jing Zhou
    Wendy Hall
    David De Roure
    Journal of Computer Science and Technology, 2009, 24 : 447 - 462
  • [19] Deployment Service for Scalable Distributed Deep Learning Training on Multiple Clouds
    Jorge, Javier
    Molto, German
    Segrelles, Damian
    Fontes, Joao Pedro
    Guevara, Miguel Angel
    CLOSER: PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND SERVICES SCIENCE, 2021, : 135 - 142
  • [20] BK.Synapse: A scalable distributed training framework for deep learning
    Dinh Viet Sang
    Phan Ngoc Lan
    SOICT 2019: PROCEEDINGS OF THE TENTH INTERNATIONAL SYMPOSIUM ON INFORMATION AND COMMUNICATION TECHNOLOGY, 2019, : 43 - 48