Optimal distributed parallel algorithms for deep learning framework Tensorflow

被引：9

作者：

Xie, Yuanlun ^{[1
]}

He, Majun ^{[1
]}

Ma, Tingsong ^{[1
]}

Tian, Wenhong ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Sch Informat & Software Engn, Chengdu, Peoples R China

来源：

APPLIED INTELLIGENCE | 2022年 / 52卷 / 04期

关键词：

Deep learning; Tensorflow; Data parallelism; Model parallelism; Optimal distributed parallel algorithms;

D O I：

10.1007/s10489-021-02588-9

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Since its release, the Tensorflow framework has been widely used in various fields due to its advantages in deep learning. However, it is still at its early state. Its native distributed implementation has difficulty in expanding for large models because it has issues of low utilization of multiple GPUs and slow distribution compared with running on single machine. It is of great significance to reduce the training time through parallel models. In view of this, we firstly provided an in-depth analysis of the implementation principle of Tensorflow and identify the bottlenecks of its native distributed parallel models to improve. Then, two optimal algorithms are designed and implemented based on data parallelism and model parallelism modes of Tensorflow. For data parallelism, the proposed algorithm is implemented to replace the native linear execution mode with pipeline execution mode. As for model parallelism, the native random partitioning mode is replaced by our proposed novel greedy algorithm. Finally, we built a homogeneous distributed cluster and a heterogeneous distributed cluster respectively to verify the effectiveness of the proposed algorithms. Through a number of comparative experiments, we showed that the proposed optimal parallel algorithms can effectively reduce model training time by an average of 26.5%(or average 1.5x speedup than native distributed algorithms) and improve the utilization of the cluster while keeping the same accuracy level of native Tensorflow.

引用

页码：3880 / 3900

页数：21

共 50 条

[1] Optimal distributed parallel algorithms for deep learning framework Tensorflow
Yuanlun Xie
Majun He
Tingsong Ma
Wenhong Tian
[J]. Applied Intelligence, 2022, 52 : 3880 - 3900
[2] Detailed Performance Analysis of Distributed Tensorflow on a GPU Cluster using Deep Learning Algorithms
Malik, Abid
Lu, Micheal
Wang, Nathenial
Lin, Yeiwei
Yoo, Shinjae
[J]. 2018 NEW YORK SCIENTIFIC DATA SUMMIT (NYSDS), 2018,
[3] Distributed Deep Reinforcement Learning using TensorFlow
Rao, P. Ajay
Kumar, Navaneesh B.
Cadabam, Siddharth
Praveena, T.
[J]. 2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 171 - 174
[4] Accelerating geostatistical seismic inversion using TensorFlow: A heterogeneous distributed deep learning framework
Liu, Mingliang
Grana, Dario
[J]. COMPUTERS & GEOSCIENCES, 2019, 124 : 37 - 45
[5] OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
Jiang, Youhe
Fu, Fangcheng
Miao, Xupeng
Nie, Xiaonan
Cui, Bin
[J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 2142 - 2150
[6] Boosting algorithms for parallel and distributed learning
Lazarevic, A
Obradovic, Z
[J]. DISTRIBUTED AND PARALLEL DATABASES, 2002, 11 (02) : 203 - 229
[7] Boosting Algorithms for Parallel and Distributed Learning
Aleksandar Lazarevic
Zoran Obradovic
[J]. Distributed and Parallel Databases, 2002, 11 : 203 - 229
[8] Deep Learning With TensorFlow: A Review
Pang, Bo
Nijkamp, Erik
Wu, Ying Nian
[J]. JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICS, 2020, 45 (02) : 227 - 248
[9] A Framework for Parallel Genetic Algorithms for Distributed Memory Architectures
Georgiev, Dobromir
Atanassov, Emanouil
Alexandrov, Vassil
[J]. 2014 5TH WORKSHOP ON LATEST ADVANCES IN SCALABLE ALGORITHMS FOR LARGE-SCALE SYSTEMS (SCALA), 2014, : 47 - 53
[10] PMA-DRL: A parallel model -augmented framework for deep reinforcement learning algorithms
Luo, Xufang
Wang, Yunhong
[J]. NEUROCOMPUTING, 2020, 403 : 109 - 120

← 1 2 3 4 5 →