Scalable machine learning computing a data summarization matrix with a parallel array DBMS

被引：0

作者：

Carlos Ordonez

Yiqun Zhang

S. Lennart Johnsson

机构：

[1] University of Houston,Department of Computer Science

来源：

Distributed and Parallel Databases | 2019年 / 37卷

关键词：

Matrix; Summarization; Parallel DBMS; Linear algebra;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.

引用

页码：329 / 350

页数：21

共 50 条

[1] Scalable machine learning computing a data summarization matrix with a parallel array DBMS
Ordonez, Carlos
Zhang, Yiqun
Johnsson, S. Lennart
DISTRIBUTED AND PARALLEL DATABASES, 2019, 37 (03) : 329 - 350
[2] Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization
Al-Amin, Sikder Tahsin
Ordonez, Carlos
BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY (DAWAK 2020), 2020, 12393 : 269 - 284
[3] A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Zhang, Yiqun
Ordonez, Carlos
Johnsson, Lennart
2017 28TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2017, : 22 - 26
[4] Scalable Machine Learning in the R Language Using a Summarization Matrix
Chebolu, Siva Uday Sampreeth
Ordonez, Carlos
Al-Amin, Sikder Tahsin
DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT II, 2019, 11707 : 247 - 262
[5] Efficient machine learning on data science languages with parallel data summarization
Al-Amin, Sikder Tahsin
Ordonez, Carlos
DATA & KNOWLEDGE ENGINEERING, 2021, 136
[6] Parallel and Distributed Machine Learning Algorithms for Scalable Big Data Analytics
Bal, Henri
Pal, Arindam
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 108 : 1159 - 1161
[7] Towards Machine Learning in Distributed Array DBMS: Networking Considerations
Zalipynis, Ramon Antonio Rodriges
MACHINE LEARNING FOR NETWORKING, MLN 2020, 2021, 12629 : 284 - 304
[8] Scalable and Parallel Machine Learning Algorithms for Statistical Data Mining - Practice & Experience
Riedel, M.
Goetz, M.
Richerzhagen, M.
Glock, P.
Bodenstein, C.
Memon, A. S.
Memon, M. S.
2015 8TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2015, : 204 - 209
[9] Scalable Random Forest with Data-Parallel Computing
Vazquez-Novoa, Fernando
Conejero, Javier
Tatu, Cristian
Badia, Rosa M.
EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 397 - 410
[10] Scalable massively parallel computing using continuous-time data representation in nanoscale crossbar array
Wang, Cong
Liang, Shi-Jun
Wang, Chen-Yu
Yang, Zai-Zheng
Ge, Yingmeng
Pan, Chen
Shen, Xi
Wei, Wei
Zhao, Yichen
Zhang, Zaichen
Cheng, Bin
Zhang, Chuan
Miao, Feng
NATURE NANOTECHNOLOGY, 2021, 16 (10) : 1079 - +

← 1 2 3 4 5 →