Scalable machine learning computing a data summarization matrix with a parallel array DBMS

被引:0
|
作者
Carlos Ordonez
Yiqun Zhang
S. Lennart Johnsson
机构
[1] University of Houston,Department of Computer Science
来源
关键词
Matrix; Summarization; Parallel DBMS; Linear algebra;
D O I
暂无
中图分类号
学科分类号
摘要
Big data analytics requires scalable (beyond RAM limits) and highly parallel (exploiting many CPU cores) processing of machine learning models, which in general involve heavy matrix manipulation. Array DBMSs represent a promising system to manipulate large matrices. With that motivation in mind, we present a high performance system exploiting a parallel array DBMS to evaluate a general, but compact, matrix summarization that benefits many machine learning models. We focus on two representative models: linear regression (supervised) and PCA (unsupervised). Our approach combines data summarization inside the parallel DBMS with further model computation in a mathematical language (e.g. R). We introduce a two-phase algorithm which first computes a general data summary in parallel and then evaluates matrix equations with reduced intermediate matrices in main memory on one node. We present theory results characterizing speedup and time/space complexity. From a parallel data system perspective, we consider scale-up and scale-out in a shared-nothing architecture. In contrast to most big data analytic systems, our system is based on array operators programmed in C++, working directly on the Unix file system instead of Java or Scala running on HDFS mounted of top of Unix, resulting in much faster processing. Experiments compare our system with Spark (parallel) and R (single machine), showing orders of magnitude time improvement. We present parallel benchmarks varying number of threads and processing nodes. Our two-phase approach should motivate analysts to exploit a parallel array DBMS for matrix summarization.
引用
收藏
页码:329 / 350
页数:21
相关论文
共 50 条
  • [1] Scalable machine learning computing a data summarization matrix with a parallel array DBMS
    Ordonez, Carlos
    Zhang, Yiqun
    Johnsson, S. Lennart
    DISTRIBUTED AND PARALLEL DATABASES, 2019, 37 (03) : 329 - 350
  • [2] Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization
    Al-Amin, Sikder Tahsin
    Ordonez, Carlos
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY (DAWAK 2020), 2020, 12393 : 269 - 284
  • [3] A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
    Zhang, Yiqun
    Ordonez, Carlos
    Johnsson, Lennart
    2017 28TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS (DEXA), 2017, : 22 - 26
  • [4] Scalable Machine Learning in the R Language Using a Summarization Matrix
    Chebolu, Siva Uday Sampreeth
    Ordonez, Carlos
    Al-Amin, Sikder Tahsin
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PT II, 2019, 11707 : 247 - 262
  • [5] Efficient machine learning on data science languages with parallel data summarization
    Al-Amin, Sikder Tahsin
    Ordonez, Carlos
    DATA & KNOWLEDGE ENGINEERING, 2021, 136
  • [6] Parallel and Distributed Machine Learning Algorithms for Scalable Big Data Analytics
    Bal, Henri
    Pal, Arindam
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 108 : 1159 - 1161
  • [7] Towards Machine Learning in Distributed Array DBMS: Networking Considerations
    Zalipynis, Ramon Antonio Rodriges
    MACHINE LEARNING FOR NETWORKING, MLN 2020, 2021, 12629 : 284 - 304
  • [8] Scalable and Parallel Machine Learning Algorithms for Statistical Data Mining - Practice & Experience
    Riedel, M.
    Goetz, M.
    Richerzhagen, M.
    Glock, P.
    Bodenstein, C.
    Memon, A. S.
    Memon, M. S.
    2015 8TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2015, : 204 - 209
  • [9] Scalable Random Forest with Data-Parallel Computing
    Vazquez-Novoa, Fernando
    Conejero, Javier
    Tatu, Cristian
    Badia, Rosa M.
    EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 397 - 410
  • [10] Scalable massively parallel computing using continuous-time data representation in nanoscale crossbar array
    Wang, Cong
    Liang, Shi-Jun
    Wang, Chen-Yu
    Yang, Zai-Zheng
    Ge, Yingmeng
    Pan, Chen
    Shen, Xi
    Wei, Wei
    Zhao, Yichen
    Zhang, Zaichen
    Cheng, Bin
    Zhang, Chuan
    Miao, Feng
    NATURE NANOTECHNOLOGY, 2021, 16 (10) : 1079 - +