Automated Debugging in Data-Intensive Scalable Computing

被引:16
|
作者
Gulzar, Muhammad Ali [1 ]
Interlandi, Matteo [2 ]
Han, Xueyuan [3 ]
Li, Mingda [1 ]
Condie, Tyson [1 ]
Kim, Miryung [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[2] Microsoft, Redmond, WA USA
[3] Harvard Univ, Cambridge, MA 02138 USA
关键词
Automated debugging; fault localization; data provenance; data-intensive scalable computing (DISC); big data; and data cleaning; PROVENANCE;
D O I
10.1145/3127479.3131624
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BIGSIFT is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failureinducing inputs. BIGSIFT redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BIGSIFT improves the accuracy of fault localizability by several orders-of-magnitude (similar to 10(3) to 10(7) x) compared to Titian data provenance, and improves performance by up to 66 x compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BIGSIFT is able to localize fault-inducing data within 62% of the original job running time.
引用
收藏
页码:520 / 534
页数:15
相关论文
共 50 条
  • [31] In-Memory Data Rearrangement for Irregular, Data-Intensive Computing
    Lloyd, Scott
    Gokhale, Maya
    COMPUTER, 2015, 48 (08) : 18 - 25
  • [32] Data Allocation with Neural Similarity Estimation for Data-Intensive Computing
    Vamosi, Ralf
    Schikuta, Erich
    COMPUTATIONAL SCIENCE - ICCS 2022, PT III, 2022, 13352 : 534 - 546
  • [33] An Improved Bayesian Inference Method for Data-Intensive Computing
    Ma, Feng
    Liu, Weiyi
    COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS, 2012, 316 : 134 - 144
  • [34] Innovative methods and algorithms for advanced data-intensive computing
    Cuzzocrea, A. (cuzzocrea@si.deis.unical.it), 1600, Elsevier B.V. (37):
  • [35] Enabling Trusted Data-Intensive Execution in Cloud Computing
    Zhang, Ning
    Lou, Wenjing
    Jiang, Xuxian
    Hou, Y. Thomas
    2014 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), 2014, : 355 - 363
  • [36] Data-intensive computing in the 21st century
    Gorton, Ian
    Greenfield, Paul
    Szalay, Alex
    Williams, Roy
    COMPUTER, 2008, 41 (04) : 30 - 32
  • [37] Data-Intensive Computing in Smart Microgrids: Volume II
    Herodotou, Herodotos
    Aslam, Sheraz
    ENERGIES, 2022, 15 (16)
  • [38] A Data-Intensive Workflow Scheduling Algorithm for Grid Computing
    Xu, Meng
    Cui, Lizhen
    Wang, Haiyang
    Bi, Yanbing
    Bian, Ji
    FOURTH CHINAGRID ANNUAL CONFERENCE, PROCEEDINGS, 2009, : 110 - 115
  • [39] Hyracks: A Flexible and Extensible Foundation for Data-Intensive Computing
    Borkar, Vinayak
    Carey, Michael
    Grover, Raman
    Onose, Nicola
    Vernica, Rares
    IEEE 27TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2011), 2011, : 1151 - 1162
  • [40] A new volunteer computing model for data-intensive applications
    Alonso-Monsalve, Saul
    Garcia-Carballeira, Felix
    Calderon, Alejandro
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (24):