Automated Debugging in Data-Intensive Scalable Computing

被引:16
|
作者
Gulzar, Muhammad Ali [1 ]
Interlandi, Matteo [2 ]
Han, Xueyuan [3 ]
Li, Mingda [1 ]
Condie, Tyson [1 ]
Kim, Miryung [1 ]
机构
[1] Univ Calif Los Angeles, Los Angeles, CA 90095 USA
[2] Microsoft, Redmond, WA USA
[3] Harvard Univ, Cambridge, MA 02138 USA
关键词
Automated debugging; fault localization; data provenance; data-intensive scalable computing (DISC); big data; and data cleaning; PROVENANCE;
D O I
10.1145/3127479.3131624
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Developing Big Data Analytics workloads often involves trial and error debugging, due to the unclean nature of datasets or wrong assumptions made about data. When errors (e.g., program crash, outlier results, etc.) arise, developers are often interested in identifying a subset of the input data that is able to reproduce the problem. BIGSIFT is a new faulty data localization approach that combines insights from automated fault isolation in software engineering and data provenance in database systems to find a minimum set of failureinducing inputs. BIGSIFT redefines data provenance for the purpose of debugging using a test oracle function and implements several unique optimizations, specifically geared towards the iterative nature of automated debugging workloads. BIGSIFT improves the accuracy of fault localizability by several orders-of-magnitude (similar to 10(3) to 10(7) x) compared to Titian data provenance, and improves performance by up to 66 x compared to Delta Debugging, an automated fault-isolation technique. For each faulty output, BIGSIFT is able to localize fault-inducing data within 62% of the original job running time.
引用
收藏
页码:520 / 534
页数:15
相关论文
共 50 条
  • [21] Coordinating Green Clouds as Data-Intensive Computing
    Biran, Yahav
    Collins, George
    Liberatore, Joseph
    PROCEEDINGS 2016 EIGHTH ANNUAL IEEE GREEN TECHNOLOGIES CONFERENCE (GREENTECH 2016), 2016, : 130 - 135
  • [22] Real-Time Data-Intensive Computing
    Parkinson, Dilworth Y.
    Beattie, Keith
    Chen, Xian
    Correa, Joaquin
    Dart, Eli
    Daurer, Benedikt J.
    Deslippe, Jack R.
    Hexemer, Alexander
    Krishnan, Harinarayan
    MacDowell, Alastair A.
    Maia, Filipe R. N. C.
    Marchesini, Stefano
    Padmore, Howard A.
    Patton, Simon J.
    Perciano, Talita
    Sethian, James A.
    Shapiro, David
    Stromsness, Rune
    Tamura, Nobumichi
    Tierney, Brian L.
    Tull, Craig E.
    Ushizima, Daniela
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON SYNCHROTRON RADIATION INSTRUMENTATION (SRI2015), 2016, 1741
  • [23] Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks
    Pietro Michiardi
    Damiano Carra
    Sara Migliorini
    Information Systems Frontiers, 2021, 23 : 35 - 51
  • [24] A Resistive TCAM Accelerator for Data-Intensive Computing
    Guo, Qing
    Guo, Xiaochen
    Bai, Yuxin
    Ipek, Engin
    PROCEEDINGS OF THE 2011 44TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 44), 2011, : 339 - 350
  • [25] PARROT: AN APPLICATION ENVIRONMENT FOR DATA-INTENSIVE COMPUTING
    Thain, Douglas
    Livny, Miron
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2005, 6 (03): : 9 - 18
  • [26] A scalable architecture for data-intensive natural language processing
    Beloki, Zuhaitz
    Artola, Xabier
    Soroa, Aitor
    NATURAL LANGUAGE ENGINEERING, 2017, 23 (05) : 709 - 731
  • [27] TomusBlobs: scalable data-intensive processing on Azure clouds
    Costan, Alexandru
    Tudoran, Radu
    Antoniu, Gabriel
    Brasche, Goetz
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2016, 28 (04): : 950 - 976
  • [28] Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing
    Gowanlock, Michael
    Gallet, Benoit
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 350 - 357
  • [29] A New Data Classification Algorithm for Data-Intensive Computing Environments
    Deng, Qizhi
    Zhang, Longbo
    Qian, Xin
    Chen, Yali
    Wang, Fengying
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION APPLICATIONS (ICCIA 2012), 2012, : 1351 - 1354
  • [30] Improvement Of Data Throughput In Data-Intensive Cloud Computing Applications
    Ibrahim, Ibrahim Adel
    Bassiouni, Mostafa
    2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), 2019, : 49 - 54