Microservice Debugging with Checkpoint-Restart

被引:1
|
作者
Merino, Xavier [1 ]
Otero, Carlos E. [1 ]
机构
[1] Florida Inst Technol, Dept Comp Engn & Sci, Melbourne, FL 32901 USA
来源
关键词
checkpointing; debugging; microservices;
D O I
10.1109/CloudSummit57601.2023.00016
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Debugging microservices in complex cloud-native deployments can be a daunting task due to interaction-based problems and challenges in reproducing such environments. Traditional fault localization approaches may be ineffective, leading to longer debugging times. To address these challenges, we propose utilizing checkpoint/restart (C/R) techniques to replicate buggy environments across different hardware configurations without code instrumentation or specialized kernels. Our approach integrates with existing debugging practices, making it adaptable and user-friendly. However, since C/R requires some downtime, we assess our approach's practicality by analyzing data from 13,000 observations and estimating the time required to capture a service's state. The minimal downtime introduced by our approach minimizes service interruption. This can be leveraged by operators to plan deployments, live debugging, maintenance, and game-day operations. By combining the power of C/R techniques with existing debugging practices, we aim to facilitate environment reproduction and reduce the iterative nature of the debugging process in complex cloud-native deployments.
引用
收藏
页码:58 / 63
页数:6
相关论文
共 50 条
  • [41] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
    Shaiizad, Faisal
    Wittmann, Markus
    Kreutzer, Moritz
    Zeiser, Thomas
    Haler, Ceorc
    Wellein, Gerhahd
    PARALLEL PROCESSING LETTERS, 2013, 23 (04)
  • [42] Checkpoint and restart for distributed components in XCAT3
    Krishnan, S
    Gannon, D
    FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
  • [43] Job migration in HPC clusters by means of checkpoint/restart
    Manuel Rodríguez-Pascual
    Jiajun Cao
    José A. Moríñigo
    Gene Cooperman
    Rafael Mayo-García
    The Journal of Supercomputing, 2019, 75 : 6517 - 6541
  • [44] An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
    Levy, Scott
    Ferreira, Kurt B.
    PROCEEDINGS OF THE ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE (FTXS'16), 2016, : 35 - 42
  • [45] Job migration in HPC clusters by means of checkpoint/restart
    Rodriguez-Pascual, Manuel
    Cao, Jiajun
    Morinigo, Jose A.
    Cooperman, Gene
    Mayo-Garcia, Rafael
    JOURNAL OF SUPERCOMPUTING, 2019, 75 (10): : 6517 - 6541
  • [46] Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart
    Zhang, Jialing
    Zhuo, Xiaoyan
    Moon, Aekyeung
    Liu, Hang
    Son, Seung Woo
    2019 35TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST 2019), 2019, : 79 - 91
  • [47] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
    Zhou, Xiang
    Peng, Xin
    Xie, Tao
    Sun, Jun
    Ji, Chao
    Li, Wenhai
    Ding, Dan
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (02) : 243 - 260
  • [48] CRState: checkpoint/restart of OpenCL program for in-kernel applications
    Chen, Genlang
    Zhang, Jiajian
    Zhu, Zufang
    Jiang, Qiangqiang
    Jiang, Hai
    Pang, Chaoyi
    JOURNAL OF SUPERCOMPUTING, 2021, 77 (06): : 5426 - 5467
  • [49] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
    Gholami, Masoud
    Schintke, Florian
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
  • [50] CRState: checkpoint/restart of OpenCL program for in-kernel applications
    Genlang Chen
    Jiajian Zhang
    Zufang Zhu
    Qiangqiang Jiang
    Hai Jiang
    Chaoyi Pang
    The Journal of Supercomputing, 2021, 77 : 5426 - 5467