Microservice Debugging with Checkpoint-Restart

被引：1

作者：

Merino, Xavier ^{[1
]}

Otero, Carlos E. ^{[1
]}

机构：

[1] Florida Inst Technol, Dept Comp Engn & Sci, Melbourne, FL 32901 USA

来源：

2023 IEEE CLOUD SUMMIT | 2023年

关键词：

checkpointing; debugging; microservices;

D O I：

10.1109/CloudSummit57601.2023.00016

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Debugging microservices in complex cloud-native deployments can be a daunting task due to interaction-based problems and challenges in reproducing such environments. Traditional fault localization approaches may be ineffective, leading to longer debugging times. To address these challenges, we propose utilizing checkpoint/restart (C/R) techniques to replicate buggy environments across different hardware configurations without code instrumentation or specialized kernels. Our approach integrates with existing debugging practices, making it adaptable and user-friendly. However, since C/R requires some downtime, we assess our approach's practicality by analyzing data from 13,000 observations and estimating the time required to capture a service's state. The minimal downtime introduced by our approach minimizes service interruption. This can be leveraged by operators to plan deployments, live debugging, maintenance, and game-day operations. By combining the power of C/R techniques with existing debugging practices, we aim to facilitate environment reproduction and reduce the iterative nature of the debugging process in complex cloud-native deployments.

引用

页码：58 / 63

页数：6

共 50 条

[41] A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS
Shaiizad, Faisal
Wittmann, Markus
Kreutzer, Moritz
Zeiser, Thomas
Haler, Ceorc
Wellein, Gerhahd
PARALLEL PROCESSING LETTERS, 2013, 23 (04)
[42] Checkpoint and restart for distributed components in XCAT3
Krishnan, S
Gannon, D
FIFTH IEEE/ACM INTERNATIONAL WORKSHOP ON GRID COMPUTING, PROCEEDINGS, 2004, : 281 - 288
[43] Job migration in HPC clusters by means of checkpoint/restart
Manuel Rodríguez-Pascual
Jiajun Cao
José A. Moríñigo
Gene Cooperman
Rafael Mayo-García
The Journal of Supercomputing, 2019, 75 : 6517 - 6541
[44] An Examination of the Impact of Failure Distribution on Coordinated Checkpoint/Restart
Levy, Scott
Ferreira, Kurt B.
PROCEEDINGS OF THE ACM WORKSHOP ON FAULT-TOLERANCE FOR HPC AT EXTREME SCALE (FTXS'16), 2016, : 35 - 42
[45] Job migration in HPC clusters by means of checkpoint/restart
Rodriguez-Pascual, Manuel
Cao, Jiajun
Morinigo, Jose A.
Cooperman, Gene
Mayo-Garcia, Rafael
JOURNAL OF SUPERCOMPUTING, 2019, 75 (10): : 6517 - 6541
[46] Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart
Zhang, Jialing
Zhuo, Xiaoyan
Moon, Aekyeung
Liu, Hang
Son, Seung Woo
2019 35TH SYMPOSIUM ON MASS STORAGE SYSTEMS AND TECHNOLOGIES (MSST 2019), 2019, : 79 - 91
[47] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
Zhou, Xiang
Peng, Xin
Xie, Tao
Sun, Jun
Ji, Chao
Li, Wenhai
Ding, Dan
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (02) : 243 - 260
[48] CRState: checkpoint/restart of OpenCL program for in-kernel applications
Chen, Genlang
Zhang, Jiajian
Zhu, Zufang
Jiang, Qiangqiang
Jiang, Hai
Pang, Chaoyi
JOURNAL OF SUPERCOMPUTING, 2021, 77 (06): : 5426 - 5467
[49] Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart
Gholami, Masoud
Schintke, Florian
2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 277 - 288
[50] CRState: checkpoint/restart of OpenCL program for in-kernel applications
Genlang Chen
Jiajian Zhang
Zufang Zhu
Qiangqiang Jiang
Hai Jiang
Chaoyi Pang
The Journal of Supercomputing, 2021, 77 : 5426 - 5467

← 1 2 3 4 5 →