Microservice Debugging with Checkpoint-Restart

被引：1

作者：

Merino, Xavier ^{[1
]}

Otero, Carlos E. ^{[1
]}

机构：

[1] Florida Inst Technol, Dept Comp Engn & Sci, Melbourne, FL 32901 USA

来源：

2023 IEEE CLOUD SUMMIT | 2023年

关键词：

checkpointing; debugging; microservices;

D O I：

10.1109/CloudSummit57601.2023.00016

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Debugging microservices in complex cloud-native deployments can be a daunting task due to interaction-based problems and challenges in reproducing such environments. Traditional fault localization approaches may be ineffective, leading to longer debugging times. To address these challenges, we propose utilizing checkpoint/restart (C/R) techniques to replicate buggy environments across different hardware configurations without code instrumentation or specialized kernels. Our approach integrates with existing debugging practices, making it adaptable and user-friendly. However, since C/R requires some downtime, we assess our approach's practicality by analyzing data from 13,000 observations and estimating the time required to capture a service's state. The minimal downtime introduced by our approach minimizes service interruption. This can be leveraged by operators to plan deployments, live debugging, maintenance, and game-day operations. By combining the power of C/R techniques with existing debugging practices, we aim to facilitate environment reproduction and reduce the iterative nature of the debugging process in complex cloud-native deployments.

引用

页码：58 / 63

页数：6

共 50 条

[1] Checkpoint-Restart for a Network of Virtual Machines
Garg, Rohan
Sodha, Komal
Jin, Zhengping
Cooperman, Gene
2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,
[2] DMTCP: Bringing interactive checkpoint-restart to Python
Arya, Kapil
Cooperman, Gene
Computational Science and Discovery, 2015, 8 (01)
[3] Virtualization aware job schedulers for checkpoint-restart
Badrinath, R.
Krishnakumar, R.
Rajan, R. K. Palanivel
2007 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, VOLS 1 AND 2, 2007, : 876 - 882
[4] Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism
Di, Sheng
Robert, Yves
Vivien, Frederic
Kondo, Derrick
Wang, Cho-Li
Cappello, Franck
2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
[5] Transparent checkpoint-restart of distributed applications on commodity clusters
Laadan, Oren
Phung, Dan
Nieh, Jason
2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2006, : 52 - +
[6] CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
Jain, Twinkle
Cooperman, Gene
PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
[7] The impact of asynchronous I/O in checkpoint-restart workloads
Devarajan, Hariharan
Moody, Adam
Dai, Donglai
Stanavige, Cameron
Gonsiorowski, Elsa
McFadden, Marty
Faaland, Olaf
Kosinovsky, Greg
Mohror, Kathryn
2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 397 - 405
[8] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
Garg, Rohan
Mohan, Apoorve
Sullivan, Michael
Cooperman, Gene
2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
[9] An In-Memory Checkpoint-Restart Mechanism for a Cluster of Virtual Machines
Yaothanee, Jumpol
Chanchio, Kasidit
2019 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2019), 2019, : 131 - 136
[10] System-level Scalable Checkpoint-Restart for Petascale Computing
Cao, Jiajun
Arya, Kapil
Garg, Rohan
Matott, Shawn
Panda, Dhabaleswar K.
Subramoni, Hari
Vienne, Jerome
Cooperman, Gene
2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 932 - 941

← 1 2 3 4 5 →