Microservice Debugging with Checkpoint-Restart

被引:1
|
作者
Merino, Xavier [1 ]
Otero, Carlos E. [1 ]
机构
[1] Florida Inst Technol, Dept Comp Engn & Sci, Melbourne, FL 32901 USA
来源
关键词
checkpointing; debugging; microservices;
D O I
10.1109/CloudSummit57601.2023.00016
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Debugging microservices in complex cloud-native deployments can be a daunting task due to interaction-based problems and challenges in reproducing such environments. Traditional fault localization approaches may be ineffective, leading to longer debugging times. To address these challenges, we propose utilizing checkpoint/restart (C/R) techniques to replicate buggy environments across different hardware configurations without code instrumentation or specialized kernels. Our approach integrates with existing debugging practices, making it adaptable and user-friendly. However, since C/R requires some downtime, we assess our approach's practicality by analyzing data from 13,000 observations and estimating the time required to capture a service's state. The minimal downtime introduced by our approach minimizes service interruption. This can be leveraged by operators to plan deployments, live debugging, maintenance, and game-day operations. By combining the power of C/R techniques with existing debugging practices, we aim to facilitate environment reproduction and reduce the iterative nature of the debugging process in complex cloud-native deployments.
引用
收藏
页码:58 / 63
页数:6
相关论文
共 50 条
  • [1] Checkpoint-Restart for a Network of Virtual Machines
    Garg, Rohan
    Sodha, Komal
    Jin, Zhengping
    Cooperman, Gene
    2013 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2013,
  • [2] DMTCP: Bringing interactive checkpoint-restart to Python
    Arya, Kapil
    Cooperman, Gene
    Computational Science and Discovery, 2015, 8 (01)
  • [3] Virtualization aware job schedulers for checkpoint-restart
    Badrinath, R.
    Krishnakumar, R.
    Rajan, R. K. Palanivel
    2007 INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS, VOLS 1 AND 2, 2007, : 876 - 882
  • [4] Optimization of Cloud Task Processing with Checkpoint-Restart Mechanism
    Di, Sheng
    Robert, Yves
    Vivien, Frederic
    Kondo, Derrick
    Wang, Cho-Li
    Cappello, Franck
    2013 INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC), 2013,
  • [5] Transparent checkpoint-restart of distributed applications on commodity clusters
    Laadan, Oren
    Phung, Dan
    Nieh, Jason
    2005 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2006, : 52 - +
  • [6] CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM
    Jain, Twinkle
    Cooperman, Gene
    PROCEEDINGS OF SC20: THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SC20), 2020,
  • [7] The impact of asynchronous I/O in checkpoint-restart workloads
    Devarajan, Hariharan
    Moody, Adam
    Dai, Donglai
    Stanavige, Cameron
    Gonsiorowski, Elsa
    McFadden, Marty
    Faaland, Olaf
    Kosinovsky, Greg
    Mohror, Kathryn
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 397 - 405
  • [8] CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
    Garg, Rohan
    Mohan, Apoorve
    Sullivan, Michael
    Cooperman, Gene
    2018 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2018, : 302 - 313
  • [9] An In-Memory Checkpoint-Restart Mechanism for a Cluster of Virtual Machines
    Yaothanee, Jumpol
    Chanchio, Kasidit
    2019 16TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2019), 2019, : 131 - 136
  • [10] System-level Scalable Checkpoint-Restart for Petascale Computing
    Cao, Jiajun
    Arya, Kapil
    Garg, Rohan
    Matott, Shawn
    Panda, Dhabaleswar K.
    Subramoni, Hari
    Vienne, Jerome
    Cooperman, Gene
    2016 IEEE 22ND INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2016, : 932 - 941