A Fault Avoidance Strategy Improving the Reliability of the EGI Production Grid Infrastructure

被引:0
|
作者
Palmieri, Francesco [1 ]
Pardi, Silvio [2 ]
Veronesi, Paolo [3 ]
机构
[1] Univ Naples Federico II, Via Cinthia 5, I-80126 Naples, Italy
[2] INFN Istituto Nazionale Di Fisica Nucleare, INDAM, I-80126 Naples, Italy
[3] INFN CNAF, I-40127 Bologna, Italy
来源
关键词
Reliability; Fault Avoidance; Monitoring; Resource Management; COMPUTING SYSTEMS;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Reliability is a crucial issue for the development of stable and effective production grid infrastructures. That is, grid users must be able to trust upon the runtime service they request and receive from the underlying grid. Many runtime services and capabilities offered by modern Grid infrastructures are not available in advance to the application developers and dynamically bound only at the execution time, leading to an increased incidence of interaction faults. In this work we propose, implement and evaluate a novel low-impact fault-avoidance scheme, specifically conceived to improve the grid reliability from the user/application point of view, by providing proper service status information to the workload management system. In particular, starting from the EGEE experience, we designed a strategy inhibiting the use of some specific runtime capabilities on the available resources as soon as the monitoring system detect any anomalous behavior associated to these capabilities and re-integrating them when they restart to correctly work again. The results of a significant set of tests ran on the production EGEE infrastructure, have been presented to show the effectiveness of our approach.
引用
收藏
页码:159 / +
页数:3
相关论文
共 50 条
  • [21] SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
    Xiong, Yifan
    Jiang, Yuting
    Yang, Ziyue
    Qu, Lei
    Zhao, Guoshuai
    Liu, Shuguang
    Zhong, Dong
    Pinzur, Boris
    Zhang, Jie
    Wang, Yang
    Jose, Jithin
    Pourreza, Hossein
    Baxter, Jeff
    Datta, Kushal
    Ram, Prabhat
    Melton, Luke
    Chau, Joe
    Cheng, Peng
    Xiong, Yongqiang
    Zhou, Lidong
    PROCEEDINGS OF THE 2024 USENIX ANNUAL TECHNICAL CONFERENCE, ATC 2024, 2024, : 835 - 850
  • [22] Improving Reliability in Management of Cloud Computing Infrastructure by Formal Methods
    Kikuchi, Shinji
    Hiraishi, Kunihiko
    2014 IEEE NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (NOMS), 2014,
  • [23] A Replication Strategy for Fault Tolerance in Data Grid Environment
    Li, Jing
    ACC 2009: ETP/IITA WORLD CONGRESS IN APPLIED COMPUTING, COMPUTER SCIENCE, AND COMPUTER ENGINEERING, 2009, : 363 - 366
  • [24] Adaptive fault detection strategy in GRID based on OGSA
    College of Computer and Communication Engineering, China University of Petroleum , Dongying 257061, China
    Beijing Jiaotong Daxue Xuebao, 2008, 6 (102-105+110):
  • [25] A Study on Communication Network Reliability for Advanced Metering Infrastructure in Smart Grid
    Xu, Shengjie
    Qian, Yi
    Hu, Rose Qingyang
    2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS(DASC/PICOM/DATACOM/CYBERSCI, 2017, : 127 - 132
  • [26] Development strategy of grid security and reliability in the new period
    Hou, Yuebin
    Wu, Guoming
    Xu, Ding
    Li, Zhenzhou
    Su, Liping
    2014 IEEE PES ASIA-PACIFIC POWER AND ENERGY ENGINEERING CONFERENCE (IEEE PES APPEEC), 2014,
  • [27] A Comprehensive Fault Prediction Model for Improving Software Reliability
    Raghuvanshi, Kamlesh Kumar
    Agarwal, Arun
    Jain, Khushboo
    Singh, Amit Kumar
    INTERNATIONAL JOURNAL OF SOFTWARE INNOVATION, 2022, 10 (01)
  • [28] Improving Virtual Machine Reliability with Driver Fault Isolation
    Zheng, Hao
    Dong, Xiaoshe
    Wang, Endong
    Chen, Baoke
    Wu, Nan
    Zhang, Xingjun
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 239 - 244
  • [29] IMPROVING FAULT-DETECTION AND RELIABILITY IN FDDI NETWORKS
    PILLAI, RR
    HARDI, A
    SELVARAJAN, A
    COMPUTER COMMUNICATIONS, 1992, 15 (09) : 586 - 592
  • [30] Improving system reliability with automatic fault tree generation
    Liggesmeyer, P
    Rothfelder, M
    TWENTY-EIGHTH ANNUAL INTERNATIONAL SYMPOSIUM ON FAULT-TOLERANT COMPUTING, DIGEST PAPERS, 1998, : 90 - 99