Autonomic failure prediction based on manifold learning for large-scale distributed systems

被引:6
|
作者
Lu X. [1 ]
Wang H.-Q. [1 ]
Zhou R.-J. [1 ]
Ge B.-Y. [1 ]
机构
[1] College of Computer Science and Technology, Harbin Engineering University
基金
中国国家自然科学基金;
关键词
autonomic computing; failure prediction; locally linear embedding; manifold learning;
D O I
10.1016/S1005-8885(09)60497-0
中图分类号
学科分类号
摘要
This article investigates autonomic failure prediction in large-scale distributed systems with nonlinear dimensionality reduction to automatically extract failure features. Most existing methods for failure prediction focus on building prediction models or heuristic rules by discovering failure patterns, but the process of feature extraction before failure patterns recognition is rarely considered due to the increasing complexity of modern distributed systems. In this work, a novel performance-centric approach to automate failure prediction is proposed based on manifold learning (ML). In addition, the ML algorithm named supervised locally linear embedding (SLLE) is applied to achieve feature extraction. To generalize the dimensionality reduction mapping, the nonlinear mapping approximation and optimization solution is also proposed. In experimental work a file transfer test bed with fault injection is developed which can gather multilevel performance metrics transparently. Based on the runtime monitoring of these metrics, the SLLE method can automatically predict more than 50 of the central processing unit (CPU) and memory failures, and around 70 of the network failure. © 2010 The Journal of China Universities of Posts and Telecommunications.
引用
收藏
页码:116 / 124
页数:8
相关论文
共 50 条
  • [31] Analysis of large-scale distributed information systems
    Hellerstein, JL
    Jayram, TS
    Squillante, MS
    8TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS, PROCEEDINGS, 2000, : 164 - 171
  • [32] Robustness of large-scale distributed computer systems
    Khoroshevsky, VG
    EUROSIM '96 - HPCN CHALLENGES IN TELECOMP AND TELECOM: PARALLEL SIMULATION OF COMPLEX SYSTEMS AND LARGE-SCALE APPLICATIONS, 1996, : 141 - 150
  • [33] Legal reliability in large-scale distributed systems
    Sommer, P
    SEVENTEENTH IEEE SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS, PROCEEDINGS, 1998, : 416 - 421
  • [34] Risk modeling in distributed, large-scale systems
    Grabowski, M
    Merrick, JRW
    Harrald, JR
    Mazzuchi, TA
    van Dorp, JR
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 2000, 30 (06): : 651 - 660
  • [35] Designing a Testbed for Large-scale Distributed Systems
    Leng, Christof
    Lehn, Max
    Rehner, Robert
    Buchmann, Alejandro
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2011, 41 (04) : 400 - 401
  • [36] A Token-Based Scheme for Coordinating Decisions in Large-Scale Autonomic Systems
    Melekhova, Olga
    Malenfant, Jacques
    2017 IEEE 26TH INTERNATIONAL CONFERENCE ON ENABLING TECHNOLOGIES - INFRASTRUCTURE FOR COLLABORATIVE ENTERPRISES (WETICE), 2017, : 60 - 65
  • [37] Efficient Objective Functions for Coordinated Learning in Large-Scale Distributed OSA Systems
    NoroozOliaee, MohammadJavad
    Hamdaoui, Bechir
    Tumer, Kagan
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2013, 12 (05) : 931 - 944
  • [38] Distributed Learning Algorithm for Distributed PV Large-Scale Access to Power Grid Based on Machine Learning
    Lei, Zhen
    Yang, Yong-biao
    Xu, Xiao-hui
    ADVANCED HYBRID INFORMATION PROCESSING, ADHIP 2019, PT I, 2019, 301 : 439 - 447
  • [39] Distributed LMMSE Estimation for Large-Scale Systems Based on Local Information
    Wang, Yan
    Xiong, Junlin
    Ho, Daniel W. C.
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (08) : 8528 - 8536
  • [40] Cluster-based file replication in large-scale distributed systems
    Sandhu, Harjinder
    Zhou, Songnian
    Performance Evaluation Review, 1992, 20 (01):