Semisupervised Clustering Approach for Pipe Failure Prediction with Imbalanced Data Set

被引:3
|
作者
Zali, Ramiz Beig [1 ]
Latifi, Milad [1 ]
Javadi, Akbar A. [1 ]
Farmani, Raziyeh [1 ]
机构
[1] Univ Exeter, Ctr Water Syst, NorthPark Rd, Exeter EX4 4PY, England
基金
英国科研创新办公室;
关键词
Water distribution network (WDN); Pipe failure prediction; Semisupervised clustering; Class imbalance; Machine learning (ML); WATER; SELECTION; MODELS;
D O I
10.1061/JWRMD5.WRENG-6263
中图分类号
TU [建筑科学];
学科分类号
0813 ;
摘要
In recent years, machine learning (ML) approaches have been used widely for water pipe condition assessment and failure prediction. These methods require a considerable amount of data from water distribution networks (WDNs). Imbalanced and missing data, whether asset or failure data, compromise a model's prediction performance. In this research, using only 2 years of failure data in a real WDN, three ML methods-XGBoost, random forest and logistic regression-were used to prioritize asset rehabilitation. To address the issue of imbalanced data, a novel method of semisupervised clustering is proposed to leverage the domain knowledge in combination with unsupervised learning to divide the data set into homogeneous categories and enhance the classification accuracy. The introduced approach performed better than well-known data science class imbalance treatment techniques. Furthermore, analysis of the results indicated that classification evaluation metrics struggled to assess practically the effectiveness of various methods. To address this, an economic indicator is proposed to rank the pipes for rehabilitation based on their cost and likelihood of failure (LoF). Preventive maintenance using the results of an economic indicator reduces the number of failures with a small fraction of the total replacement cost. Moreover, another indicator was developed to consider the consequence of the failures and LoF simultaneously. This indicator mitigates in a cost-effective manner the flow capacity reductions in WDNs caused by failures. The results of this study provide asset managers with a powerful tool to prioritize assets for rehabilitation. In recent years, machine learning algorithms have gained popularity for assessing water pipe conditions and predicting failures. However, their effectiveness relies on substantial data from water distribution networks (WDNs). Challenges arise with limited (imbalanced) data, affecting prediction accuracy. This study focused on a specific WDN with only 2 years of failure data, aiming to identify priority assets for rehabilitation. Three ML methods (XGBoost, random forest, and logistic regression) and a novel semisupervised clustering approach were employed. This method combines expert knowledge with traditional techniques, significantly improving predictive accuracy. By applying ML algorithms within these homogenous clusters, predictive accuracy was enhanced notably. Two novel metrics were introduced for prioritizing pipe rehabilitation: one combining failure likelihood and replacement costs, and the other evaluating pipes based on their significance within the WDN and associated rehabilitation expenses. These models empower asset managers to optimize pipe replacement budget allocation and enhance the network performance.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Condition Prediction of Sanitary Sewer Pipe Data Set with Imbalanced Classification
    Loganathan, Karthikeyan
    Najafi, Mohammad
    Maduri, Praveen Kumar
    Kaur, Kawalpreet
    [J]. PIPELINES 2023: CONDITION ASSESSMENT, UTILITY ENGINEERING, SURVEYING, AND MULTIDISCIPLINE, 2023, : 170 - 180
  • [2] Investigating the Role of Clustering in Construction-Accident Severity Prediction Using a Heterogeneous and Imbalanced Data Set
    Salarian, Ali Akbar
    Etemadfard, Hossein
    Rahimzadegan, Ali
    Ghalehnovi, Mansour
    [J]. JOURNAL OF CONSTRUCTION ENGINEERING AND MANAGEMENT, 2023, 149 (02)
  • [3] Clustering of Imbalanced Moodle Data for Early Alert of Student Failure
    Sisovic, Sabina
    Matetic, Maja
    Bakaric, Marija Brkic
    [J]. 2016 IEEE 14TH INTERNATIONAL SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI), 2016, : 165 - 170
  • [4] Active Semisupervised Clustering Algorithm with Label Propagation for Imbalanced and Multidensity Datasets
    Leng, Mingwei
    Cheng, Jianjun
    Wang, Jinjin
    Zhang, Zhengquan
    Zhou, Hanhai
    Chen, Xiaoyun
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2013, 2013
  • [5] Pipe Failure Prediction: A Data Mining Method
    Wang, Rui
    Dong, Weishan
    Wang, Yu
    Tang, Ke
    Yao, Xin
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 1208 - 1218
  • [6] A memetic approach for training set selection in imbalanced data sets
    Nikpour, Bahareh
    Nezamabadi-pour, Hossein
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2019, 10 (11) : 3043 - 3070
  • [7] A memetic approach for training set selection in imbalanced data sets
    Bahareh Nikpour
    Hossein Nezamabadi-pour
    [J]. International Journal of Machine Learning and Cybernetics, 2019, 10 : 3043 - 3070
  • [8] Effective resampling approach for skewed distribution on imbalanced data set
    Nwe, Mar Mar
    Lynn, Khin Thidar
    [J]. IAENG International Journal of Computer Science, 2020, 47 (02): : 234 - 249
  • [9] SPECTRAL CLUSTERING WITH IMBALANCED DATA
    Qian, Jing
    Saligrama, Venkatesh
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [10] Label matrix normalization for semisupervised learning from imbalanced Data
    Li, Fengqi
    Li, Guangming
    Yang, Nanhai
    Xia, Feng
    Yu, Chuang
    [J]. NEW REVIEW OF HYPERMEDIA AND MULTIMEDIA, 2014, 20 (01) : 5 - 23