MLOps FMEA: A Proactive & Structured Approach to Mitigate Failures and Ensure Success for Machine Learning Operations

被引:0
|
作者
Paul, Abhishek [1 ]
Son, Roderick Y. [1 ]
Balodi, Shiv A. [2 ]
Crooks, Kenney [3 ]
机构
[1] Northrop Grumman, Chief Data Off, 1 Space Pk Dr, Redondo Beach, CA 90034 USA
[2] Northrop Grumman, Chief Data Off, 2980 Fairview Pk Dr, Falls Church, VA 22042 USA
[3] Northrop Grumman, Reliabil & Model Based Sustainment, Aeronaut Sect, 2000 W NASA Blvd, Melbourne, FL 32901 USA
关键词
Machine Learning; Technology Readiness Levels; Natural Language Processing Failure Modes and Effects Analysis; Predictive Maintenance;
D O I
10.1109/RAMS51492.2024.10457600
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Machine learning applications have seen an exponential rise in prevalence across many different industries including healthcare, banking, manufacturing, and defense. While there is a lot of potential for machine learning applications, successful development and productionization is not assured. To prevent failures and ensure success, a Machine Learning Operations (MLOps) Failure Modes and Effects Analysis (FMEA) is proposed as a proactive structured approach for risk identification and mitigation. The MLOps FMEA framework demonstrates an approach to enumerate, prioritize, and mitigate potential failure modes, which spans the entire MLOps lifecycle. The MLOps FMEA framework tailors the classical FMEA to address the risk assessment needs for machine learning projects. This work proposes developing templated MLOps failure modes by utilizing the CRISP-ML(Q) as a standardized representation of the MLOps workflow to identify categories of MLOps failure modes, and the NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0) as the basis for principled MLOps Design Patterns to derive specific failure modes. Together, these standards establish a methodological and comprehensive foundation to identify and establish templated failure modes in the MLOps lifecycle. This work also proposes adaptations to the classical FMEA workflow and risk prioritization to support the MLOps FMEA framework. For prioritizing MLOps failure modes, MLOps-centric Severity, Occurrence, & Detection tables were proposed, Consequence Levels (Safe vs. Unsafe) were incorporated, and risks are categorized by intentional and unintentional failure modes. As a machine learning project transitions from a proof of concept to a production solution, the MLOps FMEA framework is applied at each Machine Learning Technology Maturity Level (MLTRL). The MLOps FMEA framework is demonstrated with a predictive maintenance case study. This framework has aided the organization in increasing the successful delivery of impactful machine learning solutions to production, as well as providing the added benefit of increased machine learning awareness and maturity in the organizational culture.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] Towards a machine learning operations (MLOps) soft sensor for real-time predictions in industrial-scale fed-batch fermentation
    Metcalfe, Brett
    Acosta-Pavas, Juan Camilo
    Robles-Rodriguez, Carlos Eduardo
    Georgakilas, George K.
    Dalamagas, Theodore
    Aceves-Lara, Cesar Arturo
    Daboussi, Fayza
    Koehorst, Jasper J.
    Corrales, David Camilo
    COMPUTERS & CHEMICAL ENGINEERING, 2025, 194
  • [22] Enhancing Institutional Sustainability Through Process Optimization: A Hybrid Approach Using FMEA and Machine Learning
    Naranjo, Jose E.
    Alban, Juan S.
    Balseca, Marcos S.
    Villagomez, Diego Fernando Bustamante
    Falconi, Maria Gabriela Mancheno
    Garcia, Marcelo V.
    SUSTAINABILITY, 2025, 17 (04)
  • [23] An enhanced machine learning based approach for failures detection and diagnosis of PV systems
    Garoudja, Elyes
    Chouder, Aissa
    Kara, Kamel
    Silvestre, Santiago
    ENERGY CONVERSION AND MANAGEMENT, 2017, 151 : 496 - 513
  • [24] Estimation of surface quality for turning operations using machine learning approach
    Dewangan, Avinash
    Neigapula, Venkata Swamy Naidu
    Soni, Dheeraj Lal
    Vaidya, Shailesh
    TRIBOLOGY-MATERIALS SURFACES & INTERFACES, 2024, 18 (03) : 228 - 242
  • [25] Employing machine learning in water infrastructure management: predicting pipeline failures for improved maintenance and sustainable operations
    Yasin Asadi
    Industrial Artificial Intelligence, 2 (1):
  • [26] Deep Excavation Success Prediction: A Hybrid Approach with FEA and Machine Learning
    Tuan, Phuong Nguyen
    Anh, Tuan Nguyen
    Xuan, Truong Dang
    Van, Hoa Tran Vu
    TRANSPORTATION INFRASTRUCTURE GEOTECHNOLOGY, 2025, 12 (02)
  • [27] Predictors of perceived success in quitting smoking by vaping: A machine learning approach
    Fu, Rui
    Schwartz, Robert
    Mitsakakis, Nicholas
    Diemert, Lori M.
    O'Connor, Shawn
    Cohen, Joanna E.
    PLOS ONE, 2022, 17 (01):
  • [28] Impact of Image Content on Medical Crowdfunding Success: A Machine Learning Approach
    Wang, Renwu
    Xu, Huimin
    Zhang, Xupin
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [29] A Machine Learning Approach to Predict Movie Box-Office Success
    Quader, Nahid
    Gani, Md. Osman
    Chaki, Dipankar
    Ali, Md. Haider
    2017 20TH INTERNATIONAL CONFERENCE OF COMPUTER AND INFORMATION TECHNOLOGY (ICCIT), 2017,
  • [30] SUCCESS PREDICTION OF CROWDFUNDING CAMPAIGNS WITH PROJECT NETWORK: A MACHINE LEARNING APPROACH
    Zhong, Chao
    Xu, Wei
    Du, Wei
    JOURNAL OF ELECTRONIC COMMERCE RESEARCH, 2022, 23 (02): : 99 - 114