MLOps FMEA: A Proactive & Structured Approach to Mitigate Failures and Ensure Success for Machine Learning Operations

被引:0
|
作者
Paul, Abhishek [1 ]
Son, Roderick Y. [1 ]
Balodi, Shiv A. [2 ]
Crooks, Kenney [3 ]
机构
[1] Northrop Grumman, Chief Data Off, 1 Space Pk Dr, Redondo Beach, CA 90034 USA
[2] Northrop Grumman, Chief Data Off, 2980 Fairview Pk Dr, Falls Church, VA 22042 USA
[3] Northrop Grumman, Reliabil & Model Based Sustainment, Aeronaut Sect, 2000 W NASA Blvd, Melbourne, FL 32901 USA
关键词
Machine Learning; Technology Readiness Levels; Natural Language Processing Failure Modes and Effects Analysis; Predictive Maintenance;
D O I
10.1109/RAMS51492.2024.10457600
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Machine learning applications have seen an exponential rise in prevalence across many different industries including healthcare, banking, manufacturing, and defense. While there is a lot of potential for machine learning applications, successful development and productionization is not assured. To prevent failures and ensure success, a Machine Learning Operations (MLOps) Failure Modes and Effects Analysis (FMEA) is proposed as a proactive structured approach for risk identification and mitigation. The MLOps FMEA framework demonstrates an approach to enumerate, prioritize, and mitigate potential failure modes, which spans the entire MLOps lifecycle. The MLOps FMEA framework tailors the classical FMEA to address the risk assessment needs for machine learning projects. This work proposes developing templated MLOps failure modes by utilizing the CRISP-ML(Q) as a standardized representation of the MLOps workflow to identify categories of MLOps failure modes, and the NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0) as the basis for principled MLOps Design Patterns to derive specific failure modes. Together, these standards establish a methodological and comprehensive foundation to identify and establish templated failure modes in the MLOps lifecycle. This work also proposes adaptations to the classical FMEA workflow and risk prioritization to support the MLOps FMEA framework. For prioritizing MLOps failure modes, MLOps-centric Severity, Occurrence, & Detection tables were proposed, Consequence Levels (Safe vs. Unsafe) were incorporated, and risks are categorized by intentional and unintentional failure modes. As a machine learning project transitions from a proof of concept to a production solution, the MLOps FMEA framework is applied at each Machine Learning Technology Maturity Level (MLTRL). The MLOps FMEA framework is demonstrated with a predictive maintenance case study. This framework has aided the organization in increasing the successful delivery of impactful machine learning solutions to production, as well as providing the added benefit of increased machine learning awareness and maturity in the organizational culture.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] How to succeed in the market? Predicting startup success using a machine learning approach
    Kim, Jongwoo
    Kim, Hongil
    Geum, Youngjung
    TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, 2023, 193
  • [42] A new proactive-reactive approach to hedge against uncertain processing times and unexpected machine failures in the two-machine flow shop scheduling problems
    Rahmani, D.
    SCIENTIA IRANICA, 2017, 24 (03) : 1571 - 1584
  • [43] Prediction of failures in the project management knowledge areas using a machine learning approach for software companies
    Taye, Gizatie Desalegn
    Feleke, Yibelital Alemu
    SN APPLIED SCIENCES, 2022, 4 (06):
  • [44] Enhancing production planning in metal-mechanical industries: statistical analysis and machine learning approach for predicting machine failures
    Cardoso, Valdir Henrique
    Neto, Geraldo Cardoso de Oliveira
    da Silva, Rodrigo Neri Bueno
    Alexandruk, Marcos
    Araujo, Sidnei Alves de
    Tucci, Henrricco Nieves Pujol
    Lourenco, Sergio R.
    Moraes, Edmilson Alves de
    Vido, Marcos
    Amorim, Marlene
    JOURNAL OF INDUSTRIAL AND PRODUCTION ENGINEERING, 2025,
  • [45] An Enhanced Frequency Analysis and Machine Learning Based Approach for Open Circuit Failures in PV Systems
    Lavador-Osorio, Mauricio
    Zuniga-Reyes, Marco-Antonio
    Alvarez-Alvarado, Jose M.
    Sevilla-Camacho, Perla-Yazmin
    Garduno-Aparicio, Mariano
    Rodriguez-Resendiz, Juvenal
    IEEE ACCESS, 2024, 12 : 96342 - 96357
  • [46] Prediction of failures in the project management knowledge areas using a machine learning approach for software companies
    Gizatie Desalegn Taye
    Yibelital Alemu Feleke
    SN Applied Sciences, 2022, 4
  • [47] Root cause prediction for failures in semiconductor industry, a genetic algorithm-machine learning approach
    Rammal, Abbas
    Ezukwoke, Kenneth
    Hoayek, Anis
    Batton-Hubert, Mireille
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [48] A Machine Learning Approach to Predict Bin Defects in E-commerce Fulfillment Operations
    Weaver, Zachary
    Bharadwaj, Rupesh
    HCI INTERNATIONAL 2024 POSTERS, PT V, HCII 2024, 2024, 2118 : 105 - 112
  • [49] Adaptive Layered Machine Learning Approach to Detect and Mitigate Behavioral Based Intrusions in Wireless Sensor Network
    Saathvika, S.
    Accamma, B. L.
    Kumar, Santhosh B. J.
    2024 CONTROL INSTRUMENTATION SYSTEM CONFERENCE, CISCON 2024, 2024,
  • [50] Predicting in vitro fertilization success in the Brazilian public health system: a machine learning approach
    Nayara C. N. Barreto
    Giulia Z. Castro
    Ramon G. Pereira
    Francisco A. N. Pereira
    Fernando M. Reis
    Wagner M. Junior
    Ines K. D. Cavallo
    Karina B. Gomes
    Medical & Biological Engineering & Computing, 2022, 60 : 1851 - 1861