MLOps FMEA: A Proactive & Structured Approach to Mitigate Failures and Ensure Success for Machine Learning Operations

被引:0
|
作者
Paul, Abhishek [1 ]
Son, Roderick Y. [1 ]
Balodi, Shiv A. [2 ]
Crooks, Kenney [3 ]
机构
[1] Northrop Grumman, Chief Data Off, 1 Space Pk Dr, Redondo Beach, CA 90034 USA
[2] Northrop Grumman, Chief Data Off, 2980 Fairview Pk Dr, Falls Church, VA 22042 USA
[3] Northrop Grumman, Reliabil & Model Based Sustainment, Aeronaut Sect, 2000 W NASA Blvd, Melbourne, FL 32901 USA
关键词
Machine Learning; Technology Readiness Levels; Natural Language Processing Failure Modes and Effects Analysis; Predictive Maintenance;
D O I
10.1109/RAMS51492.2024.10457600
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Machine learning applications have seen an exponential rise in prevalence across many different industries including healthcare, banking, manufacturing, and defense. While there is a lot of potential for machine learning applications, successful development and productionization is not assured. To prevent failures and ensure success, a Machine Learning Operations (MLOps) Failure Modes and Effects Analysis (FMEA) is proposed as a proactive structured approach for risk identification and mitigation. The MLOps FMEA framework demonstrates an approach to enumerate, prioritize, and mitigate potential failure modes, which spans the entire MLOps lifecycle. The MLOps FMEA framework tailors the classical FMEA to address the risk assessment needs for machine learning projects. This work proposes developing templated MLOps failure modes by utilizing the CRISP-ML(Q) as a standardized representation of the MLOps workflow to identify categories of MLOps failure modes, and the NIST Artificial Intelligence Risk Management Framework (AI RMF 1.0) as the basis for principled MLOps Design Patterns to derive specific failure modes. Together, these standards establish a methodological and comprehensive foundation to identify and establish templated failure modes in the MLOps lifecycle. This work also proposes adaptations to the classical FMEA workflow and risk prioritization to support the MLOps FMEA framework. For prioritizing MLOps failure modes, MLOps-centric Severity, Occurrence, & Detection tables were proposed, Consequence Levels (Safe vs. Unsafe) were incorporated, and risks are categorized by intentional and unintentional failure modes. As a machine learning project transitions from a proof of concept to a production solution, the MLOps FMEA framework is applied at each Machine Learning Technology Maturity Level (MLTRL). The MLOps FMEA framework is demonstrated with a predictive maintenance case study. This framework has aided the organization in increasing the successful delivery of impactful machine learning solutions to production, as well as providing the added benefit of increased machine learning awareness and maturity in the organizational culture.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Machine Learning Operations (MLOps): Overview, Definition, and Architecture
    Kreuzberger, Dominik
    Kuehl, Niklas
    Hirschl, Sebastian
    IEEE ACCESS, 2023, 11 : 31866 - 31879
  • [2] Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach
    Bayram, Firas
    Ahmed, Bestoun s.
    ACM COMPUTING SURVEYS, 2025, 57 (05)
  • [3] A Review of Big Data and Machine Learning Operations in Official Statistics: MLOps and Feature Store Adoption
    Ramos Nunes, Carlos Eduardo
    Ashofteh, Afshin
    2024 IEEE 48TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC 2024, 2024, : 711 - 718
  • [4] A Machine Learning Operations (MLOps) Monitoring Model Using BI-LSTM and SARSA Algorithms
    Elgamal, Zeinab Shoieb
    Elfangary, Laila
    Fahmy, Hanan
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (10) : 583 - 593
  • [5] Success Prediction of Leads - A Machine Learning Approach
    Gil Custodio, Joao Pedro
    Costa, Carlos J.
    Carvalho, Joao Paulo
    2020 15TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI'2020), 2020,
  • [6] Proactive advising: a machine learning driven approach to vaccine hesitancy
    Bell, Andrew
    Rich, Alexander
    Teng, Melisande
    Oreskovic, Tin
    Bras, Nuno B.
    Mestrinho, Lenia
    Golubovic, Srdan
    Pristas, Ivan
    Zejnilovic, Leid
    2019 IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI), 2019, : 362 - 367
  • [7] Automatic Resource Provisioning: a Machine Learning based Proactive approach
    Biswas, Anshuman
    Majumdar, Shikharesh
    Nandy, Biswajit
    El-Haraki, Ali
    2014 IEEE 6TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING TECHNOLOGY AND SCIENCE (CLOUDCOM), 2014, : 168 - 173
  • [8] Predictive System of Semiconductor Failures based on Machine Learning Approach
    El Mourabit, Yousef
    El Habouz, Youssef
    Zougagh, Hicham
    Wadiai, Younes
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (12) : 199 - 203
  • [9] Forecasting bank failures and stress testing: A machine learning approach
    Gogas, Periklis
    Papadimitriou, Theophilos
    Agrapetidou, Anna
    INTERNATIONAL JOURNAL OF FORECASTING, 2018, 34 (03) : 440 - 455
  • [10] A Machine Learning Approach for Adaptive Classification of Power MOSFET Failures
    McMenemy, Donald
    Chen, Weiqiang
    Zhang, Lingyi
    Pattipati, Krishna
    Bazzi, Ali M.
    Joshi, Shailesh
    2019 IEEE TRANSPORTATION ELECTRIFICATION CONFERENCE AND EXPO (ITEC), 2019,