The Lifecycle of an AIOps System#
Artificial Intelligence for ML Operations (AIOps) has emerged as a powerful approach to managing complex Machine Learning (ML) environments. By leveraging AI, machine learning, and data analytics, AIOps enables ML teams to automate and optimize various aspects of their operations, improving efficiency and reducing downtime. In this documentation, we’ll explore the lifecycle of an AIOps system, providing insights into how these systems are developed, implemented, and maintained.
In Google’s article MLOps: Continuous delivery and automation pipelines in machine learning, the authors defined 9 key steps in the lifecycle of an AIOps system. We will refine it below.
Model Development, Selection and Training: The data scientist implements different algorithms with the prepared data to train various ML models. In addition, you subject the implemented algorithms to hyperparameter tuning to get the best performing ML model. The output of this step is a trained model.
Model Evaluation: The model is evaluated on a holdout test set to evaluate the model quality. The output of this step is a set of metrics to assess the quality of the model.
Model Validation, Registry and Pushing Model to Production: The model is confirmed to be adequate for deployment—that its predictive performance is better than a certain baseline.
Model Deployment and Serving: The validated model is deployed to a target environment to serve predictions. This deployment can be one of the following:
Microservices with a REST API to serve online predictions.
An embedded model to an edge or mobile device.
Part of a batch prediction system.
Model Monitoring: The model is monitored to ensure that it continues to perform as expected. This monitoring can be one of the following:
Anomaly detection to detect unexpected behavior in the model.
Drift detection to detect changes in the data distribution.
Performance monitoring to detect changes in the model performance.
The lifecycle gif below by Deepak captures a simplified lifecycle of an AIOps system.
He also has a DataOps lifecycle gif below.
Table of Contents#
- Stage 1. Problem Formulation
- Stage 2. Project Scoping And Framing The Problem
- Stage 3. Data Pipeline (Data Engineering and DataOps)
- Stage 4. Data Extraction (MLOps), Data Analysis (Data Science), Data Preparation (Data Science)
- Stage 5. Model Development and Training (MLOps)
- Stage 6. Model Evaluation (MLOps)
- Stage 7. Model Validation, Registry and Pushing Model to Production (MLOps)
- Stage 8. Model Serving (MLOps)
- Stage 9. Model Monitoring (MLOps)
- Stage 10. Continuous Integration, Deployment, Learning and Training (DevOps, DataOps, MLOps)
References and Further Readings#
MLOps: Continuous delivery and automation pipelines in machine learning
Huyen, Chip. “Chapter 2. Introduction to Machine Learning Systems Design.” In Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications, O’Reilly Media, Inc., 2022.
Kleppmann, Martin. “Chapter 1. Reliable, Scalable, and Maintainable Applications.” In Designing Data-Intensive Applications. Beijing: O’Reilly, 2017.