Yet ANOther MAchine Learning OPerations Article (YANO MALOPA)
In this article, I will describe the Machine Learning Operations (MLOps) aspect of Aleksei Luchinsky and my project called "Brain Tumor Detection with Topological Data Analysis and Machine Learning Operations". If you want a more in-depth discussion about Topological Data Analysis (TDA), request from Aleksei Luchinsky to write that article.
For my part, your first question will be: What is MLOps? I find it helpful to break MLOps into its two pieces: 1) Machine Learning and 2) Operations.
Machine Learning
Let's first contrast machine learning with traditional programming/learning.
Traditional programming takes in data and rules, regulations, steps, procedures to produce an outcome. In other words:
Traditional Programming = Data + Program ⇨ Outcomes.
Machine learning takes in data and outcomes to produce rules, regulations, steps, and procedures. In other words:
Machine Learning = Data + Outcome ⇨ Program.
Hopefully, this disambiguation demystifies machine learning algorithms. When you hear a data scientist mention a random forest or neural network, remember, the program/model is really a collection of rules, regulations, steps, and procedures of the best mapping of the data to their respective outcomes.
(More can be said about machine learning, such as "What criteria is used to define 'best mapping'?", or "What tuning can be done for different algorithms?", or "How does the program explain the mapping between data and output?" All good questions, none answered here.)
Operations
"But these people have never heard of Our Ford, and they aren't civilized," a quote from Brave New World. Aldous Huxley used Henry Ford as a metaphor for mass consumption, religious zeal, and technocracy. I include the quote as a homage to Fordism which I'll use, in turn, as an allusion to operations.
The evolution from traditional machine learning practices to MLOps mirrors the historical shift from artisanal/craft production to Fordism in manufacturing. Just like Fordism revolutionized industrial production with its emphasis on efficiency, standardization, and scalability; MLOps transformed machine learning by embedding similar principles into the development and operation of ML models.
Let's review assembly line production. An assembly line, in an MLOps context, is where different stages of the machine learning lifecycle are streamlined into a continuous integration and continuous deployment (CI/CD) pipeline. Each phase of the pipeline—from data collection and preprocessing to training, validating, and monitoring models—is optimized and automated.
Similar to the assembly line function, MLOps breaks down the machine learning lifecycle into distinct, standardized phases: data ingestion, model training, model tracking & registration, deployment, and monitoring. Each phase is designed to be repeatable and scalable, supported by automated pipelines.
Furthermore, through automation and the use of scalable technologies, models can be retrained and redeployed to handle growing data volumes or to adapt to new conditions without extensive manual effort. The operational efficiency of MLOps, with continuous monitoring and automated adjustments, ensures that models maintain high performance even as operational conditions change.
MLOps in Brain Tumor Detection
The architecture of our MLOps implementation attempts to utilize only open-source technologies. I believe many readers will be familiar with technologies in the bottom-center and bottom-left sections, so descriptions of those, I will skip.
Docker is a tool to build, package, and deploy applications in a consistent and isolated environment. Docker Compose is tool that lets you define and run multi-container Docker applications. You set up everything you need for your app in a simple YAML file, and start everything with one command. MLflow is a platform that helps you manage the whole machine learning lifecycle, including experiment tracking, model deployment, and workflow automation. Evidently is a tool that can generate detailed reports to monitor your machine learning models in training and production, helping you keep an eye on things like data drift and model performance. Apache Airflow is the king of technologies in our particular MLOp. Airflow helps you plan, schedule, and monitor complex workflows. You set up tasks and dependencies in a clear, logical sequence and Airflow manages the execution for you.
Airflow is called an orchestrator. In Airflow, you write pipelines to perform a particular task. These pipelines are called a Directed Acyclic Graph or DAGs. We have three DAGs that compose our MLOp:
1) Data Ingestion and Preprocessing
Sequentially, the first DAG takes in a Kaggle dataset of MRI images. The images undergo preprocessing, viz. trimming the image and standardizing color, shape, and extension. Then, it creates each MRI's Persistent Diagrams (PD) through which it performs TDA extraction techniques such as Persistence Landscape, Persistence Blocks, Functional Data Analysis, and a summary technique. This PD to TDA-extraction-technique converts the MRI image into a structured tabular format conducive for a set of machine learning algorithms.
(You see why you should nudge Aleksei Luchinsky to write a post?)
2) Model Training and Experimentation
Once the data is ingested and preprocessed, our second DAG begins. This stage involves training various machine learning models on the extracted data, specifically, XGBoost, Gradient Boost, Random Forest, Logistic Regression, and Support Vector Machine. The DAG manages the train/test splitting (80/20), Kfold (k=5) cross-validation, regressor standardization, and a grid search per algorithm. MLflow is used to track each experiment, capture performance metrics, handle model versioning, and serve predictions.
Recommended by LinkedIn
3) New Data Triggering Model Monitoring
The third DAG triggers when new data is placed into one of two folders. These folders represent a doctor filing whether she found an MRI to contain a tumor. Once an image is added, the DAG sets off a series of events:
The image above represents the architecture described above. The blue colored blocks represent the first DAG, the red colored blocks represent the second DAG, and the yellow colored block represents the third DAG.
A final and important note not able to be depicted in the image; when new data is added, each extraction is ran and appended to its respective extraction's master table. In effect, when the Model Training and Experimentation DAG is reran (can be triggered after a certain amount of new data or data drift becoming unwieldy, etc), the data being fed into machine learning models includes all the Kaggle data in addition to every new data that has been added by the doctor. This is part of the continuous improvement aspect of MLOps.
Results
Results for an MLOps process must demonstrate how the system performs when new data is introduced. New data comes from another Kaggle MRI dataset. The new MRI dataset was meant as a multi-classification problem. MRIs are labeled as either no-tumor, glioma, meningioma, and pituitary tumors. For reference, the original dataset contained no-tumor and meningioma tumors, only. We batch-introduced "yes" and "no" images according to their respective folders for four different batches.
Experiment 1: Subset of Sames
In the first experiment, we added a subset of the training images from the new data: 13 meningioma MRIs and 12 non-tumorous MRIs. Results are show below:
"Dummy (by reference)" is a naive labeler according to the distribution of "yes" and "no" labels in the test data. "Dummy (by current)" is a naive labeler according to the distribution of "yes" and "no" labels in the new data. "Model" is how the model performed on the new data.
Experiment 2: Glioma Test
In the second experiment, two things were different.
In general, glioma does not look like meningioma. Meningioma have bright spots that indicate a tumor, while glioma tumors are outlined by a mostly visible circle, not brightness. Results are shown below:
Given the "yes" folder MRIs are completely unseen by the model, unsurprisingly, the model fit was worse. I take this as a good sign if only because the results from are lining up with the intuition that the model won't perform as well.
Experiment 3: Triple Training
In the third experiment, we added the full training folder from the new data. This time, in the "yes" folder, we kept only meningioma instead of adding either glioma or pituitary tumors. In all, there are 252 "No" images and 214 "Yes" images, nearly tripling our original data. Results are shown below:
With experiment 3, we added many more images to our dataset. We did this to prepare experiment 4. Before monitoring experiment 4, we reran Model the Training and Experimentation DAG (can be triggered after a certain amount of new data or data drift becoming unwieldy, etc). Recall the process of the Training and Experimentation DAG, effectively: 1) split extractions into test and train, and 2) run, fit, and register each extraction's best model.
Experiment 4: CI/CD Re-train Re-monitor
In the fourth experiment, we added the full testing folder from the new data. Here again, in the "yes" folder, we kept only meningioma instead of adding glioma or pituitary tumors. In all, there are 89 "No" images and 79 "Yes" images. For reference data, we switched back to using the conventional testing data. Recall that testing comprises 20% of the reference data. This 20% is split between 86 "No" and and 60 "Yes", after all previous batches were added to the original data.
The evaluative tables below are slightly different than the previous three. I wanted to show these tables to see how the best model fit to both 1) the new data testing folder (current), and 2) how the newly trained best model fit to the newly partitioned test data.
Turns out, even with the nearly tripling of our data, the best model still came from Persistence Block extraction. Had it have been from another extraction, the Monitoring DAG would have evaluated that extraction's best model on the new data's extraction of the same type.
Overall, the results are exciting; on the relatively small test data, we get an F1 score of 0.762. More importantly, on the new incoming data, the F1 score is 0.883. This is very good and I think a well crafted test case for MLOps in the real-world.
Summary
In conclusion, the successful implementation of MLOps was a exciting process. Not only was it exciting, but we demonstrated the effectiveness of our pipelines by introducing new data in batches and evaluating the performance of the best model on the new data. The results showed that our models maintained a high F1 score throughout many batches. We demonstrated the continuous improvement aspect in our MLOps system by retraining models on all the new data the doctor inputted. Overall, this implementation showcases the power of MLOps to automate machine learning workflows, ensure reproducibility, and adapt to new data in real-world applications.
Any questions about the project or process, feel free to write me, I'm happy to answer any questions. Thanks for reading.
☰ Infrastructure Engineer ☰ DevOps ☰ SRE ☰ MLOps ☰ AIOps ☰ Helping companies scale their platforms to an enterprise grade level
1yThe article delves into MLOps for Brain Tumor Detection, explaining concepts and tools. It's a compelling read for those intrigued by data science or technology innovations. Jaryt S.