If you’ve been reading
for the past few months, you know he stresses the importance of continuous self improvement. ie, who you are today is not who you were 5 years ago, and it is not who you should be, 5 years from now.In Machine Learning, we call this process: CI/CD
Table of Contents
CI/CD Pipeline for ML
Jenkins
Making a CI/CD Pipeline
1 - CI/CD Pipeline for ML
In the world of Machine Learning Operations (MLOps), the CI/CD pipeline plays an important role in automating and streamlining the process. It brings machine learning models from development to production. Understanding the components of CI/CD and how they apply to ML is essential. This is essential for any software development company looking to efficiently deploy and manage machine learning applications.
1.1 Continuous Integration (CI)
Continuous Integration (CI) is a development practice. This process is where developers integrate code changes into a shared repository. This is done frequently, preferably several times a day. Each integration is then verified by an automated build and test process. The main goals of CI are to detect and address conflicts early. It will improve software quality and reduce the time it takes to validate and release new software updates.
1.2 Continuous Delivery (CD)
Continuous Delivery (CD) extends the concept of CI, it does this by ensuring the entire software release process is automated. This is in addition to testing, and it includes pushing code changes to a repository, conducting automated tests, and seamlessly deploying code to a production environment. CD minimizes the manual steps required to deploy new features, updates, or fixes, ensuring a smooth and consistent flow from development to deployment.
1.3 Continuous Training (CT)
Continuous Training is a concept that's gaining traction in the ML community. It refers to the continuous retraining and updating of machine learning models to incorporate new data. It will also adapt to changing environments, and improve accuracy over time. This involves automatically retraining models with new data, evaluating performance, and deploying updated models. This is all within this automated workflow.
1.4 Why this matters
CI/CD processes are critical for software development companies for several reasons:
Enhanced Quality and Reliability: Frequent testing and integration help catch and fix issues early, leading to higher quality and more reliable software.
Faster Release Cycles: Automated pipelines speed up the process of getting new features and fixes into production, enabling a faster response to market demands.
Improved Collaboration and Efficiency: CI/CD fosters a collaborative environment where code changes are integrated, tested, and shared quickly, leading to more efficient development cycles.
Reduced Manual Work: Automation in CI/CD reduces manual, error-prone deployment processes, freeing up developers to focus on more strategic tasks.
1.5 The CI/CD Pipeline for Machine Learning Engineers
Building a CI/CD pipeline for ML involves several steps:
Commit Code Changes: Developers push their code changes to a version control system like Git. This could include new model code, updates to existing models, or infrastructure changes.
Automated Build and Test: Upon code commit, the CI system automatically builds the project and runs a series of tests. These tests can range from unit tests and integration tests to model validation tests.
Continuous Training: For ML applications, the CI process can also trigger retraining of models with new or updated datasets.
Deploy to Production: If the build and tests are successful, the CD process automatically deploys the changes to the production environment. This could involve updating a live ML model in a production system.
Monitoring and Feedback: Continuous monitoring of the application and model performance in production is crucial. Feedback from this monitoring can lead to further improvements or adjustments in the model.
With the basics out of the way, let’s talk about how it’s done.
2 - Jenkins
In the dynamic world of Machine Learning Operations (MLOps), Jenkins is the tool for automating the CI/CD pipeline. It helps in automating the parts of software development related to building, testing, and deploying. This will give us continuous integration and continuous delivery.
Jenkins is an open-source automation server written in Java. It is widely used for CI/CD in software development. It automates many different stages of the development pipeline. This allows teams to build, test, and deploy applications efficiently.
2.1 Installing Jenkins
The installation process of Jenkins varies based on the operating system:
Windows: Download the Jenkins Windows installer from the Jenkins website, execute the installer, and follow the on-screen instructions.
Mac: Jenkins can be installed on macOS using Homebrew. Simply run
brew install jenkins-lts
, then start Jenkins usingbrew services start jenkins-lts
.Linux: For most Linux distributions, Jenkins can be installed via package manager. For instance, on Ubuntu/Debian, you can use:
sudo apt-get update
sudo apt-get install jenkins
sudo systemctl start jenkins
2.2 Prelim work to setup a CI/CD Pipeline with Jenkins
Step 1: Develop the Codebase
First, create the necessary files for your ML application:
main.py
: This is your application file. It could be a FastAPI instance with a data model and a/predict
endpoint for making predictions.requirements.txt
: List all the Python dependencies your application needs.Dockerfile
: This file contains instructions for Docker to build an image of your application.
Step 2: Create a Personal Access Token on GitHub
You need a Personal Access Token to integrate Jenkins with GitHub:
Go to your GitHub account settings.
Under Developer settings, choose Personal access tokens, then generate a new token.
Make sure that you select the appropriate scopes/permissions for the token.
Step 3: Create a Webhook on the GitHub Repository
A webhook triggers Jenkins to start a build process whenever changes are pushed to the repository:
In your GitHub repository, go to Settings, then Webhooks.
Click Add webhook and set the Payload URL to your Jenkins environment.
Choose the events that you want to trigger your webhook.
Step 4: Configure Jenkins
Open Jenkins in your browser (usually on http://localhost:8080).
Install necessary plugins (like GitHub, Docker plugins) if not already installed.
Create a new item (job) for your ML project.
In the job configuration, set up the GitHub repository URL and credentials using the Personal Access Token.
Define the build triggers, like triggering a build on a GitHub webhook.
In the build steps, define the commands or scripts to run tests, build the Docker image, and push it to a Docker registry.
Now with all of the setup prep work out of the way, let’s actually use the above configuration, and files to make a CI/CD pipeline with Jenkins
Here is a nice video that consolidates all of the information together.
3 - Making a CI/CD Pipeline
Every good CI/CD Pipeline for Machine Learning Engineers needs to have 4 things taken care of:
GitHub to Container: This process involves taking code changes from GitHub and automatically deploying them in a containerized environment
Continuous Model Training: This process will automate the training of your ML model with new data or code changes
Automated Model Testing: This process will ensure model accuracy and performance through automated testing
Deployment Status Email: This process will keep your team updated on the deployment status through automated emails
3.1 GitHub to Container
This process involves taking code changes from GitHub and automatically deploying them in a containerized environment
Jenkins Job Configuration:
In Jenkins, create a new job for your GitHub repository.
Under Source Code Management, link your GitHub repository.
Set up build triggers, for example, triggering a build on every push to the repository.
Build Script:
In the build section of your Jenkins job, create a script to build a Docker image.
Sample build script in Jenkins:
#!/bin/bash docker build -t ml-model:latest . docker push yourdockerhubusername/ml-model:latest
This script builds a Docker image from your Dockerfile
and pushes it to your Docker Hub repository.
3.2 Continuous Model Training
This process will automate the training of your ML model with new data or code changes
Training Script:
Add a script in your repository, say
train_model.py
, that trains your model.Use data from accessible storage or download it within the script.
Jenkins Build Step for Training:
Add a step in your Jenkins pipeline to run this training script.
Example command in Jenkins build step:
python train_model.py
3.3 Automated Model Testing
Testing Script:
Write a script, e.g.,
test_model.py
, that tests your model's performance.It should output pass/fail status based on predefined criteria.
Jenkins Build Step for Testing:
Add a step in your Jenkins pipeline to execute the testing script.
Jenkins command:
python test_model.py
If the script returns a fail status, the Jenkins build can be marked as failed, halting the pipeline.
3.4 Deployment Status Email
This process will keep your team updated on the deployment status through automated emails
Configure Email Notification in Jenkins:
Install the 'Email Extension Plugin' in Jenkins.
In post-build actions, configure the email notifications.
Set recipients, email content, and triggers (e.g., email on build failure or successful deployment).
Sample Email Content:
Customize the content to include build status, any error logs, and links to the Jenkins build page for more details.
If you are interested, here’s what the pipeline looks like in the real world:
Jenkins is well known, but I avoid it whenever possible and strongly prefer GitLab Runner or GitHub Actions.
Jenkins is a pain to install and administer. For one thing, you need to install about 100 plugins before it is useful. This is extremely difficult in corporate environments with strict computing policies and/or air gapped networks. Administration, such as management of secrets, group permissions, SSO integration, and RBAC policies are all difficult compared to alternatives.
With GitHub Actions or GitLab Runner, you get tight integration with your VCS and much simpler installation and administration.