MLOps 5: Git & GitHub Fundamentals
Git Concepts, Git Workflow, Getting Started with git & github, Git commands
We haven’t touched MLOps (Machine Learning Operations) in quite some time. It would be a good idea to look at some of the previous relevant posts as a nice refresher:
After finishing the pervious python series, you should be great at wrangling data. So, in the real world, we have this process automated in a category “MLOps”. We use MLOps instead of making the same old python file, and manually executing it over and over again.
Table of Contents
Git Concepts
Git Workflow
Getting started with git & github
Git Commands
1 - Git Concepts
Machine Learning Operations (MLOps) involves the end-to-end machine learning lifecycle. So, when discussing MLOps, understanding version control is crucial. Git and GitHub are often pivotal in this journey. Git is a version control system and GitHub is a hosting platform for Git repositories. Let's dive into these concepts.
1.1 Git
Git is a distributed version control system designed to handle everything from small to very large projects with speed and efficiency. The primary reasons why developers use Git include:
Version Control: Git allows multiple people to work on the same project simultaneously. Whenever a developer saves a version of their work, Git creates a "commit". This has a unique ID that allows the retrieval of a specific version of the project.
Branching: Developers can branch off from the "main" line of development. They can also work without affecting the main line. This is particularly useful when developing features or trying out new ideas.
Merging: After finishing work on a branch, developers can merge their changes back to the main branch. Git has built-in mechanisms to help resolve potential conflicts.
Distributed Nature: Every Git directory on every computer is a full-fledged repository. It has a complete history and full version-tracking capabilities. It also has independent of network access or a central server.
1.2 GitHub
GitHub is a cloud-based platform for individuals, teams, and organizations. They can store Git repositories, collaborate on code, and manage projects. It extends the capabilities of Git. It provides a graphical interface and issue tracking and continuous integration options. It also provides a community aspect where developers can collaborate on open-source projects.
Some unique features of GitHub include:
Forking: This is the process of creating a copy of a repository from one user's account to another. This allows you to make changes without affecting the original project.
Pull Requests: After you've forked a repository, made changes, and want those changes to be seen or incorporated in the original repository, you can create a pull request. The owners of the original repository can then review your changes and decide whether to merge them.
GitHub Actions: An automation feature that allows users to set up CI/CD workflows directly in their repository.
GitHub Pages: A feature that allows users to turn repositories into live websites to showcase their projects.
1.3 Differences
Nature: Git is a version control system. GitHub is a hosting service for Git repositories.
Accessibility: Git is a command-line tool, while GitHub provides a graphical interface.
Usage: You can use Git locally on your computer even without an internet connection. In contrast, while GitHub does have a desktop version, its primary utility is as a cloud service.
Functionality: Git focuses on version control and code sharing. GitHub introduces more features like pull requests, issue tracking, and a social aspect. With the social aspect, developers can follow each other, star repositories, and more.
To summarize, while Git and GitHub are distinct, they complement each other. Git provides the tools for tracking and managing code changes. GitHub offers a space for collaboration and broader project management in the cloud. In the realm of MLOps, understanding and utilizing both is essential. It allows for effective collaboration and seamless deployment of machine learning models.
2 - Git Workflow
Pictures are worth a 1000 words, so here’s a picture that describes how the Git workflow works.
And, here’s a few small paragraphs that explains each of these concepts in a bit more details
2.1 Components
Working Directory:
This is your local directory where you write, edit, and delete files. It's essentially your workspace.
Staging Area:
Before committing changes to the local repository, changes are first listed in what's called the staging area. This allows for a review of what will go into the next commit and ensures more granular control over versions.
Local Repository:
This is the Git repository stored on your local machine. It contains all the commit history of the project.
Remote Repository (e.g., GitHub):
This is the repository hosted on a remote server, allowing multiple people to collaborate on the same project.
2.2 Operations
Create:
This operation initializes a new Git repository and begins tracking an existing directory.
Action:
git init
to initialize a new repository.
Update:
You can update your repository by adding new changes/files or by fetching changes from a remote repository.
Action:
git add [filename]
to add changes to the staging area.git commit -m "Your message here"
to commit changes from the staging area to the local repository.git pull
to fetch and merge changes from the remote repository.git push
to push changes to the remote repository.
Changes:
View changes between the working directory, staging area, and local repository.
Action:
git status
gives a summary of all changes.git diff
provides a detailed view of line-by-line changes.
Revert:
If you made an error, you can revert back to a previous commit.
Action:
git revert [commit ID]
creates a new commit that undoes the changes from the specified commit.git reset
can be used to reset the staging area, the local repository, or both to a previous state.
Difference:
Understand the differences between commits, branches, the staging area, and the working directory.
Action:
git diff
is a versatile tool to spot differences. For instance,git diff HEAD
shows differences between the working directory and the last commit. You can also compare between different commits using their commit IDs.
2.3 Recommended Workflow
Pull Latest Changes: Always start by pulling the latest changes from the remote repository to ensure you're working with the latest code.
Action:
git pull origin [branch-name]
Make Changes in Working Directory: Edit, add, or delete files in your working directory.
Move to Staging: Before committing, review changes and move them to the staging area.
Action:
git add .
to add all changes orgit add [filename]
to add specific files.
Commit Locally: Commit the staged changes to your local repository with a meaningful message.
Action:
git commit -m "Detailed commit message"
Push to Remote: After ensuring your changes are as desired, push them to the remote repository.
Action:
git push origin [branch-name]
3 - Getting started with git & github
In the world of MLOps, collaboration, version control, and code integration are crucial. Git and GitHub provide a seamless platform for these tasks. If you're new to Git or GitHub, this guide will walk you through the basics of setting them up and getting started.
3.1 Installing Git
Linux:
sudo apt-get update
sudo apt-get install git
Mac:
Assuming you have homebrew already
brew install git
Windows:
Just go to this link: Git For Windows
I personally don’t like working with command lines, so I prefer having a graphical user interface (GUI). You can pick a GUI for git here:
SourceTree: Available for Mac and Windows, this is a popular, free tool.
GitKraken: Cross-platform (Windows, Mac, and Linux) and offers a more visual representation of repositories.
TortoiseGit: Specifically for Windows, integrates directly into the Windows shell.
3.2 GitHub & GitHub Desktop
Creating a GitHub Account:
Go to GitHub's website.
Click the “Sign Up” button.
Provide the required details (username, email, password) and follow the on-screen instructions to complete the process.
It's recommended to verify your email address as some repositories and actions require a verified email.
If you prefer having a graphical interface for Github, you can grab a tool called GitHub Desktop:
Visit the official GitHub Desktop website.
Download the version suitable for your operating system.
Once downloaded, run the installer and follow the on-screen instructions.
After installation, open GitHub Desktop and sign in using your GitHub account details
4 - Git Commands
As Machine Learning becomes increasingly interwoven with software development, understanding and using Git becomes paramount for MLOps. Git provides version control, allowing multiple contributors to work on a project without overwriting each other's contributions. Below, we'll cover essential Git commands that every Machine Learning Engineer (MLE) should know, both to understand their function and when to use them.
4.1 Setup
Purpose: Before you begin using Git, you need to identify yourself to ensure your commits are correctly attributed to you.
Command:
Set your name:
git config --global user.name "Your Full Name"
Set your email:
git config --global user.email "youremail@example.com"
When to use: You'll typically set this up when you first install Git on a new system or when you're configuring a new work environment.
4.2 New Repository
Purpose: This is about starting a new project or obtaining a copy of an existing one.
git init: Initializes a new Git repository and starts tracking an existing directory.
git init
git clone [url]: Clone (or copy) a repository from an existing URL.
git clone https://github.com/username/repo-name.git
When to use:
git init
: When starting a new project from scratch or when you want to start version-controlling existing projects.git clone
: When you want to work on a project that already exists on a platform like GitHub.
4.3 Update
Purpose: To synchronize your local repository with the remote repository.
git pull: Fetches the changes from a remote repository and merges them into your current branch.
git pull
git fetch [alias]: Downloads objects and refs from another repository. It fetches the changes without merging them.
git fetch origin
When to use:
git pull
: When you know that changes on the remote repository need to be integrated into your current working branch.git fetch
: When you want to see the changes in the remote repository before merging them.
4.4 Changes
Purpose: To track and update changes to your codebase.
git status: Shows the status of changes as untracked, modified, or staged.
git status
git add [file]: Adds changes in the working directory to the staging area. It tells Git you want to include updates to these files in the next commit.
git add filename
git commit: Captures the snapshot of the changes. Commits are essentially snapshots of the changes in your repository.
git commit -m "Descriptive message about the commit"
git push: Pushes commits to the remote repository.
git push origin branch-name
When to use:
After making any changes in your code, you'd generally follow the sequence:
git status
->git add
->git commit
->git push
. This sequence allows you to track, snapshot, and push your changes.
4.5 Revert
Purpose: To undo or revisit past changes.
git log: Shows a log of all the commits made.
git log
git checkout: Switches between branches or restores working tree files. Also can be used to revert to a previous commit.
git checkout commitID
When to use:
git log
: When you want to view the commit history.git checkout
: When you want to switch to another branch, go back to a previous version of your code, or check out code from a previous commit.
Final Note for Machine Learning Professionals:
For Machine Learning Engineers, Git is not only for code. Using Git will enhance the reproducibility and collaboration of your projects. Whether it's versioning your data, tracking experiments or collaborating on model development. Always remember to commit smaller changes instead of large, infrequent ones. This makes it easier to track progress, debug issues, and collaborate with others.