Graph Theory
Graphs are all around us. The social networks that connect us, the infrastructure that powers our cities, and even the circuit diagrams that bring our technology to life are all examples of graphs. But what exactly is graph theory and how can understanding it help us make sense of the connected world?
I first learned about graph theory in one of my college math classes. It was sort of an abstract concept at first - just nodes and edges on a page. Graphs are everywhere.
The subway map in the city was a graph, with stations as nodes and train lines as edges. Your Facebook friends list was a graph. Even the wiring under the hood of your car forms a graph of connections. Graphs are a simple yet powerful way to represent any system made up of interrelated parts.
Formally, a graph is made up of vertices (also called nodes or points) connected by lines or arcs called edges. The vertices can represent people, computers, molecules - anything that can be connected to something else. The edges show the relationships or interactions between those things. A graph gives us a structured way to model relationships visually. Once you have a representation of a system as a graph, you can start to analyze it mathematically.
One of the basic questions graph theorists ask is: How connected is this graph? One way to measure that is through graph invariants like diameter, density, and connectivity. The diameter tells you the longest distance between any two vertices when moving along edges. Density measures how many edges there are relative to the total possible. Connectivity reveals whether the graph is made of one connected component or multiple disconnected subgraphs. Understanding these metrics helps us evaluate things like network resilience, traffic patterns, and disease spread.
Algorithms that operate on graphs also come up frequently in computer science. Shortest path problems, minimum spanning trees, and graph traversals all use graph representations. The algorithms power everything from GPS route planning and social networking recommendations to VLSI circuit design. I find it amazing that many complex real-world issues can often be boiled down to a graph and solved through algorithmic graph operations.
Graphs with Git
Git stores content in a data structure called the Git object database. At its core, this database represents Git's revision history as a directed acyclic graph (DAG) of commits (DAG vs. tree using Git). Each commit in the history is a vertex in the graph, while the parent-child relationships between commits form the edges. For example, when you create a commit, it will reference the commit(s) that came directly before it as its parent(s) - this establishes edges in the commit-graph.
This commit-graph data structure allows Git to do really powerful things. Since it's a graph, it can represent complex commit histories involving branches and merges as they happen in the real world of development. Traditional version control systems at the time stored histories as linear sequences, limiting how history could evolve. But Git's DAG lets histories branch and converge freely, closely mirroring how codebases are developed and maintained in real-life projects.
Other objects in the Git database like trees, blobs, and tags end up forming secondary graphs connected to the commit-graph. For example, each commit points to a tree object containing its content, while trees point to blob files. So the trees and blobs together give a snapshot of the code tree structure at each commit vertex in the main graph. These connected subgraphs allow traversing between versions of content, authors, timestamps, and more through the graph connections.
VISUALIZING YOUR GIT REPOSITORY
The main point of using a visualizer is to help you make sense of your branch history. For example, to list all commits in your repository at the command line, you could do git log --oneline --abbrev-commit --all
which will get you this flattened view
git log --oneline --abbrev-commit --all --graph
GARBAGE COLLECTION and GIT AMEND
You write some code and check it in. Then, you realize you forgot to run the tests, so you run them, and they uncover a syntax error. Or you spot a typo. For whatever reason, you weren't done when you thought you were done.
Think like (a) git
When I just started learning git
, I added a new commit for each change but these changes are doing the same feature. The first one would say "Add feature X", and the next several would have messages like "Fix typo" "Fix bug" or "Fix sonarqube". That makes my git graph a mess, and unclear for each commit. And then I didn't know which this commit was for which feature.
Git gives you another option: you can tack the new change onto the previous commit using git commit --amend
. This keeps all of your related changes bundled together in one commit, so you can understand it more quickly when you're reviewing it later.
When you amend a commit in Git, it works by creating a brand new commit object that replaces the original. This new commit references the same tree as the old one but has changes like modified metadata or updated commit message. Behind the scenes, Git links the new commit back to the original's parent(s). But it doesn't delete the old commit object right away - that's where GC comes in.
Since Git is designed around its internal object database, it needs some housekeeping to avoid cluttering up disk space with stale, unreferenced objects over time. This is the job of Git's automatic garbage collector. By default, Git runs GC when running other commands if too many loose objects (not packed in a delta-compressed file) exist. It detects unreachable objects that aren't ancestors of any references like branches or tags.
These unreferenced commits from amend operations will be identified as garbage. GC then consolidates them into a packed file to reclaim disk space. This cleaning happens transparently without affecting your work.
You can rewrite the commit message with git commit --amend -m "new commit message"
or keep the old one git commit --amend --no-edit
GIT CHERRY-PICK
A Git commit's ID is a hash of both its contents and its history. So, even if you have two commits that introduce the exact same change, if they point to different parent commits, they'll have different IDs.
What git cherry-pick does is take a commit from somewhere else, and "play it back" wherever you are right now. Because this introduces the same change with a different parent, Git builds a new commit with a different ID
If you were at node H in this graph, and you typed git cherry-pick E
you'd wind up with a copy of commit E—let's call it "E prime" or E'—that pointed to H as its parent, like so:
Git Rules
- Do works in a feature branch (create feature branches from the
develop
branch. - Never push into the
develop
ormaster
branch. Make a Merger Request(Gitlab)/ Pull Request(GitHub). - Update your local
develop
branch and do an interactive rebase before pushing your feature and making a PR/ MR. - Resolve potential conflicts while rebasing and before making a PR/ MR.
- Delete local and remote feature branches after merging (Feature branches should only exist while the work is still in progress, avoid dead branches).
- Before making a PR/MR, make sure your feature branch builds successfully and passes all tests in CI/ CD.
- Use
.gitignore
file (system files/env files.. that should not be sent with your code into a remote repository). - Protect
master
,stage
,develop
branches (Avoid unexpected and irreversible changes).
Git Workflow
- Start with a new project.
cd <path_to_project_folder>
git init
git checkout -b master
git add .
git commit -m "Initial commit"
git remote add origin <git_url/ssh_url>
git push origin master
- Check out a new feature branch.
git checkout -b <branch_name>
- Add changes.
git add -p
git commit
git add -p
will give you a chance to review all of the introduced changes one by one and decide whether to include them in the commit or not. git commit
will start an editor that lets you separate the subject from the body.
- Sync with the remote repository to get changes of anyone in your team just merged into develop.
git checkout develop
git pull
- Update the feature branch with the latest changes from the
develop
branch by interactive rebase.
git checkout <branch_name>
git rebase -i --autosquash develop
You can use --autosquash
to squash all your commits to a single commit/ or not based on your team. For me, 1-4 commits for each feature.
- If you don’t have conflicts, just go ahead. If you have, resolve them and continue to rebase.
git add .
git rebase --continue
- Push to feature branch after rebasing. Rebase will change history, so you'll have to use
-f
to force changes into the remote branch.
git push origin <your_current_feature_branch_name> -f
- Create a PR/ MR.
- PR/MR will be accepted or commented on, merged, and closed by a reviewer. If the reviewer commented, based on it you have to update the code or not then use
git amend
to append modify to the previous commit or create a new commit. - Remove your local feature branch if you're done.
git branch -D <branch_name>
git fetch -p && for branch in git branch -vv --no-color | grep ': gone]' | awk '{print $1}'
; do git branch -D $branch; done
Conclusion
I hope this discussion gave you some useful insight into the interesting connections between Git, its underlying data model, graph theory concepts, and Git workflow. While version control systems and graph databases may seem like different topics, at their core they are both capturing and querying directed acyclic graphs of information.
Git's revolutionary approach of modeling code history as a graph is what allows it to represent complex project histories flexibly. This relies on graph theory constructs like vertices, edges, and parent-child relationships between commits. Secondary subgraphs and garbage collection keep the object database organized over time.
Top comments (4)
Nicely explained. I like when someone explains data structures connecting real world tech than just blindly doing leetcode.
Thanks for the kind feedback!
wonderful. You opened my brain
Yeah, it's a great article.