What does it mean that we talk about snapshots of our Git repository, while in Subversion we think in terms of file changes? For me at least, the key to understanding Git is that every commit is, in fact, a snapshot of the entire project. Not a list of patches. Not a difference to the previous commit. Just a snapshot of the whole thing.
Git snapshots everything, they said. Coming from Subversion, this is hard to believe. How would a version control system scale if it stored the entire project state again and again, with each and every commit?
First, let's do a little experiment on how one might approach version control intuitively, without considering neither Git nor Subversion.
Poor Man's Version Control
Let's say we have a project called myapp
stored in a directory of that name. All it contains is a main.c file:
myapp
main.c
Without version control, how would you track changes in order to be able to restore a particular state later? Easy enough, you might say: Just create a copy of the entire myapp
directory and call it something like myapp-<version>
. After a while, you would end up with a bunch of backup directories:
myapp-01
main.c
myapp-<...>
main.c
myapp-<N>
main.c
To step back to a previous state in history, you might go and replace the entire myapp
directory by one of these snapshots created previously.
In order to avoid wasting space by keeping so much redundant information, you might consider putting everything into a gzipp'ed archive and deleting all the backup directories:
tar -cvzf myapp.tgz myapp-*
rm -r myapp-*
Interestingly, this naive approach is not completely different from the way Git actually works.
How Git does it
Every time you create a commit, Git takes the content of each added or modified file, compresses it and stores it in an internal object database together with a commit object that holds some meta information1. This approach makes it easy to reason about, as it's no more difficult than what we've done in the simple attempt mentioned above.
You may realize a big drawback here though: Although individual file contents are compressed -- which is fine--, even small changes between commits will cause massive duplication inside the object database.
In Git, this scalability problem is simply ignored at the first stage and solved later on. In a process called packing, all the objects are delta compressed and moved into one or more packfiles. This is done on several occasions; you can enforce it using git gc --aggressive
.
The drawing below shows a simplified illustration of the storage of compressed file content into blob objects as well as the packfile generation.
Conclusion
Even for everyday Git usage it is vitally important to understand a bit of its inner workings; it's good to see that the basic idea is not inherently complex but more or less identical with what we might come up with anyway.
This understanding gives us the power to get a grasp of all the more advanced features like branching, merging and rebasing.
This post has originally been published on steffen.ronalter.de
References
-
For all the details please refer to the section Git Objects of the excellent Pro Git book. ↩
Top comments (2)
In my opinion repacking is just a little optimization for saving some storage on the disc and not a fundamental concept of git
Git would work pretty well without this function. Git never store the exact same object twice. So if you have two commits which only differs by one file change, every file except the one who's changes is used again for the second commit. (underneath it's like a key value store which uses the hole content to generate a unique key (sha-1)).
The changing file will be stored a second time here as a complete new file. Git does does not track file changes. So, if this file is big it could be a little a waste of disc space (normally not so important today). But to make such situation more efficient, you can use the repack feature.
You‘re right. Repacking is just an optimization.
Speaking in the naive analogy, you don‘t need to compress the backup folders after all. It‘s an optimization to save disk space.
As suggested in a comment to the original article, this analogy might a bit misleading...