Catriel Lopez

Posted on Aug 28, 2020

Git, but not dumb down: Introduction

#git #github #beginners #tutorial

Motivation

Tutorials on git are present everywhere. So, why write another? There’s a trend I’m not a big fan of: a lot of people like to divide the content between “basic” and “advanced” git.

Listen, you don’t have to know how a motor works to drive a car. I know. But sooner rather than later, something is gonna break. And if you don’t know how the thing works, other than how to use some basic commands, odds are that you are the one who’s gonna break it and will get stuck waiting for A Professional to show up.

This post is meant to explain Git in a clean and concise way, so you can be The Professional. I’m not going to try to explain the advantages or disadvantages of using git or any other VCS. I’m also not gonna divide the content into “basic” and “advanced”. To truly understand git, you must look below the surface, at the internals.

If you want to learn about why Git is such a great tool, Attlasian wrote this fantastic article about it.

If you want to learn about the implementation of Git, the source code is available here.

There’s also this great book about it, Pro Git, that I can’t recommend enough, and this post by Nico Riedmann.

Let’s start with the basics. We’ll create a new repository on any folder on your computer, by opening up the terminal on it and executing the command git init:

When you create a new repository into a folder in your computer, your workspace is divided in three main areas:

Working directory

The folder itself, in which you will create, modify, and delete files.

Stage area or Index

All the changes tracked by git will reflect in the .git directory that the git init command created, even before you commit them. You can use this index to build up a set of changes that you want to commit together. When you create a save a snapshot, you save what is currently in the index, not what is in your working directory. How do we do this?

After making any changes to the working tree, and before running the commit command, you must use the add command to add any new or modified files to the index.

This command can be performed multiple times before a commit. It only adds the content of the specified file(s) at the time the add command is run; if you want subsequent changes included in the next commit, then you must run git add again to add the new content to the index.

Repository

The (local) repository is everything in your .git directory. Mainly what you will see in your repository are all of your snapshots of the project.

In a few paragraphs, we can see the main git workflow in action: you modify a file, prepare the changes to be saved, and save the snapshot (the current version of your project) to the repository.

As we said, the git init command has created a new folder .git in our working directory. What's inside?

As it turns out, a lot of things! We'll look at them as we need them. For now, we need content. We'll create a new file, called new_file.txt and insert some text to it:

The new_file file has been modified, but its not yet staged. We can see this if we take a look at the git status command:

Git knows of the existence of the file in the working directory, but it is not tracking any change on it. We can make it do so by staging it:

Does this mean that git has stored the file in its history? Not yet. If we take a look at the output of git log, which shows all the commits made, we see that there's no snapshot of the project stored yet.

But! Changes have been made to the .git folder:

We have a new folder and file under .git/objects/, with a long and weird name. We only have one file, so this must be the snapshot, right? And if we look at what's inside, we should get a single sentence ("This is a new file"):

Clearly, it didn't work, as that is a binary file. And it gets worse! Look at what happens when we save the snapshot of our project to the repository:

The plot thickens! We have two new folders and files under .git/objects/ now. Let's take a moment to see what's going on here.

Git internals: How is this data saved?

After we added the new_file object to the repository, Git created a file under the objects folder with a weird name. This file is named with the SHA-1 checksum of the content and its header. The subdirectory is named with the first 2 characters of the SHA-1, and the filename is the remaining 38 characters.

Can we look at what's inside? Of course. Git has the cat-file command to inspect it:

Great! So the CONTENT of the file is stored in the repository. This is important, as a lot of people think that git stores only the diff between files.

So, let's say I create a repository with a 50gb in it. And let's say that I modified it three times. Does this mean that I'll have a 150gb size repository? No. Git has a compression layer that we could talk about later.

Ok, git stored the content of new_file, but just that and nothing else, not even the name of it!. What about the other two files that were created after the commit?

The repository’s content is stored as tree and blob objects, with trees corresponding to directory entries and blobs corresponding more or less to file contents. A single tree object contains one or more entries, each of which is the SHA-1 hash of a blob or subtree with its associated mode, type, and filename.

Now we have that extra information we needed about the file. But, git is a collaboration tool, and as in any collaboration tool, we need some information about the user. We don’t have any information about who saved the snapshots, when they were saved, or why they were saved, yet. This is the basic information that the commit object stores for you.

To summarize: every time we stage a file, a blob object is created, storing the content of each file. Its name is the SHA-1 of the file's content and it’s stored under .git/objects. Whenever we commit the changes to the repository, we create two new kinds of files: a tree, which stores more information about the files changed under it and a commit, with all the information about who saved the snapshot and why (the message!).

What’s next?

We’ll stop here for now, and we’ll get into branches on the next post. Let me know of any suggestions you've got, or if I made any mistake, as this is my first post. I hope you enjoyed it!

Original cover photo by Emile Perron on Unsplash.