Git: Behind The Scenes
Hello friends, hope you are doing well. Recently, I was studying some concepts of Git and I am amazed by the capability of this powerful, open-source tool. But then I wondered, the commands and operations I learned, I get it, but how does Git actually perform this, how it stores the data and keep track of thousands of commits at the same time? Isn’t it worth researching and studying? So, don’t worry, I have already done this for you all. Let’s see under the hoods of Git.
(This article is considering that you already have knowledge of Git and we are looking into some advanced Git concepts.)
Just for introduction, what is Git?
Git is Distributed Version Control System (DVCS). Git keeps track of different versions of codes with its timeline so we can go back in time and work on previous versions anytime. It maintains local and remote repositories for data availability.
It is nothing but a database that stores the values in key, value patterns. This key-value pair is generated by the SHA-1 algorithm hash object. Hash is the value that points to a particular commit in the git. We will see what this means in upcoming sections. Okay, now let’s deep dive into this stuff.
Commit: Commit is a git command used to store the copy of the committed file in the “./.git” folder. This folder is generated when we use the “git init” command for a specific directory and make it a repository (empty repository). Commit simply saves the current condition of the branch/project in the local repository at that point in time. Every commit is atomic. With the commit command, the files or the changes in the Staging area will be moved to the Commit area (local repository). We can modify these changes again.
SHA-1 Algorithm (Secure Hash Algorithm-1):
This is the cryptography algorithm that takes some input data and encrypts it into an output of 160-bit (20-byte) hash value. And why this SHA-1 is used. As mentioned in the above lines it is cryptographic means security, well this is the default feature that comes with this algorithm but Git uses it for data integrity. This means, when you hash the data it will be the same after years and undisturbed as there will be only that unique hash value which will be the one for your input data. So many users use it at the same time and Git is still able to maintain this integrity and provide the same data anytime in the future, well this is what power is!
The question over here is, the hashing part, Git does this internally. But, can we see this? Yes, we can. Suppose my_story.txt is our file and we want to generate the hash for it. For this, git hash-object is used:
- git hash-object my_story.txt
This is the hash value you get: bc9655bcb38da73170a67a2bd6b82586978735d5
If we dig more into this, we can understand the directory and hash value storage.
.git/objects is the directory where all the objects are stored in the compressed format, let's see its content by listing it.
- ls ./.git/objects
All these values are nothing but the directory name for hash values named after the first two characters of the hash value.
For example, from the above-displayed directory in output, we consider 08 directory and list its content, we can see:
- ls ./.git/objects/08
So, the original hash is 0837cb9b1877c9d7d314c566dc3ad0f6a76e45a9. From which "08" will be considered as the directory name and "37cb9b1877c9d7d314c566dc3ad0f6a76e45a9" is the data of the hash object, this whole hash value works as a reference to the actual file.
To see the content of the actual file data we need to use cat-file with p flag for pretty printing the file data. In simple words, human-readable format as data is encrypted.
- git cat-file -p 0837
In the above command, we can observe that we have used a short value (0837) of hash to get the data, we can use full value also and get the same data.
Now, we understood the concept of hashing and have seen the use of commands such as cat-file and hash-object. But, we do not observe these commands in our daily life while working with Git, then what are these commands? These are called plumbing commands.
Git consists of two types of commands,
Recommended by LinkedIn
1. Porcelain Commands: These commands are human-friendly and deal with working with Git and its functionality. Ex. git add, git commit, git status
2. Plumbing Commands: Used to directly work with the Git internal structure. Ex. git hash-object, git cat-file, git rev-parse
To understand Git in an effective way, it is necessary to understand porcelain as well as plumbing commands.
In recent few examples of plumbing commands, we see we used cat-file with the hash object we generated with the hash-object command. We can use the same command with the hash value of commit as we add and commit this file. The commit gives us more details regarding changes.
- git cat-file -p 91be
Here,
The Epoch Time stamp is added to keep track of time and the commit message is displayed for maintaining the working history.
These details about the commit are stored in the Git objects. There are basic three object types in Git:
This looks something like the below:
All the commits that we perform will be stored in a structure similar to the above to maintain a clear history of all the changes made.
This is most of what we need to know about behind the scenes of Git and this is what I understood from going through these concepts. Let me know if any point needs to be covered, you can add it in the comments also.
Bye, for now, see you in the next blog with some new concepts and technology. Keep learning and keep blogging!