Visualize how Git works internally in your local (using few commands)

Visualize how Git works internally in your local (using few commands)

I will be honest. The first time I came across Git, I found it to be quite complicated and difficult to understand. After going through some of the tutorial videos and blogs on how to work with Git, though I was able to perform my tasks, still I found it to be nothing less than magic. You just write some commands in the terminal and repositories are getting created, files are getting moved locally or through the network, changes are getting reverted and what not.

I decided to take some time and delve into it and try to understand what is actually happening inside. This blog is a step-by-step documentation of what I learned and which you can also try in your system. You will surely like it.

To us, Git is a “Version Control System”, but from internal architecture point of view, it is actually a “Content Addressable System (CAS)” where the “Content” is the data to be stored. “Addressable” means that the data can be accessed using a key (address). Git generates and holds unique IDs for the contents.

Note:- We will use “Git Bash Terminal” here (Install Git if you haven’t done already) and work with bash commands.

Let’s start by quickly creating a new folder “MyNewApp” for the project and move inside the folder.

-> mkdir MyNewApp

-> cd MyNewApp        

Now, initialize a git repository inside the folder by running the command “git init”. Also, run the command “du -c” to view all the directories and subdirectories inside the current “MyNewApp” directory.

-> git init

-> du -c

./.git/hooks
./.git/info
./.git/objects/info
./.git/objects/pack
./.git/objects
./.git/refs/heads
./.git/refs/tags
./.git/refs
./.git
.
total        

You can see that many subdirectories got created inside the “.git” directory. You can think of this as a database containing all the information required to retain, manage and retrieve the revisions/history of a project. It uses the data structures “index” and “object store” to achieve this.

Now, we will create a text file “FirstFile.txt” inside the project folder by using the command “touch FirstFile.txt” and write some content inside it. The command “cat FirstFile.txt” will print the file content.

-> touch FirstFile.txt

-> echo "This is a sample text" > FirstFile.txt

-> cat FirstFile.txt

This is a sample text        

Now, let’s add and commit the file created using the “git add” and “git commit” commands.

-> git add FirstFile.txt

-> git commit -am "First Commit"        

Here, it is important to note that:

Git treats files as “objects” and the addresses to the files are also treated as objects (but with “SHA”).

Now, what is SHA?

The files, in Git, are represented by 40-character strings “object names” which are calculated by taking the SHA1 (a cryptographic hash function) hash of the contents of the objects.

After the commit, running “du -c” command in the terminal, we can see that three extra objects got created inside “./.git/objects/”.

-> du -c

./.git/hooks
./.git/info
./.git/logs/refs/heads
./.git/logs/refs
./.git/logs
./.git/objects/00
./.git/objects/6e
./.git/objects/a5
./.git/objects/info
./.git/objects/pack
./.git/objects
./.git/refs/heads
./.git/refs/tags
./.git/refs
./.git
.
total        

We can view the hash for the content that has been added and committed using the command “ls ./.git/objects/a5”.

-> ls ./.git/objects/a5

3073ff9f6d50c5ead79b0232869618889fec70        

To verify the content of the file using the hash, let’s run the command “cat .git/objects/a5/3073ff9f6d50c5ead79b0232869618889fec70

This outputs a garbage value. Reason being – git compresses the file content in “gzip” format using the “zlib” library.

We can decompress the compressed file by running the below command:

“printf “\x1f\x8b\x08\x00\x00\x00\x00\x00” |cat – .git/objects/a5/3073ff9f6d50c5ead79b0232869618889fec70 |gzip –dc

-> printf “\x1f\x8b\x08\x00\x00\x00\x00\x00” |cat – .git/objects/a5/3073ff9f6d50c5ead79b0232869618889fec70 |gzip –dc        

The output, as you can see, consists of 3 parts -> “blob” (which is the file type), the “size” of the file content (in terms of number of characters) and the actual file content. “blob” stands for “Binary Large Object” which is generally a file and is used to store the file data.

Let’s now introduce the “Git Plumbing Commands”. Whatever git command we generally use, like “git add”, “git commit”, “git log” etc. are called “Git Porcelain Commands”. The Porcelain commands are the human-friendly, high-level commands, while the Plumbing commands are used to directly manipulate the Git internals. One such Plumbing command is “git cat-file” which can be used to view the content, type and size of the file by using its hash. Let’s see the content of the file we created by using the combination of “git cat-file” and the hash. (“-p” will give you pretty print, “-t” will give you the file type and “-s” will provide the content size in terms of number of characters present)

-> git cat-file -p a530
This is a sample text

-> git cat-file -t a530
blob

-> git cat-file -s a530
22        

To view the “SHA hash” of the file, we run the below command:

echo -e “blob 22\0This is a sample text” |shasum

-> echo -e “blob 22\0This is a sample text” |shasum
a53073ff9f6d50c5ead79b0232869618889fec70 *-        

When we did “git commit”, the object “./.git/objects/6e” got created, which actually points to a “Tree” inside Git. Let’s run the below commands to view the details of the commit made.

ls ./.git/objects/6e” (To get the SHA)

git cat-file -p 6e8d” (To get the commit details. Here “6e8d” is the combination of the object name and the SHA value)

git cat-file -t 6e8d” (To view the type of the object, which is “commit”)

-> ls ./.git/objects/6e
8de220f5bb9f953a627df0d78dabc1eea91071

-> git cat-file -p 6e8d
tree 006df9db2a4ba8462a03f2f4b6d6ac7020d9b6fb

First Commit

-> git cat-file -t 6e8d
commit        

Here we can see that the tree, the commit is pointing to, is given by the hash “006df9db2a4ba8462a03f2f4b6d6ac7020d9b6fb”. So, we can pretty print the tree and the object type using the below two commands:

git cat-file -p 006d

git cat-file -t 006d

-> git cat-file -p 006d
100644 blob a53073ff9f6d50c5ead79b0232869618889fec70    FirstFile.txt

-> git cat-file -t 006d
tree        

File Renaming

Let’s see how renaming a file works in Git internally. We will rename the file from “FirstFile.txt” to “FirstFileRenamed.txt” and add+commit the renamed file.

mv FirstFile.txt FirstFileRenamed.txt

git add FirstFileRenamed.txt

git commit -am “Second Commit with the file name renamed“”

-> mv FirstFile.txt FirstFileRenamed.txt

-> git add FirstFileRenamed.txt

-> git commit -am "Second Commit with the file name renamed"        

You can see below that two more new objects got created due to this:

-> du -c

./.git/hooks
./.git/info
./.git/logs/refs/heads
./.git/logs/refs
./.git/logs
./.git/objects/00
./.git/objects/26
./.git/objects/6e
./.git/objects/a5
./.git/objects/ef
./.git/objects/info
./.git/objects/pack
./.git/objects
./.git/refs/heads
./.git/refs/tags
./.git/refs
./.git
.
total        

Below we can see that “26” is for the tree and “ef” is for the commit made after the file is renamed.

-> ls ./.git/objects/26
0e02825f9c119751f0890723ed6a79864029a8

-> ls ./.git/objects/ef
0b8aa8b3b74c95a64c131e8140d6e790950141

-> git cat-file -t 260e
tree

-> git cat-file -t ef0b
commit

-> git cat-file -p 260e
100644 blob a53073ff9f6d50c5ead79b0232869618889fec70  FirstFileRenamed.txt        

If we take the tree, we can see that the hash value hasn’t changed after renaming the file. This means that Git does not create a separate hash value for the file if the file content, the file type and content-size have not really changed. Hence, the renaming operation gets performed quickly in Git.

Content Modification

What does Git do when we modify the file content? Let’s add a semicolon at the end of the content of the file “FirstFileRenamed.txt” and add+commit the modified file.

-> git add FirstFileRenamed.txt

-> git commit -am "Added a semicolon at the end of the file"

-> du -c

./.git/hooks
./.git/info
./.git/logs/refs/heads
./.git/logs/refs
./.git/logs
./.git/objects/00
./.git/objects/26
./.git/objects/6e
./.git/objects/89
./.git/objects/8e
./.git/objects/a5
./.git/objects/a9
./.git/objects/ef
./.git/objects/info
./.git/objects/pack
./.git/objects
./.git/refs/heads
./.git/refs/tags
./.git/refs
./.git
.
total        

A total of 3 new objects got created (89, 8e and a9). Out of which a new commit object got created on top of the existing commit object. The hash value inside also changed since the content of the file changed.

Now, we can do garbage collection by using the following Git Plumbing command:

git gc – -aggressive

This command is actually run by Git when you do “git push” or “git pull” operations and network comes into the picture. What the command actually does is that it compresses the individual blob, tree and commit objects (performs delta compression) and puts them inside “./.git/objects/pack

-> git gc --aggressive

-> du -c

./.git/hooks
./.git/info
./.git/logs/refs/heads
./.git/logs/refs
./.git/logs
./.git/objects/info
./.git/objects/pack
./.git/objects
./.git/refs/heads
./.git/refs/tags
./.git/refs
./.git
.
total        

Inside the “./.git/objects/pack”, two files get created – 1) Index file (with .idx extension) and 2) Pack file (with .pack extension). The pack file contains all the information related to the compressed tree, blob and commit objects.

We can use the below command to view those information inside the pack file:

git verify-pack -v ./.git/objects/pack/pack-3df7d7e8a2fedfc57485ac1abe4b8327beda56a7.pack

-> git verify-pack -v ./.git/objects/pack/pack-3df7d7e8a2fedfc57485ac1abe4b8327beda56a7.pack        
No alt text provided for this image

Inside the pack file, there is a “Graph” which tells that there is a file that maybe pointing to a parent file. Using the graph, the pack file helps to reduce the storage requirement and to take less bandwidth wile communicating over the network.

Branching

To git, branches are nothing but “references” pointing to objects. To understand how branching works internally in git, let’s create a new branch from the “master” branch we were working on and switch to it.

git checkout -b new_feature

We can view all the available branches by running the command:

ls ./.git/logs/refs/heads

Then, we perform “cat” operation on the newly create branch and check the type of the object created.

“cat ./.git/logs/refs/heads/new_feature

“git cat-file -p 8e8b

-> git checkout -b new_feature
Switched to a new branch 'new_feature'

-> ls ./.git/logs/refs/heads
main new_feature

-> cat ./.git/logs/refs/heads/new_feature

-> git cat-file -p 8e8b
tree a9fbcb760b4a9e29dcca1eb7341f4743902a8048
parent ef0b8aa8b3b74c95a64c131e8140d6e790950141

Added a semicolon at the end of the file        

Here, we can see the branch has created a “commit” object and it is pointing/referring to the commit object of the parent.

------------------------------------------------------------------------------------------------------------

Thank you so much for reading the 5th edition of the #AutomationKaksha newsletter. Every week, I will be publishing articles on Automation, Framework design, ML, System Design, Web Development and Data Science.

If you found this article interesting, you may also love my other blogs:

  1. Builder Design Pattern: When to use?
  2. Why to use Dependency Injection for handling resources during object creation in Java?
  3. 25 cool Java things you may or may not have come across

Do subscribe to #AutomationKaksha and also share it with your colleagues, friends and connections who can get benefit from it.

Keep Learning, and Keep Sharing.

Namburi Yashwanth Kumar

Computer Science and Engineering at Vellore Institute of Technology-Amaravathi || Machine Learning|| Deep Learning || Natural language Processing || Python Devolper|| Certified Blockchain Developer by IDS

9mo

very informative one Sumon Dey Thanks for sharing

Like
Reply
Sanjeev Mishra

SDET Lead | Java | Selenium | Rest Assured | WebdriverIO | TestNG | Maven | Cucumber | Jenkins | Data Structures

2y

Way of explaining concepts is great. Thanks.

Alexis Määttä Vinkler

Delivering end-user value through Flatwave!

2y

Nice summary! It looks like an interesting upcoming post. Instead of calling Git a CAS, you can also think of it as a giant hashmap (or dictionary), if that makes it easier to comprehend. Or, what Mr. Torvalds himself calls it: "A stupid content tracker"! 🤣

Swaroop Nadella

Test Automation Engineer | Educator, YouTuber | Software Testing and Automation

2y

Thanks for sharing Sumon Dey 👏

To view or add a comment, sign in

More articles by Sumon Dey

Insights from the community

Others also viewed

Explore topics