Where does git store my files?

Years ago, I have written in Chinese on this same topic: git的存储 | LI, Yu from 1981 (liyu1981.github.io), here is the English version of it.

If you can read, you can read the most details here:  Git-Internals-Git-Objects.

But if TLDR; or just lazy, let me explain it here.

To know something best, is to use it. So in next I will show you by doing (you can follow too).

Preparation

Open a terminal window, and do following

mkdir gittest
cd gittest
cat world >hello.txt
git init
git add .
git commit -a -m "init"
cat world2 >>hello.txt
git commit -a -m "step2"

after this step, we should have a simple repo with 2 commits.

Check where does git store my files

The files must be somewhere in our hard disk. So let us find them step by step.

The first step is to find our commit’s id, which is a sha1 value.

[liyu@hk153 ~/DevCamp/gittest]$ git log
commit f5bf4f39d9d0d86164e574850528e43c70f2bb0e
Author: LI Yu <liyu@clustertech.com>
Date:   Mon Mar 17 12:12:28 2014 +0800

    step 2

commit 5757f1fb992f9b8a13d2b3cbe426b48134777bb1
Author: LI Yu <liyu@clustertech.com>
Date:   Mon Mar 17 12:12:09 2014 +0800

    init

From the practice, we know my commit 1 id is f5bf4f39d9d0d86164e574850528e43c70f2bb0e and my commit 2 id is 5757f1fb992f9b8a13d2b3cbe426b48134777bb1. (If you follow my commands, your commit ids are to be definitely different to mine.)

With commit ids, we can then check the content of our commit in detail

[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p 5757f1fb992f9b8a13d2b3cbe426b48134777bb1
tree 7f2dbfa479cbe99062de2ef82b713f044c4406d8
author LI Yu <liyu@clustertech.com> 1395029529 +0800
committer LI Yu <liyu@clustertech.com> 1395029529 +0800

init
[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p f5bf4f39d9d0d86164e574850528e43c70f2bb0e
tree 82e61d0a3922801e92d6912c009168345c268353
parent 5757f1fb992f9b8a13d2b3cbe426b48134777bb1
author LI Yu <liyu@clustertech.com> 1395029548 +0800
committer LI Yu <liyu@clustertech.com> 1395029548 +0800

step 2

My 2 queries tell us that: my commit 1’s content is stored (looking at tree value) in record with id: 7f2dbfa479cbe99062de2ef82b713f044c4406d8, and my commit 2’s content is stored in record with id: 82e61d0a3922801e92d6912c009168345c268353.

Then let us follow the leads, for the real files git has stored.

[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p 7f2dbfa479cbe99062de2ef82b713f044c4406d8
100644 blob cc628ccd10742baea8241c5924df992b5c019f71    hello.txt

[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p 82e61d0a3922801e92d6912c009168345c268353
100644 blob 5ad310361e95be08361d9ecd032ad66506d4c066    hello.txt

The outputs tell us that:

  1. my commit 1’s file has one blob, with permission 100644, sha1 (which is also the record id) is cc628ccd10742baea8241c5924df992b5c019f71, and the name is ‘hello.txt’;
  2. my commit 2’s file has one blob too, with permission 100644, sha1 is 5ad310361e95be08361d9ecd032ad66506d4c066, and the name is ‘hello.txt’.

And now we are at the final step: find all those files stored by git.

For commit 1

[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p cc628ccd10742baea8241c5924df992b5c019f71
world

For commit 2

[liyu@hk153 ~/DevCamp/gittest]$ git cat-file -p 5ad310361e95be08361d9ecd032ad66506d4c066
world
world2

Well, the conclusion: git has stored 2 full versions of our files.

I can further find them in .git folder. Simply search in .git/objects, I can see my commit 2 file is at the .git/objects/5a folder, and the file name is simply the sha1 value after removing first 2 chars (which is ‘5a’).

[liyu@hk153 ~/DevCamp/gittest]$ find .git/objects/
.git/objects/
.git/objects/info
.git/objects/pack
.git/objects/57
.git/objects/57/57f1fb992f9b8a13d2b3cbe426b48134777bb1
.git/objects/cc
.git/objects/cc/628ccd10742baea8241c5924df992b5c019f71
.git/objects/f5
.git/objects/f5/bf4f39d9d0d86164e574850528e43c70f2bb0e
.git/objects/7f
.git/objects/7f/2dbfa479cbe99062de2ef82b713f044c4406d8
.git/objects/5a
.git/objects/5a/d310361e95be08361d9ecd032ad66506d4c066
.git/objects/82
.git/objects/82/e61d0a3922801e92d6912c009168345c268353

Let us check the pros and cons of how git stores files in this way

Pros
  1. Git stores full copies of our modified files. With this method, image that if we need to calculate diff of two far away versions of a file? We only need to: first read those two full copies into memory by finding them (with the same method we have just used), second then run the diff algorithm once against loaded copies. This is much simpler than only store diffs of each modification, because with the saving diffs method, if two versions are too far away, we will need load all diffs between them and merge all diffs one by one.
  2. Git stores objects with their first 2 chars of sha1 as folder name, and the rest of sha1 as file name. This method turns file system as an index, because by knowing sha1 of a record, we will only need first find which folder is, then find inside that folder the file we are after. Because of sha1 is uniqe to different file names we will not have conflicts, and because sha1 is different enough in first few chars for each file name, we will not have too many files in same folder. This in practice has been proved to be a good balancing between too deep folder structures and searching efficiency.
  3. Saving all full copies in local storage also make sure that even without internet (where usually the central version control system is), it can function normally. When Linus designs git to solve his own problem of managing Linux Kernel’s code, this is almost No.1 requirement he has in mind.
Cons
  1. Of course git will use more storage than tools saving diffs only (such as svn/cvs). Storage is a precious resource couple years ago but general not a problem nowadays, especially for code.
  2. There is no central server in default, so there is no so-called permission controll in default. As long as I know where other’s file system is, I may exchange data with him. This usually a hard requirement in enterprise revision control system, but generally git is not targeting in original version. (and it turns out to be easy to handle if we check nowadays GitHub).

Finally

below the talk Linus himself given about how he created git, very inspiring talk, worth your time to watch it.