Re: Understanding Git Under The Hood: Trees

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Andreas Ericsson <ae@op5.se>
To: Erik Bernoth <erik.bernoth@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Understanding Git Under The Hood: Trees
Date: Thu, 15 Aug 2013 14:40:51 +0200	[thread overview]
Message-ID: <520CCC53.4090308@op5.se> (raw)
In-Reply-To: <CAB46HOnsOdYyt3sEe=iv3AJu_BDpTqCLKUpTBFQSnVGMZc8r8A@mail.gmail.com>

On 2013-08-15 12:29, Erik Bernoth wrote:
> Hi,
>
> I'm currently trying to understand the inner workings of git better by
> writing a git clone in Python. I find it rather hard to understand how
> to efficiently use trees.
>
> What I understand is this: Trees are in essence blobs with a specific
> content. The content is a relation between other blobs names, types
> and modes. With these lines they can refer to other blobs and trees
> and thus make a tine filesystem.
>
> Now I constructed a system to read and write blobs and have an index
> file that potentially references to a top tree object, which
> represents the currently cached repository state. I can add and remove
> files from the index manually, by creating the blob of the file and
> working my way up adding or updating trees until I hit the one
> referenced in INDEX. My algorithm to automate it is really ugly, error
> prone and inefficient, though. I also have a hard time to find my way
> around in C files, so maybe some people here in the list could explain
> the algorithm in Git to me.
>
> suppose we have the following Index:
>
> INDEX
>   -> tree
>         -> tree "a/"
>             -> blob "b.txt"
>         -> tree "c/"
>             -> blob "d.txt"
>
> Now you want to stage a file with the following reference "a/e/g.txt".
>
> One approach would be to walk top-down, splitting the path into its
> elements and looking for the corresponding trees, either retrieving an
> existing tree or creating a new one. Then finally create the blob
> "g.txt" and be done with it. This seems rather inefficient, though,
> because each created or updated tree means that all trees way back up
> need to be updated as well, once for every step in the loop.
>
> The other way is to go bottom-up, first creating the blob, then
> creating trees up to the project root folder. But then I don't see a
> way to find which tree elements already exist and need to be updated.
>
> So the only algorithm I can come up with is this:
>   1. walk down the tree with help of the path string to the tree that
> is closest to the file I want to store. On the way remember all the
> trees on the path from INDEX to the resulting file. (In the example
> above I'd like to get the hash of the "a/" tree)
>   2. create the blob (in the example with the context of g.txt)
>   3. create the trees bottom-up until one step before the tree found in
> 1. (in the example create a "e/" tree, containing the "g.txt"'s blob)
>   4. Add the resulting tree from 3. to the one found in 1. and create
> the updated hash
>   5. Now with help of the list from 1. walk the existing trees
> bottom-up and update each one with the new hashes until INDEX is hit.
>   6. Update INDEX.
>
>
> Alltogether the idea of trees looked really simple and recursive which
> makes me quite unhappy with the algorithm I came up with.
>
>
> What is the algorithm to stage single files in Git itself?

You seem to believe that the in-memory representation of trees have to
be the same as the on-disk one. That's simply not true. Git cheats
outrageously with internal formats for pretty much everything in order
to squeeze out more performance.

You also seem to believe that a tree is more than one directory.
That's not true either.

So... Each tree that's updated can (and does) have an in-memory link
to its parent directory. Whenever we update a directory and create a
commit from it, we make sure to write them out from right to left (ie, 
children before parents), so that we're never in a state where a tree on 
disk can't find any of its content blobs in the same object storage.

With that information in hand, I'm sure you can quite easily create an
algorithm that turns out a bit prettier than what you have now.

>
> Also: I thought to myself, why not just make a function that returns
> the relative path to the repository root folder and consider that
> string the file name, drop the idea of trees at all and add the
> information that is traditionally stored in tree objects directly in a
> commit object.

Because commit objects must have parents and things like that. Also,
the idea of trees containing trees saves a *huge* amount of disk I/O
and space when updating a project with many subdirectories.

> Wouldn't that be much simpler and still accomplish the
> same?

Simpler; Yes. More efficient; No.

The commit that changed the tree structure from flat to hierarchical
dates back to April of 2005. It's commit number 25 in the history of
git and solves one of the first real problems from live-testing git
on the kernel repository, which was that tree-files got so huge when
they had to be rewritten in their entirety that creating a new commit
took several seconds.

> I think the idea of keeping information in separate small files
> instead of single big files was dropped at one point anyway, when
> pack-files were introduced.
>

Not really. We strive hard to minimize disk I/O whenever we can. The
packfiles actually reduce I/O, since it lets the kernel use the disk's
builtin caches a lot better, and read-ahead works to our advantage.

Btw, when I say "minimize disk I/O", I really mean "minimize user wait",
although disk I/O is certainly (often) the largest part of that.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

next prev parent reply	other threads:[~2013-08-15 12:41 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-15 10:29 Understanding Git Under The Hood: Trees Erik Bernoth
2013-08-15 12:40 ` Andreas Ericsson [this message]
2013-08-15 17:31   ` Junio C Hamano
2013-08-15 19:32     ` Erik Bernoth
2013-08-16  9:12       ` Andreas Ericsson
2013-08-16 13:13         ` Erik Bernoth

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=520CCC53.4090308@op5.se \
    --to=ae@op5.se \
    --cc=erik.bernoth@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).