git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Git tree object storing policy
@ 2012-02-21  9:22 Ivan Tolstosheyev
  2012-02-21 10:18 ` Thomas Rast
  0 siblings, 1 reply; 2+ messages in thread
From: Ivan Tolstosheyev @ 2012-02-21  9:22 UTC (permalink / raw)
  To: git

Hello,

now tree object is a simple list of <attributes, hash, name>sorted by name
(tricky sorted, cause we assuming that directory name "$X" is actually "$X/"
in comparison function). The problem is, that if I want to insert 10k files
in empty git repository on / folder there will be 10k new 
trees with sizes from (1 to 10k)*(hash+name+attribute)+eps  .

itroot@localhost ~/tmp> cat git-test.sh 
#!/usr/bin/env bash

git init test
cd test
for i in `seq 1 10000` 
do
touch ${i} ; git add ${i} ; git commit -m "Add ${i}" ;
done
cd ..
du -hs test
itroot@localhost ~/tmp>

itroot@localhost ~/tmp> ./git-test.sh
...
180M	test
itroot@localhost ~/tmp>

180 MB!!!?? and 7.4M after `git gc` - thanks to delta compression!

Ok, you can say that this example is artificial, and I can add 10k files
with 1 commit. Thats true. But manipulating files in big tree objects
(in a big directories) is storage-expensive, and if I need to store a 
lot of files in one directory and frequently change them - git just
don't scales now properly at this use-case.

What do I propose? 
We can add another git object, named for example "btree" , 
that contains another "btree" objects or files.  This will be a simple
btree structure (tree entries sorted practically by name, BTW,
maybe it's time to fix sorting =] ), that allows us to do insertion,
removal, search in ln(n) time. But - we do not have troubles 
with big direcories now. BTW, if all directories are small, btree
will be tree-like - just btree pointing to  files.
So, one big tree with 10k files transforms to (hmm, for example...)
101 btrees - one, pointing to 100 btrees, and thay points to files.
(100 entries per btree is a wild guess =) )

Suggestions?

 

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Git tree object storing policy
  2012-02-21  9:22 Git tree object storing policy Ivan Tolstosheyev
@ 2012-02-21 10:18 ` Thomas Rast
  0 siblings, 0 replies; 2+ messages in thread
From: Thomas Rast @ 2012-02-21 10:18 UTC (permalink / raw)
  To: Ivan Tolstosheyev; +Cc: git

Ivan Tolstosheyev <ivan.tolstosheyev@gmail.com> writes:

> #!/usr/bin/env bash
>
> git init test
> cd test
> for i in `seq 1 10000` 
> do
> touch ${i} ; git add ${i} ; git commit -m "Add ${i}" ;
> done
> cd ..
> du -hs test
[...]
> 180 MB!!!?? and 7.4M after `git gc` - thanks to delta compression!

Most of those 180MB are waste from mostly unused 4KB (presumably) blocks
of your filesystem.  You should be looking at the post-gc'd numbers.

Let's see the breakdown of 'du -h .git':

0       .git/rr-cache
1.5M    .git/logs/refs/heads
1.5M    .git/logs/refs
2.9M    .git/logs
4.0K    .git/objects/info
2.8M    .git/objects/pack
2.8M    .git/objects
0       .git/branches
12K     .git/info
0       .git/remotes
88K     .git/hooks
0       .git/refs/tags
0       .git/refs/heads
0       .git/refs
6.5M    .git

So 2.9MB are git keeping a reflog of everything we did (on HEAD and on
master).  Since merely storing a SHA1 for each of your 10000 operations
already takes 200K, that's not so far off -- the factor of 10 is in the
email, date and log message.

In my case 704K went into the index (not directly visible above, it's
the bulk of the top level).  That's also not unreasonable: merely
storing the object SHA1 (20 bytes) and a bunch of timestamps for 10000
files also gets you into the 500K ballpark.

The pack index amazingly takes only about 500K, even though it is
indexing 10000 trees and 10000 commits, so again the SHA1s alone get you
into the 400K ballpark.

That leaves only 2.3MB for the actual pack (which contains all the
data!).  But every commit must store a tree and a parent, so there are
at least 2*10000*20 = 400K uncompressable bytes in the commits
already[*].  So we are within a factor of 6 of just the data required to
save the shape of your history DAG, no content included.  I'd say that's
not too bad.


[*] This is not quite true, the parents and trees might be pointers
within the pack.  AFAIK the proposed pack v4 format does this, and would
yield a more efficient compression.  So if you're going to waste energy
worrying about this, you should help with pack v4.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2012-02-21 10:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-21  9:22 Git tree object storing policy Ivan Tolstosheyev
2012-02-21 10:18 ` Thomas Rast

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).