Achieving efficient storage of weirdly structured repos

Git development
 help / color / mirror / Atom feed

* Achieving efficient storage of weirdly structured repos
@ 2008-04-03 19:42 Roman Shaposhnik
  2008-04-03 21:11 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-03 19:42 UTC (permalink / raw)
  To: git

Hi Guys!

First of all, I must admit that I got seriously hooked up on Git when  
I started to
use it internally for FFmpeg development and my life as far as non  
company
related projects are concerned hasn't been the same ever since. The way
Git addresses issues like branches (both local and remote), merges and
content tracking are, in my opinion, quite superior to the  
alternative SCMs
such as Mercurial. In fact, that's why when I saw my coworkers  
struggling
with exactly the same issues around NetBeans Mercurial repository my
natural instinct was to introduce them to Git and show them that life  
doesn't
have to be that complicated or troublesome.

But before I can do that, there's a little issue that I either need  
to address or
at least understand better. It seems that the storage requirements  
for the
repository that a friend of mine and I created from: http:// 
hg.netbeans.org/main/
got doubled under Git:
     $  du -sh Mercurial- nb-main
     491M   Mercurial- nb-main
     $ du -sh Git-nb-main
     1.1G     Git-nb-main

The repository was created using hg2git (the one based on git-fast- 
import)
and it was GC'ed and REPACK'ed just in case. I spent some time trying
to understand the issue and here's some statistics on the kind of  
objects
that this repository consists of:
     75053 commit
   334572 blob
   750003 tree

The last item (trees) also seem to take the most space and the most  
reasonable
explanation that I can offer is that NetBeans repository has a really  
weird
structure where they have approximately 700 (yes, seven hundred!) top- 
level
subdirectories there. They are clearly Submodules-shy, but that's  
another
issue that I will need to address with them.

So based on my preliminary analysis the biggest culprit seems to be
that for every commit at least one pretty hefty tree object needs to  
be created
(remember 700 top level subdirectories). These objects are all in 18-20K
range but they are almost identical to each other except for one or two
trees that they refer to.

So here's my question: is there anything in Git that I can use to  
accommodate
a repository like http://hg.netbeans.org/main in an efficient manner?  
Any
kind of ideas (like particular command line options for repack, etc.)  
would be
greatly appreciated!

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
@ 2008-04-03 21:11 ` Linus Torvalds
  2008-04-04  6:21   ` Jakub Narebski
  2008-04-04 23:30   ` Roman Shaposhnik
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2008-04-03 21:11 UTC (permalink / raw)
  To: Roman Shaposhnik; +Cc: git

On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> 
> The repository was created using hg2git (the one based on git-fast-import)
> and it was GC'ed and REPACK'ed just in case.

Before going any further - exactly _how_ was it repacked?

In particular, when using importers that do partial packing on their own 
(and any "git-fastimport" user is that by definition - and I think 
hg2git does that), at the end of it all you have to make sure to repack in 
a way where the repacking will totally discard the import-time packfiles.

IOW, that's one of the very few times you should use "-f" to git repack.

It's usually also a good place to make sure that since you ignore the old 
packing information, it's best to also make sure that the new packing info 
is good by using a bigger window (and perhaps a bigger depth). That makes 
the packing much slower, of course, but this is meant to be a one-time 
event.

So try something like

	git repack -a -d -f --depth=100 --window=100

if you have a good CPU and plenty of memory.

> The last item (trees) also seem to take the most space and the most 
> reasonable explanation that I can offer is that NetBeans repository has 
> a really weird structure where they have approximately 700 (yes, seven 
> hundred!) top-level subdirectories there. They are clearly 
> Submodules-shy, but that's another issue that I will need to address 
> with them.

Trees taking the biggest amount of space is not unheard of, and it may 
also be that the name heuristics (for finding good packing partners) could 
be failign, which would result in a much bigger pack than necessary. 

So if you already did an aggressive repack like the above, I'd happily 
take a look at whether maybe it's bad heuristics for finding tree objects 
to pair up for delta-compression. Do you have a place where you can put 
that repo for people to clone and look at? 

			Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-03 21:11 ` Linus Torvalds
@ 2008-04-04  6:21   ` Jakub Narebski
  2008-04-04 13:11     ` Nicolas Pitre
  2008-04-04 23:30   ` Roman Shaposhnik
  1 sibling, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2008-04-04  6:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roman Shaposhnik, git

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
>> 
>> The last item (trees) also seem to take the most space and the most 
>> reasonable explanation that I can offer is that NetBeans repository has 
>> a really weird structure where they have approximately 700 (yes, seven 
>> hundred!) top-level subdirectories there. They are clearly 
>> Submodules-shy, but that's another issue that I will need to address 
>> with them.
> 
> Trees taking the biggest amount of space is not unheard of, and it may 
> also be that the name heuristics (for finding good packing partners) could 
> be failign, which would result in a much bigger pack than necessary. 
> 
> So if you already did an aggressive repack like the above, I'd happily 
> take a look at whether maybe it's bad heuristics for finding tree objects 
> to pair up for delta-compression. Do you have a place where you can put 
> that repo for people to clone and look at? 

Hmmm... I wonder if it would be the case that would speed-up
development of pack v4.  If I remember correctly one of bigger changes
was the way trees were represented in pack; the biggest improvement
was for trees.

One of bigger hindrances, as I understand it, in developing pack v4
was the fact that it didn't offer that much of improvement in typical
cases for the work needed... but perhaps "your" repository would be
good showcase for pack v4.

Just my 2 eurocents...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-04  6:21   ` Jakub Narebski
@ 2008-04-04 13:11     ` Nicolas Pitre
  2008-04-04 14:16       ` Pieter de Bie
  2008-04-05  3:24       ` Shawn O. Pearce
  0 siblings, 2 replies; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-04 13:11 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Linus Torvalds, Roman Shaposhnik, git

On Thu, 3 Apr 2008, Jakub Narebski wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> >> 
> >> The last item (trees) also seem to take the most space and the most 
> >> reasonable explanation that I can offer is that NetBeans repository has 
> >> a really weird structure where they have approximately 700 (yes, seven 
> >> hundred!) top-level subdirectories there. They are clearly 
> >> Submodules-shy, but that's another issue that I will need to address 
> >> with them.
> > 
> > Trees taking the biggest amount of space is not unheard of, and it may 
> > also be that the name heuristics (for finding good packing partners) could 
> > be failign, which would result in a much bigger pack than necessary. 
> > 
> > So if you already did an aggressive repack like the above, I'd happily 
> > take a look at whether maybe it's bad heuristics for finding tree objects 
> > to pair up for delta-compression. Do you have a place where you can put 
> > that repo for people to clone and look at? 
> 
> Hmmm... I wonder if it would be the case that would speed-up
> development of pack v4.

Not really.  Pack v4 won't magically shrink a repository to less than 
half the pack v3 size.

I think we're simply facing the same situation as with the initial GCC 
repository which shrank from 3GB down to 300MB or so due to misfitted 
repacking parameters.

> If I remember correctly one of bigger changes
> was the way trees were represented in pack; the biggest improvement
> was for trees.

Yes, but that wasn't really so much about size but rather access speed 
by not deflating them. The pack v4 tree representation would certainly 
help, of course, but I suspect that simply repacking with more 
aggressive window/depth arguments would be even more effective in this 
case.

> One of bigger hindrances, as I understand it, in developing pack v4
> was the fact that it didn't offer that much of improvement in typical
> cases for the work needed... but perhaps "your" repository would be
> good showcase for pack v4.

The biggest hindrance for pack v4 is actually the lack of a native 
runtime tree walking, and having both tree object formats properly and 
optimally abstracted has not been looked at yet.

Speed is the primary goal for pack v4.  The fact that it also provides a 
10% pack reduction is only consequential.  But without native tree 
walking we must recreate the legacy tree format on the fly each time a 
tree object is loaded which dwarfs any improvements pack v4 is aiming 
for (yes it is still a little bit faster than pack v3 nevertheless, but 
not yet significantly enough to overcome the incompatibility costs).

Nicolas (who wishes he was still a student with plenty of hacking time)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-04 13:11     ` Nicolas Pitre
@ 2008-04-04 14:16       ` Pieter de Bie
  2008-04-05  3:24       ` Shawn O. Pearce
  1 sibling, 0 replies; 14+ messages in thread
From: Pieter de Bie @ 2008-04-04 14:16 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git


On 4 apr 2008, at 15:11, Nicolas Pitre wrote:
> I think we're simply facing the same situation as with the initial GCC
> repository which shrank from 3GB down to 300MB or so due to misfitted
> repacking parameters.

I'd say so too. I imported the first 14000 revisions of that hg  
repository and the resulting pack file was just 25 MB after repacking.

- Pieter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-03 21:11 ` Linus Torvalds
  2008-04-04  6:21   ` Jakub Narebski
@ 2008-04-04 23:30   ` Roman Shaposhnik
  2008-04-04 23:57     ` Linus Torvalds
  1 sibling, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-04 23:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Hi Linus!

On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote:
> 
> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> > 
> > The repository was created using hg2git (the one based on git-fast-import)
> > and it was GC'ed and REPACK'ed just in case.
> 
> Before going any further - exactly _how_ was it repacked?

I believe it was the following two steps:
   $ git gc --aggressive
   $ git repack

> In particular, when using importers that do partial packing on their own 
> (and any "git-fastimport" user is that by definition - and I think 
> hg2git does that), at the end of it all you have to make sure to repack in 
> a way where the repacking will totally discard the import-time packfiles.

Good point. Speaking of which: do you have an FAQ for importers? The
entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce)
seem a bit stale for such an important issue. After all, importing from
an existing SCM is what usually forms a first time impression of Git's
effectiveness.

> IOW, that's one of the very few times you should use "-f" to git repack.

Got it!

> It's usually also a good place to make sure that since you ignore the old 
> packing information, it's best to also make sure that the new packing info 
> is good by using a bigger window (and perhaps a bigger depth). That makes 
> the packing much slower, of course, but this is meant to be a one-time 
> event.
> 
> So try something like
> 
> 	git repack -a -d -f --depth=100 --window=100
> 
> if you have a good CPU and plenty of memory.

That turned out to be a perfect suggestion. Thank you. I'm now the
happiest camper ever. And I'm also also pretty dumbfounded ;-)

Here's what happened. 

I started with a a repository filled with "loose" (one object per file)
objects (the reason I needed it was for the ease of sleuthing through
individual objects and it was created by git-unpack-objects from that
initial 1.1Gb pack). And I tried to pack it exactly like you
suggested:
   $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects
   Generating pack...
   Counting objects: 1096305
   Done counting 1159628 objects.
   Deltifying 1159628 objects...
      100% (1159628/1159628) done
   Writing 1159628 objects...
   dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f

   Total 1159628 (delta 386980), reused 0 (delta 0)

and it payed off reasonably well:
    $ du -s NB-clone
    670M NB-clone

It still was bigger than the Mercurial repository but at least it got
2 times smaller than the original result of hg2git. Now, if it wasn't
for a friend of mine, I probably would've stopped there. But he
showed up and saved the day ;-) His comments made me try something
that I didn't consider to be of any use -- repacking a freshly packed
pack with the *same* --depth=100 --window=100:
    $ git repack -a  -f --window=100 --depth=100 
    Generating pack...
    Counting objects: 1056829
    Done counting 1159628 objects.
    Deltifying 1159628 objects...
       100% (1159628/1159628) done
    Writing 1159628 objects...
       100% (1159628/1159628) done
    Total 1159628 (delta 614516), reused 0 (delta 0)
    Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created.
And then, a miracle occurred:
     $ du -sh NB-small 
     268M NB-small

Now, don't get me wrong: I'm as happy as a clam. The repository is now
*smaller* than the Mercurial's and because the structure of the
tree is so weird Git gets major points here. The only question that
is still bothering me is: how did it happen? Why did repacking 
a repository with exactly the same set of objects and the only
difference being where these objects resided (former case filesystem,
the later case an intermediate pack) made so huge a difference?

Please help!

> > The last item (trees) also seem to take the most space and the most 
> > reasonable explanation that I can offer is that NetBeans repository has 
> > a really weird structure where they have approximately 700 (yes, seven 
> > hundred!) top-level subdirectories there. They are clearly 
> > Submodules-shy, but that's another issue that I will need to address 
> > with them.
> 
> Trees taking the biggest amount of space is not unheard of, and it may 
> also be that the name heuristics (for finding good packing partners) could 
> be failign, which would result in a much bigger pack than necessary. 

Is there any documentation that describes the heuristics involved in
creating a pack?

> So if you already did an aggressive repack like the above, I'd happily 
> take a look at whether maybe it's bad heuristics for finding tree objects 
> to pair up for delta-compression. Do you have a place where you can put 
> that repo for people to clone and look at?

Unfortunately I don't. The only thing I can do is I can always create
a *.tar.bz2 and put and on  Sun's ftp server. Actually, that makes me
wonder: is there any public Git hosting available such that publishing
a hefty repository for the forensic purposes only wouldn't violate their
terms of use?

Thanks,
Roman.

P.S. Oh, and here's one extra tiny question that I also have: what
does the output:
   Total 1159628 (delta 614516), reused 0 (delta 0)
really mean?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-04 23:30   ` Roman Shaposhnik
@ 2008-04-04 23:57     ` Linus Torvalds
  2008-04-06  0:13       ` Roman Shaposhnik
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-04-04 23:57 UTC (permalink / raw)
  To: Roman Shaposhnik; +Cc: git

On Fri, 4 Apr 2008, Roman Shaposhnik wrote:
> 
> That turned out to be a perfect suggestion. Thank you. I'm now the
> happiest camper ever. And I'm also also pretty dumbfounded ;-)

Ok, the dumbfounded we can help with.

> Here's what happened. 
> 
> I started with a a repository filled with "loose" (one object per file)
> objects (the reason I needed it was for the ease of sleuthing through
> individual objects and it was created by git-unpack-objects from that
> initial 1.1Gb pack). And I tried to pack it exactly like you
> suggested:
>    $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects

Well, that's not exactly like I suggested, that's a *really* old-fashioned 
way.

But the part you left off was what the "objects" file contained?

In particular, "git pack-objects" can take just a raw list of objects, and 
it will *work*, but without the naming information and without the 
ordering information on the objects, the end result will generally suck.

So how did you generate the "objects" file? You can do it by just listing 
every single loose object you have, and it will work, but the end result 
won't be very pretty.

>    Total 1159628 (delta 386980), reused 0 (delta 0)
> 
> and it payed off reasonably well:
>     $ du -s NB-clone
>     670M NB-clone

So what git-repack-objects will do is that it will *try* to figure out a 
good and likely pairing of objects, and it uses the type and the size of 
the objectf rot it, but it obviously didn't find very good deltas. In 
fact, only about 35% of all objects are deltas at all!

Which is why I suspect that your object list was just the raw list of 
objects.

> His comments made me try something that I didn't consider to be of any 
> use -- repacking a freshly packed pack with the *same* --depth=100 
> --window=100:
>     $ git repack -a  -f --window=100 --depth=100 

Well, "git repack" doesn't just list the objects and then repack them, it 
also

 - lists them with the filename they were reachable from (if they are 
   trees or blobs)
 - sorts them topologically (ie most reachable objects first).

And that first thing will generate a *much* better packing, because now 
the packing algorithm can look at the filename, and generate a much better 
guess about which objects to try to delta against each others. And now you 
have

>     Total 1159628 (delta 614516), reused 0 (delta 0)

Ie you got a much better 60% delta ratio, and obviously a much smaller 
pack. But it's not just smaller, because the resulting pack will also have 
the objects in topological order, so that objects that are "new" 
(ie more closely reachable from the top-of-tree) will be at the front of 
the pack, so you'll also generally have better IO patterns in the packfile 
itself.

> Now, don't get me wrong: I'm as happy as a clam. The repository is now
> *smaller* than the Mercurial's and because the structure of the
> tree is so weird Git gets major points here. The only question that
> is still bothering me is: how did it happen? Why did repacking 
> a repository with exactly the same set of objects and the only
> difference being where these objects resided (former case filesystem,
> the later case an intermediate pack) made so huge a difference?

That location thing really shouldn't have mattered (since you used "-f", 
which throws it away).  So I think it was all from how you generated the 
list of objects. But since I don't know how you did that, I cannot 
guarantee anything.

> Is there any documentation that describes the heuristics involved in
> creating a pack?

It's been explained a few times on the mailing list, but I don't know if 
there is some write-up.

The git pack format is a bit more relaxed than any other SCM format I know 
about, since it literally is just a "bunch of objects with deltas randomly 
between them". So a pack can look just about any way it wants - there are 
no real ordering requirements (--delta-base-offset does require that the 
delta follows the base, but even that is really just a trivial practical 
"because otherwise you could never know what the offset in the pack-file 
was" issue rather than anything else).

So you can order the objects and make deltas just about any way you want. 
What "git repack" will do is to sort objects by <type, namehash, length>, 
and then walk the list with the window of the specified size to see the 
best pairing it can do.

The "namehash" is just designed so that files that have the same name sort 
together (and it's also not a real "hash" - it's really a meaningless 
number, but one that is designed so that even if the names aren't exactly 
the same, they sort closer together if they end in similar character 
sequences - so a file that moves from one directory to another but keeps 
the same basename will sort source and destination together).

See "type_size_sort()" in builtin-pack-objects.c to see what's up.

(The preferred_base thing is just to keep objects we actually want to 
*keep* in the pack separated from the objects we may be using for 
references - but that is only used for transferring objects over the wire, 
never for standalone packs).

Oh. And now I notice that there is *some* documentation on this in 

	Documentation/technical/pack-heuristics.txt

but that's pretty old (not that I think it has really changed) and not 
very deep. Just taken from some #irc discussion.

> P.S. Oh, and here's one extra tiny question that I also have: what
> does the output:
>    Total 1159628 (delta 614516), reused 0 (delta 0)
> really mean?

It just means that there were a million objects total, the pack was 
created with 614k of them as deltas against other objects, and none of 
them were re-used from previous packs.

You'll see that when you do incremental repacks, 99% of all the objects 
and deltas get re-used from the old pack, which is how we can make 
repacking fairly efficient. That's imporant because the native git 
protocol format itself is just a pack-file (plus the handshake to figure 
out what both sides have, of course). So a "git clone" implies a repack.

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-04 13:11     ` Nicolas Pitre
  2008-04-04 14:16       ` Pieter de Bie
@ 2008-04-05  3:24       ` Shawn O. Pearce
  1 sibling, 0 replies; 14+ messages in thread
From: Shawn O. Pearce @ 2008-04-05  3:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git

Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 3 Apr 2008, Jakub Narebski wrote:
> 
> > One of bigger hindrances, as I understand it, in developing pack v4
> > was the fact that it didn't offer that much of improvement in typical
> > cases for the work needed... but perhaps "your" repository would be
> > good showcase for pack v4.
> 
> The biggest hindrance for pack v4 is actually the lack of a native 
> runtime tree walking, and having both tree object formats properly and 
> optimally abstracted has not been looked at yet.
> 
> Speed is the primary goal for pack v4.  The fact that it also provides a 
> 10% pack reduction is only consequential.  But without native tree 
> walking we must recreate the legacy tree format on the fly each time a 
> tree object is loaded which dwarfs any improvements pack v4 is aiming 
> for (yes it is still a little bit faster than pack v3 nevertheless, but 
> not yet significantly enough to overcome the incompatibility costs).

Even though we don't have native tree walking, I think the right
way to do this is to put in pack v4 with "canonical tree, canonical
commit" mode, where it inflates its native tree/commit encoding
into the canonical forms, then come back later with native walking.

Canonical mode is still faster than pack v2 inflate is for these
types, so it does (slightly) boost rev-list performance.  It might
chop a solid 30% off the CPU time jgit spends in its equivilant of
revision.c, and that's without teaching jgit to use the native pack
v4 encoding directly.

Once we have it in we can experiment with the necessary abstractions
to handle the two different available encodings, and allowing
higher level code to switch back and forth between them as objects
come from loose or pack v2, and from pack v4.  One of the things we
wanted to do was boost path limiter performance by matching on tree
name ids when walking a pack v4 native tree, but fall back to the
string based memcmp when walking a canonical tree.  That won't be
easy to design without the two different encodings being available
at the lower level in sha1_file.c.

Just my rapidly declining .02 bush peso.

> Nicolas (who wishes he was still a student with plenty of hacking time)

Don't we all.  :-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-04 23:57     ` Linus Torvalds
@ 2008-04-06  0:13       ` Roman Shaposhnik
  2008-04-06  0:48         ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-06  0:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Hi Linus,

On Apr 4, 2008, at 4:57 PM, Linus Torvalds wrote:
> On Fri, 4 Apr 2008, Roman Shaposhnik wrote:
>>
>> That turned out to be a perfect suggestion. Thank you. I'm now the
>> happiest camper ever. And I'm also also pretty dumbfounded ;-)
>
> Ok, the dumbfounded we can help with.

Great!

>> Here's what happened.
>>
>> I started with a a repository filled with "loose" (one object per  
>> file)
>> objects (the reason I needed it was for the ease of sleuthing through
>> individual objects and it was created by git-unpack-objects from that
>> initial 1.1Gb pack). And I tried to pack it exactly like you
>> suggested:
>>    $ git-pack-objects --depth=100 --window=100 --delta-base-offset  
>> --progress pack < objects
>
> Well, that's not exactly like I suggested, that's a *really* old- 
> fashioned
> way.

It sure is old-fashioned. My only excuse is that since git repack is  
a shell
script around git-pack-objects I've always felt comfortable just using
the lower level utility.

> But the part you left off was what the "objects" file contained?

It was pretty much the result of (cd .git/objects ; find . -type f |  
tr -d './')
Omitting that was quite silly on my part, but I was really convinced
that because of the internal sorting the original ordering of objects
wouldn't matter at all.

> In particular, "git pack-objects" can take just a raw list of  
> objects, and
> it will *work*, but without the naming information and without the
> ordering information on the objects, the end result will generally  
> suck.
>
> So how did you generate the "objects" file? You can do it by just  
> listing
> every single loose object you have, and it will work, but the end  
> result
> won't be very pretty.

So it seems that my list of objects was different from what git-rev- 
list/setup_revisions()
would have provided in two ways:
     1. the order of objects was arbitrary
     2. the naming of blobs and trees was missing
#2 was, indeed, a huge oversight on my part. At the same time, I've  
always thought
that #1 shouldn't matter, because, as you pointed out, the objects get
sorted by <type, namehash, length> anyway. However, it seems that  
because
of the lack of naming there was much less sorting done by  
type_size_sort()
and the original order persevered (at least within type-size  
partitions).
It did matter after all!

Now, in my particular case, the ordering was braindead and it  
resulted in
a highly visible inefficiency. On the other hand, it seems that creative
ordering could very well be used to exploit localities which go beyond
how default name hashing in builtin-pack-objects.c works.

In fact, it starts to look awfully like custom MPEG profiles where if  
you know
your footage you can achieve a much higher compression ratio
compared to what the default might give you. It also makes it very
clear that Git's approach of doing away with per-file content tracking
is quite superior to the in-file deltas. Cool!

Here's my final question on that issue: wouldn't it be great to give  
users
a direct control over specifying the list of objects in exactly the  
order
they would like them to be tried for deltifying? Something like -- 
preserve-order
option available for git-pack-objects?

>>     Total 1159628 (delta 614516), reused 0 (delta 0)
>
> Ie you got a much better 60% delta ratio, and obviously a much smaller
> pack. But it's not just smaller, because the resulting pack will  
> also have
> the objects in topological order, so that objects that are "new"
> (ie more closely reachable from the top-of-tree) will be at the  
> front of
> the pack, so you'll also generally have better IO patterns in the  
> packfile
> itself.

Got it! This makes total sense.

>> Is there any documentation that describes the heuristics involved in
>> creating a pack?
>
> It's been explained a few times on the mailing list, but I don't  
> know if
> there is some write-up.
>
> The git pack format is a bit more relaxed than any other SCM format  
> I know
> about, since it literally is just a "bunch of objects with deltas  
> randomly
> between them". So a pack can look just about any way it wants -  
> there are
> no real ordering requirements (--delta-base-offset does require  
> that the
> delta follows the base, but even that is really just a trivial  
> practical
> "because otherwise you could never know what the offset in the pack- 
> file
> was" issue rather than anything else).
>
> So you can order the objects and make deltas just about any way you  
> want.
> What "git repack" will do is to sort objects by <type, namehash,  
> length>,
> and then walk the list with the window of the specified size to see  
> the
> best pairing it can do.
>
> The "namehash" is just designed so that files that have the same  
> name sort
> together (and it's also not a real "hash" - it's really a meaningless
> number, but one that is designed so that even if the names aren't  
> exactly
> the same, they sort closer together if they end in similar character
> sequences - so a file that moves from one directory to another but  
> keeps
> the same basename will sort source and destination together).
>
> See "type_size_sort()" in builtin-pack-objects.c to see what's up.
>
> (The preferred_base thing is just to keep objects we actually want to
> *keep* in the pack separated from the objects we may be using for
> references - but that is only used for transferring objects over  
> the wire,
> never for standalone packs).
>
> Oh. And now I notice that there is *some* documentation on this in
>
> 	Documentation/technical/pack-heuristics.txt
>
> but that's pretty old (not that I think it has really changed) and not
> very deep. Just taken from some #irc discussion.
>
>> P.S. Oh, and here's one extra tiny question that I also have: what
>> does the output:
>>    Total 1159628 (delta 614516), reused 0 (delta 0)
>> really mean?
>
> It just means that there were a million objects total, the pack was
> created with 614k of them as deltas against other objects, and none of
> them were re-used from previous packs.
>
> You'll see that when you do incremental repacks, 99% of all the  
> objects
> and deltas get re-used from the old pack, which is how we can make
> repacking fairly efficient. That's imporant because the native git
> protocol format itself is just a pack-file (plus the handshake to  
> figure
> out what both sides have, of course). So a "git clone" implies a  
> repack.

Very nice summary! Many, many thanks for providing it!

Thanks,
Roman.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-06  0:13       ` Roman Shaposhnik
@ 2008-04-06  0:48         ` Linus Torvalds
  2008-04-06 16:10           ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-04-06  0:48 UTC (permalink / raw)
  To: Roman Shaposhnik; +Cc: git

On Sat, 5 Apr 2008, Roman Shaposhnik wrote:
> 
> It was pretty much the result of (cd .git/objects ; find . -type f | tr -d
> './')

Ok, that explains it. And yes, it definitely works, but I think you now 
understand why it gave you that oddly pessimised pack-file.

That said, I think it's also a really good example of how the git 
pack-files really are just a "bag of objects", and the fact that it worked 
but was non-optimal is a very good way to show something very fundamental 
about git.

> So it seems that my list of objects was different from what
> git-rev-list/setup_revisions()
> would have provided in two ways:
>    1. the order of objects was arbitrary
>    2. the naming of blobs and trees was missing
> #2 was, indeed, a huge oversight on my part. At the same time, I've always
> thought
> that #1 shouldn't matter, because, as you pointed out, the objects get
> sorted by <type, namehash, length> anyway. However, it seems that because
> of the lack of naming there was much less sorting done by type_size_sort()
> and the original order persevered (at least within type-size partitions).
> It did matter after all!

Well, a pack actually as two *different* orderings, in that there's one 
ordering that is used for laying out the result in the pack-file, and 
another ordering that is used for actually finding the deltas with the 
sliding window.

And the input order is actually used for the layout ordering.

And both matter. The  <type, namehash, length> is the one that determines 
how large the resulting pack is, but the layout order is what causes the 
IO patterns for most common uses, so if the pack-file is cold in the 
cache, it can matter quite a bit for performance.

So the layout order is the one you give as input to "git pack-objects". 
And if that input has no particular ordering, then the final layout will 
also have no particular ordering and you get random IO patterns when 
loading from disk.

> Now, in my particular case, the ordering was braindead and it resulted in
> a highly visible inefficiency. On the other hand, it seems that creative
> ordering could very well be used to exploit localities which go beyond
> how default name hashing in builtin-pack-objects.c works.

Yes. We've changed some of the heuristics subtly over time (eg the whole 
name-hashing was a tweak that was added later), but you're absolutely 
right - it's an area where some *particular* usage model could come up 
with a special ordering to optimize a some special load. The defaults are 
meant to be "good enough", and there has been some thought put into them, 
but yes, there's probably room for specialization there.

In particular, one thing I've wanted to do is to do some "fingerprint 
hash" and perhaps use that as a sorting guide for finding deltas even more 
aggressively than we do now, but I've never really found the energy to try 
something like that out.

> In fact, it starts to look awfully like custom MPEG profiles where if you know
> your footage you can achieve a much higher compression ratio
> compared to what the default might give you. It also makes it very
> clear that Git's approach of doing away with per-file content tracking
> is quite superior to the in-file deltas. Cool!

Well, one thing that git doesn't do, but that I find intriguing is if it 
could be possible to do deltas not even between different files, but even 
*across* files. 

One thing that the git model sucks at is how it's not very good at 
handling large objects. I've often wondered if I should have made "object" 
be more fine-grained and tried to build up large files from multiple 
smaller objects.

[ That said, I think git does the right thing - for source code. The 
  blocking-up of files would cause a rather more complex model, and one of 
  the great things about git is how simple the basic model is. But the
  large-file thing does mean that git potentially sucks really badly for 
  some other loads ]

> Here's my final question on that issue: wouldn't it be great to give users
> a direct control over specifying the list of objects in exactly the order
> they would like them to be tried for deltifying? Something like
> --preserve-order option available for git-pack-objects?

See above: there's actually two orders, so you'd need to specify *both* 
orders some way, since the input order already has meaning (it's assumed 
to be the "topological ordering".

You *can* approximate what you descibe by playing games with the filenames 
(which you can give in the input too - they're just strings on the same 
line as the SHA1), and that would be a total hack to give that secondary 
ordering. And yes, I agree that it might be cool to do this explicitly 
some way. If only to give a way to test experimental versions of the type 
sorting.

[ To see how the filename thing works, do

	git rev-list --objects --all > object-list
	git pack-objects ... < object-list

  and you can look at the "object-list" file to see how it's not just the 
  list of SHA1, it now has the filename info in it too. Imagine changing 
  that filename to generate a "custom" pack order ]

				Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-06  0:48         ` Linus Torvalds
@ 2008-04-06 16:10           ` Jeff King
  2008-04-07  0:13             ` Nicolas Pitre
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2008-04-06 16:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Roman Shaposhnik, git

On Sat, Apr 05, 2008 at 05:48:43PM -0700, Linus Torvalds wrote:

> One thing that the git model sucks at is how it's not very good at 
> handling large objects. I've often wondered if I should have made "object" 
> be more fine-grained and tried to build up large files from multiple 
> smaller objects.
> 
> [ That said, I think git does the right thing - for source code. The 
>   blocking-up of files would cause a rather more complex model, and one of 
>   the great things about git is how simple the basic model is. But the
>   large-file thing does mean that git potentially sucks really badly for 
>   some other loads ]

I have considered something like this for one of my repos, which is full
of images. The large image data very rarely changes, but the small EXIF
tags do.

My thought was something like:

  - add a new object type, multiblob; a multiblob contains zero or more
    "child" sha1s, each of which is another multiblob or a blob. The
    data in the multiblob is an in-order concatenation of its children.

  - you would create multiblobs with a "smart" git-add that understands
    the filetype and splits the file accordingly (in my case, probably a
    chunk of headers and EXIF data, and then a chunk with the image
    data).

  - in most of git, whenever you need a blob, you just "unwrap" the
    multiblob to get the original blob data

  - because they're separate objects, pack-objects automagically does
    the right thing

  - a few places would benefit from handling multiblobs specially. In
    particular:
      - the diff machinery could do much more efficient comparisons for
        some inexact renames. E.g., multiblob "1234\n5678" and multiblob
        "abcd\n5678" could ignore the "5678" id.
      - the diff machinery could show diffs that were more human
        readable (e.g., even without understanding what the chunks of
        the multiblob _mean_, it can still say "most of this image
        didn't change, but this textual part did").

Of course there are a few drawbacks:

  - one of git's strengths is that content is the same no matter who
    adds it or how. Now the same file has a different sha1 as a
    multiblob versus a regular blob.

  - it breaks the git model of "we store state in the simplest way, and
    figure everything out afterwards." IOW, you are stuck with whatever
    crappy multiblob split you did when you added or updated the file.
    The usual pattern in git is "dumb add, smart view". Now maybe it is
    worth breaking this for two reasons:

      - dumb add, smart view is often very resource intensive; we can
        get smaller packs and faster rename detection out of this

      - we might be losing information; in the case of renames, we can
        justify not explicitly recording because we can figure out
        later what actually happened. I don't know if there is a
        multiblob split that would encapsulate useful user input.
        My EXIF example doesn't; with a little more CPU time, you could
        just do the automated split at diff or delta time.

So it's an approach that I think would work, but I'm not sure it's worth
the effort unless somebody comes up with a compelling reason that you
can't just split the blobs up after the fact (and maybe the right
approach is that pack v5 can split blobs intelligently to get better
deltas, so they are still blobs, but we just store them differently).

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-06 16:10           ` Jeff King
@ 2008-04-07  0:13             ` Nicolas Pitre
  2008-04-07  0:18               ` Jeff King
  0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-07  0:13 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git

On Sun, 6 Apr 2008, Jeff King wrote:

> My thought was something like:
> 
>   - add a new object type, multiblob; a multiblob contains zero or more
>     "child" sha1s, each of which is another multiblob or a blob. The
>     data in the multiblob is an in-order concatenation of its children.
> 
>   - you would create multiblobs with a "smart" git-add that understands
>     the filetype and splits the file accordingly (in my case, probably a
>     chunk of headers and EXIF data, and then a chunk with the image
>     data).

Well, in your example, the large image part should already be common to 
many objects due to deltas if they're really the same: different objects 
will only have different EXIF data plus a delta reference to the same 
base image object. So in a way the split is already there.  Needs only 
that some applications exploit this information at runtime.


Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-07  0:13             ` Nicolas Pitre
@ 2008-04-07  0:18               ` Jeff King
  2008-04-07  0:36                 ` Nicolas Pitre
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2008-04-07  0:18 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Roman Shaposhnik, git

On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote:

> Well, in your example, the large image part should already be common to 
> many objects due to deltas if they're really the same: different objects 
> will only have different EXIF data plus a delta reference to the same 
> base image object. So in a way the split is already there.  Needs only 
> that some applications exploit this information at runtime.

Yes, the resulting packfiles find the deltas and are pretty efficient
(although it is quite slow to pack).  However, the delta information is
not used at all for inexact rename detection. Are you proposing to make
that information available to the rename detector?

-Peff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Achieving efficient storage of weirdly structured repos
  2008-04-07  0:18               ` Jeff King
@ 2008-04-07  0:36                 ` Nicolas Pitre
  0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-07  0:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git

On Sun, 6 Apr 2008, Jeff King wrote:

> On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote:
> 
> > Well, in your example, the large image part should already be common to 
> > many objects due to deltas if they're really the same: different objects 
> > will only have different EXIF data plus a delta reference to the same 
> > base image object. So in a way the split is already there.  Needs only 
> > that some applications exploit this information at runtime.
> 
> Yes, the resulting packfiles find the deltas and are pretty efficient
> (although it is quite slow to pack).  However, the delta information is
> not used at all for inexact rename detection. Are you proposing to make
> that information available to the rename detector?

In practice I don't know how well that would work since the 
current heuristic groups deltas and their 
base according to the name under which those objects are known.  So it 
is possible that some inexact renames end up creating objects that 
currently never delta against each other even if that would be the right 
thing to do.

But in some cases, that might be beneficial to look at the delta object 
themselves when diffing files as the delta might already contain the 
information telling the upper layer that file A and B are in fact 90% 
the same and that they differ from offset X to Y only.

Nicolas

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2008-04-07  0:37 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
2008-04-04  6:21   ` Jakub Narebski
2008-04-04 13:11     ` Nicolas Pitre
2008-04-04 14:16       ` Pieter de Bie
2008-04-05  3:24       ` Shawn O. Pearce
2008-04-04 23:30   ` Roman Shaposhnik
2008-04-04 23:57     ` Linus Torvalds
2008-04-06  0:13       ` Roman Shaposhnik
2008-04-06  0:48         ` Linus Torvalds
2008-04-06 16:10           ` Jeff King
2008-04-07  0:13             ` Nicolas Pitre
2008-04-07  0:18               ` Jeff King
2008-04-07  0:36                 ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox