* Achieving efficient storage of weirdly structured repos
@ 2008-04-03 19:42 Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-03 19:42 UTC (permalink / raw)
To: git
Hi Guys!
First of all, I must admit that I got seriously hooked up on Git when
I started to
use it internally for FFmpeg development and my life as far as non
company
related projects are concerned hasn't been the same ever since. The way
Git addresses issues like branches (both local and remote), merges and
content tracking are, in my opinion, quite superior to the
alternative SCMs
such as Mercurial. In fact, that's why when I saw my coworkers
struggling
with exactly the same issues around NetBeans Mercurial repository my
natural instinct was to introduce them to Git and show them that life
doesn't
have to be that complicated or troublesome.
But before I can do that, there's a little issue that I either need
to address or
at least understand better. It seems that the storage requirements
for the
repository that a friend of mine and I created from: http://
hg.netbeans.org/main/
got doubled under Git:
$ du -sh Mercurial- nb-main
491M Mercurial- nb-main
$ du -sh Git-nb-main
1.1G Git-nb-main
The repository was created using hg2git (the one based on git-fast-
import)
and it was GC'ed and REPACK'ed just in case. I spent some time trying
to understand the issue and here's some statistics on the kind of
objects
that this repository consists of:
75053 commit
334572 blob
750003 tree
The last item (trees) also seem to take the most space and the most
reasonable
explanation that I can offer is that NetBeans repository has a really
weird
structure where they have approximately 700 (yes, seven hundred!) top-
level
subdirectories there. They are clearly Submodules-shy, but that's
another
issue that I will need to address with them.
So based on my preliminary analysis the biggest culprit seems to be
that for every commit at least one pretty hefty tree object needs to
be created
(remember 700 top level subdirectories). These objects are all in 18-20K
range but they are almost identical to each other except for one or two
trees that they refer to.
So here's my question: is there anything in Git that I can use to
accommodate
a repository like http://hg.netbeans.org/main in an efficient manner?
Any
kind of ideas (like particular command line options for repack, etc.)
would be
greatly appreciated!
Thanks,
Roman.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
@ 2008-04-03 21:11 ` Linus Torvalds
2008-04-04 6:21 ` Jakub Narebski
2008-04-04 23:30 ` Roman Shaposhnik
0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2008-04-03 21:11 UTC (permalink / raw)
To: Roman Shaposhnik; +Cc: git
On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
>
> The repository was created using hg2git (the one based on git-fast-import)
> and it was GC'ed and REPACK'ed just in case.
Before going any further - exactly _how_ was it repacked?
In particular, when using importers that do partial packing on their own
(and any "git-fastimport" user is that by definition - and I think
hg2git does that), at the end of it all you have to make sure to repack in
a way where the repacking will totally discard the import-time packfiles.
IOW, that's one of the very few times you should use "-f" to git repack.
It's usually also a good place to make sure that since you ignore the old
packing information, it's best to also make sure that the new packing info
is good by using a bigger window (and perhaps a bigger depth). That makes
the packing much slower, of course, but this is meant to be a one-time
event.
So try something like
git repack -a -d -f --depth=100 --window=100
if you have a good CPU and plenty of memory.
> The last item (trees) also seem to take the most space and the most
> reasonable explanation that I can offer is that NetBeans repository has
> a really weird structure where they have approximately 700 (yes, seven
> hundred!) top-level subdirectories there. They are clearly
> Submodules-shy, but that's another issue that I will need to address
> with them.
Trees taking the biggest amount of space is not unheard of, and it may
also be that the name heuristics (for finding good packing partners) could
be failign, which would result in a much bigger pack than necessary.
So if you already did an aggressive repack like the above, I'd happily
take a look at whether maybe it's bad heuristics for finding tree objects
to pair up for delta-compression. Do you have a place where you can put
that repo for people to clone and look at?
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-03 21:11 ` Linus Torvalds
@ 2008-04-04 6:21 ` Jakub Narebski
2008-04-04 13:11 ` Nicolas Pitre
2008-04-04 23:30 ` Roman Shaposhnik
1 sibling, 1 reply; 14+ messages in thread
From: Jakub Narebski @ 2008-04-04 6:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Roman Shaposhnik, git
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
>>
>> The last item (trees) also seem to take the most space and the most
>> reasonable explanation that I can offer is that NetBeans repository has
>> a really weird structure where they have approximately 700 (yes, seven
>> hundred!) top-level subdirectories there. They are clearly
>> Submodules-shy, but that's another issue that I will need to address
>> with them.
>
> Trees taking the biggest amount of space is not unheard of, and it may
> also be that the name heuristics (for finding good packing partners) could
> be failign, which would result in a much bigger pack than necessary.
>
> So if you already did an aggressive repack like the above, I'd happily
> take a look at whether maybe it's bad heuristics for finding tree objects
> to pair up for delta-compression. Do you have a place where you can put
> that repo for people to clone and look at?
Hmmm... I wonder if it would be the case that would speed-up
development of pack v4. If I remember correctly one of bigger changes
was the way trees were represented in pack; the biggest improvement
was for trees.
One of bigger hindrances, as I understand it, in developing pack v4
was the fact that it didn't offer that much of improvement in typical
cases for the work needed... but perhaps "your" repository would be
good showcase for pack v4.
Just my 2 eurocents...
--
Jakub Narebski
Poland
ShadeHawk on #git
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-04 6:21 ` Jakub Narebski
@ 2008-04-04 13:11 ` Nicolas Pitre
2008-04-04 14:16 ` Pieter de Bie
2008-04-05 3:24 ` Shawn O. Pearce
0 siblings, 2 replies; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-04 13:11 UTC (permalink / raw)
To: Jakub Narebski; +Cc: Linus Torvalds, Roman Shaposhnik, git
On Thu, 3 Apr 2008, Jakub Narebski wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> >>
> >> The last item (trees) also seem to take the most space and the most
> >> reasonable explanation that I can offer is that NetBeans repository has
> >> a really weird structure where they have approximately 700 (yes, seven
> >> hundred!) top-level subdirectories there. They are clearly
> >> Submodules-shy, but that's another issue that I will need to address
> >> with them.
> >
> > Trees taking the biggest amount of space is not unheard of, and it may
> > also be that the name heuristics (for finding good packing partners) could
> > be failign, which would result in a much bigger pack than necessary.
> >
> > So if you already did an aggressive repack like the above, I'd happily
> > take a look at whether maybe it's bad heuristics for finding tree objects
> > to pair up for delta-compression. Do you have a place where you can put
> > that repo for people to clone and look at?
>
> Hmmm... I wonder if it would be the case that would speed-up
> development of pack v4.
Not really. Pack v4 won't magically shrink a repository to less than
half the pack v3 size.
I think we're simply facing the same situation as with the initial GCC
repository which shrank from 3GB down to 300MB or so due to misfitted
repacking parameters.
> If I remember correctly one of bigger changes
> was the way trees were represented in pack; the biggest improvement
> was for trees.
Yes, but that wasn't really so much about size but rather access speed
by not deflating them. The pack v4 tree representation would certainly
help, of course, but I suspect that simply repacking with more
aggressive window/depth arguments would be even more effective in this
case.
> One of bigger hindrances, as I understand it, in developing pack v4
> was the fact that it didn't offer that much of improvement in typical
> cases for the work needed... but perhaps "your" repository would be
> good showcase for pack v4.
The biggest hindrance for pack v4 is actually the lack of a native
runtime tree walking, and having both tree object formats properly and
optimally abstracted has not been looked at yet.
Speed is the primary goal for pack v4. The fact that it also provides a
10% pack reduction is only consequential. But without native tree
walking we must recreate the legacy tree format on the fly each time a
tree object is loaded which dwarfs any improvements pack v4 is aiming
for (yes it is still a little bit faster than pack v3 nevertheless, but
not yet significantly enough to overcome the incompatibility costs).
Nicolas (who wishes he was still a student with plenty of hacking time)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-04 13:11 ` Nicolas Pitre
@ 2008-04-04 14:16 ` Pieter de Bie
2008-04-05 3:24 ` Shawn O. Pearce
1 sibling, 0 replies; 14+ messages in thread
From: Pieter de Bie @ 2008-04-04 14:16 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git
On 4 apr 2008, at 15:11, Nicolas Pitre wrote:
> I think we're simply facing the same situation as with the initial GCC
> repository which shrank from 3GB down to 300MB or so due to misfitted
> repacking parameters.
I'd say so too. I imported the first 14000 revisions of that hg
repository and the resulting pack file was just 25 MB after repacking.
- Pieter
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-03 21:11 ` Linus Torvalds
2008-04-04 6:21 ` Jakub Narebski
@ 2008-04-04 23:30 ` Roman Shaposhnik
2008-04-04 23:57 ` Linus Torvalds
1 sibling, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-04 23:30 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Hi Linus!
On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote:
>
> On Thu, 3 Apr 2008, Roman Shaposhnik wrote:
> >
> > The repository was created using hg2git (the one based on git-fast-import)
> > and it was GC'ed and REPACK'ed just in case.
>
> Before going any further - exactly _how_ was it repacked?
I believe it was the following two steps:
$ git gc --aggressive
$ git repack
> In particular, when using importers that do partial packing on their own
> (and any "git-fastimport" user is that by definition - and I think
> hg2git does that), at the end of it all you have to make sure to repack in
> a way where the repacking will totally discard the import-time packfiles.
Good point. Speaking of which: do you have an FAQ for importers? The
entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce)
seem a bit stale for such an important issue. After all, importing from
an existing SCM is what usually forms a first time impression of Git's
effectiveness.
> IOW, that's one of the very few times you should use "-f" to git repack.
Got it!
> It's usually also a good place to make sure that since you ignore the old
> packing information, it's best to also make sure that the new packing info
> is good by using a bigger window (and perhaps a bigger depth). That makes
> the packing much slower, of course, but this is meant to be a one-time
> event.
>
> So try something like
>
> git repack -a -d -f --depth=100 --window=100
>
> if you have a good CPU and plenty of memory.
That turned out to be a perfect suggestion. Thank you. I'm now the
happiest camper ever. And I'm also also pretty dumbfounded ;-)
Here's what happened.
I started with a a repository filled with "loose" (one object per file)
objects (the reason I needed it was for the ease of sleuthing through
individual objects and it was created by git-unpack-objects from that
initial 1.1Gb pack). And I tried to pack it exactly like you
suggested:
$ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects
Generating pack...
Counting objects: 1096305
Done counting 1159628 objects.
Deltifying 1159628 objects...
100% (1159628/1159628) done
Writing 1159628 objects...
dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f
Total 1159628 (delta 386980), reused 0 (delta 0)
and it payed off reasonably well:
$ du -s NB-clone
670M NB-clone
It still was bigger than the Mercurial repository but at least it got
2 times smaller than the original result of hg2git. Now, if it wasn't
for a friend of mine, I probably would've stopped there. But he
showed up and saved the day ;-) His comments made me try something
that I didn't consider to be of any use -- repacking a freshly packed
pack with the *same* --depth=100 --window=100:
$ git repack -a -f --window=100 --depth=100
Generating pack...
Counting objects: 1056829
Done counting 1159628 objects.
Deltifying 1159628 objects...
100% (1159628/1159628) done
Writing 1159628 objects...
100% (1159628/1159628) done
Total 1159628 (delta 614516), reused 0 (delta 0)
Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created.
And then, a miracle occurred:
$ du -sh NB-small
268M NB-small
Now, don't get me wrong: I'm as happy as a clam. The repository is now
*smaller* than the Mercurial's and because the structure of the
tree is so weird Git gets major points here. The only question that
is still bothering me is: how did it happen? Why did repacking
a repository with exactly the same set of objects and the only
difference being where these objects resided (former case filesystem,
the later case an intermediate pack) made so huge a difference?
Please help!
> > The last item (trees) also seem to take the most space and the most
> > reasonable explanation that I can offer is that NetBeans repository has
> > a really weird structure where they have approximately 700 (yes, seven
> > hundred!) top-level subdirectories there. They are clearly
> > Submodules-shy, but that's another issue that I will need to address
> > with them.
>
> Trees taking the biggest amount of space is not unheard of, and it may
> also be that the name heuristics (for finding good packing partners) could
> be failign, which would result in a much bigger pack than necessary.
Is there any documentation that describes the heuristics involved in
creating a pack?
> So if you already did an aggressive repack like the above, I'd happily
> take a look at whether maybe it's bad heuristics for finding tree objects
> to pair up for delta-compression. Do you have a place where you can put
> that repo for people to clone and look at?
Unfortunately I don't. The only thing I can do is I can always create
a *.tar.bz2 and put and on Sun's ftp server. Actually, that makes me
wonder: is there any public Git hosting available such that publishing
a hefty repository for the forensic purposes only wouldn't violate their
terms of use?
Thanks,
Roman.
P.S. Oh, and here's one extra tiny question that I also have: what
does the output:
Total 1159628 (delta 614516), reused 0 (delta 0)
really mean?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-04 23:30 ` Roman Shaposhnik
@ 2008-04-04 23:57 ` Linus Torvalds
2008-04-06 0:13 ` Roman Shaposhnik
0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-04-04 23:57 UTC (permalink / raw)
To: Roman Shaposhnik; +Cc: git
On Fri, 4 Apr 2008, Roman Shaposhnik wrote:
>
> That turned out to be a perfect suggestion. Thank you. I'm now the
> happiest camper ever. And I'm also also pretty dumbfounded ;-)
Ok, the dumbfounded we can help with.
> Here's what happened.
>
> I started with a a repository filled with "loose" (one object per file)
> objects (the reason I needed it was for the ease of sleuthing through
> individual objects and it was created by git-unpack-objects from that
> initial 1.1Gb pack). And I tried to pack it exactly like you
> suggested:
> $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects
Well, that's not exactly like I suggested, that's a *really* old-fashioned
way.
But the part you left off was what the "objects" file contained?
In particular, "git pack-objects" can take just a raw list of objects, and
it will *work*, but without the naming information and without the
ordering information on the objects, the end result will generally suck.
So how did you generate the "objects" file? You can do it by just listing
every single loose object you have, and it will work, but the end result
won't be very pretty.
> Total 1159628 (delta 386980), reused 0 (delta 0)
>
> and it payed off reasonably well:
> $ du -s NB-clone
> 670M NB-clone
So what git-repack-objects will do is that it will *try* to figure out a
good and likely pairing of objects, and it uses the type and the size of
the objectf rot it, but it obviously didn't find very good deltas. In
fact, only about 35% of all objects are deltas at all!
Which is why I suspect that your object list was just the raw list of
objects.
> His comments made me try something that I didn't consider to be of any
> use -- repacking a freshly packed pack with the *same* --depth=100
> --window=100:
> $ git repack -a -f --window=100 --depth=100
Well, "git repack" doesn't just list the objects and then repack them, it
also
- lists them with the filename they were reachable from (if they are
trees or blobs)
- sorts them topologically (ie most reachable objects first).
And that first thing will generate a *much* better packing, because now
the packing algorithm can look at the filename, and generate a much better
guess about which objects to try to delta against each others. And now you
have
> Total 1159628 (delta 614516), reused 0 (delta 0)
Ie you got a much better 60% delta ratio, and obviously a much smaller
pack. But it's not just smaller, because the resulting pack will also have
the objects in topological order, so that objects that are "new"
(ie more closely reachable from the top-of-tree) will be at the front of
the pack, so you'll also generally have better IO patterns in the packfile
itself.
> Now, don't get me wrong: I'm as happy as a clam. The repository is now
> *smaller* than the Mercurial's and because the structure of the
> tree is so weird Git gets major points here. The only question that
> is still bothering me is: how did it happen? Why did repacking
> a repository with exactly the same set of objects and the only
> difference being where these objects resided (former case filesystem,
> the later case an intermediate pack) made so huge a difference?
That location thing really shouldn't have mattered (since you used "-f",
which throws it away). So I think it was all from how you generated the
list of objects. But since I don't know how you did that, I cannot
guarantee anything.
> Is there any documentation that describes the heuristics involved in
> creating a pack?
It's been explained a few times on the mailing list, but I don't know if
there is some write-up.
The git pack format is a bit more relaxed than any other SCM format I know
about, since it literally is just a "bunch of objects with deltas randomly
between them". So a pack can look just about any way it wants - there are
no real ordering requirements (--delta-base-offset does require that the
delta follows the base, but even that is really just a trivial practical
"because otherwise you could never know what the offset in the pack-file
was" issue rather than anything else).
So you can order the objects and make deltas just about any way you want.
What "git repack" will do is to sort objects by <type, namehash, length>,
and then walk the list with the window of the specified size to see the
best pairing it can do.
The "namehash" is just designed so that files that have the same name sort
together (and it's also not a real "hash" - it's really a meaningless
number, but one that is designed so that even if the names aren't exactly
the same, they sort closer together if they end in similar character
sequences - so a file that moves from one directory to another but keeps
the same basename will sort source and destination together).
See "type_size_sort()" in builtin-pack-objects.c to see what's up.
(The preferred_base thing is just to keep objects we actually want to
*keep* in the pack separated from the objects we may be using for
references - but that is only used for transferring objects over the wire,
never for standalone packs).
Oh. And now I notice that there is *some* documentation on this in
Documentation/technical/pack-heuristics.txt
but that's pretty old (not that I think it has really changed) and not
very deep. Just taken from some #irc discussion.
> P.S. Oh, and here's one extra tiny question that I also have: what
> does the output:
> Total 1159628 (delta 614516), reused 0 (delta 0)
> really mean?
It just means that there were a million objects total, the pack was
created with 614k of them as deltas against other objects, and none of
them were re-used from previous packs.
You'll see that when you do incremental repacks, 99% of all the objects
and deltas get re-used from the old pack, which is how we can make
repacking fairly efficient. That's imporant because the native git
protocol format itself is just a pack-file (plus the handshake to figure
out what both sides have, of course). So a "git clone" implies a repack.
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-04 13:11 ` Nicolas Pitre
2008-04-04 14:16 ` Pieter de Bie
@ 2008-04-05 3:24 ` Shawn O. Pearce
1 sibling, 0 replies; 14+ messages in thread
From: Shawn O. Pearce @ 2008-04-05 3:24 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git
Nicolas Pitre <nico@cam.org> wrote:
> On Thu, 3 Apr 2008, Jakub Narebski wrote:
>
> > One of bigger hindrances, as I understand it, in developing pack v4
> > was the fact that it didn't offer that much of improvement in typical
> > cases for the work needed... but perhaps "your" repository would be
> > good showcase for pack v4.
>
> The biggest hindrance for pack v4 is actually the lack of a native
> runtime tree walking, and having both tree object formats properly and
> optimally abstracted has not been looked at yet.
>
> Speed is the primary goal for pack v4. The fact that it also provides a
> 10% pack reduction is only consequential. But without native tree
> walking we must recreate the legacy tree format on the fly each time a
> tree object is loaded which dwarfs any improvements pack v4 is aiming
> for (yes it is still a little bit faster than pack v3 nevertheless, but
> not yet significantly enough to overcome the incompatibility costs).
Even though we don't have native tree walking, I think the right
way to do this is to put in pack v4 with "canonical tree, canonical
commit" mode, where it inflates its native tree/commit encoding
into the canonical forms, then come back later with native walking.
Canonical mode is still faster than pack v2 inflate is for these
types, so it does (slightly) boost rev-list performance. It might
chop a solid 30% off the CPU time jgit spends in its equivilant of
revision.c, and that's without teaching jgit to use the native pack
v4 encoding directly.
Once we have it in we can experiment with the necessary abstractions
to handle the two different available encodings, and allowing
higher level code to switch back and forth between them as objects
come from loose or pack v2, and from pack v4. One of the things we
wanted to do was boost path limiter performance by matching on tree
name ids when walking a pack v4 native tree, but fall back to the
string based memcmp when walking a canonical tree. That won't be
easy to design without the two different encodings being available
at the lower level in sha1_file.c.
Just my rapidly declining .02 bush peso.
> Nicolas (who wishes he was still a student with plenty of hacking time)
Don't we all. :-)
--
Shawn.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-04 23:57 ` Linus Torvalds
@ 2008-04-06 0:13 ` Roman Shaposhnik
2008-04-06 0:48 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-06 0:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Hi Linus,
On Apr 4, 2008, at 4:57 PM, Linus Torvalds wrote:
> On Fri, 4 Apr 2008, Roman Shaposhnik wrote:
>>
>> That turned out to be a perfect suggestion. Thank you. I'm now the
>> happiest camper ever. And I'm also also pretty dumbfounded ;-)
>
> Ok, the dumbfounded we can help with.
Great!
>> Here's what happened.
>>
>> I started with a a repository filled with "loose" (one object per
>> file)
>> objects (the reason I needed it was for the ease of sleuthing through
>> individual objects and it was created by git-unpack-objects from that
>> initial 1.1Gb pack). And I tried to pack it exactly like you
>> suggested:
>> $ git-pack-objects --depth=100 --window=100 --delta-base-offset
>> --progress pack < objects
>
> Well, that's not exactly like I suggested, that's a *really* old-
> fashioned
> way.
It sure is old-fashioned. My only excuse is that since git repack is
a shell
script around git-pack-objects I've always felt comfortable just using
the lower level utility.
> But the part you left off was what the "objects" file contained?
It was pretty much the result of (cd .git/objects ; find . -type f |
tr -d './')
Omitting that was quite silly on my part, but I was really convinced
that because of the internal sorting the original ordering of objects
wouldn't matter at all.
> In particular, "git pack-objects" can take just a raw list of
> objects, and
> it will *work*, but without the naming information and without the
> ordering information on the objects, the end result will generally
> suck.
>
> So how did you generate the "objects" file? You can do it by just
> listing
> every single loose object you have, and it will work, but the end
> result
> won't be very pretty.
So it seems that my list of objects was different from what git-rev-
list/setup_revisions()
would have provided in two ways:
1. the order of objects was arbitrary
2. the naming of blobs and trees was missing
#2 was, indeed, a huge oversight on my part. At the same time, I've
always thought
that #1 shouldn't matter, because, as you pointed out, the objects get
sorted by <type, namehash, length> anyway. However, it seems that
because
of the lack of naming there was much less sorting done by
type_size_sort()
and the original order persevered (at least within type-size
partitions).
It did matter after all!
Now, in my particular case, the ordering was braindead and it
resulted in
a highly visible inefficiency. On the other hand, it seems that creative
ordering could very well be used to exploit localities which go beyond
how default name hashing in builtin-pack-objects.c works.
In fact, it starts to look awfully like custom MPEG profiles where if
you know
your footage you can achieve a much higher compression ratio
compared to what the default might give you. It also makes it very
clear that Git's approach of doing away with per-file content tracking
is quite superior to the in-file deltas. Cool!
Here's my final question on that issue: wouldn't it be great to give
users
a direct control over specifying the list of objects in exactly the
order
they would like them to be tried for deltifying? Something like --
preserve-order
option available for git-pack-objects?
>> Total 1159628 (delta 614516), reused 0 (delta 0)
>
> Ie you got a much better 60% delta ratio, and obviously a much smaller
> pack. But it's not just smaller, because the resulting pack will
> also have
> the objects in topological order, so that objects that are "new"
> (ie more closely reachable from the top-of-tree) will be at the
> front of
> the pack, so you'll also generally have better IO patterns in the
> packfile
> itself.
Got it! This makes total sense.
>> Is there any documentation that describes the heuristics involved in
>> creating a pack?
>
> It's been explained a few times on the mailing list, but I don't
> know if
> there is some write-up.
>
> The git pack format is a bit more relaxed than any other SCM format
> I know
> about, since it literally is just a "bunch of objects with deltas
> randomly
> between them". So a pack can look just about any way it wants -
> there are
> no real ordering requirements (--delta-base-offset does require
> that the
> delta follows the base, but even that is really just a trivial
> practical
> "because otherwise you could never know what the offset in the pack-
> file
> was" issue rather than anything else).
>
> So you can order the objects and make deltas just about any way you
> want.
> What "git repack" will do is to sort objects by <type, namehash,
> length>,
> and then walk the list with the window of the specified size to see
> the
> best pairing it can do.
>
> The "namehash" is just designed so that files that have the same
> name sort
> together (and it's also not a real "hash" - it's really a meaningless
> number, but one that is designed so that even if the names aren't
> exactly
> the same, they sort closer together if they end in similar character
> sequences - so a file that moves from one directory to another but
> keeps
> the same basename will sort source and destination together).
>
> See "type_size_sort()" in builtin-pack-objects.c to see what's up.
>
> (The preferred_base thing is just to keep objects we actually want to
> *keep* in the pack separated from the objects we may be using for
> references - but that is only used for transferring objects over
> the wire,
> never for standalone packs).
>
> Oh. And now I notice that there is *some* documentation on this in
>
> Documentation/technical/pack-heuristics.txt
>
> but that's pretty old (not that I think it has really changed) and not
> very deep. Just taken from some #irc discussion.
>
>> P.S. Oh, and here's one extra tiny question that I also have: what
>> does the output:
>> Total 1159628 (delta 614516), reused 0 (delta 0)
>> really mean?
>
> It just means that there were a million objects total, the pack was
> created with 614k of them as deltas against other objects, and none of
> them were re-used from previous packs.
>
> You'll see that when you do incremental repacks, 99% of all the
> objects
> and deltas get re-used from the old pack, which is how we can make
> repacking fairly efficient. That's imporant because the native git
> protocol format itself is just a pack-file (plus the handshake to
> figure
> out what both sides have, of course). So a "git clone" implies a
> repack.
Very nice summary! Many, many thanks for providing it!
Thanks,
Roman.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-06 0:13 ` Roman Shaposhnik
@ 2008-04-06 0:48 ` Linus Torvalds
2008-04-06 16:10 ` Jeff King
0 siblings, 1 reply; 14+ messages in thread
From: Linus Torvalds @ 2008-04-06 0:48 UTC (permalink / raw)
To: Roman Shaposhnik; +Cc: git
On Sat, 5 Apr 2008, Roman Shaposhnik wrote:
>
> It was pretty much the result of (cd .git/objects ; find . -type f | tr -d
> './')
Ok, that explains it. And yes, it definitely works, but I think you now
understand why it gave you that oddly pessimised pack-file.
That said, I think it's also a really good example of how the git
pack-files really are just a "bag of objects", and the fact that it worked
but was non-optimal is a very good way to show something very fundamental
about git.
> So it seems that my list of objects was different from what
> git-rev-list/setup_revisions()
> would have provided in two ways:
> 1. the order of objects was arbitrary
> 2. the naming of blobs and trees was missing
> #2 was, indeed, a huge oversight on my part. At the same time, I've always
> thought
> that #1 shouldn't matter, because, as you pointed out, the objects get
> sorted by <type, namehash, length> anyway. However, it seems that because
> of the lack of naming there was much less sorting done by type_size_sort()
> and the original order persevered (at least within type-size partitions).
> It did matter after all!
Well, a pack actually as two *different* orderings, in that there's one
ordering that is used for laying out the result in the pack-file, and
another ordering that is used for actually finding the deltas with the
sliding window.
And the input order is actually used for the layout ordering.
And both matter. The <type, namehash, length> is the one that determines
how large the resulting pack is, but the layout order is what causes the
IO patterns for most common uses, so if the pack-file is cold in the
cache, it can matter quite a bit for performance.
So the layout order is the one you give as input to "git pack-objects".
And if that input has no particular ordering, then the final layout will
also have no particular ordering and you get random IO patterns when
loading from disk.
> Now, in my particular case, the ordering was braindead and it resulted in
> a highly visible inefficiency. On the other hand, it seems that creative
> ordering could very well be used to exploit localities which go beyond
> how default name hashing in builtin-pack-objects.c works.
Yes. We've changed some of the heuristics subtly over time (eg the whole
name-hashing was a tweak that was added later), but you're absolutely
right - it's an area where some *particular* usage model could come up
with a special ordering to optimize a some special load. The defaults are
meant to be "good enough", and there has been some thought put into them,
but yes, there's probably room for specialization there.
In particular, one thing I've wanted to do is to do some "fingerprint
hash" and perhaps use that as a sorting guide for finding deltas even more
aggressively than we do now, but I've never really found the energy to try
something like that out.
> In fact, it starts to look awfully like custom MPEG profiles where if you know
> your footage you can achieve a much higher compression ratio
> compared to what the default might give you. It also makes it very
> clear that Git's approach of doing away with per-file content tracking
> is quite superior to the in-file deltas. Cool!
Well, one thing that git doesn't do, but that I find intriguing is if it
could be possible to do deltas not even between different files, but even
*across* files.
One thing that the git model sucks at is how it's not very good at
handling large objects. I've often wondered if I should have made "object"
be more fine-grained and tried to build up large files from multiple
smaller objects.
[ That said, I think git does the right thing - for source code. The
blocking-up of files would cause a rather more complex model, and one of
the great things about git is how simple the basic model is. But the
large-file thing does mean that git potentially sucks really badly for
some other loads ]
> Here's my final question on that issue: wouldn't it be great to give users
> a direct control over specifying the list of objects in exactly the order
> they would like them to be tried for deltifying? Something like
> --preserve-order option available for git-pack-objects?
See above: there's actually two orders, so you'd need to specify *both*
orders some way, since the input order already has meaning (it's assumed
to be the "topological ordering".
You *can* approximate what you descibe by playing games with the filenames
(which you can give in the input too - they're just strings on the same
line as the SHA1), and that would be a total hack to give that secondary
ordering. And yes, I agree that it might be cool to do this explicitly
some way. If only to give a way to test experimental versions of the type
sorting.
[ To see how the filename thing works, do
git rev-list --objects --all > object-list
git pack-objects ... < object-list
and you can look at the "object-list" file to see how it's not just the
list of SHA1, it now has the filename info in it too. Imagine changing
that filename to generate a "custom" pack order ]
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-06 0:48 ` Linus Torvalds
@ 2008-04-06 16:10 ` Jeff King
2008-04-07 0:13 ` Nicolas Pitre
0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2008-04-06 16:10 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Roman Shaposhnik, git
On Sat, Apr 05, 2008 at 05:48:43PM -0700, Linus Torvalds wrote:
> One thing that the git model sucks at is how it's not very good at
> handling large objects. I've often wondered if I should have made "object"
> be more fine-grained and tried to build up large files from multiple
> smaller objects.
>
> [ That said, I think git does the right thing - for source code. The
> blocking-up of files would cause a rather more complex model, and one of
> the great things about git is how simple the basic model is. But the
> large-file thing does mean that git potentially sucks really badly for
> some other loads ]
I have considered something like this for one of my repos, which is full
of images. The large image data very rarely changes, but the small EXIF
tags do.
My thought was something like:
- add a new object type, multiblob; a multiblob contains zero or more
"child" sha1s, each of which is another multiblob or a blob. The
data in the multiblob is an in-order concatenation of its children.
- you would create multiblobs with a "smart" git-add that understands
the filetype and splits the file accordingly (in my case, probably a
chunk of headers and EXIF data, and then a chunk with the image
data).
- in most of git, whenever you need a blob, you just "unwrap" the
multiblob to get the original blob data
- because they're separate objects, pack-objects automagically does
the right thing
- a few places would benefit from handling multiblobs specially. In
particular:
- the diff machinery could do much more efficient comparisons for
some inexact renames. E.g., multiblob "1234\n5678" and multiblob
"abcd\n5678" could ignore the "5678" id.
- the diff machinery could show diffs that were more human
readable (e.g., even without understanding what the chunks of
the multiblob _mean_, it can still say "most of this image
didn't change, but this textual part did").
Of course there are a few drawbacks:
- one of git's strengths is that content is the same no matter who
adds it or how. Now the same file has a different sha1 as a
multiblob versus a regular blob.
- it breaks the git model of "we store state in the simplest way, and
figure everything out afterwards." IOW, you are stuck with whatever
crappy multiblob split you did when you added or updated the file.
The usual pattern in git is "dumb add, smart view". Now maybe it is
worth breaking this for two reasons:
- dumb add, smart view is often very resource intensive; we can
get smaller packs and faster rename detection out of this
- we might be losing information; in the case of renames, we can
justify not explicitly recording because we can figure out
later what actually happened. I don't know if there is a
multiblob split that would encapsulate useful user input.
My EXIF example doesn't; with a little more CPU time, you could
just do the automated split at diff or delta time.
So it's an approach that I think would work, but I'm not sure it's worth
the effort unless somebody comes up with a compelling reason that you
can't just split the blobs up after the fact (and maybe the right
approach is that pack v5 can split blobs intelligently to get better
deltas, so they are still blobs, but we just store them differently).
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-06 16:10 ` Jeff King
@ 2008-04-07 0:13 ` Nicolas Pitre
2008-04-07 0:18 ` Jeff King
0 siblings, 1 reply; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-07 0:13 UTC (permalink / raw)
To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git
On Sun, 6 Apr 2008, Jeff King wrote:
> My thought was something like:
>
> - add a new object type, multiblob; a multiblob contains zero or more
> "child" sha1s, each of which is another multiblob or a blob. The
> data in the multiblob is an in-order concatenation of its children.
>
> - you would create multiblobs with a "smart" git-add that understands
> the filetype and splits the file accordingly (in my case, probably a
> chunk of headers and EXIF data, and then a chunk with the image
> data).
Well, in your example, the large image part should already be common to
many objects due to deltas if they're really the same: different objects
will only have different EXIF data plus a delta reference to the same
base image object. So in a way the split is already there. Needs only
that some applications exploit this information at runtime.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-07 0:13 ` Nicolas Pitre
@ 2008-04-07 0:18 ` Jeff King
2008-04-07 0:36 ` Nicolas Pitre
0 siblings, 1 reply; 14+ messages in thread
From: Jeff King @ 2008-04-07 0:18 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Roman Shaposhnik, git
On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote:
> Well, in your example, the large image part should already be common to
> many objects due to deltas if they're really the same: different objects
> will only have different EXIF data plus a delta reference to the same
> base image object. So in a way the split is already there. Needs only
> that some applications exploit this information at runtime.
Yes, the resulting packfiles find the deltas and are pretty efficient
(although it is quite slow to pack). However, the delta information is
not used at all for inexact rename detection. Are you proposing to make
that information available to the rename detector?
-Peff
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos
2008-04-07 0:18 ` Jeff King
@ 2008-04-07 0:36 ` Nicolas Pitre
0 siblings, 0 replies; 14+ messages in thread
From: Nicolas Pitre @ 2008-04-07 0:36 UTC (permalink / raw)
To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git
On Sun, 6 Apr 2008, Jeff King wrote:
> On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote:
>
> > Well, in your example, the large image part should already be common to
> > many objects due to deltas if they're really the same: different objects
> > will only have different EXIF data plus a delta reference to the same
> > base image object. So in a way the split is already there. Needs only
> > that some applications exploit this information at runtime.
>
> Yes, the resulting packfiles find the deltas and are pretty efficient
> (although it is quite slow to pack). However, the delta information is
> not used at all for inexact rename detection. Are you proposing to make
> that information available to the rename detector?
In practice I don't know how well that would work since the
current heuristic groups deltas and their
base according to the name under which those objects are known. So it
is possible that some inexact renames end up creating objects that
currently never delta against each other even if that would be the right
thing to do.
But in some cases, that might be beneficial to look at the delta object
themselves when diffing files as the delta might already contain the
information telling the upper layer that file A and B are in fact 90%
the same and that they differ from offset X to Y only.
Nicolas
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-04-07 0:37 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
2008-04-04 6:21 ` Jakub Narebski
2008-04-04 13:11 ` Nicolas Pitre
2008-04-04 14:16 ` Pieter de Bie
2008-04-05 3:24 ` Shawn O. Pearce
2008-04-04 23:30 ` Roman Shaposhnik
2008-04-04 23:57 ` Linus Torvalds
2008-04-06 0:13 ` Roman Shaposhnik
2008-04-06 0:48 ` Linus Torvalds
2008-04-06 16:10 ` Jeff King
2008-04-07 0:13 ` Nicolas Pitre
2008-04-07 0:18 ` Jeff King
2008-04-07 0:36 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox