* Achieving efficient storage of weirdly structured repos
@ 2008-04-03 19:42 Roman Shaposhnik
2008-04-03 21:11 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Roman Shaposhnik @ 2008-04-03 19:42 UTC (permalink / raw)
To: git
Hi Guys!
First of all, I must admit that I got seriously hooked up on Git when
I started to
use it internally for FFmpeg development and my life as far as non
company
related projects are concerned hasn't been the same ever since. The way
Git addresses issues like branches (both local and remote), merges and
content tracking are, in my opinion, quite superior to the
alternative SCMs
such as Mercurial. In fact, that's why when I saw my coworkers
struggling
with exactly the same issues around NetBeans Mercurial repository my
natural instinct was to introduce them to Git and show them that life
doesn't
have to be that complicated or troublesome.
But before I can do that, there's a little issue that I either need
to address or
at least understand better. It seems that the storage requirements
for the
repository that a friend of mine and I created from: http://
hg.netbeans.org/main/
got doubled under Git:
$ du -sh Mercurial- nb-main
491M Mercurial- nb-main
$ du -sh Git-nb-main
1.1G Git-nb-main
The repository was created using hg2git (the one based on git-fast-
import)
and it was GC'ed and REPACK'ed just in case. I spent some time trying
to understand the issue and here's some statistics on the kind of
objects
that this repository consists of:
75053 commit
334572 blob
750003 tree
The last item (trees) also seem to take the most space and the most
reasonable
explanation that I can offer is that NetBeans repository has a really
weird
structure where they have approximately 700 (yes, seven hundred!) top-
level
subdirectories there. They are clearly Submodules-shy, but that's
another
issue that I will need to address with them.
So based on my preliminary analysis the biggest culprit seems to be
that for every commit at least one pretty hefty tree object needs to
be created
(remember 700 top level subdirectories). These objects are all in 18-20K
range but they are almost identical to each other except for one or two
trees that they refer to.
So here's my question: is there anything in Git that I can use to
accommodate
a repository like http://hg.netbeans.org/main in an efficient manner?
Any
kind of ideas (like particular command line options for repack, etc.)
would be
greatly appreciated!
Thanks,
Roman.
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: Achieving efficient storage of weirdly structured repos 2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik @ 2008-04-03 21:11 ` Linus Torvalds 2008-04-04 6:21 ` Jakub Narebski 2008-04-04 23:30 ` Roman Shaposhnik 0 siblings, 2 replies; 14+ messages in thread From: Linus Torvalds @ 2008-04-03 21:11 UTC (permalink / raw) To: Roman Shaposhnik; +Cc: git On Thu, 3 Apr 2008, Roman Shaposhnik wrote: > > The repository was created using hg2git (the one based on git-fast-import) > and it was GC'ed and REPACK'ed just in case. Before going any further - exactly _how_ was it repacked? In particular, when using importers that do partial packing on their own (and any "git-fastimport" user is that by definition - and I think hg2git does that), at the end of it all you have to make sure to repack in a way where the repacking will totally discard the import-time packfiles. IOW, that's one of the very few times you should use "-f" to git repack. It's usually also a good place to make sure that since you ignore the old packing information, it's best to also make sure that the new packing info is good by using a bigger window (and perhaps a bigger depth). That makes the packing much slower, of course, but this is meant to be a one-time event. So try something like git repack -a -d -f --depth=100 --window=100 if you have a good CPU and plenty of memory. > The last item (trees) also seem to take the most space and the most > reasonable explanation that I can offer is that NetBeans repository has > a really weird structure where they have approximately 700 (yes, seven > hundred!) top-level subdirectories there. They are clearly > Submodules-shy, but that's another issue that I will need to address > with them. Trees taking the biggest amount of space is not unheard of, and it may also be that the name heuristics (for finding good packing partners) could be failign, which would result in a much bigger pack than necessary. So if you already did an aggressive repack like the above, I'd happily take a look at whether maybe it's bad heuristics for finding tree objects to pair up for delta-compression. Do you have a place where you can put that repo for people to clone and look at? Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-03 21:11 ` Linus Torvalds @ 2008-04-04 6:21 ` Jakub Narebski 2008-04-04 13:11 ` Nicolas Pitre 2008-04-04 23:30 ` Roman Shaposhnik 1 sibling, 1 reply; 14+ messages in thread From: Jakub Narebski @ 2008-04-04 6:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Roman Shaposhnik, git Linus Torvalds <torvalds@linux-foundation.org> writes: > On Thu, 3 Apr 2008, Roman Shaposhnik wrote: >> >> The last item (trees) also seem to take the most space and the most >> reasonable explanation that I can offer is that NetBeans repository has >> a really weird structure where they have approximately 700 (yes, seven >> hundred!) top-level subdirectories there. They are clearly >> Submodules-shy, but that's another issue that I will need to address >> with them. > > Trees taking the biggest amount of space is not unheard of, and it may > also be that the name heuristics (for finding good packing partners) could > be failign, which would result in a much bigger pack than necessary. > > So if you already did an aggressive repack like the above, I'd happily > take a look at whether maybe it's bad heuristics for finding tree objects > to pair up for delta-compression. Do you have a place where you can put > that repo for people to clone and look at? Hmmm... I wonder if it would be the case that would speed-up development of pack v4. If I remember correctly one of bigger changes was the way trees were represented in pack; the biggest improvement was for trees. One of bigger hindrances, as I understand it, in developing pack v4 was the fact that it didn't offer that much of improvement in typical cases for the work needed... but perhaps "your" repository would be good showcase for pack v4. Just my 2 eurocents... -- Jakub Narebski Poland ShadeHawk on #git ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-04 6:21 ` Jakub Narebski @ 2008-04-04 13:11 ` Nicolas Pitre 2008-04-04 14:16 ` Pieter de Bie 2008-04-05 3:24 ` Shawn O. Pearce 0 siblings, 2 replies; 14+ messages in thread From: Nicolas Pitre @ 2008-04-04 13:11 UTC (permalink / raw) To: Jakub Narebski; +Cc: Linus Torvalds, Roman Shaposhnik, git On Thu, 3 Apr 2008, Jakub Narebski wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > On Thu, 3 Apr 2008, Roman Shaposhnik wrote: > >> > >> The last item (trees) also seem to take the most space and the most > >> reasonable explanation that I can offer is that NetBeans repository has > >> a really weird structure where they have approximately 700 (yes, seven > >> hundred!) top-level subdirectories there. They are clearly > >> Submodules-shy, but that's another issue that I will need to address > >> with them. > > > > Trees taking the biggest amount of space is not unheard of, and it may > > also be that the name heuristics (for finding good packing partners) could > > be failign, which would result in a much bigger pack than necessary. > > > > So if you already did an aggressive repack like the above, I'd happily > > take a look at whether maybe it's bad heuristics for finding tree objects > > to pair up for delta-compression. Do you have a place where you can put > > that repo for people to clone and look at? > > Hmmm... I wonder if it would be the case that would speed-up > development of pack v4. Not really. Pack v4 won't magically shrink a repository to less than half the pack v3 size. I think we're simply facing the same situation as with the initial GCC repository which shrank from 3GB down to 300MB or so due to misfitted repacking parameters. > If I remember correctly one of bigger changes > was the way trees were represented in pack; the biggest improvement > was for trees. Yes, but that wasn't really so much about size but rather access speed by not deflating them. The pack v4 tree representation would certainly help, of course, but I suspect that simply repacking with more aggressive window/depth arguments would be even more effective in this case. > One of bigger hindrances, as I understand it, in developing pack v4 > was the fact that it didn't offer that much of improvement in typical > cases for the work needed... but perhaps "your" repository would be > good showcase for pack v4. The biggest hindrance for pack v4 is actually the lack of a native runtime tree walking, and having both tree object formats properly and optimally abstracted has not been looked at yet. Speed is the primary goal for pack v4. The fact that it also provides a 10% pack reduction is only consequential. But without native tree walking we must recreate the legacy tree format on the fly each time a tree object is loaded which dwarfs any improvements pack v4 is aiming for (yes it is still a little bit faster than pack v3 nevertheless, but not yet significantly enough to overcome the incompatibility costs). Nicolas (who wishes he was still a student with plenty of hacking time) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-04 13:11 ` Nicolas Pitre @ 2008-04-04 14:16 ` Pieter de Bie 2008-04-05 3:24 ` Shawn O. Pearce 1 sibling, 0 replies; 14+ messages in thread From: Pieter de Bie @ 2008-04-04 14:16 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git On 4 apr 2008, at 15:11, Nicolas Pitre wrote: > I think we're simply facing the same situation as with the initial GCC > repository which shrank from 3GB down to 300MB or so due to misfitted > repacking parameters. I'd say so too. I imported the first 14000 revisions of that hg repository and the resulting pack file was just 25 MB after repacking. - Pieter ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-04 13:11 ` Nicolas Pitre 2008-04-04 14:16 ` Pieter de Bie @ 2008-04-05 3:24 ` Shawn O. Pearce 1 sibling, 0 replies; 14+ messages in thread From: Shawn O. Pearce @ 2008-04-05 3:24 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Jakub Narebski, Linus Torvalds, Roman Shaposhnik, git Nicolas Pitre <nico@cam.org> wrote: > On Thu, 3 Apr 2008, Jakub Narebski wrote: > > > One of bigger hindrances, as I understand it, in developing pack v4 > > was the fact that it didn't offer that much of improvement in typical > > cases for the work needed... but perhaps "your" repository would be > > good showcase for pack v4. > > The biggest hindrance for pack v4 is actually the lack of a native > runtime tree walking, and having both tree object formats properly and > optimally abstracted has not been looked at yet. > > Speed is the primary goal for pack v4. The fact that it also provides a > 10% pack reduction is only consequential. But without native tree > walking we must recreate the legacy tree format on the fly each time a > tree object is loaded which dwarfs any improvements pack v4 is aiming > for (yes it is still a little bit faster than pack v3 nevertheless, but > not yet significantly enough to overcome the incompatibility costs). Even though we don't have native tree walking, I think the right way to do this is to put in pack v4 with "canonical tree, canonical commit" mode, where it inflates its native tree/commit encoding into the canonical forms, then come back later with native walking. Canonical mode is still faster than pack v2 inflate is for these types, so it does (slightly) boost rev-list performance. It might chop a solid 30% off the CPU time jgit spends in its equivilant of revision.c, and that's without teaching jgit to use the native pack v4 encoding directly. Once we have it in we can experiment with the necessary abstractions to handle the two different available encodings, and allowing higher level code to switch back and forth between them as objects come from loose or pack v2, and from pack v4. One of the things we wanted to do was boost path limiter performance by matching on tree name ids when walking a pack v4 native tree, but fall back to the string based memcmp when walking a canonical tree. That won't be easy to design without the two different encodings being available at the lower level in sha1_file.c. Just my rapidly declining .02 bush peso. > Nicolas (who wishes he was still a student with plenty of hacking time) Don't we all. :-) -- Shawn. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-03 21:11 ` Linus Torvalds 2008-04-04 6:21 ` Jakub Narebski @ 2008-04-04 23:30 ` Roman Shaposhnik 2008-04-04 23:57 ` Linus Torvalds 1 sibling, 1 reply; 14+ messages in thread From: Roman Shaposhnik @ 2008-04-04 23:30 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Hi Linus! On Thu, 2008-04-03 at 14:11 -0700, Linus Torvalds wrote: > > On Thu, 3 Apr 2008, Roman Shaposhnik wrote: > > > > The repository was created using hg2git (the one based on git-fast-import) > > and it was GC'ed and REPACK'ed just in case. > > Before going any further - exactly _how_ was it repacked? I believe it was the following two steps: $ git gc --aggressive $ git repack > In particular, when using importers that do partial packing on their own > (and any "git-fastimport" user is that by definition - and I think > hg2git does that), at the end of it all you have to make sure to repack in > a way where the repacking will totally discard the import-time packfiles. Good point. Speaking of which: do you have an FAQ for importers? The entries in the official FAQ (http://git.or.cz/gitwiki/GitFaq#head-929a8825d04dde226c2530f5337d3b3ed8dcc7ce) seem a bit stale for such an important issue. After all, importing from an existing SCM is what usually forms a first time impression of Git's effectiveness. > IOW, that's one of the very few times you should use "-f" to git repack. Got it! > It's usually also a good place to make sure that since you ignore the old > packing information, it's best to also make sure that the new packing info > is good by using a bigger window (and perhaps a bigger depth). That makes > the packing much slower, of course, but this is meant to be a one-time > event. > > So try something like > > git repack -a -d -f --depth=100 --window=100 > > if you have a good CPU and plenty of memory. That turned out to be a perfect suggestion. Thank you. I'm now the happiest camper ever. And I'm also also pretty dumbfounded ;-) Here's what happened. I started with a a repository filled with "loose" (one object per file) objects (the reason I needed it was for the ease of sleuthing through individual objects and it was created by git-unpack-objects from that initial 1.1Gb pack). And I tried to pack it exactly like you suggested: $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects Generating pack... Counting objects: 1096305 Done counting 1159628 objects. Deltifying 1159628 objects... 100% (1159628/1159628) done Writing 1159628 objects... dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f Total 1159628 (delta 386980), reused 0 (delta 0) and it payed off reasonably well: $ du -s NB-clone 670M NB-clone It still was bigger than the Mercurial repository but at least it got 2 times smaller than the original result of hg2git. Now, if it wasn't for a friend of mine, I probably would've stopped there. But he showed up and saved the day ;-) His comments made me try something that I didn't consider to be of any use -- repacking a freshly packed pack with the *same* --depth=100 --window=100: $ git repack -a -f --window=100 --depth=100 Generating pack... Counting objects: 1056829 Done counting 1159628 objects. Deltifying 1159628 objects... 100% (1159628/1159628) done Writing 1159628 objects... 100% (1159628/1159628) done Total 1159628 (delta 614516), reused 0 (delta 0) Pack pack-dd134c407324dc55b0cd2aa3a9e1b3420c2bba3f created. And then, a miracle occurred: $ du -sh NB-small 268M NB-small Now, don't get me wrong: I'm as happy as a clam. The repository is now *smaller* than the Mercurial's and because the structure of the tree is so weird Git gets major points here. The only question that is still bothering me is: how did it happen? Why did repacking a repository with exactly the same set of objects and the only difference being where these objects resided (former case filesystem, the later case an intermediate pack) made so huge a difference? Please help! > > The last item (trees) also seem to take the most space and the most > > reasonable explanation that I can offer is that NetBeans repository has > > a really weird structure where they have approximately 700 (yes, seven > > hundred!) top-level subdirectories there. They are clearly > > Submodules-shy, but that's another issue that I will need to address > > with them. > > Trees taking the biggest amount of space is not unheard of, and it may > also be that the name heuristics (for finding good packing partners) could > be failign, which would result in a much bigger pack than necessary. Is there any documentation that describes the heuristics involved in creating a pack? > So if you already did an aggressive repack like the above, I'd happily > take a look at whether maybe it's bad heuristics for finding tree objects > to pair up for delta-compression. Do you have a place where you can put > that repo for people to clone and look at? Unfortunately I don't. The only thing I can do is I can always create a *.tar.bz2 and put and on Sun's ftp server. Actually, that makes me wonder: is there any public Git hosting available such that publishing a hefty repository for the forensic purposes only wouldn't violate their terms of use? Thanks, Roman. P.S. Oh, and here's one extra tiny question that I also have: what does the output: Total 1159628 (delta 614516), reused 0 (delta 0) really mean? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-04 23:30 ` Roman Shaposhnik @ 2008-04-04 23:57 ` Linus Torvalds 2008-04-06 0:13 ` Roman Shaposhnik 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2008-04-04 23:57 UTC (permalink / raw) To: Roman Shaposhnik; +Cc: git On Fri, 4 Apr 2008, Roman Shaposhnik wrote: > > That turned out to be a perfect suggestion. Thank you. I'm now the > happiest camper ever. And I'm also also pretty dumbfounded ;-) Ok, the dumbfounded we can help with. > Here's what happened. > > I started with a a repository filled with "loose" (one object per file) > objects (the reason I needed it was for the ease of sleuthing through > individual objects and it was created by git-unpack-objects from that > initial 1.1Gb pack). And I tried to pack it exactly like you > suggested: > $ git-pack-objects --depth=100 --window=100 --delta-base-offset --progress pack < objects Well, that's not exactly like I suggested, that's a *really* old-fashioned way. But the part you left off was what the "objects" file contained? In particular, "git pack-objects" can take just a raw list of objects, and it will *work*, but without the naming information and without the ordering information on the objects, the end result will generally suck. So how did you generate the "objects" file? You can do it by just listing every single loose object you have, and it will work, but the end result won't be very pretty. > Total 1159628 (delta 386980), reused 0 (delta 0) > > and it payed off reasonably well: > $ du -s NB-clone > 670M NB-clone So what git-repack-objects will do is that it will *try* to figure out a good and likely pairing of objects, and it uses the type and the size of the objectf rot it, but it obviously didn't find very good deltas. In fact, only about 35% of all objects are deltas at all! Which is why I suspect that your object list was just the raw list of objects. > His comments made me try something that I didn't consider to be of any > use -- repacking a freshly packed pack with the *same* --depth=100 > --window=100: > $ git repack -a -f --window=100 --depth=100 Well, "git repack" doesn't just list the objects and then repack them, it also - lists them with the filename they were reachable from (if they are trees or blobs) - sorts them topologically (ie most reachable objects first). And that first thing will generate a *much* better packing, because now the packing algorithm can look at the filename, and generate a much better guess about which objects to try to delta against each others. And now you have > Total 1159628 (delta 614516), reused 0 (delta 0) Ie you got a much better 60% delta ratio, and obviously a much smaller pack. But it's not just smaller, because the resulting pack will also have the objects in topological order, so that objects that are "new" (ie more closely reachable from the top-of-tree) will be at the front of the pack, so you'll also generally have better IO patterns in the packfile itself. > Now, don't get me wrong: I'm as happy as a clam. The repository is now > *smaller* than the Mercurial's and because the structure of the > tree is so weird Git gets major points here. The only question that > is still bothering me is: how did it happen? Why did repacking > a repository with exactly the same set of objects and the only > difference being where these objects resided (former case filesystem, > the later case an intermediate pack) made so huge a difference? That location thing really shouldn't have mattered (since you used "-f", which throws it away). So I think it was all from how you generated the list of objects. But since I don't know how you did that, I cannot guarantee anything. > Is there any documentation that describes the heuristics involved in > creating a pack? It's been explained a few times on the mailing list, but I don't know if there is some write-up. The git pack format is a bit more relaxed than any other SCM format I know about, since it literally is just a "bunch of objects with deltas randomly between them". So a pack can look just about any way it wants - there are no real ordering requirements (--delta-base-offset does require that the delta follows the base, but even that is really just a trivial practical "because otherwise you could never know what the offset in the pack-file was" issue rather than anything else). So you can order the objects and make deltas just about any way you want. What "git repack" will do is to sort objects by <type, namehash, length>, and then walk the list with the window of the specified size to see the best pairing it can do. The "namehash" is just designed so that files that have the same name sort together (and it's also not a real "hash" - it's really a meaningless number, but one that is designed so that even if the names aren't exactly the same, they sort closer together if they end in similar character sequences - so a file that moves from one directory to another but keeps the same basename will sort source and destination together). See "type_size_sort()" in builtin-pack-objects.c to see what's up. (The preferred_base thing is just to keep objects we actually want to *keep* in the pack separated from the objects we may be using for references - but that is only used for transferring objects over the wire, never for standalone packs). Oh. And now I notice that there is *some* documentation on this in Documentation/technical/pack-heuristics.txt but that's pretty old (not that I think it has really changed) and not very deep. Just taken from some #irc discussion. > P.S. Oh, and here's one extra tiny question that I also have: what > does the output: > Total 1159628 (delta 614516), reused 0 (delta 0) > really mean? It just means that there were a million objects total, the pack was created with 614k of them as deltas against other objects, and none of them were re-used from previous packs. You'll see that when you do incremental repacks, 99% of all the objects and deltas get re-used from the old pack, which is how we can make repacking fairly efficient. That's imporant because the native git protocol format itself is just a pack-file (plus the handshake to figure out what both sides have, of course). So a "git clone" implies a repack. Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-04 23:57 ` Linus Torvalds @ 2008-04-06 0:13 ` Roman Shaposhnik 2008-04-06 0:48 ` Linus Torvalds 0 siblings, 1 reply; 14+ messages in thread From: Roman Shaposhnik @ 2008-04-06 0:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Hi Linus, On Apr 4, 2008, at 4:57 PM, Linus Torvalds wrote: > On Fri, 4 Apr 2008, Roman Shaposhnik wrote: >> >> That turned out to be a perfect suggestion. Thank you. I'm now the >> happiest camper ever. And I'm also also pretty dumbfounded ;-) > > Ok, the dumbfounded we can help with. Great! >> Here's what happened. >> >> I started with a a repository filled with "loose" (one object per >> file) >> objects (the reason I needed it was for the ease of sleuthing through >> individual objects and it was created by git-unpack-objects from that >> initial 1.1Gb pack). And I tried to pack it exactly like you >> suggested: >> $ git-pack-objects --depth=100 --window=100 --delta-base-offset >> --progress pack < objects > > Well, that's not exactly like I suggested, that's a *really* old- > fashioned > way. It sure is old-fashioned. My only excuse is that since git repack is a shell script around git-pack-objects I've always felt comfortable just using the lower level utility. > But the part you left off was what the "objects" file contained? It was pretty much the result of (cd .git/objects ; find . -type f | tr -d './') Omitting that was quite silly on my part, but I was really convinced that because of the internal sorting the original ordering of objects wouldn't matter at all. > In particular, "git pack-objects" can take just a raw list of > objects, and > it will *work*, but without the naming information and without the > ordering information on the objects, the end result will generally > suck. > > So how did you generate the "objects" file? You can do it by just > listing > every single loose object you have, and it will work, but the end > result > won't be very pretty. So it seems that my list of objects was different from what git-rev- list/setup_revisions() would have provided in two ways: 1. the order of objects was arbitrary 2. the naming of blobs and trees was missing #2 was, indeed, a huge oversight on my part. At the same time, I've always thought that #1 shouldn't matter, because, as you pointed out, the objects get sorted by <type, namehash, length> anyway. However, it seems that because of the lack of naming there was much less sorting done by type_size_sort() and the original order persevered (at least within type-size partitions). It did matter after all! Now, in my particular case, the ordering was braindead and it resulted in a highly visible inefficiency. On the other hand, it seems that creative ordering could very well be used to exploit localities which go beyond how default name hashing in builtin-pack-objects.c works. In fact, it starts to look awfully like custom MPEG profiles where if you know your footage you can achieve a much higher compression ratio compared to what the default might give you. It also makes it very clear that Git's approach of doing away with per-file content tracking is quite superior to the in-file deltas. Cool! Here's my final question on that issue: wouldn't it be great to give users a direct control over specifying the list of objects in exactly the order they would like them to be tried for deltifying? Something like -- preserve-order option available for git-pack-objects? >> Total 1159628 (delta 614516), reused 0 (delta 0) > > Ie you got a much better 60% delta ratio, and obviously a much smaller > pack. But it's not just smaller, because the resulting pack will > also have > the objects in topological order, so that objects that are "new" > (ie more closely reachable from the top-of-tree) will be at the > front of > the pack, so you'll also generally have better IO patterns in the > packfile > itself. Got it! This makes total sense. >> Is there any documentation that describes the heuristics involved in >> creating a pack? > > It's been explained a few times on the mailing list, but I don't > know if > there is some write-up. > > The git pack format is a bit more relaxed than any other SCM format > I know > about, since it literally is just a "bunch of objects with deltas > randomly > between them". So a pack can look just about any way it wants - > there are > no real ordering requirements (--delta-base-offset does require > that the > delta follows the base, but even that is really just a trivial > practical > "because otherwise you could never know what the offset in the pack- > file > was" issue rather than anything else). > > So you can order the objects and make deltas just about any way you > want. > What "git repack" will do is to sort objects by <type, namehash, > length>, > and then walk the list with the window of the specified size to see > the > best pairing it can do. > > The "namehash" is just designed so that files that have the same > name sort > together (and it's also not a real "hash" - it's really a meaningless > number, but one that is designed so that even if the names aren't > exactly > the same, they sort closer together if they end in similar character > sequences - so a file that moves from one directory to another but > keeps > the same basename will sort source and destination together). > > See "type_size_sort()" in builtin-pack-objects.c to see what's up. > > (The preferred_base thing is just to keep objects we actually want to > *keep* in the pack separated from the objects we may be using for > references - but that is only used for transferring objects over > the wire, > never for standalone packs). > > Oh. And now I notice that there is *some* documentation on this in > > Documentation/technical/pack-heuristics.txt > > but that's pretty old (not that I think it has really changed) and not > very deep. Just taken from some #irc discussion. > >> P.S. Oh, and here's one extra tiny question that I also have: what >> does the output: >> Total 1159628 (delta 614516), reused 0 (delta 0) >> really mean? > > It just means that there were a million objects total, the pack was > created with 614k of them as deltas against other objects, and none of > them were re-used from previous packs. > > You'll see that when you do incremental repacks, 99% of all the > objects > and deltas get re-used from the old pack, which is how we can make > repacking fairly efficient. That's imporant because the native git > protocol format itself is just a pack-file (plus the handshake to > figure > out what both sides have, of course). So a "git clone" implies a > repack. Very nice summary! Many, many thanks for providing it! Thanks, Roman. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-06 0:13 ` Roman Shaposhnik @ 2008-04-06 0:48 ` Linus Torvalds 2008-04-06 16:10 ` Jeff King 0 siblings, 1 reply; 14+ messages in thread From: Linus Torvalds @ 2008-04-06 0:48 UTC (permalink / raw) To: Roman Shaposhnik; +Cc: git On Sat, 5 Apr 2008, Roman Shaposhnik wrote: > > It was pretty much the result of (cd .git/objects ; find . -type f | tr -d > './') Ok, that explains it. And yes, it definitely works, but I think you now understand why it gave you that oddly pessimised pack-file. That said, I think it's also a really good example of how the git pack-files really are just a "bag of objects", and the fact that it worked but was non-optimal is a very good way to show something very fundamental about git. > So it seems that my list of objects was different from what > git-rev-list/setup_revisions() > would have provided in two ways: > 1. the order of objects was arbitrary > 2. the naming of blobs and trees was missing > #2 was, indeed, a huge oversight on my part. At the same time, I've always > thought > that #1 shouldn't matter, because, as you pointed out, the objects get > sorted by <type, namehash, length> anyway. However, it seems that because > of the lack of naming there was much less sorting done by type_size_sort() > and the original order persevered (at least within type-size partitions). > It did matter after all! Well, a pack actually as two *different* orderings, in that there's one ordering that is used for laying out the result in the pack-file, and another ordering that is used for actually finding the deltas with the sliding window. And the input order is actually used for the layout ordering. And both matter. The <type, namehash, length> is the one that determines how large the resulting pack is, but the layout order is what causes the IO patterns for most common uses, so if the pack-file is cold in the cache, it can matter quite a bit for performance. So the layout order is the one you give as input to "git pack-objects". And if that input has no particular ordering, then the final layout will also have no particular ordering and you get random IO patterns when loading from disk. > Now, in my particular case, the ordering was braindead and it resulted in > a highly visible inefficiency. On the other hand, it seems that creative > ordering could very well be used to exploit localities which go beyond > how default name hashing in builtin-pack-objects.c works. Yes. We've changed some of the heuristics subtly over time (eg the whole name-hashing was a tweak that was added later), but you're absolutely right - it's an area where some *particular* usage model could come up with a special ordering to optimize a some special load. The defaults are meant to be "good enough", and there has been some thought put into them, but yes, there's probably room for specialization there. In particular, one thing I've wanted to do is to do some "fingerprint hash" and perhaps use that as a sorting guide for finding deltas even more aggressively than we do now, but I've never really found the energy to try something like that out. > In fact, it starts to look awfully like custom MPEG profiles where if you know > your footage you can achieve a much higher compression ratio > compared to what the default might give you. It also makes it very > clear that Git's approach of doing away with per-file content tracking > is quite superior to the in-file deltas. Cool! Well, one thing that git doesn't do, but that I find intriguing is if it could be possible to do deltas not even between different files, but even *across* files. One thing that the git model sucks at is how it's not very good at handling large objects. I've often wondered if I should have made "object" be more fine-grained and tried to build up large files from multiple smaller objects. [ That said, I think git does the right thing - for source code. The blocking-up of files would cause a rather more complex model, and one of the great things about git is how simple the basic model is. But the large-file thing does mean that git potentially sucks really badly for some other loads ] > Here's my final question on that issue: wouldn't it be great to give users > a direct control over specifying the list of objects in exactly the order > they would like them to be tried for deltifying? Something like > --preserve-order option available for git-pack-objects? See above: there's actually two orders, so you'd need to specify *both* orders some way, since the input order already has meaning (it's assumed to be the "topological ordering". You *can* approximate what you descibe by playing games with the filenames (which you can give in the input too - they're just strings on the same line as the SHA1), and that would be a total hack to give that secondary ordering. And yes, I agree that it might be cool to do this explicitly some way. If only to give a way to test experimental versions of the type sorting. [ To see how the filename thing works, do git rev-list --objects --all > object-list git pack-objects ... < object-list and you can look at the "object-list" file to see how it's not just the list of SHA1, it now has the filename info in it too. Imagine changing that filename to generate a "custom" pack order ] Linus ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-06 0:48 ` Linus Torvalds @ 2008-04-06 16:10 ` Jeff King 2008-04-07 0:13 ` Nicolas Pitre 0 siblings, 1 reply; 14+ messages in thread From: Jeff King @ 2008-04-06 16:10 UTC (permalink / raw) To: Linus Torvalds; +Cc: Roman Shaposhnik, git On Sat, Apr 05, 2008 at 05:48:43PM -0700, Linus Torvalds wrote: > One thing that the git model sucks at is how it's not very good at > handling large objects. I've often wondered if I should have made "object" > be more fine-grained and tried to build up large files from multiple > smaller objects. > > [ That said, I think git does the right thing - for source code. The > blocking-up of files would cause a rather more complex model, and one of > the great things about git is how simple the basic model is. But the > large-file thing does mean that git potentially sucks really badly for > some other loads ] I have considered something like this for one of my repos, which is full of images. The large image data very rarely changes, but the small EXIF tags do. My thought was something like: - add a new object type, multiblob; a multiblob contains zero or more "child" sha1s, each of which is another multiblob or a blob. The data in the multiblob is an in-order concatenation of its children. - you would create multiblobs with a "smart" git-add that understands the filetype and splits the file accordingly (in my case, probably a chunk of headers and EXIF data, and then a chunk with the image data). - in most of git, whenever you need a blob, you just "unwrap" the multiblob to get the original blob data - because they're separate objects, pack-objects automagically does the right thing - a few places would benefit from handling multiblobs specially. In particular: - the diff machinery could do much more efficient comparisons for some inexact renames. E.g., multiblob "1234\n5678" and multiblob "abcd\n5678" could ignore the "5678" id. - the diff machinery could show diffs that were more human readable (e.g., even without understanding what the chunks of the multiblob _mean_, it can still say "most of this image didn't change, but this textual part did"). Of course there are a few drawbacks: - one of git's strengths is that content is the same no matter who adds it or how. Now the same file has a different sha1 as a multiblob versus a regular blob. - it breaks the git model of "we store state in the simplest way, and figure everything out afterwards." IOW, you are stuck with whatever crappy multiblob split you did when you added or updated the file. The usual pattern in git is "dumb add, smart view". Now maybe it is worth breaking this for two reasons: - dumb add, smart view is often very resource intensive; we can get smaller packs and faster rename detection out of this - we might be losing information; in the case of renames, we can justify not explicitly recording because we can figure out later what actually happened. I don't know if there is a multiblob split that would encapsulate useful user input. My EXIF example doesn't; with a little more CPU time, you could just do the automated split at diff or delta time. So it's an approach that I think would work, but I'm not sure it's worth the effort unless somebody comes up with a compelling reason that you can't just split the blobs up after the fact (and maybe the right approach is that pack v5 can split blobs intelligently to get better deltas, so they are still blobs, but we just store them differently). -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-06 16:10 ` Jeff King @ 2008-04-07 0:13 ` Nicolas Pitre 2008-04-07 0:18 ` Jeff King 0 siblings, 1 reply; 14+ messages in thread From: Nicolas Pitre @ 2008-04-07 0:13 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git On Sun, 6 Apr 2008, Jeff King wrote: > My thought was something like: > > - add a new object type, multiblob; a multiblob contains zero or more > "child" sha1s, each of which is another multiblob or a blob. The > data in the multiblob is an in-order concatenation of its children. > > - you would create multiblobs with a "smart" git-add that understands > the filetype and splits the file accordingly (in my case, probably a > chunk of headers and EXIF data, and then a chunk with the image > data). Well, in your example, the large image part should already be common to many objects due to deltas if they're really the same: different objects will only have different EXIF data plus a delta reference to the same base image object. So in a way the split is already there. Needs only that some applications exploit this information at runtime. Nicolas ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-07 0:13 ` Nicolas Pitre @ 2008-04-07 0:18 ` Jeff King 2008-04-07 0:36 ` Nicolas Pitre 0 siblings, 1 reply; 14+ messages in thread From: Jeff King @ 2008-04-07 0:18 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Roman Shaposhnik, git On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote: > Well, in your example, the large image part should already be common to > many objects due to deltas if they're really the same: different objects > will only have different EXIF data plus a delta reference to the same > base image object. So in a way the split is already there. Needs only > that some applications exploit this information at runtime. Yes, the resulting packfiles find the deltas and are pretty efficient (although it is quite slow to pack). However, the delta information is not used at all for inexact rename detection. Are you proposing to make that information available to the rename detector? -Peff ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Achieving efficient storage of weirdly structured repos 2008-04-07 0:18 ` Jeff King @ 2008-04-07 0:36 ` Nicolas Pitre 0 siblings, 0 replies; 14+ messages in thread From: Nicolas Pitre @ 2008-04-07 0:36 UTC (permalink / raw) To: Jeff King; +Cc: Linus Torvalds, Roman Shaposhnik, git On Sun, 6 Apr 2008, Jeff King wrote: > On Sun, Apr 06, 2008 at 08:13:10PM -0400, Nicolas Pitre wrote: > > > Well, in your example, the large image part should already be common to > > many objects due to deltas if they're really the same: different objects > > will only have different EXIF data plus a delta reference to the same > > base image object. So in a way the split is already there. Needs only > > that some applications exploit this information at runtime. > > Yes, the resulting packfiles find the deltas and are pretty efficient > (although it is quite slow to pack). However, the delta information is > not used at all for inexact rename detection. Are you proposing to make > that information available to the rename detector? In practice I don't know how well that would work since the current heuristic groups deltas and their base according to the name under which those objects are known. So it is possible that some inexact renames end up creating objects that currently never delta against each other even if that would be the right thing to do. But in some cases, that might be beneficial to look at the delta object themselves when diffing files as the delta might already contain the information telling the upper layer that file A and B are in fact 90% the same and that they differ from offset X to Y only. Nicolas ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2008-04-07 0:37 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-04-03 19:42 Achieving efficient storage of weirdly structured repos Roman Shaposhnik 2008-04-03 21:11 ` Linus Torvalds 2008-04-04 6:21 ` Jakub Narebski 2008-04-04 13:11 ` Nicolas Pitre 2008-04-04 14:16 ` Pieter de Bie 2008-04-05 3:24 ` Shawn O. Pearce 2008-04-04 23:30 ` Roman Shaposhnik 2008-04-04 23:57 ` Linus Torvalds 2008-04-06 0:13 ` Roman Shaposhnik 2008-04-06 0:48 ` Linus Torvalds 2008-04-06 16:10 ` Jeff King 2008-04-07 0:13 ` Nicolas Pitre 2008-04-07 0:18 ` Jeff King 2008-04-07 0:36 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox