* 1.3.0 creating bigger packs than 1.2.3 @ 2006-04-20 13:36 Shawn Pearce 2006-04-20 14:47 ` Linus Torvalds 2006-04-20 16:09 ` Nicolas Pitre 0 siblings, 2 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 13:36 UTC (permalink / raw) To: git Apparently I have created a repository which v1.2.3 packs about 50% smaller than 'next' does: v1.2.3 (tag): 60M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack 1.2.3.gf3a4 (an older 'next'): 128M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack 1.3.0.rc4.g8060 (a fairly recent 'next'): 118M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Repeated packing with 1.3.0.rc4.g8060 doesn't seem to change the size of the pack file, its pretty consistent at 118M. Given that disk is pretty cheap these days I'm not concerned about the 2x increase but thought I'd let folks know that the packing improvements in 1.3.0 seem to have taken a small step backwards with regards to this particular dataset. I can make the repository available if somebody wants to look at it. -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 13:36 1.3.0 creating bigger packs than 1.2.3 Shawn Pearce @ 2006-04-20 14:47 ` Linus Torvalds 2006-04-20 15:03 ` Shawn Pearce 2006-04-20 16:09 ` Nicolas Pitre 1 sibling, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2006-04-20 14:47 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On Thu, 20 Apr 2006, Shawn Pearce wrote: > Apparently I have created a repository which v1.2.3 packs about 50% > smaller than 'next' does: > > v1.2.3 (tag): > 60M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > 1.2.3.gf3a4 (an older 'next'): > 128M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > 1.3.0.rc4.g8060 (a fairly recent 'next'): > 118M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > Repeated packing with 1.3.0.rc4.g8060 doesn't seem to change the > size of the pack file, its pretty consistent at 118M. First try "git repack -a -d f", where the "-f" is the magic one. Without the -f, git repack will re-use old pack information, which is much much faster, but not as space-efficient. If that doesn't help, it might be time to look at the actual repo, but try that first. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 14:47 ` Linus Torvalds @ 2006-04-20 15:03 ` Shawn Pearce 2006-04-20 16:07 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 15:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> wrote: > > On Thu, 20 Apr 2006, Shawn Pearce wrote: > > > Apparently I have created a repository which v1.2.3 packs about 50% > > smaller than 'next' does: > > > > v1.2.3 (tag): > > 60M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > > > 1.2.3.gf3a4 (an older 'next'): > > 128M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > > > 1.3.0.rc4.g8060 (a fairly recent 'next'): > > 118M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > > > Repeated packing with 1.3.0.rc4.g8060 doesn't seem to change the > > size of the pack file, its pretty consistent at 118M. > > First try "git repack -a -d f", where the "-f" is the magic one. > > Without the -f, git repack will re-use old pack information, which is much > much faster, but not as space-efficient. > > If that doesn't help, it might be time to look at the actual repo, but try > that first. So with 1.3.0.g56c1 "git repack -a -d -f" did worse: Total 46391, written 46391 (delta 6649), reused 39742 (delta 0) 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack I just tried -f on v1.2.3 and it did slightly better then before: Total 46391, written 46391 (delta 6847), reused 38012 (delta 0) 59M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 15:03 ` Shawn Pearce @ 2006-04-20 16:07 ` Linus Torvalds 2006-04-20 16:43 ` Shawn Pearce 0 siblings, 1 reply; 30+ messages in thread From: Linus Torvalds @ 2006-04-20 16:07 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On Thu, 20 Apr 2006, Shawn Pearce wrote: > > So with 1.3.0.g56c1 "git repack -a -d -f" did worse: > > Total 46391, written 46391 (delta 6649), reused 39742 (delta 0) > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > I just tried -f on v1.2.3 and it did slightly better then before: > > Total 46391, written 46391 (delta 6847), reused 38012 (delta 0) > 59M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Interesting. The bigger packs do generate fewer deltas, but they don't seem to be _that_ much fewer. And the deltas themselves certainly shouldn't be bigger. It almost sounds like there's a problem with choosing what to delta against, not necessarily a delta algorithm problem. Although that sounds a bit strange, because I wouldn't have thought we actually changed the packing algorithm noticeably since 1.2.3. Hmm. Doing "gitk v1.2.3.. -- pack-objects.c" shows that I was wrong. Junio did the "hash basename and direname a bit differently" thing, which would appear to change the "find objects to delta against" a lot. That could be it. You could try to revert that change: git revert eeef7135fed9b8784627c4c96e125241c06c65e1 which needs a trivial manual fixup (remove the conflict entirely: everything between the "<<<<" and ">>>>>" lines should go), and see if that's it. You can also try to see if git repack -a -d -f --window=50 makes for a better pack (at the cost of a much slower repack). It makes git try more objects to delta against, and can thus hide a bad sort order. Junio, any other suggestions? Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 16:07 ` Linus Torvalds @ 2006-04-20 16:43 ` Shawn Pearce 2006-04-20 17:03 ` Linus Torvalds 0 siblings, 1 reply; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 16:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 20 Apr 2006, Shawn Pearce wrote: > > > > So with 1.3.0.g56c1 "git repack -a -d -f" did worse: > > > > Total 46391, written 46391 (delta 6649), reused 39742 (delta 0) > > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > > > I just tried -f on v1.2.3 and it did slightly better then before: > > > > Total 46391, written 46391 (delta 6847), reused 38012 (delta 0) > > 59M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Oddly enough repacking the v1.2.3 pack using 1.3.0.g56c1 created an even smaller pack ("git-repack -a -d"): Total 46391, written 46391 (delta 8253), reused 44985 (delta 6847) 49M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack and repacking again with "git-repack -a -d" chopped another 1M: Total 46391, written 46391 (delta 8258), reused 46386 (delta 8253) 48M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pac but then adding -f definately gives us the 2x explosion again: Total 46391, written 46391 (delta 6649), reused 37894 (delta 0) 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > Interesting. The bigger packs do generate fewer deltas, but they don't > seem to be _that_ much fewer. And the deltas themselves certainly > shouldn't be bigger. > > It almost sounds like there's a problem with choosing what to delta > against, not necessarily a delta algorithm problem. Although that sounds a > bit strange, because I wouldn't have thought we actually changed the > packing algorithm noticeably since 1.2.3. > > Hmm. Doing "gitk v1.2.3.. -- pack-objects.c" shows that I was wrong. Junio > did the "hash basename and direname a bit differently" thing, which would > appear to change the "find objects to delta against" a lot. That could be > it. > > You could try to revert that change: > > git revert eeef7135fed9b8784627c4c96e125241c06c65e1 > > which needs a trivial manual fixup (remove the conflict entirely: > everything between the "<<<<" and ">>>>>" lines should go), and see if > that's it. Whoa. I did that revert and fixup on top of 'next'. The pack from "git-repack -a -d -f" is now even larger due to even less delta reuse: Total 46391, written 46391 (delta 5148), reused 39565 (delta 0) 171M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > You can also try to see if > > git repack -a -d -f --window=50 > > makes for a better pack (at the cost of a much slower repack). It makes > git try more objects to delta against, and can thus hide a bad sort order. With --window=50 on 'next' (without the revert'): Total 46391, written 46391 (delta 6666), reused 39723 (delta 0) 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack For added measure I tried --window=100 and 500 with pretty much the same result (slightly higher delta but still a 129M pack). -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 16:43 ` Shawn Pearce @ 2006-04-20 17:03 ` Linus Torvalds 2006-04-20 17:24 ` Junio C Hamano ` (3 more replies) 0 siblings, 4 replies; 30+ messages in thread From: Linus Torvalds @ 2006-04-20 17:03 UTC (permalink / raw) To: Shawn Pearce; +Cc: Git Mailing List, Nicolas Pitre On Thu, 20 Apr 2006, Shawn Pearce wrote: > > Oddly enough repacking the v1.2.3 pack using 1.3.0.g56c1 created an > even smaller pack ("git-repack -a -d"): That's "normal". Repacking without -f will always pack _more_, never less. So a different packing algorithm can only improve (of course, usually not by a huge margin, and it quickly diminishes). > but then adding -f definately gives us the 2x explosion again: > > Total 46391, written 46391 (delta 6649), reused 37894 (delta 0) > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Right. Doing the -f will discard any old packing info, so if the new packing algorithm has problems (and it obviously does), then using -f will show them. > > You could try to revert that change: > > > > git revert eeef7135fed9b8784627c4c96e125241c06c65e1 > > Whoa. I did that revert and fixup on top of 'next'. The pack > from "git-repack -a -d -f" is now even larger due to even less > delta reuse: Ok, so that wasn't it, and the new sort order is superior. That means that it probably _is_ the delta changes themselves (probably commit c13c6bf7 "diff-delta: bound hash list length to avoid O(m*n) behavior". You can try git revert c13c6bf7 to see if that's it. Although Nico already showed interest, and if you make the archive available to him, he's sure to figure it out. > With --window=50 on 'next' (without the revert'): > > Total 46391, written 46391 (delta 6666), reused 39723 (delta 0) > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Yeah, that didn't do much. Slightly more deltas than without, but not a lot, and it didn't matter much size-wise. You can try "--depth=50" (slogan: more "hot delta on delta action"), but it's looking less and less like a delta selection issue, and more and more like the deltas themselves are deproved. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:03 ` Linus Torvalds @ 2006-04-20 17:24 ` Junio C Hamano 2006-04-20 17:31 ` Shawn Pearce ` (2 subsequent siblings) 3 siblings, 0 replies; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 17:24 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > You can try "--depth=50" (slogan: more "hot delta on delta action"), but > it's looking less and less like a delta selection issue, and more and more > like the deltas themselves are deproved. I do not think I have time to look into this today until late night, but one thing I noticed is that trying to delta more things sometimes tend to produce bigger result X-<. At the end of pack-objects.c::find_deltas(), there is a code that is commented out, which is remnant from a failed experiment. What it tried to do was to avoid placing an object whose delta depth is already window-size back into the candidates list, which means the next object gets compared with one object more than otherwise would be (the extra one being the oldest one in the window -- which might not produce better delta than the maxed out one, but the delta with the maxed out one would not be used anyway). The result was noticeably worse overall packsize with more deltified objects. We might be better off favoring compressed undeltified representation over deltified representation a bit more aggressively. Currently we allow delta as big as half the uncompressed size minus 20-byte overhead or something like that; tweaking that limit might show improvements. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:03 ` Linus Torvalds 2006-04-20 17:24 ` Junio C Hamano @ 2006-04-20 17:31 ` Shawn Pearce 2006-04-20 17:54 ` Nicolas Pitre 2006-04-20 21:31 ` Junio C Hamano 2006-04-20 17:41 ` Nicolas Pitre 2006-04-20 17:55 ` Shawn Pearce 3 siblings, 2 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 17:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Nicolas Pitre I just spent some time bisecting this issue and it looks like the following change by Junio may be the culprit: commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46 Author: Junio C Hamano <junkio@cox.net> Date: Wed Feb 22 22:10:24 2006 -0800 pack-objects: use full pathname to help hashing with "thin" pack. This uses the same hashing algorithm to the "preferred base tree" objects and the incoming pathnames, to group the same files from different revs together, while spreading files with the same basename in different directories. Signed-off-by: Junio C Hamano <junkio@cox.net> :100644 100644 af3bdf5d358b8a47ed23bcb7e9721e956eb59d60 3a16b7e4ce25ec05c64817dfd92dd9d517ab9dd3 M pack-objects.c Linus Torvalds <torvalds@osdl.org> wrote: > > > On Thu, 20 Apr 2006, Shawn Pearce wrote: > > > > Oddly enough repacking the v1.2.3 pack using 1.3.0.g56c1 created an > > even smaller pack ("git-repack -a -d"): > > That's "normal". Repacking without -f will always pack _more_, never less. > So a different packing algorithm can only improve (of course, usually not > by a huge margin, and it quickly diminishes). > > > but then adding -f definately gives us the 2x explosion again: > > > > Total 46391, written 46391 (delta 6649), reused 37894 (delta 0) > > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > Right. Doing the -f will discard any old packing info, so if the new > packing algorithm has problems (and it obviously does), then using -f will > show them. > > > > You could try to revert that change: > > > > > > git revert eeef7135fed9b8784627c4c96e125241c06c65e1 > > > > Whoa. I did that revert and fixup on top of 'next'. The pack > > from "git-repack -a -d -f" is now even larger due to even less > > delta reuse: > > Ok, so that wasn't it, and the new sort order is superior. > > That means that it probably _is_ the delta changes themselves (probably > commit c13c6bf7 "diff-delta: bound hash list length to avoid O(m*n) > behavior". You can try > > git revert c13c6bf7 > > to see if that's it. Although Nico already showed interest, and if you > make the archive available to him, he's sure to figure it out. > > > With --window=50 on 'next' (without the revert'): > > > > Total 46391, written 46391 (delta 6666), reused 39723 (delta 0) > > 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > Yeah, that didn't do much. Slightly more deltas than without, but not a > lot, and it didn't matter much size-wise. > > You can try "--depth=50" (slogan: more "hot delta on delta action"), but > it's looking less and less like a delta selection issue, and more and more > like the deltas themselves are deproved. > > Linus -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:31 ` Shawn Pearce @ 2006-04-20 17:54 ` Nicolas Pitre 2006-04-20 21:31 ` Junio C Hamano 1 sibling, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-20 17:54 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, Git Mailing List On Thu, 20 Apr 2006, Shawn Pearce wrote: > I just spent some time bisecting this issue and it looks like the > following change by Junio may be the culprit: > > commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46 > Author: Junio C Hamano <junkio@cox.net> > Date: Wed Feb 22 22:10:24 2006 -0800 > > pack-objects: use full pathname to help hashing with "thin" pack. > > This uses the same hashing algorithm to the "preferred base > tree" objects and the incoming pathnames, to group the same > files from different revs together, while spreading files with > the same basename in different directories. > > Signed-off-by: Junio C Hamano <junkio@cox.net> > > :100644 100644 af3bdf5d358b8a47ed23bcb7e9721e956eb59d60 3a16b7e4ce25ec05c64817dfd92dd9d517ab9dd3 M pack-objects.c Hmmm... This one is for Junio to fix I'd say. Not sure what it does. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:31 ` Shawn Pearce 2006-04-20 17:54 ` Nicolas Pitre @ 2006-04-20 21:31 ` Junio C Hamano 2006-04-20 21:53 ` Shawn Pearce 2006-04-20 21:56 ` Jakub Narebski 1 sibling, 2 replies; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 21:31 UTC (permalink / raw) To: Shawn Pearce; +Cc: git, Nicolas Pitre, Linus Torvalds Shawn Pearce <spearce@spearce.org> writes: > I just spent some time bisecting this issue and it looks like the > following change by Junio may be the culprit: > > commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46 > Author: Junio C Hamano <junkio@cox.net> > Date: Wed Feb 22 22:10:24 2006 -0800 > > pack-objects: use full pathname to help hashing with "thin" pack. > > This uses the same hashing algorithm to the "preferred base > tree" objects and the incoming pathnames, to group the same > files from different revs together, while spreading files with > the same basename in different directories. > > Signed-off-by: Junio C Hamano <junkio@cox.net> > Unfortunately, that is not the same hash we use in v1.3.0, so we need to look elsewhere for interactions. v1.2.3 hash was base-name only. doc/Makefile and t/Makefile were thrown in the same bin and sorted by size. When the history you are packing is deep, and doc/Makefile and t/Makefile are not related at all, this made effective size of delta window 1/N where N is the number of such duplicates. The one you found above uses a hash that is fully full-path. The two are in completely different bins, and bins are totally random. This was not a good strategy. v1.3.0 hash is base-name hash concatenated with leading-path has. t/Makefile and doc/Makefile go in separate bins, but the bins are close to each other; this avoids the problem in v1.2.3 when you have deep history, but at the same time if you do not have many many versions of t/Makefile to overflow the delta window, it gives t/Makefile a chance to delta with doc/Makefile. The earlier observation by Linus on reverting eeef7135 is consistent with it; that commit was the one that introduced v1.3.0 hash. You could try this patch to resurrect the hash used in v1.2.3, and you may get better packing for your particular repository; but I am not sure if it gives better results in the general case. I am running the test myself now while waiting for my day-job database to load X-<. NOTE NOTE NOTE. The hash in v1.2.3 was done with the basename (relying on rev-list --objects to only show the basename) and hashed from front to back. The current one uses the same hash scrambling function, but it hashes from back to front, and it knows rev-list --objects gives it a full path. What this patch does is to stop the hashing after we are done with the basename part. So it still gives different hash value to the same path from v1.2.3 version, but the distribution should be equivalent. NOTE 2. Feeding output from the current "rev-list --objects" to v1.2.3 pack-object is the same as "hash full path and spread things out" intermediate version, which is the worst performer. -- >8 -- git diff diff --git a/pack-objects.c b/pack-objects.c index 09f4f2c..e58e169 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -492,6 +492,8 @@ static unsigned name_hash(struct name_pa name_hash = hash; hash = 0; } + return name_hash; + for (p = path; p; p = p->up) { hash = hash * 11 + '/'; n = p->elem + p->len; ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:31 ` Junio C Hamano @ 2006-04-20 21:53 ` Shawn Pearce 2006-04-20 21:56 ` Jakub Narebski 1 sibling, 0 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 21:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: git, Nicolas Pitre, Linus Torvalds Junio C Hamano <junkio@cox.net> wrote: > Shawn Pearce <spearce@spearce.org> writes: > > > I just spent some time bisecting this issue and it looks like the > > following change by Junio may be the culprit: > > > > commit 1d6b38cc76c348e2477506ca9759fc241e3d0d46 [snip] > Unfortunately, that is not the same hash we use in v1.3.0, so we > need to look elsewhere for interactions. Pity. Then either bisect goofed or there was a goof in meatspace while using bisect. I honestly expected bisect to point at the problem commit. I tried reverting 1d6b38cc but it didn't apply cleanly and I didn't feel like working through all of the conflicts at the time. [snip] > The earlier observation by Linus on reverting eeef7135 is > consistent with it; that commit was the one that introduced > v1.3.0 hash. Yet reverting that didn't help either. [snip] > You could try this patch to resurrect the hash used in v1.2.3, > and you may get better packing for your particular repository; > but I am not sure if it gives better results in the general > case. I am running the test myself now while waiting for my > day-job database to load X-<. [snip] Nope. When applied to 'next' it didn't help very much: Total 46391, written 46391 (delta 6466), reused 38662 (delta 0) 118M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Just to note: the 1.3.0 packer is saving 1M in the GIT repository over the 1.2.3 packer. So for a real project it does seem to have some benefit. And if you benchmarked the 1.3.0 packer against the Linux kernel and found it to be better than the 1.2.3 packer that's even better. I think this repository of mine may just be a degenerate case which GIT doesn't pack very well. GIT can't be all things to all people! -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:31 ` Junio C Hamano 2006-04-20 21:53 ` Shawn Pearce @ 2006-04-20 21:56 ` Jakub Narebski 1 sibling, 0 replies; 30+ messages in thread From: Jakub Narebski @ 2006-04-20 21:56 UTC (permalink / raw) To: git Junio C Hamano wrote: > v1.2.3 hash was base-name only. doc/Makefile and t/Makefile > were thrown in the same bin and sorted by size. When the > history you are packing is deep, and doc/Makefile and t/Makefile > are not related at all, this made effective size of delta window > 1/N where N is the number of such duplicates. > > The one you found above uses a hash that is fully full-path. > The two are in completely different bins, and bins are totally > random. This was not a good strategy. > > v1.3.0 hash is base-name hash concatenated with leading-path > has. t/Makefile and doc/Makefile go in separate bins, but the > bins are close to each other; this avoids the problem in v1.2.3 > when you have deep history, but at the same time if you do not > have many many versions of t/Makefile to overflow the delta > window, it gives t/Makefile a chance to delta with doc/Makefile. [...] > You could try this patch to resurrect the hash used in v1.2.3, > and you may get better packing for your particular repository; > but I am not sure if it gives better results in the general > case. I am running the test myself now while waiting for my > day-job database to load X-<. Perhaps the packing code could check which version gives smaller pack, or at least be instructed that one might want different packing heuristic for specific repository? Surely 2x difference in size is worth considering (and complication)... -- Jakub Narebski Warsaw, Poland ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:03 ` Linus Torvalds 2006-04-20 17:24 ` Junio C Hamano 2006-04-20 17:31 ` Shawn Pearce @ 2006-04-20 17:41 ` Nicolas Pitre 2006-04-20 17:55 ` Shawn Pearce 3 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-20 17:41 UTC (permalink / raw) To: Linus Torvalds; +Cc: Shawn Pearce, Git Mailing List On Thu, 20 Apr 2006, Linus Torvalds wrote: > That means that it probably _is_ the delta changes themselves (probably > commit c13c6bf7 "diff-delta: bound hash list length to avoid O(m*n) > behavior". You can try > > git revert c13c6bf7 > > to see if that's it. Although Nico already showed interest, and if you > make the archive available to him, he's sure to figure it out. It is not that. With that code disabled there is still a 2x pack size. Substituting diff-delta.c from the version in 1.2.3 doesn't solve the issue either. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:03 ` Linus Torvalds ` (2 preceding siblings ...) 2006-04-20 17:41 ` Nicolas Pitre @ 2006-04-20 17:55 ` Shawn Pearce 2006-04-20 18:24 ` Nicolas Pitre 3 siblings, 1 reply; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 17:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Git Mailing List, Nicolas Pitre Linus Torvalds <torvalds@osdl.org> wrote: > Ok, so that wasn't it, and the new sort order is superior. > > That means that it probably _is_ the delta changes themselves (probably > commit c13c6bf7 "diff-delta: bound hash list length to avoid O(m*n) > behavior". You can try > > git revert c13c6bf7 No effect. > to see if that's it. Although Nico already showed interest, and if you > make the archive available to him, he's sure to figure it out. I sent the URL privately to Nico as I did not want the repository to be publically available before next Tuesday. > You can try "--depth=50" (slogan: more "hot delta on delta action"), but > it's looking less and less like a delta selection issue, and more and more > like the deltas themselves are deproved. No effect at either 50 or 100. The more that I think about it the more it seems possible that the pathname hashing is what may be causing the problem. Not only did bisect point to 1d6b38cc76c348e2477506ca9759fc241e3d0d46 but the directory which contains the bulk of the space has many files with the same name located in different directories: results/MT/Math/10000/0-11-AdjLite.deg results/MT/Math/10000/0-12-AdjLite.deg ... results/MT/Math/30000/2-11-AdjLite.deg results/MT/Math/30000/2-12-AdjLite.deg ... results/Rand48/Math/10000/2-11-AdjLite.deg results/Rand48/Math/10000/2-12-AdjLite.deg ... results/Rand48/Math/30000/2-11-AdjLite.deg results/Rand48/Math/30000/2-12-AdjLite.deg ... For example the name '0-11-AdjLite.deg' occurs in 63 directories and none of those occurrances are likely to delta against one another very well. Also most of these files only have 1 or 2 revisions, so there is very little per-file history. -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 17:55 ` Shawn Pearce @ 2006-04-20 18:24 ` Nicolas Pitre 2006-04-20 18:49 ` Junio C Hamano 0 siblings, 1 reply; 30+ messages in thread From: Nicolas Pitre @ 2006-04-20 18:24 UTC (permalink / raw) To: Shawn Pearce; +Cc: Linus Torvalds, Git Mailing List On Thu, 20 Apr 2006, Shawn Pearce wrote: > The more that I think about it the more it seems possible that the > pathname hashing is what may be causing the problem. Not only did > bisect point to 1d6b38cc76c348e2477506ca9759fc241e3d0d46 but the > directory which contains the bulk of the space has many files with > the same name located in different directories: [...] But the bad commit according to your bisection talks about "thin" packs which are not involved in your case. So something looks fishy with that commit which should not have touched path hashing in the non-thin pack case... I think... Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 18:24 ` Nicolas Pitre @ 2006-04-20 18:49 ` Junio C Hamano 2006-04-20 21:02 ` Nicolas Pitre 0 siblings, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 18:49 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: > On Thu, 20 Apr 2006, Shawn Pearce wrote: > >> The more that I think about it the more it seems possible that the >> pathname hashing is what may be causing the problem. Not only did >> bisect point to 1d6b38cc76c348e2477506ca9759fc241e3d0d46 but the >> directory which contains the bulk of the space has many files with >> the same name located in different directories: > [...] > > But the bad commit according to your bisection talks about "thin" packs > which are not involved in your case. So something looks fishy with that > commit which should not have touched path hashing in the non-thin pack > case... I think... I think this explains it. The new code hashes full-path, but places bins for the paths with the same basename next to each other, so before Makefile and doc/Makefile and t/Makefile were all in the same bin, but now they are in three different bins next to each other. I originally thought, with one single notable exception of Makefile, having the identically named file in many different directories is not common nor sane, and the new code favors to delta with the exact same path for deeper history over wasting delta window for making delta with objects with the same name in different places in more recent history. I think I benched this with kernel repository (git.git was too small for that). But I suspect we have a built-in "we sort bigger to smaller, and we cut off when we switch bins" somewhere in find_delta() loop, which I do not recall touching when I did that change, so that may be interfering and preventing 0-11-AdjLite.deg from all over the place to delta against each other. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 18:49 ` Junio C Hamano @ 2006-04-20 21:02 ` Nicolas Pitre 2006-04-20 21:40 ` Junio C Hamano 2006-04-20 23:02 ` Junio C Hamano 0 siblings, 2 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-20 21:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 20 Apr 2006, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > On Thu, 20 Apr 2006, Shawn Pearce wrote: > > > >> The more that I think about it the more it seems possible that the > >> pathname hashing is what may be causing the problem. Not only did > >> bisect point to 1d6b38cc76c348e2477506ca9759fc241e3d0d46 but the > >> directory which contains the bulk of the space has many files with > >> the same name located in different directories: > > [...] > > > > But the bad commit according to your bisection talks about "thin" packs > > which are not involved in your case. So something looks fishy with that > > commit which should not have touched path hashing in the non-thin pack > > case... I think... > > I think this explains it. The new code hashes full-path, but > places bins for the paths with the same basename next to each > other, so before Makefile and doc/Makefile and t/Makefile were > all in the same bin, but now they are in three different bins > next to each other. That is fine. In fact I did try with a tweaked name_hash() that completely ignored all directory components and the resulting pack was even bigger, much bigger, when repacking Shawn's repo. > I originally thought, with one single notable exception of > Makefile, having the identically named file in many different > directories is not common nor sane, I'd tend to disagree with that but... > and the new code favors to > delta with the exact same path for deeper history over wasting > delta window for making delta with objects with the same name in > different places in more recent history. I think I benched this > with kernel repository (git.git was too small for that). This is obviously fine. And if a file in a given directory has few revisions then the delta window will consider objects for a file with the same name in other directories as well, which is also sensible. So if files of the same name are located in different directories they should delta well against each other if they're similar enough. This should cover Shawn's repo layout. > But I suspect we have a built-in "we sort bigger to smaller, and > we cut off when we switch bins" somewhere in find_delta() loop, > which I do not recall touching when I did that change, so that > may be interfering and preventing 0-11-AdjLite.deg from all over > the place to delta against each other. I just cannot find something that would do that in the code. When --no-reuse-delta is specified, the only things that will break the loop in find_delta() is when try_delta() returns -1, and that happens only when changing object type or when the size difference is too big, but nothing looks at the name hash. It is also hard to corelate it with commit 1d6b38cc which is the one that introduced the regression. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:02 ` Nicolas Pitre @ 2006-04-20 21:40 ` Junio C Hamano 2006-04-20 22:02 ` Shawn Pearce ` (2 more replies) 2006-04-20 23:02 ` Junio C Hamano 1 sibling, 3 replies; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 21:40 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: >> But I suspect we have a built-in "we sort bigger to smaller, and >> we cut off when we switch bins" somewhere in find_delta() loop, >> which I do not recall touching when I did that change, so that >> may be interfering and preventing 0-11-AdjLite.deg from all over >> the place to delta against each other. > > I just cannot find something that would do that in the code. When > --no-reuse-delta is specified, the only things that will break the loop > in find_delta() is when try_delta() returns -1, and that happens only > when changing object type or when the size difference is too big, but > nothing looks at the name hash. The list is sorted by type then hash then size (type_size_sort), so if you have t/Makefile that are big medium small too-small and then doc/Makefile that are big medium, once you see the too-small t/Makefile it would not consider the big doc/Makefile as a candidate X-<. This _might_ improve things: diff --git a/pack-objects.c b/pack-objects.c index 09f4f2c..0c6abe9 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -1037,7 +1039,7 @@ static int try_delta(struct unpacked *cu sizediff = oldsize > size ? oldsize - size : size - oldsize; if (size < 50) - return -1; + return 0; if (old_entry->depth >= max_depth) return 0; @@ -1052,7 +1054,7 @@ static int try_delta(struct unpacked *cu if (cur_entry->delta) max_size = cur_entry->delta_size-1; if (sizediff >= max_size) - return -1; + return 0; delta_buf = diff_delta(old->data, oldsize, cur->data, size, &delta_size, max_size); if (!delta_buf) ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:40 ` Junio C Hamano @ 2006-04-20 22:02 ` Shawn Pearce 2006-04-20 22:35 ` Junio C Hamano 2006-04-20 22:59 ` Linus Torvalds 2006-04-21 0:52 ` Nicolas Pitre 2006-04-21 1:20 ` Shawn Pearce 2 siblings, 2 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-20 22:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git Junio C Hamano <junkio@cox.net> wrote: > Nicolas Pitre <nico@cam.org> writes: > > >> But I suspect we have a built-in "we sort bigger to smaller, and > >> we cut off when we switch bins" somewhere in find_delta() loop, > >> which I do not recall touching when I did that change, so that > >> may be interfering and preventing 0-11-AdjLite.deg from all over > >> the place to delta against each other. > > > > I just cannot find something that would do that in the code. When > > --no-reuse-delta is specified, the only things that will break the loop > > in find_delta() is when try_delta() returns -1, and that happens only > > when changing object type or when the size difference is too big, but > > nothing looks at the name hash. > > The list is sorted by type then hash then size (type_size_sort), > so if you have t/Makefile that are big medium small too-small > and then doc/Makefile that are big medium, once you see the > too-small t/Makefile it would not consider the big doc/Makefile > as a candidate X-<. > > This _might_ improve things: > > diff --git a/pack-objects.c b/pack-objects.c > index 09f4f2c..0c6abe9 100644 > --- a/pack-objects.c > +++ b/pack-objects.c > @@ -1037,7 +1039,7 @@ static int try_delta(struct unpacked *cu > sizediff = oldsize > size ? oldsize - size : size - oldsize; > > if (size < 50) > - return -1; > + return 0; > if (old_entry->depth >= max_depth) > return 0; > > @@ -1052,7 +1054,7 @@ static int try_delta(struct unpacked *cu > if (cur_entry->delta) > max_size = cur_entry->delta_size-1; > if (sizediff >= max_size) > - return -1; > + return 0; > delta_buf = diff_delta(old->data, oldsize, > cur->data, size, &delta_size, max_size); > if (!delta_buf) Holy cow, it did: Total 46391, written 46391 (delta 8391), reused 37774 (delta 0) 46M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack That's the smallest packing I've seen yet. And it doesn't have a negative affect on repacking GIT either. -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 22:02 ` Shawn Pearce @ 2006-04-20 22:35 ` Junio C Hamano 2006-04-21 1:01 ` Shawn Pearce 2006-04-20 22:59 ` Linus Torvalds 1 sibling, 1 reply; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 22:35 UTC (permalink / raw) To: Shawn Pearce; +Cc: Nicolas Pitre, git Shawn Pearce <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: > >> The list is sorted by type then hash then size (type_size_sort), >> so if you have t/Makefile that are big medium small too-small >> and then doc/Makefile that are big medium, once you see the >> too-small t/Makefile it would not consider the big doc/Makefile >> as a candidate X-<. >> >> This _might_ improve things: >>... > > Holy cow, it did: > > Total 46391, written 46391 (delta 8391), reused 37774 (delta 0) > 46M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > That's the smallest packing I've seen yet. And it doesn't have a > negative affect on repacking GIT either. Thanks for trying. Mind trying one more? I suspect the test patch makes pack-objects a lot more expensive. The code before the test patch said "if the size is very small or size difference is too great, do not consider this, and do not consider any more objects in the delta window, because we know they are either even smaller of the same path, they have different names, or they are of different type". The test patch you tried was a quick and dirty hack that said "under the too-small condition, skip this one, but keep trying the rest of the delta window". Here is a cleaned up patch. What it does is "under the too-small condition, see if the object has the same basename, and if so keep going, but otherwise skip the rest as before". If you have objects like this and are trying to pack the first object (this list is sorted in the order pack-object tries): (size) (path) 1000 t/0-11-AdjLite.deg 10 t/0-11-AdjLite.deg 800 s/0-11-AdjLite.deg 20 t/0-12-AdjLite.deg the current code stops after checking t/0-11-AdjLite.deg. The test patch tries all of them. This patch skips that file, but tries "s/0-11-AdjLite.deg", and then stops at the next one. -- >8 -- diff --git a/pack-objects.c b/pack-objects.c index 09f4f2c..2173709 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -1036,8 +1036,6 @@ static int try_delta(struct unpacked *cu oldsize = old_entry->size; sizediff = oldsize > size ? oldsize - size : size - oldsize; - if (size < 50) - return -1; if (old_entry->depth >= max_depth) return 0; @@ -1048,20 +1046,27 @@ static int try_delta(struct unpacked *cu * more space-efficient (deletes don't have to say _what_ they * delete). */ - max_size = size / 2 - 20; - if (cur_entry->delta) - max_size = cur_entry->delta_size-1; - if (sizediff >= max_size) - return -1; - delta_buf = diff_delta(old->data, oldsize, - cur->data, size, &delta_size, max_size); - if (!delta_buf) + if (50 <= size) { + max_size = size / 2 - 20; + if (cur_entry->delta) + max_size = cur_entry->delta_size-1; + if (sizediff < max_size) { + delta_buf = diff_delta(old->data, oldsize, + cur->data, size, + &delta_size, max_size); + if (!delta_buf) + return 0; + cur_entry->delta = old_entry; + cur_entry->delta_size = delta_size; + cur_entry->depth = old_entry->depth + 1; + free(delta_buf); + return 0; + } + } + /* Keep going as long as the basename matches */ + if (((cur_entry->hash ^ old_entry->hash) >>DIRBITS) == 0) return 0; - cur_entry->delta = old_entry; - cur_entry->delta_size = delta_size; - cur_entry->depth = old_entry->depth + 1; - free(delta_buf); - return 0; + return -1; } static void progress_interval(int signum) ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 22:35 ` Junio C Hamano @ 2006-04-21 1:01 ` Shawn Pearce 0 siblings, 0 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-21 1:01 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git Junio C Hamano <junkio@cox.net> wrote: [snip] > I suspect the test patch makes pack-objects a lot more > expensive. Which patch are you talking about the previous patch or the one in the message I'm now replying to? > The code before the test patch said "if the size is very small > or size difference is too great, do not consider this, and do > not consider any more objects in the delta window, because we > know they are either even smaller of the same path, they have > different names, or they are of different type". The test patch > you tried was a quick and dirty hack that said "under the > too-small condition, skip this one, but keep trying the rest of > the delta window". > > Here is a cleaned up patch. What it does is "under the > too-small condition, see if the object has the same basename, > and if so keep going, but otherwise skip the rest as before". [snip] The patch below does not help very much: Total 46391, written 46391 (delta 6686), reused 37979 (delta 0) 129M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > diff --git a/pack-objects.c b/pack-objects.c > index 09f4f2c..2173709 100644 > --- a/pack-objects.c > +++ b/pack-objects.c > @@ -1036,8 +1036,6 @@ static int try_delta(struct unpacked *cu > oldsize = old_entry->size; > sizediff = oldsize > size ? oldsize - size : size - oldsize; > > - if (size < 50) > - return -1; > if (old_entry->depth >= max_depth) > return 0; > > @@ -1048,20 +1046,27 @@ static int try_delta(struct unpacked *cu > * more space-efficient (deletes don't have to say _what_ they > * delete). > */ > - max_size = size / 2 - 20; > - if (cur_entry->delta) > - max_size = cur_entry->delta_size-1; > - if (sizediff >= max_size) > - return -1; > - delta_buf = diff_delta(old->data, oldsize, > - cur->data, size, &delta_size, max_size); > - if (!delta_buf) > + if (50 <= size) { > + max_size = size / 2 - 20; > + if (cur_entry->delta) > + max_size = cur_entry->delta_size-1; > + if (sizediff < max_size) { > + delta_buf = diff_delta(old->data, oldsize, > + cur->data, size, > + &delta_size, max_size); > + if (!delta_buf) > + return 0; > + cur_entry->delta = old_entry; > + cur_entry->delta_size = delta_size; > + cur_entry->depth = old_entry->depth + 1; > + free(delta_buf); > + return 0; > + } > + } > + /* Keep going as long as the basename matches */ > + if (((cur_entry->hash ^ old_entry->hash) >>DIRBITS) == 0) > return 0; > - cur_entry->delta = old_entry; > - cur_entry->delta_size = delta_size; > - cur_entry->depth = old_entry->depth + 1; > - free(delta_buf); > - return 0; > + return -1; > } > > static void progress_interval(int signum) > -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 22:02 ` Shawn Pearce 2006-04-20 22:35 ` Junio C Hamano @ 2006-04-20 22:59 ` Linus Torvalds 1 sibling, 0 replies; 30+ messages in thread From: Linus Torvalds @ 2006-04-20 22:59 UTC (permalink / raw) To: Shawn Pearce; +Cc: Junio C Hamano, Nicolas Pitre, git On Thu, 20 Apr 2006, Shawn Pearce wrote: > Junio C Hamano <junkio@cox.net> wrote: > > > > This _might_ improve things: > > > > diff --git a/pack-objects.c b/pack-objects.c > > index 09f4f2c..0c6abe9 100644 > > --- a/pack-objects.c > > +++ b/pack-objects.c > > @@ -1037,7 +1039,7 @@ static int try_delta(struct unpacked *cu > > sizediff = oldsize > size ? oldsize - size : size - oldsize; > > > > if (size < 50) > > - return -1; > > + return 0; > > if (old_entry->depth >= max_depth) > > return 0; > > > > @@ -1052,7 +1054,7 @@ static int try_delta(struct unpacked *cu > > if (cur_entry->delta) > > max_size = cur_entry->delta_size-1; > > if (sizediff >= max_size) > > - return -1; > > + return 0; > > delta_buf = diff_delta(old->data, oldsize, > > cur->data, size, &delta_size, max_size); > > if (!delta_buf) > > Holy cow, it did: > > Total 46391, written 46391 (delta 8391), reused 37774 (delta 0) > 46M pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack > > That's the smallest packing I've seen yet. And it doesn't have a > negative affect on repacking GIT either. I think I know what's going on, and why your bisection claimed it was the re-hashing change that was the problem, even though it really wasn't. That if (sizediff >= max_size) return -1; check is actually fairly _old_. It's from June 2005, ie it's from pretty much the first two days of that packing thing existing in the first place. (The initial repacking was done June 25th, with a lot of tweaking over the next few days. That sizediff thing was part of the fairly early tweaking). The thing is, that check made sense back then. Why? Because we sorted things in decreasing size order back then (I think this was before we even did any name-based heuristic sorting at all), so that when we tried the delta algorithm, and the size diff was bigger than the last delta size, we pretty much _knew_ the new delta would be bigger still, and there was no point in continuing with try_delta. HOWEVER. We have since changed the sorting to sort according to name before it sorts according to size, so that old heuristic that depended on the size being monotonically increasing simply doesn't make any sense any more. So I think at that second "return -1" really _should_ be changed to a "return 0", and not just because it helps your particular case. It's literally a bug these days, because the assumptions that caused it to return -1 simply aren't true any more. (It wasn't _strictly_ true even originally: even if the sizediff is huge, the _delta_ may not be huge, since we can delete data with a small delta. So it's quite likely that we should compare the "old is bigger than new" and "new is bigger than old" separately and have different heuristics for them. Again, that was simply not much of an issue back when we sorted just by size). So even the "return 0" might not be completely right. We might actually want to look at how big the delta is, and return only once that fails. Linus ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:40 ` Junio C Hamano 2006-04-20 22:02 ` Shawn Pearce @ 2006-04-21 0:52 ` Nicolas Pitre 2006-04-21 1:20 ` Shawn Pearce 2 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-21 0:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 20 Apr 2006, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > >> But I suspect we have a built-in "we sort bigger to smaller, and > >> we cut off when we switch bins" somewhere in find_delta() loop, > >> which I do not recall touching when I did that change, so that > >> may be interfering and preventing 0-11-AdjLite.deg from all over > >> the place to delta against each other. > > > > I just cannot find something that would do that in the code. When > > --no-reuse-delta is specified, the only things that will break the loop > > in find_delta() is when try_delta() returns -1, and that happens only > > when changing object type or when the size difference is too big, but > > nothing looks at the name hash. > > The list is sorted by type then hash then size (type_size_sort), > so if you have t/Makefile that are big medium small too-small > and then doc/Makefile that are big medium, once you see the > too-small t/Makefile it would not consider the big doc/Makefile > as a candidate X-<. Bingo! I didn't think it all through before. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:40 ` Junio C Hamano 2006-04-20 22:02 ` Shawn Pearce 2006-04-21 0:52 ` Nicolas Pitre @ 2006-04-21 1:20 ` Shawn Pearce 2006-04-21 2:28 ` Nicolas Pitre 2006-04-21 2:32 ` Shawn Pearce 2 siblings, 2 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-21 1:20 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git Based on Linus' comment I changed your patch to just the following. It still produced the 46M pack file, so the first hunk apears to not have had much of an affect with this data. >From a running time perspective it appears as though this patch is making things slightly better, not worse. I ran it a few times for each case always using the 46M pack as input for "git-repack -a -d -f". 'next' 137.13 real 95.82 user 25.24 sys 'next'+patch 131.62 real 89.35 user 28.56 sys but even if the running time was an extra 6 seconds I'd still rather spend 4% more running time to use 1/2 the storage space. diff --git a/pack-objects.c b/pack-objects.c index 09f4f2c..f7d6217 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -1052,7 +1052,7 @@ static int try_delta(struct unpacked *cu if (cur_entry->delta) max_size = cur_entry->delta_size-1; if (sizediff >= max_size) - return -1; + return 0; delta_buf = diff_delta(old->data, oldsize, cur->data, size, &delta_size, max_size); if (!delta_buf) ^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-21 1:20 ` Shawn Pearce @ 2006-04-21 2:28 ` Nicolas Pitre 2006-04-21 2:40 ` Shawn Pearce 2006-04-21 2:32 ` Shawn Pearce 1 sibling, 1 reply; 30+ messages in thread From: Nicolas Pitre @ 2006-04-21 2:28 UTC (permalink / raw) To: Shawn Pearce; +Cc: Junio C Hamano, git On Thu, 20 Apr 2006, Shawn Pearce wrote: > Based on Linus' comment I changed your patch to just the following. > It still produced the 46M pack file, so the first hunk apears to > not have had much of an affect with this data. > > From a running time perspective it appears as though this patch is > making things slightly better, not worse. I ran it a few times > for each case always using the 46M pack as input for > "git-repack -a -d -f". > > 'next' 137.13 real 95.82 user 25.24 sys > 'next'+patch 131.62 real 89.35 user 28.56 sys > > but even if the running time was an extra 6 seconds I'd still rather > spend 4% more running time to use 1/2 the storage space. > > > diff --git a/pack-objects.c b/pack-objects.c > index 09f4f2c..f7d6217 100644 > --- a/pack-objects.c > +++ b/pack-objects.c > @@ -1052,7 +1052,7 @@ static int try_delta(struct unpacked *cu > if (cur_entry->delta) > max_size = cur_entry->delta_size-1; > if (sizediff >= max_size) > - return -1; > + return 0; > delta_buf = diff_delta(old->data, oldsize, > cur->data, size, &delta_size, max_size); > if (!delta_buf) I can confirm this is indeed the best fix so far. Any "smarter" solution I could think of did increase the size of the final pack quite spectacularly and rather unexpectedly with Shawn's repository. Of course removing the if (sizediff >= max_size) entirely does produce a smaller pack (39MB) but it takes about twice the CPU. With the patch above the Linux kernel pack is 0.3% smaller with 1% more CPU usage. But like for the diff-delta hash list limiting code this small overhead is certainly a good compromize to avoid big degradations in some other cases. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-21 2:28 ` Nicolas Pitre @ 2006-04-21 2:40 ` Shawn Pearce 2006-04-21 3:07 ` Nicolas Pitre 0 siblings, 1 reply; 30+ messages in thread From: Shawn Pearce @ 2006-04-21 2:40 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, git Nicolas Pitre <nico@cam.org> wrote: > On Thu, 20 Apr 2006, Shawn Pearce wrote: > > > Based on Linus' comment I changed your patch to just the following. > > It still produced the 46M pack file, so the first hunk apears to > > not have had much of an affect with this data. > > > > From a running time perspective it appears as though this patch is > > making things slightly better, not worse. I ran it a few times > > for each case always using the 46M pack as input for > > "git-repack -a -d -f". > > > > 'next' 137.13 real 95.82 user 25.24 sys > > 'next'+patch 131.62 real 89.35 user 28.56 sys > > > > but even if the running time was an extra 6 seconds I'd still rather > > spend 4% more running time to use 1/2 the storage space. > > > > > > diff --git a/pack-objects.c b/pack-objects.c > > index 09f4f2c..f7d6217 100644 > > --- a/pack-objects.c > > +++ b/pack-objects.c > > @@ -1052,7 +1052,7 @@ static int try_delta(struct unpacked *cu > > if (cur_entry->delta) > > max_size = cur_entry->delta_size-1; > > if (sizediff >= max_size) > > - return -1; > > + return 0; > > delta_buf = diff_delta(old->data, oldsize, > > cur->data, size, &delta_size, max_size); > > if (!delta_buf) > > I can confirm this is indeed the best fix so far. Any "smarter" > solution I could think of did increase the size of the final pack quite > spectacularly and rather unexpectedly with Shawn's repository. Wow. I'm such a trouble maker. *grin* > Of course removing the if (sizediff >= max_size) entirely does produce a > smaller pack (39MB) but it takes about twice the CPU. Eh, that's not worth it. 7M disk space saved for twice the work isn't that good of a tradeoff. I'm not in favor of that version. > With the patch above the Linux kernel pack is 0.3% smaller with 1% more > CPU usage. But like for the diff-delta hash list limiting code this > small overhead is certainly a good compromize to avoid big degradations > in some other cases. Hmm. See the email I just sent. I was seeing a good 10% increase in my own tests on a Linux kernel repository. But I guess I can hope that my test was flawed somehow and it really is closer to a 1% increase in running time, making it more likely that the above fix makes it into GIT. -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-21 2:40 ` Shawn Pearce @ 2006-04-21 3:07 ` Nicolas Pitre 0 siblings, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-21 3:07 UTC (permalink / raw) To: Shawn Pearce; +Cc: Junio C Hamano, git On Thu, 20 Apr 2006, Shawn Pearce wrote: > Nicolas Pitre <nico@cam.org> wrote: > > With the patch above the Linux kernel pack is 0.3% smaller with 1% more > > CPU usage. But like for the diff-delta hash list limiting code this > > small overhead is certainly a good compromize to avoid big degradations > > in some other cases. > > Hmm. See the email I just sent. I was seeing a good 10% increase > in my own tests on a Linux kernel repository. But I guess I can > hope that my test was flawed somehow and it really is closer to a 1% > increase in running time, making it more likely that the above fix > makes it into GIT. Well, I repeated the kernel run and this time it took 2.5% more CPU with the patch. But the thing is that I get a +/- 1% difference between successive runs. So while the patch does add a certain overhead, it appears to be in the same range as noise here. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-21 1:20 ` Shawn Pearce 2006-04-21 2:28 ` Nicolas Pitre @ 2006-04-21 2:32 ` Shawn Pearce 1 sibling, 0 replies; 30+ messages in thread From: Shawn Pearce @ 2006-04-21 2:32 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git I just tried the patch below on a couple-month-old Linux 2.6 repository from Linus (last commit: Feb 14 2006). It did not decrease the pack file size by much despite the higher delta: 'next' Total 189435, written 189435 (delta 142093), reused 44057 (delta 0) 'next'+patch Total 189435, written 189435 (delta 142712), reused 43954 (delta 0) 'next' 104464297 bytes 'next'+patch 104092920 bytes (99.6% of 'next') 'next' 328.98 real 206.02 user 93.60 sys 'next'+patch 363.06 real 218.98 user 94.72 sys So it looks like the patch is taking longer to run, and by about 10%. An expensive price to pay for what amounts to only a 0.4% reduction in pack size on the kernel. Shawn Pearce <spearce@spearce.org> wrote: > Based on Linus' comment I changed your patch to just the following. > It still produced the 46M pack file, so the first hunk apears to > not have had much of an affect with this data. > > From a running time perspective it appears as though this patch is > making things slightly better, not worse. I ran it a few times > for each case always using the 46M pack as input for > "git-repack -a -d -f". > > 'next' 137.13 real 95.82 user 25.24 sys > 'next'+patch 131.62 real 89.35 user 28.56 sys > > but even if the running time was an extra 6 seconds I'd still rather > spend 4% more running time to use 1/2 the storage space. > > > diff --git a/pack-objects.c b/pack-objects.c > index 09f4f2c..f7d6217 100644 > --- a/pack-objects.c > +++ b/pack-objects.c > @@ -1052,7 +1052,7 @@ static int try_delta(struct unpacked *cu > if (cur_entry->delta) > max_size = cur_entry->delta_size-1; > if (sizediff >= max_size) > - return -1; > + return 0; > delta_buf = diff_delta(old->data, oldsize, > cur->data, size, &delta_size, max_size); > if (!delta_buf) -- Shawn. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 21:02 ` Nicolas Pitre 2006-04-20 21:40 ` Junio C Hamano @ 2006-04-20 23:02 ` Junio C Hamano 1 sibling, 0 replies; 30+ messages in thread From: Junio C Hamano @ 2006-04-20 23:02 UTC (permalink / raw) To: Nicolas Pitre; +Cc: git Nicolas Pitre <nico@cam.org> writes: >> I originally thought, with one single notable exception of >> Makefile, having the identically named file in many different >> directories is not common nor sane, > > I'd tend to disagree with that but... I disagree with that myself now. The kernel tree has many files with the same basename (e.g. arch/*/kernel/irq.c). It is a different issue if they are good delta base candidates with each other, though. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: 1.3.0 creating bigger packs than 1.2.3 2006-04-20 13:36 1.3.0 creating bigger packs than 1.2.3 Shawn Pearce 2006-04-20 14:47 ` Linus Torvalds @ 2006-04-20 16:09 ` Nicolas Pitre 1 sibling, 0 replies; 30+ messages in thread From: Nicolas Pitre @ 2006-04-20 16:09 UTC (permalink / raw) To: Shawn Pearce; +Cc: git On Thu, 20 Apr 2006, Shawn Pearce wrote: > Given that disk is pretty cheap these days I'm not concerned about > the 2x increase but thought I'd let folks know that the packing > improvements in 1.3.0 seem to have taken a small step backwards > with regards to this particular dataset. 2x is a huge regression. > I can make the repository available if somebody wants to look at it. Please. Nicolas ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2006-04-21 3:07 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-20 13:36 1.3.0 creating bigger packs than 1.2.3 Shawn Pearce 2006-04-20 14:47 ` Linus Torvalds 2006-04-20 15:03 ` Shawn Pearce 2006-04-20 16:07 ` Linus Torvalds 2006-04-20 16:43 ` Shawn Pearce 2006-04-20 17:03 ` Linus Torvalds 2006-04-20 17:24 ` Junio C Hamano 2006-04-20 17:31 ` Shawn Pearce 2006-04-20 17:54 ` Nicolas Pitre 2006-04-20 21:31 ` Junio C Hamano 2006-04-20 21:53 ` Shawn Pearce 2006-04-20 21:56 ` Jakub Narebski 2006-04-20 17:41 ` Nicolas Pitre 2006-04-20 17:55 ` Shawn Pearce 2006-04-20 18:24 ` Nicolas Pitre 2006-04-20 18:49 ` Junio C Hamano 2006-04-20 21:02 ` Nicolas Pitre 2006-04-20 21:40 ` Junio C Hamano 2006-04-20 22:02 ` Shawn Pearce 2006-04-20 22:35 ` Junio C Hamano 2006-04-21 1:01 ` Shawn Pearce 2006-04-20 22:59 ` Linus Torvalds 2006-04-21 0:52 ` Nicolas Pitre 2006-04-21 1:20 ` Shawn Pearce 2006-04-21 2:28 ` Nicolas Pitre 2006-04-21 2:40 ` Shawn Pearce 2006-04-21 3:07 ` Nicolas Pitre 2006-04-21 2:32 ` Shawn Pearce 2006-04-20 23:02 ` Junio C Hamano 2006-04-20 16:09 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).