* Preferring shallower deltas on repack @ 2007-07-09 4:43 Brian Downing 2007-07-09 4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 4:43 UTC (permalink / raw) To: git SBCL, a Lisp implementation, has a file in their (CVS) repository called "version.lisp-expr" which is updated every commit. This file looks like: ------------------------------------------------------------------------ ;;; This is the master value for LISP-IMPLEMENTATION-VERSION. It's ;;; separated into its own file here so that it's easy for ;;; text-munging make-ish or cvs-ish scripts to find and tweak it. For ;;; the convenience of such scripts, only a simple subset of Lisp ;;; reader syntax should be used here: semicolon-delimited comments, ;;; possible blank lines or other whitespace, and a single ;;; double-quoted string value alone on its own line. ;;; ;;; ANSI says LISP-IMPLEMENTATION-VERSION can be NIL "if no ;;; appropriate and relevant result can be produced", but as long as ;;; we control the build, we can always assign an appropriate and ;;; relevant result, so this must be a string, not NIL. ;;; ;;; Conventionally a string like "0.6.6", with three numeric fields, ;;; is used for released versions, and a string like "0.6.5.xyzzy", ;;; with something arbitrary in the fourth field, is used for CVS ;;; checkins which aren't released. (And occasionally for internal ;;; versions, especially for internal versions off the main CVS ;;; branch, it gets hairier, e.g. "0.pre7.14.flaky4.13".) "1.0.7.10" ------------------------------------------------------------------------ Only very rarely does anything but the last line change. The current repack implementation, when given values like "--window=100 --depth=1000", happily generates an incredibly deep tree for this file, maxing out the depth. (I'm only using such a high depth to demonstrate the pessimal behavior.) I noticed that it just takes the first delta it finds that is the smallest; it then only looks for deltas with a max_size of the old size - 1, so it can never find better matches with regard to depth. I modified this to prefer shallower deltas of the same size. This made the deltas for this file a very wide tree with a maximum depth of about 65. Other (much smaller) improvements were seen elsewhere in the pack. Runtime does not seem to have been affected, as most of the work had already been done when it was tossing deltas before. Some simple statistics: SBCL, standard pack-objects, window 100, depth 1000: Max depth: 980 Mean depth: 100.223622114502 Median depth: 12 Standard deviation: 188.214331919176 SBCL, patched pack-objects, window 100, depth 1000: Max depth: 787 Mean depth: 61.5669990817656 Median depth: 11 Standard deviation: 127.644652607399 git, standard pack-objects, window 100, depth 1000: Max depth: 925 Mean depth: 77.184264479754 Median depth: 8 Standard deviation: 150.112998198182 git, patched pack-objects, window 100, depth 1000: Max depth: 913 Mean depth: 74.9981877425496 Median depth: 7 Standard deviation: 147.900721785959 The only negative effect I could see from this patch might be pack locality. Unfortunately I don't know enough (read: anything) about pack access patterns to determine if this is the case. Patch to follow. -bcd ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] pack-objects: Prefer shallower deltas if the size is equal 2007-07-09 4:43 Preferring shallower deltas on repack Brian Downing @ 2007-07-09 4:45 ` Brian Downing 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano 2007-07-09 5:41 ` Preferring shallower deltas on repack Linus Torvalds 2 siblings, 0 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 4:45 UTC (permalink / raw) To: git Change "try_delta" so that if it finds a delta that has the same size but shallower depth than the existing delta, it will prefer the shallower one. This makes certain delta trees vastly less deep. Signed-off-by: Brian Downing <bdowning@lavos.net> --- builtin-pack-objects.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 3d396ca..54b9d26 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, if (max_size == 0) return 0; if (trg_entry->delta && trg_entry->delta_size <= max_size) - max_size = trg_entry->delta_size-1; + max_size = trg_entry->delta_size; src_size = src_entry->size; sizediff = src_size < trg_size ? trg_size - src_size : 0; if (sizediff >= max_size) @@ -1371,6 +1371,12 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, return 0; if (trg_entry->delta_data) { + /* Prefer only shallower same-sized deltas. */ + if (delta_size == trg_entry->delta_size && + src_entry->depth + 1 >= trg_entry->depth) { + free(delta_buf); + return 0; + } delta_cache_size -= trg_entry->delta_size; free(trg_entry->delta_data); } -- 1.5.2.GIT ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 4:43 Preferring shallower deltas on repack Brian Downing 2007-07-09 4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing @ 2007-07-09 5:31 ` Junio C Hamano 2007-07-09 5:43 ` Junio C Hamano ` (2 more replies) 2007-07-09 5:41 ` Preferring shallower deltas on repack Linus Torvalds 2 siblings, 3 replies; 19+ messages in thread From: Junio C Hamano @ 2007-07-09 5:31 UTC (permalink / raw) To: Brian Downing; +Cc: git bdowning@lavos.net (Brian Downing) writes: > I modified this to prefer shallower deltas of the same size. This made > the deltas for this file a very wide tree with a maximum depth of about > 65. Other (much smaller) improvements were seen elsewhere in the pack. > Runtime does not seem to have been affected, as most of the work had > already been done when it was tossing deltas before. > > Some simple statistics: > > SBCL, standard pack-objects, window 100, depth 1000: > Max depth: 980 > Mean depth: 100.223622114502 > Median depth: 12 > Standard deviation: 188.214331919176 > > SBCL, patched pack-objects, window 100, depth 1000: > Max depth: 787 > Mean depth: 61.5669990817656 > Median depth: 11 > Standard deviation: 127.644652607399 Putting aside a potential argument that the way the file in question, version.lisp-expr, is kept track of might be insane, this is an interesting topic. In addition to the above stats, it may be interesting to know: - pack generation time and memory footprint (/usr/bin/time); I suspect you would have to try_delta more candidates, so this may degrade a bit, but that is done for getting a better deltification, so we would need to see if the extra cost is within reason and worth spending. - resulting pack size (ls -l pack-*.pack) I do not expect your change would degrade in this area, as you are currently not trading size with shallower delta depth. Regarding your patch, I think it does not look too bad, as you never pick delta that is larger than the best-so-far in favor of shallower depth. It would become worrysome (*BUT* infinitely more interesting) once you start talking about a tradeoff between slightly larger delta and much shorter delta. Such a tradeoff, if done right, would make a lot of sense, but I do not offhand think of a way to strike a proper balance between them efficiently. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano @ 2007-07-09 5:43 ` Junio C Hamano 2007-07-09 6:52 ` Brian Downing 2007-07-09 15:58 ` Nicolas Pitre 2 siblings, 0 replies; 19+ messages in thread From: Junio C Hamano @ 2007-07-09 5:43 UTC (permalink / raw) To: Brian Downing; +Cc: git Junio C Hamano <gitster@pobox.com> writes: > It would become worrysome (*BUT* infinitely more interesting) > once you start talking about a tradeoff between slightly larger > delta and much shorter delta. Such a tradeoff, if done right, s/and much shorter delta/and much shallower depth/. Should be obvious from the context, but just in case -- without the above correction what I said does not make any sense ;-). > would make a lot of sense, but I do not offhand think of a way > to strike a proper balance between them efficiently. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano 2007-07-09 5:43 ` Junio C Hamano @ 2007-07-09 6:52 ` Brian Downing 2007-07-09 7:27 ` Junio C Hamano 2007-07-09 15:58 ` Nicolas Pitre 2 siblings, 1 reply; 19+ messages in thread From: Brian Downing @ 2007-07-09 6:52 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, Jul 08, 2007 at 10:31:43PM -0700, Junio C Hamano wrote: > Putting aside a potential argument that the way the file in > question, version.lisp-expr, is kept track of might be insane, > this is an interesting topic. Yeah, that version numbering system worked quite well for CVS, given its lack of any other kind of useful whole-tree versioning, and the fact that there wasn't much branching and merging, due to it being a pain in the ass. If an when we move to something like Git, something else will have to be done, as that file will /always/ be in conflict. > In addition to the above stats, it may be interesting to know: > > - pack generation time and memory footprint (/usr/bin/time); > > I suspect you would have to try_delta more candidates, so > this may degrade a bit, but that is done for getting a better > deltification, so we would need to see if the extra cost is > within reason and worth spending. It was already try_delta'ing everything in the window. The only difference now is that create_delta may generate one more byte of delta before giving up. That doesn't seem to have affected things at all outside of sampling noise: (These timings are for the Git pack on Linux/amd64, --window and --depth both 100. Since /usr/bin/time doesn't seem to report any useful memory statistics on Linux, I also have a "ps aux" line from when the memory size looked stable. This was different from run to run but it shows the two are in the same order of magnitude.) Unpatched: 54.99user 0.18system 0:56.80elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (14major+32417minor)pagefaults 0swaps bdowning 5290 98.7 4.5 106788 92900 pts/1 R+ 01:26 0:49 git pack-obj Patched: 55.37user 0.19system 0:56.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+32249minor)pagefaults 0swaps bdowning 6086 100 4.5 106880 92996 pts/1 R+ 01:29 0:49 git pack-obj > - resulting pack size (ls -l pack-*.pack) > > I do not expect your change would degrade in this area, as > you are currently not trading size with shallower delta > depth. The patched version is actually smaller in both SBCL's and Git's case (again, --window 100 and --depth 100): SBCL: 61696 bytes smaller (13294225-13232529) Git: 16010 bytes smaller (12690424-12674414) I believe the reason for this is that more deltas can get in under the depth limit. If I repack the Git pack with --depth=999999999, the patched version generates a pack that is 1793 bytes smaller. (12334183-12332390) (Hmm, I was expecting that to be the same, I'm not sure why it's not. Padding?) > Regarding your patch, I think it does not look too bad, as you > never pick delta that is larger than the best-so-far in favor of > shallower depth. > > It would become worrysome (*BUT* infinitely more interesting) > once you start talking about a tradeoff between slightly larger > delta and much shorter delta. Such a tradeoff, if done right, > would make a lot of sense, but I do not offhand think of a way > to strike a proper balance between them efficiently. Yeah, I was thinking about that too, and came to the same conclusion. I suspect you'd have to save a /lot/ of delta depth to want to pay any more I/O, though. Another thing that might be iffy (and complicated) is that if you keep making a good low-depth delta off of a particular object, it might be good to promote it so it stays in the window for longer. -bcd ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 6:52 ` Brian Downing @ 2007-07-09 7:27 ` Junio C Hamano 2007-07-09 7:36 ` Brian Downing 0 siblings, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2007-07-09 7:27 UTC (permalink / raw) To: Brian Downing; +Cc: git bdowning@lavos.net (Brian Downing) writes: > (These timings are for the Git pack on Linux/amd64, --window and --depth > both 100. Since /usr/bin/time doesn't seem to report any useful memory > statistics on Linux, I also have a "ps aux" line from when the memory > size looked stable. This was different from run to run but it shows the > two are in the same order of magnitude.) > > Unpatched: > 54.99user 0.18system 0:56.80elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (14major+32417minor)pagefaults 0swaps > bdowning 5290 98.7 4.5 106788 92900 pts/1 R+ 01:26 0:49 git pack-obj > > Patched: > 55.37user 0.19system 0:56.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k > 0inputs+0outputs (0major+32249minor)pagefaults 0swaps > bdowning 6086 100 4.5 106880 92996 pts/1 R+ 01:29 0:49 git pack-obj The number of minor faults are comparable (slightly favorable), which is a good sign. > The patched version is actually smaller in both SBCL's and Git's case > (again, --window 100 and --depth 100): > > SBCL: 61696 bytes smaller (13294225-13232529) > Git: 16010 bytes smaller (12690424-12674414) > > I believe the reason for this is that more deltas can get in under the > depth limit. Very sensible indeed. >> It would become worrysome (*BUT* infinitely more interesting) >> once you start talking about a tradeoff between slightly larger >> delta and much shorter delta. Such a tradeoff, if done right, >> would make a lot of sense, but I do not offhand think of a way >> to strike a proper balance between them efficiently. > > Yeah, I was thinking about that too, and came to the same conclusion. > I suspect you'd have to save a /lot/ of delta depth to want to pay any > more I/O, though. That may not be so. Deeper delta also means more I/O (and worse, because they can be from discontiguous areas) plus delta application. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 7:27 ` Junio C Hamano @ 2007-07-09 7:36 ` Brian Downing 0 siblings, 0 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 7:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Mon, Jul 09, 2007 at 12:27:24AM -0700, Junio C Hamano wrote: > >> It would become worrysome (*BUT* infinitely more interesting) > >> once you start talking about a tradeoff between slightly larger > >> delta and much shorter delta. Such a tradeoff, if done right, > >> would make a lot of sense, but I do not offhand think of a way > >> to strike a proper balance between them efficiently. > > > > Yeah, I was thinking about that too, and came to the same conclusion. > > I suspect you'd have to save a /lot/ of delta depth to want to pay any > > more I/O, though. > > That may not be so. Deeper delta also means more I/O (and > worse, because they can be from discontiguous areas) plus delta > application. Good point. I guess if I were to think about a metric for deciding whether to go for a small deep delta or a larger shallow one, I would probably keep track of the total path size (the sum of the base size and all the delta sizes) for each entry, and play with a weighting formula of single delta size verses total path size. -bcd ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano 2007-07-09 5:43 ` Junio C Hamano 2007-07-09 6:52 ` Brian Downing @ 2007-07-09 15:58 ` Nicolas Pitre 2007-07-09 16:39 ` Junio C Hamano 2 siblings, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-07-09 15:58 UTC (permalink / raw) To: Junio C Hamano; +Cc: Brian Downing, git On Sun, 8 Jul 2007, Junio C Hamano wrote: > It would become worrysome (*BUT* infinitely more interesting) > once you start talking about a tradeoff between slightly larger > delta and much shorter delta. Such a tradeoff, if done right, > would make a lot of sense, but I do not offhand think of a way > to strike a proper balance between them efficiently. We already do something similar to that with max_size being a function of the delta depth. This already has the effect of favoring shalower deltas even if deeper deltas could be smaller, and therefore creating a wider delta tree (see commits 4e8da195 and c3b06a69). Maybe this max_size curve could be modified a bit to increase the bias towards shallow deltas even in the case where a delta was already found instead of the current constant flat curve. [...] OK here it is. And results on the GIT repo and another patalogical test repo I keep around are actually really nice! Not only the pack itself is a bit smaller, but the delta depth distribution as shown by git-verify-pack -v is much nicer with the bulk of deltas in the low depth end of the spectrum and no more peak at the max depth level. Signed-off-by: Nicolas Pitre <nico@cam.org> --- diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 54b9d26..1eb86ed 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -1332,12 +1332,14 @@ static int try_delta(struct unpacked *trg, struct unpacked *src, /* Now some size filtering heuristics. */ trg_size = trg_entry->size; - max_size = trg_size/2 - 20; + if (!trg_entry->delta) + max_size = trg_size/2 - 20; + else + max_size = trg_entry->delta_size * max_depth / + (max_depth - trg_entry->depth + 1); max_size = max_size * (max_depth - src_entry->depth) / max_depth; if (max_size == 0) return 0; - if (trg_entry->delta && trg_entry->delta_size <= max_size) - max_size = trg_entry->delta_size; src_size = src_entry->size; sizediff = src_size < trg_size ? trg_size - src_size : 0; if (sizediff >= max_size) ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 15:58 ` Nicolas Pitre @ 2007-07-09 16:39 ` Junio C Hamano 2007-07-09 18:53 ` Brian Downing 0 siblings, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2007-07-09 16:39 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Brian Downing, git Nicolas Pitre <nico@cam.org> writes: > On Sun, 8 Jul 2007, Junio C Hamano wrote: > >> It would become worrysome (*BUT* infinitely more interesting) >> once you start talking about a tradeoff between slightly larger >> delta and much shallower depth. Such a tradeoff, if done right, >> would make a lot of sense, but I do not offhand think of a way >> to strike a proper balance between them efficiently. > > We already do something similar to that with max_size being a function > of the delta depth. ... Yeah, we had a fun thread about that; I forgot about it until now. > OK here it is. And results on the GIT repo and another patalogical test > repo I keep around are actually really nice! Not only the pack itself > is a bit smaller, but the delta depth distribution as shown by > git-verify-pack -v is much nicer with the bulk of deltas in the low > depth end of the spectrum and no more peak at the max depth level. Looks obviously correct. Brian, it would be very interesting to see what Nico's patch does to your dataset. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 16:39 ` Junio C Hamano @ 2007-07-09 18:53 ` Brian Downing 2007-07-09 19:13 ` Nicolas Pitre 2007-07-09 19:30 ` [PATCH] Shoddy pack information tool Brian Downing 0 siblings, 2 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 18:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git On Mon, Jul 09, 2007 at 09:39:44AM -0700, Junio C Hamano wrote: > > OK here it is. And results on the GIT repo and another patalogical test > > repo I keep around are actually really nice! Not only the pack itself > > is a bit smaller, but the delta depth distribution as shown by > > git-verify-pack -v is much nicer with the bulk of deltas in the low > > depth end of the spectrum and no more peak at the max depth level. > > Looks obviously correct. Brian, it would be very interesting to > see what Nico's patch does to your dataset. Nico's patch makes the overall statistics look better, but the version.lisp-expr file still goes 593 levels deep, as opposed to about 65 with my patch. (That's better than 980 with stock Git, though.) Pack statistics from my shoddy analysis tool (I'll post it a bit later): "sizes" are all object sizes in the pack. For deltas this is just the delta size. "path sizes" is the size of the /path/ to each object in the file; this is the size of the base and each patch in the chain to the object. This is approximately how much data you have to read to get to an object. "depths" should be obvious. SBCL, stock git: all sizes: count 46829 total 30256118 min 0 max 1012295 mean 646.10 median 45 std_dev 9555.48 all path sizes: count 46829 total 1551200401 min 0 max 1012295 mean 33124.78 median 11661 std_dev 55310.88 depths: count 46829 total 4693372 min 0 max 980 mean 100.22 median 12 std_dev 188.21 SBCL, my patch: all sizes: count 46829 total 30251762 min 0 max 1012295 mean 646.00 median 45 std_dev 9555.48 all path sizes: count 46829 total 1529629918 min 0 max 1012295 mean 32664.16 median 11213 std_dev 54930.06 depths: count 46829 total 2883121 min 0 max 787 mean 61.57 median 11 std_dev 127.64 SBCL, Nico's patch: all sizes: count 46829 total 30253345 min 0 max 1012295 mean 646.04 median 45 std_dev 9555.49 all path sizes: count 46829 total 1518730701 min 0 max 1012295 mean 32431.41 median 10819 std_dev 54751.35 depths: count 46829 total 3694511 min 0 max 699 mean 78.89 median 12 std_dev 141.53 I'm vaguely working on an alternate weighting mechanism based on path sizes, but so far all I've been able to do is generate some really strange packs. :) -bcd ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 18:53 ` Brian Downing @ 2007-07-09 19:13 ` Nicolas Pitre 2007-07-09 19:24 ` Brian Downing 2007-07-09 19:30 ` [PATCH] Shoddy pack information tool Brian Downing 1 sibling, 1 reply; 19+ messages in thread From: Nicolas Pitre @ 2007-07-09 19:13 UTC (permalink / raw) To: Brian Downing; +Cc: Junio C Hamano, git On Mon, 9 Jul 2007, Brian Downing wrote: > On Mon, Jul 09, 2007 at 09:39:44AM -0700, Junio C Hamano wrote: > > > OK here it is. And results on the GIT repo and another patalogical test > > > repo I keep around are actually really nice! Not only the pack itself > > > is a bit smaller, but the delta depth distribution as shown by > > > git-verify-pack -v is much nicer with the bulk of deltas in the low > > > depth end of the spectrum and no more peak at the max depth level. > > > > Looks obviously correct. Brian, it would be very interesting to > > see what Nico's patch does to your dataset. > > Nico's patch makes the overall statistics look better, but the > version.lisp-expr file still goes 593 levels deep, as opposed to about > 65 with my patch. (That's better than 980 with stock Git, though.) Tuning for such an extreme without impacting normal cases is rather hard. My patch was meant to be used on top of yours. Is that what you tested? Also I'd suggest you do not use a max depth of 1000. It is simply insane and might possibly make the existing logic less effective. Even for runtime pack access you want it to be reasonably short, say 100 maximum, or even the current default of 50. Any improvements you might come with (like automatic depth determined on replay cost) should be compared with that default which is known to work well already. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 19:13 ` Nicolas Pitre @ 2007-07-09 19:24 ` Brian Downing 2007-07-09 19:49 ` Brian Downing 0 siblings, 1 reply; 19+ messages in thread From: Brian Downing @ 2007-07-09 19:24 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, git On Mon, Jul 09, 2007 at 03:13:54PM -0400, Nicolas Pitre wrote: > Tuning for such an extreme without impacting normal cases is rather > hard. > > My patch was meant to be used on top of yours. Is that what you tested? > > Also I'd suggest you do not use a max depth of 1000. It is simply > insane and might possibly make the existing logic less effective. Even > for runtime pack access you want it to be reasonably short, say 100 > maximum, or even the current default of 50. Any improvements you might > come with (like automatic depth determined on replay cost) should be > compared with that default which is known to work well already. No, I didn't try it on top of mine; sorry. I'll try that out. I realize a depth of 1000 is nuts, I'm just using it to expose any suboptimal delta selection for the nasty versions.lisp-expr case, since I know that with a window size of 100 it should be easy to keep it from growing too deep. -bcd ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 19:24 ` Brian Downing @ 2007-07-09 19:49 ` Brian Downing 2007-07-09 20:22 ` Nicolas Pitre 2007-07-09 20:23 ` Brian Downing 0 siblings, 2 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 19:49 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, git On Mon, Jul 09, 2007 at 02:24:03PM -0500, Brian Downing wrote: > No, I didn't try it on top of mine; sorry. I'll try that out. The results with both your patch and mine are exactly the same as yours applied to master (c956395e). The only thing mine adds on top of yours is: diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c index 9a33698..2da78b4 100644 --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -1373,6 +1373,12 @@ static int try_delta(struct unpacked *trg, struct unpacke return 0; if (trg_entry->delta_data) { + /* Prefer only shallower same-sized deltas. */ + if (delta_size == trg_entry->delta_size && + src_entry->depth + 1 >= trg_entry->depth) { + free(delta_buf); + return 0; + } delta_cache_size -= trg_entry->delta_size; free(trg_entry->delta_data); } Which was meant to pick off the cases where I got an equivalently sized patch that was as deep or deeper. This was neccessary due to my changing: --- a/builtin-pack-objects.c +++ b/builtin-pack-objects.c @@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked if (max_size == 0) return 0; if (trg_entry->delta && trg_entry->delta_size <= max_size) - max_size = trg_entry->delta_size-1; + max_size = trg_entry->delta_size; src_size = src_entry->size; sizediff = src_size < trg_size ? trg_size - src_size : 0; if (sizediff >= max_size) max_size = trg_entry->delta_size; Your patch changed the max_size selection logic, so I'm not sure the rest of mine will do anything anymore. -bcd ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 19:49 ` Brian Downing @ 2007-07-09 20:22 ` Nicolas Pitre 2007-07-09 20:23 ` Brian Downing 1 sibling, 0 replies; 19+ messages in thread From: Nicolas Pitre @ 2007-07-09 20:22 UTC (permalink / raw) To: Brian Downing; +Cc: Junio C Hamano, git On Mon, 9 Jul 2007, Brian Downing wrote: > On Mon, Jul 09, 2007 at 02:24:03PM -0500, Brian Downing wrote: > > No, I didn't try it on top of mine; sorry. I'll try that out. > > The results with both your patch and mine are exactly the same as > yours applied to master (c956395e). The only thing mine adds on top of > yours is: > > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c > index 9a33698..2da78b4 100644 > --- a/builtin-pack-objects.c > +++ b/builtin-pack-objects.c > @@ -1373,6 +1373,12 @@ static int try_delta(struct unpacked *trg, struct unpacke > return 0; > > if (trg_entry->delta_data) { > + /* Prefer only shallower same-sized deltas. */ > + if (delta_size == trg_entry->delta_size && > + src_entry->depth + 1 >= trg_entry->depth) { > + free(delta_buf); > + return 0; > + } > delta_cache_size -= trg_entry->delta_size; > free(trg_entry->delta_data); > } > > Which was meant to pick off the cases where I got an equivalently > sized patch that was as deep or deeper. This was neccessary due to > my changing: > > --- a/builtin-pack-objects.c > +++ b/builtin-pack-objects.c > @@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked > if (max_size == 0) > return 0; > if (trg_entry->delta && trg_entry->delta_size <= max_size) > - max_size = trg_entry->delta_size-1; > + max_size = trg_entry->delta_size; > src_size = src_entry->size; > sizediff = src_size < trg_size ? trg_size - src_size : 0; > if (sizediff >= max_size) > > max_size = trg_entry->delta_size; > > Your patch changed the max_size selection logic, so I'm not sure the > rest of mine will do anything anymore. It is still valid nevertheless. Nicolas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 19:49 ` Brian Downing 2007-07-09 20:22 ` Nicolas Pitre @ 2007-07-09 20:23 ` Brian Downing 1 sibling, 0 replies; 19+ messages in thread From: Brian Downing @ 2007-07-09 20:23 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, git On Mon, Jul 09, 2007 at 02:49:32PM -0500, Brian Downing wrote: > The results with both your patch and mine are exactly the same as > yours applied to master (c956395e). Sorry, that's not quite right. The statistics were close enough that my vdiff failed, though. :) Here are all the diffs through my pack analyzer. There's amazingly few and they very nearly cancel each other out: --- treetree-nico 2007-07-09 15:19:10.000000000 -0500 +++ treetree-nicobd 2007-07-09 15:18:45.000000000 -0500 @@ -25392,12 +25392,15 @@ 0 commit 7cde9fabcd145901785a468a87108f7d9c4291fc 544 544 0 blob 7ce2b21be4ac425a8423ec69c6036a53311df5cb 2347 2347 + 1 blob 25140bfb264e7b9c8fe4b97318297630cc93b112 313 2660 + 2 blob ba80ac333acefc68362e66d6f8e61531263e135a 15 2675 + 3 blob 6f18434f6815a037d6fa15c38d3f460b530e585b 161 2836 1 blob 4dbd333da74ad866f0bff154d6cffea11b9a841a 9 2356 2 blob ddb93ae086bf0393c9d53247732a2c29af33c80c 313 2669 3 blob d34e346a0f2ea6fdb2acc414d10635dc4d326d25 11 2680 4 blob 368243c09993f7ee8b56c7ba184bef134a91607c 11 2691 - size: count 5 total 2691 min 9 max 2347 mean 538.20 median 11 std_dev 911.97 -path size: count 5 total 12743 min 2347 max 2691 mean 2548.60 median 2669 std_dev 161.11 + size: count 8 total 3180 min 9 max 2347 mean 397.50 median 161 std_dev 747.23 +path size: count 8 total 20914 min 2347 max 2836 mean 2614.25 median 2675 std_dev 160.58 0 commit 7ce2c42adf3d62f03086de940adaee48e6161a40 579 579 @@ -47249,6 +47252,9 @@ 0 commit d5319592583dda6833b74b34b52dbd2aa3109948 468 468 0 tree d53239e82aaed72745097f00b32807f2eb71997d 104 104 + 1 tree 9c01548d14e5b766397e6d5595566269094057bb 5 109 + size: count 2 total 109 min 5 max 104 mean 54.50 median 104 std_dev 49.50 +path size: count 2 total 213 min 104 max 109 mean 106.50 median 109 std_dev 2.50 0 commit d5393dd736972a5c84cd97fec9892cd3da80b669 450 450 @@ -48702,9 +48708,6 @@ path size: count 16 total 111347 min 6404 max 7298 mean 6959.19 median 7142 std_dev 333.27 0 tree df1a6239457293813f34eee001187d725e718062 104 104 - 1 tree 9c01548d14e5b766397e6d5595566269094057bb 5 109 - size: count 2 total 109 min 5 max 104 mean 54.50 median 104 std_dev 49.50 -path size: count 2 total 213 min 104 max 109 mean 106.50 median 109 std_dev 2.50 0 commit df1eebdf125384f3bf7eb2b15b874043504797e6 364 364 @@ -49407,10 +49410,14 @@ 0 blob e4285f4b5739d4a2385b28c4e4fd60cc40beb2a1 217 217 0 blob e42a4ea85cbac543f20fba2e1cb95b35c3a8a553 1021 1021 + 1 blob 9a7565236131c7c2b0f4bc5ae8ddb79cfc0eb418 112 1133 + 2 blob 9d8583de4217d0515125464300321f20539355e1 65 1198 + 2 blob f8994669d23fa5c008cfdf58ea63a0b88e46f651 14 1147 + 3 blob 3eec762b65518ffe72ffb29cd30a1bb481f6e898 19 1166 1 blob ae1a1001c0d71a8407a3432ff9e2ba7dab59166b 43 1064 2 blob 3127f77e001d0b45359254ca59af9f6a62401d77 12 1076 - size: count 3 total 1076 min 12 max 1021 mean 358.67 median 43 std_dev 468.51 -path size: count 3 total 3161 min 1021 max 1076 mean 1053.67 median 1064 std_dev 23.61 + size: count 7 total 1286 min 12 max 1021 mean 183.71 median 43 std_dev 343.41 +path size: count 7 total 7805 min 1021 max 1198 mean 1115.00 median 1133 std_dev 58.30 0 blob e431cf36086c97b6e22eb741e7566100d3133c42 4766 4766 1 blob 344136d826427f400d93240f12b8c070d700ece2 448 5214 @@ -52503,13 +52510,9 @@ 2 blob 651668cdb7db56dba200db8a53732ac984aaac04 12 4524 3 blob b31d0f38830ff8fdcdfd81b386993bc2b2216a38 16 4540 4 blob ec7899967381c5993e45df411077b598c7d25831 12 4552 - 1 blob 9a7565236131c7c2b0f4bc5ae8ddb79cfc0eb418 112 4567 - 2 blob 9d8583de4217d0515125464300321f20539355e1 65 4632 - 2 blob f8994669d23fa5c008cfdf58ea63a0b88e46f651 14 4581 - 3 blob 3eec762b65518ffe72ffb29cd30a1bb481f6e898 19 4600 1 blob e8a23b634fef5a9d906e3185d65fae8979612851 20 4475 - size: count 60 total 7628 min 11 max 4455 mean 127.13 median 20 std_dev 569.75 -path size: count 60 total 298378 min 4455 max 5455 mean 4972.97 median 5017 std_dev 318.40 + size: count 56 total 7418 min 11 max 4455 mean 132.46 median 20 std_dev 589.29 +path size: count 56 total 279998 min 4455 max 5455 mean 4999.96 median 5031 std_dev 312.48 0 tree ee1ece9ab33edbe7ec1eebcbe9b259c7a9dded4c 196 196 @@ -54809,16 +54812,13 @@ 0 commit fb76e3acd8b8a53cdadaa65bce1d090d99e004a0 324 324 0 blob fb7b9c91f11921e2d3ac5b20d2c10728dd5ba173 2417 2417 - 1 blob 25140bfb264e7b9c8fe4b97318297630cc93b112 313 2730 - 2 blob ba80ac333acefc68362e66d6f8e61531263e135a 15 2745 - 3 blob 6f18434f6815a037d6fa15c38d3f460b530e585b 161 2906 1 blob f0d45243029e1e1a9ef4d499dcb5934de4c523b9 11 2428 2 blob c9eab11115a061a457bb84e0a4b0fa1a140cf0da 41 2469 3 blob b54b3745c995f8284d1700a8f9b61eed0ed7548e 7 2476 1 blob fb107f406660f5faff8d00d060af1478ad21a6bf 15 2432 2 blob 829950c079edad38eeaeb4663ea8e96b4d6fedb7 7 2439 - size: count 9 total 2987 min 7 max 2417 mean 331.89 median 15 std_dev 743.62 -path size: count 9 total 23042 min 2417 max 2906 mean 2560.22 median 2469 std_dev 172.26 + size: count 6 total 2498 min 7 max 2417 mean 416.33 median 15 std_dev 894.80 +path size: count 6 total 14661 min 2417 max 2476 mean 2443.50 median 2439 std_dev 21.61 0 commit fb8533122551bbb7aea669f40bc91c1211809b58 778 778 @@ -56646,7 +56646,7 @@ 0 commit ffe8d65266ed7c2c67a0a6ce7ff0de633000037e 474 474 all sizes: count 46829 total 30253345 min 0 max 1012295 mean 646.04 median 45 std_dev 9555.49 - all path sizes: count 46829 total 1518730701 min 0 max 1012295 mean 32431.41 median 10819 std_dev 54751.35 + all path sizes: count 46829 total 1518716755 min 0 max 1012295 mean 32431.12 median 10819 std_dev 54751.51 tree sizes: count 6442 total 30253345 min 0 max 1012295 mean 4696.27 median 422 std_dev 28603.08 -tree path sizes: count 6442 total 1518730701 min 0 max 306510684 mean 235754.53 median 458 std_dev 4198348.32 +tree path sizes: count 6442 total 1518716755 min 0 max 306510684 mean 235752.37 median 458 std_dev 4198348.25 depths: count 46829 total 3694511 min 0 max 699 mean 78.89 median 12 std_dev 141.53 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] Shoddy pack information tool 2007-07-09 18:53 ` Brian Downing 2007-07-09 19:13 ` Nicolas Pitre @ 2007-07-09 19:30 ` Brian Downing 2007-07-11 21:55 ` Junio C Hamano 1 sibling, 1 reply; 19+ messages in thread From: Brian Downing @ 2007-07-09 19:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, git This tool will print vaguely pretty information about a pack. It expects the output of "git-verify-pack -v" as input on stdin. $ git-verify-pack -v pack | packinfo.pl This prints some full-pack statistics; currently "all sizes", "all path sizes", "tree sizes", "tree path sizes", and "depths". * "all sizes" stats are across every object size in the file; full sizes for base objects, and delta size for deltas. * "all path sizes" stats are across all object's "path sizes". A path size is the sum of the size of the delta chain, including the base object. In other words, it's how many bytes need be read to reassemble the file from deltas. * "tree sizes" are object sizes grouped into delta trees. * "tree path sizes" are path sizes grouped into delta trees. * "depths" should be obvious. When run as: $ git-verify-pack -v pack | packinfo.pl -tree The trees of objects are output along with the stats. This looks like: 0 commit 031321c6... 803 803 0 blob 03156f21... 1767 1767 1 blob f52a9d7f... 10 1777 2 blob a8cc5739... 51 1828 3 blob 660e90b1... 15 1843 4 blob 0cb8e3bb... 33 1876 2 blob e48607f0... 311 2088 size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85 path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26 The first number after the sha1 is the object size, the second number is the path size. The statistics are across all objects in the previous delta tree. Obviously they are omitted for trees of one object. When run as: $ git-verify-pack -v pack | packinfo.pl -tree -filenames It adds filenames to the tree. Getting this information is somewhat slow: 0 blob 03156f21... 1767 1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142 1 blob f52a9d7f... 10 1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74 2 blob a8cc5739... 51 1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0 3 blob 660e90b1... 15 1843 Documentation/git-lost+found.txt @ master~3222^2~2 4 blob 0cb8e3bb... 33 1876 Documentation/git-lost+found.txt @ master~3222^2~3 2 blob e48607f0... 311 2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4 size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85 path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26 When run as: $ git-verify-pack -v pack | packinfo.pl -dump It prints out "sha1 size pathsize depth" for each sha1 in lexical order. 000079a2eaef17b7eae70e1f0f635557ea67b644 30 472 7 00013cafe6980411aa6fdd940784917b5ff50f0a 44 1542 4 000182eacf99cde27d5916aa415921924b82972c 499 499 0 ... This is handy for comparing two packs. --- packinfo.pl | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 141 insertions(+), 0 deletions(-) create mode 100644 packinfo.pl diff --git a/packinfo.pl b/packinfo.pl new file mode 100644 index 0000000..6abb7f2 --- /dev/null +++ b/packinfo.pl @@ -0,0 +1,141 @@ +#!/bin/perl + +use strict; +use Getopt::Long; + +my $filenames = 0; +my $tree = 0; +my $dump = 0; +GetOptions("tree" => \$tree, + "filenames" => \$filenames, + "dump" => \$dump); + +my $parents = {}; +my $children = {}; +my $sizes = {}; +my $roots = []; +my $paths = {}; +my $types = {}; +my $commits = []; +my $names = {}; +my $depths = {}; +my @depths; + +while (<STDIN>) { + my ($sha1, $type, $size, $offset, $depth, $parent) = split(/\s+/, $_); + next unless ($sha1 =~ /^[0-9a-f]{40}$/); + $depths->{$sha1} = $depth || 0; + push(@depths, $depth || 0); + push(@$commits, $sha1) if ($type eq 'commit'); + push(@$roots, $sha1) unless $parent; + $parents->{$sha1} = $parent; + $types->{$sha1} = $type; + push(@{$children->{$parent}}, $sha1); + $sizes->{$sha1} = $size; +} + +if ($filenames && $tree) { + open(NAMES, "git-name-rev --all|"); + while (<NAMES>) { + if (/^(\S+)\s+(.*)$/) { + my ($sha1, $name) = ($1, $2); + $names->{$sha1} = $name; + } + } + close NAMES; + + for my $commit (@$commits) { + my $name = $names->{$commit}; + open(TREE, "git-ls-tree -t -r $commit|"); + print STDERR "Plumbing tree $name\n"; + while (<TREE>) { + if (/^(\S+)\s+(\S+)\s+(\S+)\s+(.*)$/) { + my ($mode, $type, $sha1, $path) = ($1, $2, $3, $4); + $paths->{$sha1} = "$path @ $name"; + } + } + close TREE; + } +} + +sub stats { + my @data = sort {$a <=> $b} @_; + my $min = $data[0]; + my $max = $data[$#data]; + my $total = 0; + my $count = scalar @data; + for my $datum (@data) { + $total += $datum; + } + my $mean = $total / $count; + my $median = $data[int(@data / 2)]; + my $diff_sum = 0; + for my $datum (@data) { + $diff_sum += ($datum - $mean)**2; + } + my $std_dev = sqrt($diff_sum / $count); + return ($count, $total, $min, $max, $mean, $median, $std_dev); +} + +sub print_stats { + my $name = shift; + my ($count, $total, $min, $max, $mean, $median, $std_dev) = stats(@_); + printf("%s: count %s total %s min %s max %s mean %.2f median %s std_dev %.2f\n", + $name, $count, $total, $min, $max, $mean, $median, $std_dev); +} + +my @sizes; +my @path_sizes; +my @all_sizes; +my @all_path_sizes; +my $path_sizes = {}; + +sub dig { + my ($sha1, $depth, $path_size) = @_; + $path_size += $sizes->{$sha1}; + push(@sizes, $sizes->{$sha1}); + push(@all_sizes, $sizes->{$sha1}); + push(@path_sizes, $path_size); + push(@all_path_sizes, $path_size); + $path_sizes->{$sha1} = $path_size; + if ($tree) { + printf("%3d%s %6s %s %8d %8d %s\n", + $depth, (" " x $depth), $types->{$sha1}, + $sha1, $sizes->{$sha1}, $path_size, $paths->{$sha1}); + } + for my $child (@{$children->{$sha1}}) { + dig($child, $depth + 1, $path_size); + } +} + +my @tree_sizes; +my @tree_path_sizes; + +for my $root (@$roots) { + undef @sizes; + undef @path_sizes; + dig($root, 0, 0); + my ($aa, $sz_total) = stats(@sizes); + my ($bb, $psz_total) = stats(@path_sizes); + push(@tree_sizes, $sz_total); + push(@tree_path_sizes, $psz_total); + if ($tree) { + if (@sizes > 1) { + print_stats(" size", @sizes); + print_stats("path size", @path_sizes); + } + print "\n"; + } +} + +if ($dump) { + for my $sha1 (sort keys %$sizes) { + print "$sha1 $sizes->{$sha1} $path_sizes->{$sha1} $depths->{$sha1}\n"; + } +} else { + print_stats(" all sizes", @all_sizes); + print_stats(" all path sizes", @all_path_sizes); + print_stats(" tree sizes", @tree_sizes); + print_stats("tree path sizes", @tree_path_sizes); + print_stats(" depths", @depths); +} -- 1.5.2.GIT ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] Shoddy pack information tool 2007-07-09 19:30 ` [PATCH] Shoddy pack information tool Brian Downing @ 2007-07-11 21:55 ` Junio C Hamano 2007-07-12 3:02 ` [PATCH] Pack " Brian Downing 0 siblings, 1 reply; 19+ messages in thread From: Junio C Hamano @ 2007-07-11 21:55 UTC (permalink / raw) To: Brian Downing; +Cc: Nicolas Pitre, git I do not think this is particularlly Shoddy. Care to move it somewhere in contrib/ (say, contrib/stats/packinfo.pl) and sign off the patch? ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH] Pack information tool 2007-07-11 21:55 ` Junio C Hamano @ 2007-07-12 3:02 ` Brian Downing 0 siblings, 0 replies; 19+ messages in thread From: Brian Downing @ 2007-07-12 3:02 UTC (permalink / raw) To: Junio C Hamano; +Cc: Brian Downing, Nicolas Pitre, git This tool will print vaguely pretty information about a pack. It expects the output of "git-verify-pack -v" as input on stdin. $ git-verify-pack -v | packinfo.pl See the documentation in the script (contrib/stats/packinfo.pl) for more information. Signed-off-by: Brian Downing <bdowning@lavos.net> --- I've been fighting with the mail configuraton on my laptop so that "git send-email" can send email that vger will accept. My apologies if anybody got dupes or mangled messages. contrib/stats/packinfo.pl | 212 +++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 212 insertions(+), 0 deletions(-) create mode 100755 contrib/stats/packinfo.pl diff --git a/contrib/stats/packinfo.pl b/contrib/stats/packinfo.pl new file mode 100755 index 0000000..792673a --- /dev/null +++ b/contrib/stats/packinfo.pl @@ -0,0 +1,212 @@ +#!/bin/perl +# +# This tool will print vaguely pretty information about a pack. It +# expects the output of "git-verify-pack -v" as input on stdin. +# +# $ git-verify-pack -v | packinfo.pl +# +# This prints some full-pack statistics; currently "all sizes", "all +# path sizes", "tree sizes", "tree path sizes", and "depths". +# +# * "all sizes" stats are across every object size in the file; +# full sizes for base objects, and delta size for deltas. +# * "all path sizes" stats are across all object's "path sizes". +# A path size is the sum of the size of the delta chain, including the +# base object. In other words, it's how many bytes need be read to +# reassemble the file from deltas. +# * "tree sizes" are object sizes grouped into delta trees. +# * "tree path sizes" are path sizes grouped into delta trees. +# * "depths" should be obvious. +# +# When run as: +# +# $ git-verify-pack -v | packinfo.pl -tree +# +# the trees of objects are output along with the stats. This looks +# like: +# +# 0 commit 031321c6... 803 803 +# +# 0 blob 03156f21... 1767 1767 +# 1 blob f52a9d7f... 10 1777 +# 2 blob a8cc5739... 51 1828 +# 3 blob 660e90b1... 15 1843 +# 4 blob 0cb8e3bb... 33 1876 +# 2 blob e48607f0... 311 2088 +# size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85 +# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26 +# +# The first number after the sha1 is the object size, the second +# number is the path size. The statistics are across all objects in +# the previous delta tree. Obviously they are omitted for trees of +# one object. +# +# When run as: +# +# $ git-verify-pack -v | packinfo.pl -tree -filenames +# +# it adds filenames to the tree. Getting this information is slow: +# +# 0 blob 03156f21... 1767 1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142 +# 1 blob f52a9d7f... 10 1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74 +# 2 blob a8cc5739... 51 1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0 +# 3 blob 660e90b1... 15 1843 Documentation/git-lost+found.txt @ master~3222^2~2 +# 4 blob 0cb8e3bb... 33 1876 Documentation/git-lost+found.txt @ master~3222^2~3 +# 2 blob e48607f0... 311 2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4 +# size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85 +# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26 +# +# When run as: +# +# $ git-verify-pack -v | packinfo.pl -dump +# +# it prints out "sha1 size pathsize depth" for each sha1 in lexical +# order. +# +# 000079a2eaef17b7eae70e1f0f635557ea67b644 30 472 7 +# 00013cafe6980411aa6fdd940784917b5ff50f0a 44 1542 4 +# 000182eacf99cde27d5916aa415921924b82972c 499 499 0 +# ... +# +# This is handy for comparing two packs. Adding "-filenames" will add +# filenames, as per "-tree -filenames" above. + +use strict; +use Getopt::Long; + +my $filenames = 0; +my $tree = 0; +my $dump = 0; +GetOptions("tree" => \$tree, + "filenames" => \$filenames, + "dump" => \$dump); + +my %parents; +my %children; +my %sizes; +my @roots; +my %paths; +my %types; +my @commits; +my %names; +my %depths; +my @depths; + +while (<STDIN>) { + my ($sha1, $type, $size, $offset, $depth, $parent) = split(/\s+/, $_); + next unless ($sha1 =~ /^[0-9a-f]{40}$/); + $depths{$sha1} = $depth || 0; + push(@depths, $depth || 0); + push(@commits, $sha1) if ($type eq 'commit'); + push(@roots, $sha1) unless $parent; + $parents{$sha1} = $parent; + $types{$sha1} = $type; + push(@{$children{$parent}}, $sha1); + $sizes{$sha1} = $size; +} + +if ($filenames && ($tree || $dump)) { + open(NAMES, "git-name-rev --all|"); + while (<NAMES>) { + if (/^(\S+)\s+(.*)$/) { + my ($sha1, $name) = ($1, $2); + $names{$sha1} = $name; + } + } + close NAMES; + + for my $commit (@commits) { + my $name = $names{$commit}; + open(TREE, "git-ls-tree -t -r $commit|"); + print STDERR "Plumbing tree $name\n"; + while (<TREE>) { + if (/^(\S+)\s+(\S+)\s+(\S+)\s+(.*)$/) { + my ($mode, $type, $sha1, $path) = ($1, $2, $3, $4); + $paths{$sha1} = "$path @ $name"; + } + } + close TREE; + } +} + +sub stats { + my @data = sort {$a <=> $b} @_; + my $min = $data[0]; + my $max = $data[$#data]; + my $total = 0; + my $count = scalar @data; + for my $datum (@data) { + $total += $datum; + } + my $mean = $total / $count; + my $median = $data[int(@data / 2)]; + my $diff_sum = 0; + for my $datum (@data) { + $diff_sum += ($datum - $mean)**2; + } + my $std_dev = sqrt($diff_sum / $count); + return ($count, $total, $min, $max, $mean, $median, $std_dev); +} + +sub print_stats { + my $name = shift; + my ($count, $total, $min, $max, $mean, $median, $std_dev) = stats(@_); + printf("%s: count %s total %s min %s max %s mean %.2f median %s std_dev %.2f\n", + $name, $count, $total, $min, $max, $mean, $median, $std_dev); +} + +my @sizes; +my @path_sizes; +my @all_sizes; +my @all_path_sizes; +my %path_sizes; + +sub dig { + my ($sha1, $depth, $path_size) = @_; + $path_size += $sizes{$sha1}; + push(@sizes, $sizes{$sha1}); + push(@all_sizes, $sizes{$sha1}); + push(@path_sizes, $path_size); + push(@all_path_sizes, $path_size); + $path_sizes{$sha1} = $path_size; + if ($tree) { + printf("%3d%s %6s %s %8d %8d %s\n", + $depth, (" " x $depth), $types{$sha1}, + $sha1, $sizes{$sha1}, $path_size, $paths{$sha1}); + } + for my $child (@{$children{$sha1}}) { + dig($child, $depth + 1, $path_size); + } +} + +my @tree_sizes; +my @tree_path_sizes; + +for my $root (@roots) { + undef @sizes; + undef @path_sizes; + dig($root, 0, 0); + my ($aa, $sz_total) = stats(@sizes); + my ($bb, $psz_total) = stats(@path_sizes); + push(@tree_sizes, $sz_total); + push(@tree_path_sizes, $psz_total); + if ($tree) { + if (@sizes > 1) { + print_stats(" size", @sizes); + print_stats("path size", @path_sizes); + } + print "\n"; + } +} + +if ($dump) { + for my $sha1 (sort keys %sizes) { + print "$sha1 $sizes{$sha1} $path_sizes{$sha1} $depths{$sha1} $paths{$sha1}\n"; + } +} else { + print_stats(" all sizes", @all_sizes); + print_stats(" all path sizes", @all_path_sizes); + print_stats(" tree sizes", @tree_sizes); + print_stats("tree path sizes", @tree_path_sizes); + print_stats(" depths", @depths); +} -- 1.5.2.GIT ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: Preferring shallower deltas on repack 2007-07-09 4:43 Preferring shallower deltas on repack Brian Downing 2007-07-09 4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano @ 2007-07-09 5:41 ` Linus Torvalds 2 siblings, 0 replies; 19+ messages in thread From: Linus Torvalds @ 2007-07-09 5:41 UTC (permalink / raw) To: Brian Downing; +Cc: git On Sun, 8 Jul 2007, Brian Downing wrote: > > I modified this to prefer shallower deltas of the same size. This made > the deltas for this file a very wide tree with a maximum depth of about > 65. Other (much smaller) improvements were seen elsewhere in the pack. > Runtime does not seem to have been affected, as most of the work had > already been done when it was tossing deltas before. This seems like a good thing to do, and on the face of it I think it's worth it. I can't see any real downsides, at least, and while the upside doesn't sound huge, it sounds real enough. So here's at least an initial tentative "ack" from me. Linus ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-07-12 3:10 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-07-09 4:43 Preferring shallower deltas on repack Brian Downing 2007-07-09 4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing 2007-07-09 5:31 ` Preferring shallower deltas on repack Junio C Hamano 2007-07-09 5:43 ` Junio C Hamano 2007-07-09 6:52 ` Brian Downing 2007-07-09 7:27 ` Junio C Hamano 2007-07-09 7:36 ` Brian Downing 2007-07-09 15:58 ` Nicolas Pitre 2007-07-09 16:39 ` Junio C Hamano 2007-07-09 18:53 ` Brian Downing 2007-07-09 19:13 ` Nicolas Pitre 2007-07-09 19:24 ` Brian Downing 2007-07-09 19:49 ` Brian Downing 2007-07-09 20:22 ` Nicolas Pitre 2007-07-09 20:23 ` Brian Downing 2007-07-09 19:30 ` [PATCH] Shoddy pack information tool Brian Downing 2007-07-11 21:55 ` Junio C Hamano 2007-07-12 3:02 ` [PATCH] Pack " Brian Downing 2007-07-09 5:41 ` Preferring shallower deltas on repack Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).