git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Preferring shallower deltas on repack
@ 2007-07-09  4:43 Brian Downing
  2007-07-09  4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09  4:43 UTC (permalink / raw)
  To: git

SBCL, a Lisp implementation, has a file in their (CVS) repository called
"version.lisp-expr" which is updated every commit.  This file looks like:

------------------------------------------------------------------------
;;; This is the master value for LISP-IMPLEMENTATION-VERSION. It's
;;; separated into its own file here so that it's easy for
;;; text-munging make-ish or cvs-ish scripts to find and tweak it. For
;;; the convenience of such scripts, only a simple subset of Lisp
;;; reader syntax should be used here: semicolon-delimited comments,
;;; possible blank lines or other whitespace, and a single
;;; double-quoted string value alone on its own line.
;;;
;;; ANSI says LISP-IMPLEMENTATION-VERSION can be NIL "if no
;;; appropriate and relevant result can be produced", but as long as
;;; we control the build, we can always assign an appropriate and
;;; relevant result, so this must be a string, not NIL.
;;;
;;; Conventionally a string like "0.6.6", with three numeric fields,
;;; is used for released versions, and a string like "0.6.5.xyzzy",
;;; with something arbitrary in the fourth field, is used for CVS
;;; checkins which aren't released. (And occasionally for internal
;;; versions, especially for internal versions off the main CVS
;;; branch, it gets hairier, e.g. "0.pre7.14.flaky4.13".)
"1.0.7.10"
------------------------------------------------------------------------

Only very rarely does anything but the last line change.

The current repack implementation, when given values like "--window=100
--depth=1000", happily generates an incredibly deep tree for this file,
maxing out the depth.  (I'm only using such a high depth to demonstrate
the pessimal behavior.)  I noticed that it just takes the first delta
it finds that is the smallest; it then only looks for deltas with a
max_size of the old size - 1, so it can never find better matches with
regard to depth.

I modified this to prefer shallower deltas of the same size.  This made
the deltas for this file a very wide tree with a maximum depth of about
65.  Other (much smaller) improvements were seen elsewhere in the pack.
Runtime does not seem to have been affected, as most of the work had
already been done when it was tossing deltas before.

Some simple statistics:

SBCL, standard pack-objects, window 100, depth 1000:
  Max depth: 980
  Mean depth: 100.223622114502
  Median depth: 12
  Standard deviation: 188.214331919176

SBCL, patched pack-objects, window 100, depth 1000:
  Max depth: 787
  Mean depth: 61.5669990817656
  Median depth: 11
  Standard deviation: 127.644652607399

git, standard pack-objects, window 100, depth 1000:
  Max depth: 925
  Mean depth: 77.184264479754
  Median depth: 8
  Standard deviation: 150.112998198182

git, patched pack-objects, window 100, depth 1000:
  Max depth: 913
  Mean depth: 74.9981877425496
  Median depth: 7
  Standard deviation: 147.900721785959

The only negative effect I could see from this patch might be pack
locality.  Unfortunately I don't know enough (read: anything) about pack
access patterns to determine if this is the case.

Patch to follow.

-bcd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] pack-objects: Prefer shallower deltas if the size is equal
  2007-07-09  4:43 Preferring shallower deltas on repack Brian Downing
@ 2007-07-09  4:45 ` Brian Downing
  2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
  2007-07-09  5:41 ` Preferring shallower deltas on repack Linus Torvalds
  2 siblings, 0 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09  4:45 UTC (permalink / raw)
  To: git

Change "try_delta" so that if it finds a delta that has the same size
but shallower depth than the existing delta, it will prefer the
shallower one.  This makes certain delta trees vastly less deep.

Signed-off-by: Brian Downing <bdowning@lavos.net>
---
 builtin-pack-objects.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 3d396ca..54b9d26 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (max_size == 0)
 		return 0;
 	if (trg_entry->delta && trg_entry->delta_size <= max_size)
-		max_size = trg_entry->delta_size-1;
+		max_size = trg_entry->delta_size;
 	src_size = src_entry->size;
 	sizediff = src_size < trg_size ? trg_size - src_size : 0;
 	if (sizediff >= max_size)
@@ -1371,6 +1371,12 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 		return 0;
 
 	if (trg_entry->delta_data) {
+		/* Prefer only shallower same-sized deltas. */
+		if (delta_size == trg_entry->delta_size &&
+		    src_entry->depth + 1 >= trg_entry->depth) {
+			free(delta_buf);
+			return 0;
+		}
 		delta_cache_size -= trg_entry->delta_size;
 		free(trg_entry->delta_data);
 	}
-- 
1.5.2.GIT

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  4:43 Preferring shallower deltas on repack Brian Downing
  2007-07-09  4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing
@ 2007-07-09  5:31 ` Junio C Hamano
  2007-07-09  5:43   ` Junio C Hamano
                     ` (2 more replies)
  2007-07-09  5:41 ` Preferring shallower deltas on repack Linus Torvalds
  2 siblings, 3 replies; 19+ messages in thread
From: Junio C Hamano @ 2007-07-09  5:31 UTC (permalink / raw)
  To: Brian Downing; +Cc: git

bdowning@lavos.net (Brian Downing) writes:

> I modified this to prefer shallower deltas of the same size.  This made
> the deltas for this file a very wide tree with a maximum depth of about
> 65.  Other (much smaller) improvements were seen elsewhere in the pack.
> Runtime does not seem to have been affected, as most of the work had
> already been done when it was tossing deltas before.
>
> Some simple statistics:
>
> SBCL, standard pack-objects, window 100, depth 1000:
>   Max depth: 980
>   Mean depth: 100.223622114502
>   Median depth: 12
>   Standard deviation: 188.214331919176
>
> SBCL, patched pack-objects, window 100, depth 1000:
>   Max depth: 787
>   Mean depth: 61.5669990817656
>   Median depth: 11
>   Standard deviation: 127.644652607399

Putting aside a potential argument that the way the file in
question, version.lisp-expr, is kept track of might be insane,
this is an interesting topic.

In addition to the above stats, it may be interesting to know:

 - pack generation time and memory footprint (/usr/bin/time);

   I suspect you would have to try_delta more candidates, so
   this may degrade a bit, but that is done for getting a better
   deltification, so we would need to see if the extra cost is
   within reason and worth spending.

 - resulting pack size (ls -l pack-*.pack)

   I do not expect your change would degrade in this area, as
   you are currently not trading size with shallower delta
   depth.

Regarding your patch, I think it does not look too bad, as you
never pick delta that is larger than the best-so-far in favor of
shallower depth.

It would become worrysome (*BUT* infinitely more interesting)
once you start talking about a tradeoff between slightly larger
delta and much shorter delta.  Such a tradeoff, if done right,
would make a lot of sense, but I do not offhand think of a way
to strike a proper balance between them efficiently.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  4:43 Preferring shallower deltas on repack Brian Downing
  2007-07-09  4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing
  2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
@ 2007-07-09  5:41 ` Linus Torvalds
  2 siblings, 0 replies; 19+ messages in thread
From: Linus Torvalds @ 2007-07-09  5:41 UTC (permalink / raw)
  To: Brian Downing; +Cc: git



On Sun, 8 Jul 2007, Brian Downing wrote:
> 
> I modified this to prefer shallower deltas of the same size.  This made
> the deltas for this file a very wide tree with a maximum depth of about
> 65.  Other (much smaller) improvements were seen elsewhere in the pack.
> Runtime does not seem to have been affected, as most of the work had
> already been done when it was tossing deltas before.

This seems like a good thing to do, and on the face of it I think it's 
worth it. I can't see any real downsides, at least, and while the upside 
doesn't sound huge, it sounds real enough.

So here's at least an initial tentative "ack" from me.

		Linus

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
@ 2007-07-09  5:43   ` Junio C Hamano
  2007-07-09  6:52   ` Brian Downing
  2007-07-09 15:58   ` Nicolas Pitre
  2 siblings, 0 replies; 19+ messages in thread
From: Junio C Hamano @ 2007-07-09  5:43 UTC (permalink / raw)
  To: Brian Downing; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> It would become worrysome (*BUT* infinitely more interesting)
> once you start talking about a tradeoff between slightly larger
> delta and much shorter delta.  Such a tradeoff, if done right,

s/and much shorter delta/and much shallower depth/.

Should be obvious from the context, but just in case -- without
the above correction what I said does not make any sense ;-).

> would make a lot of sense, but I do not offhand think of a way
> to strike a proper balance between them efficiently.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
  2007-07-09  5:43   ` Junio C Hamano
@ 2007-07-09  6:52   ` Brian Downing
  2007-07-09  7:27     ` Junio C Hamano
  2007-07-09 15:58   ` Nicolas Pitre
  2 siblings, 1 reply; 19+ messages in thread
From: Brian Downing @ 2007-07-09  6:52 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Sun, Jul 08, 2007 at 10:31:43PM -0700, Junio C Hamano wrote:
> Putting aside a potential argument that the way the file in
> question, version.lisp-expr, is kept track of might be insane,
> this is an interesting topic.

Yeah, that version numbering system worked quite well for CVS, given its
lack of any other kind of useful whole-tree versioning, and the fact
that there wasn't much branching and merging, due to it being a pain in
the ass.  If an when we move to something like Git, something else will
have to be done, as that file will /always/ be in conflict.

> In addition to the above stats, it may be interesting to know:
> 
>  - pack generation time and memory footprint (/usr/bin/time);
> 
>    I suspect you would have to try_delta more candidates, so
>    this may degrade a bit, but that is done for getting a better
>    deltification, so we would need to see if the extra cost is
>    within reason and worth spending.

It was already try_delta'ing everything in the window.  The only
difference now is that create_delta may generate one more byte of delta
before giving up.  That doesn't seem to have affected things at all
outside of sampling noise:

(These timings are for the Git pack on Linux/amd64, --window and --depth
both 100.  Since /usr/bin/time doesn't seem to report any useful memory
statistics on Linux, I also have a "ps aux" line from when the memory
size looked stable.  This was different from run to run but it shows the
two are in the same order of magnitude.)

Unpatched:
54.99user 0.18system 0:56.80elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (14major+32417minor)pagefaults 0swaps
bdowning  5290 98.7  4.5 106788 92900 pts/1    R+   01:26   0:49 git pack-obj

Patched:
55.37user 0.19system 0:56.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+32249minor)pagefaults 0swaps
bdowning  6086  100  4.5 106880 92996 pts/1    R+   01:29   0:49 git pack-obj

>  - resulting pack size (ls -l pack-*.pack)
> 
>    I do not expect your change would degrade in this area, as
>    you are currently not trading size with shallower delta
>    depth.

The patched version is actually smaller in both SBCL's and Git's case
(again, --window 100 and --depth 100):

SBCL: 61696 bytes smaller (13294225-13232529)
Git:  16010 bytes smaller (12690424-12674414)

I believe the reason for this is that more deltas can get in under the
depth limit.  If I repack the Git pack with --depth=999999999, the patched
version generates a pack that is 1793 bytes smaller.  (12334183-12332390)
(Hmm, I was expecting that to be the same, I'm not sure why it's not.
Padding?)

> Regarding your patch, I think it does not look too bad, as you
> never pick delta that is larger than the best-so-far in favor of
> shallower depth.
> 
> It would become worrysome (*BUT* infinitely more interesting)
> once you start talking about a tradeoff between slightly larger
> delta and much shorter delta.  Such a tradeoff, if done right,
> would make a lot of sense, but I do not offhand think of a way
> to strike a proper balance between them efficiently.

Yeah, I was thinking about that too, and came to the same conclusion.
I suspect you'd have to save a /lot/ of delta depth to want to pay any
more I/O, though.

Another thing that might be iffy (and complicated) is that if you keep
making a good low-depth delta off of a particular object, it might be
good to promote it so it stays in the window for longer.

-bcd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  6:52   ` Brian Downing
@ 2007-07-09  7:27     ` Junio C Hamano
  2007-07-09  7:36       ` Brian Downing
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2007-07-09  7:27 UTC (permalink / raw)
  To: Brian Downing; +Cc: git

bdowning@lavos.net (Brian Downing) writes:

> (These timings are for the Git pack on Linux/amd64, --window and --depth
> both 100.  Since /usr/bin/time doesn't seem to report any useful memory
> statistics on Linux, I also have a "ps aux" line from when the memory
> size looked stable.  This was different from run to run but it shows the
> two are in the same order of magnitude.)
>
> Unpatched:
> 54.99user 0.18system 0:56.80elapsed 97%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (14major+32417minor)pagefaults 0swaps
> bdowning  5290 98.7  4.5 106788 92900 pts/1    R+   01:26   0:49 git pack-obj
>
> Patched:
> 55.37user 0.19system 0:56.35elapsed 98%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+32249minor)pagefaults 0swaps
> bdowning  6086  100  4.5 106880 92996 pts/1    R+   01:29   0:49 git pack-obj

The number of minor faults are comparable (slightly favorable),
which is a good sign.

> The patched version is actually smaller in both SBCL's and Git's case
> (again, --window 100 and --depth 100):
>
> SBCL: 61696 bytes smaller (13294225-13232529)
> Git:  16010 bytes smaller (12690424-12674414)
>
> I believe the reason for this is that more deltas can get in under the
> depth limit.

Very sensible indeed.

>> It would become worrysome (*BUT* infinitely more interesting)
>> once you start talking about a tradeoff between slightly larger
>> delta and much shorter delta.  Such a tradeoff, if done right,
>> would make a lot of sense, but I do not offhand think of a way
>> to strike a proper balance between them efficiently.
>
> Yeah, I was thinking about that too, and came to the same conclusion.
> I suspect you'd have to save a /lot/ of delta depth to want to pay any
> more I/O, though.

That may not be so.  Deeper delta also means more I/O (and
worse, because they can be from discontiguous areas) plus delta
application.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  7:27     ` Junio C Hamano
@ 2007-07-09  7:36       ` Brian Downing
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09  7:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Jul 09, 2007 at 12:27:24AM -0700, Junio C Hamano wrote:
> >> It would become worrysome (*BUT* infinitely more interesting)
> >> once you start talking about a tradeoff between slightly larger
> >> delta and much shorter delta.  Such a tradeoff, if done right,
> >> would make a lot of sense, but I do not offhand think of a way
> >> to strike a proper balance between them efficiently.
> >
> > Yeah, I was thinking about that too, and came to the same conclusion.
> > I suspect you'd have to save a /lot/ of delta depth to want to pay any
> > more I/O, though.
> 
> That may not be so.  Deeper delta also means more I/O (and
> worse, because they can be from discontiguous areas) plus delta
> application.

Good point.

I guess if I were to think about a metric for deciding whether to go for
a small deep delta or a larger shallow one, I would probably keep track
of the total path size (the sum of the base size and all the delta sizes)
for each entry, and play with a weighting formula of single delta size
verses total path size.

-bcd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
  2007-07-09  5:43   ` Junio C Hamano
  2007-07-09  6:52   ` Brian Downing
@ 2007-07-09 15:58   ` Nicolas Pitre
  2007-07-09 16:39     ` Junio C Hamano
  2 siblings, 1 reply; 19+ messages in thread
From: Nicolas Pitre @ 2007-07-09 15:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brian Downing, git

On Sun, 8 Jul 2007, Junio C Hamano wrote:

> It would become worrysome (*BUT* infinitely more interesting)
> once you start talking about a tradeoff between slightly larger
> delta and much shorter delta.  Such a tradeoff, if done right,
> would make a lot of sense, but I do not offhand think of a way
> to strike a proper balance between them efficiently.

We already do something similar to that with max_size being a function 
of the delta depth.  This already has the effect of favoring shalower 
deltas even if deeper deltas could be smaller, and therefore creating a 
wider delta tree (see commits 4e8da195 and c3b06a69).

Maybe this max_size curve could be modified a bit to increase the bias 
towards shallow deltas even in the case where a delta was already found 
instead of the current constant flat curve.

[...]

OK here it is.  And results on the GIT repo and another patalogical test 
repo I keep around are actually really nice!  Not only the pack itself 
is a bit smaller, but the delta depth distribution as shown by 
git-verify-pack -v is much nicer with the bulk of deltas in the low 
depth end of the spectrum and no more peak at the max depth level.

Signed-off-by: Nicolas Pitre <nico@cam.org>
---

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 54b9d26..1eb86ed 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1332,12 +1332,14 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 
 	/* Now some size filtering heuristics. */
 	trg_size = trg_entry->size;
-	max_size = trg_size/2 - 20;
+	if (!trg_entry->delta)
+		max_size = trg_size/2 - 20;
+	else
+		max_size = trg_entry->delta_size * max_depth /
+				(max_depth - trg_entry->depth + 1);
 	max_size = max_size * (max_depth - src_entry->depth) / max_depth;
 	if (max_size == 0)
 		return 0;
-	if (trg_entry->delta && trg_entry->delta_size <= max_size)
-		max_size = trg_entry->delta_size;
 	src_size = src_entry->size;
 	sizediff = src_size < trg_size ? trg_size - src_size : 0;
 	if (sizediff >= max_size)

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 15:58   ` Nicolas Pitre
@ 2007-07-09 16:39     ` Junio C Hamano
  2007-07-09 18:53       ` Brian Downing
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2007-07-09 16:39 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Brian Downing, git

Nicolas Pitre <nico@cam.org> writes:

> On Sun, 8 Jul 2007, Junio C Hamano wrote:
>
>> It would become worrysome (*BUT* infinitely more interesting)
>> once you start talking about a tradeoff between slightly larger
>> delta and much shallower depth.  Such a tradeoff, if done right,
>> would make a lot of sense, but I do not offhand think of a way
>> to strike a proper balance between them efficiently.
>
> We already do something similar to that with max_size being a function 
> of the delta depth.  ...

Yeah, we had a fun thread about that; I forgot about it until
now.

> OK here it is.  And results on the GIT repo and another patalogical test 
> repo I keep around are actually really nice!  Not only the pack itself 
> is a bit smaller, but the delta depth distribution as shown by 
> git-verify-pack -v is much nicer with the bulk of deltas in the low 
> depth end of the spectrum and no more peak at the max depth level.

Looks obviously correct.  Brian, it would be very interesting to
see what Nico's patch does to your dataset.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 16:39     ` Junio C Hamano
@ 2007-07-09 18:53       ` Brian Downing
  2007-07-09 19:13         ` Nicolas Pitre
  2007-07-09 19:30         ` [PATCH] Shoddy pack information tool Brian Downing
  0 siblings, 2 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09 18:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, git

On Mon, Jul 09, 2007 at 09:39:44AM -0700, Junio C Hamano wrote:
> > OK here it is.  And results on the GIT repo and another patalogical test 
> > repo I keep around are actually really nice!  Not only the pack itself 
> > is a bit smaller, but the delta depth distribution as shown by 
> > git-verify-pack -v is much nicer with the bulk of deltas in the low 
> > depth end of the spectrum and no more peak at the max depth level.
> 
> Looks obviously correct.  Brian, it would be very interesting to
> see what Nico's patch does to your dataset.

Nico's patch makes the overall statistics look better, but the
version.lisp-expr file still goes 593 levels deep, as opposed to about
65 with my patch.  (That's better than 980 with stock Git, though.)

Pack statistics from my shoddy analysis tool (I'll post it a bit later):

"sizes" are all object sizes in the pack.  For deltas this is just the
delta size.  "path sizes" is the size of the /path/ to each object in
the file; this is the size of the base and each patch in the chain to
the object.  This is approximately how much data you have to read to
get to an object.  "depths" should be obvious.

SBCL, stock git:
      all sizes: count 46829 total 30256118 min 0 max 1012295 mean 646.10 median 45 std_dev 9555.48
 all path sizes: count 46829 total 1551200401 min 0 max 1012295 mean 33124.78 median 11661 std_dev 55310.88
         depths: count 46829 total 4693372 min 0 max 980 mean 100.22 median 12 std_dev 188.21

SBCL, my patch:
      all sizes: count 46829 total 30251762 min 0 max 1012295 mean 646.00 median 45 std_dev 9555.48
 all path sizes: count 46829 total 1529629918 min 0 max 1012295 mean 32664.16 median 11213 std_dev 54930.06
         depths: count 46829 total 2883121 min 0 max 787 mean 61.57 median 11 std_dev 127.64

SBCL, Nico's patch:
      all sizes: count 46829 total 30253345 min 0 max 1012295 mean 646.04 median 45 std_dev 9555.49
 all path sizes: count 46829 total 1518730701 min 0 max 1012295 mean 32431.41 median 10819 std_dev 54751.35
         depths: count 46829 total 3694511 min 0 max 699 mean 78.89 median 12 std_dev 141.53

I'm vaguely working on an alternate weighting mechanism based on path
sizes, but so far all I've been able to do is generate some really
strange packs.  :)

-bcd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 18:53       ` Brian Downing
@ 2007-07-09 19:13         ` Nicolas Pitre
  2007-07-09 19:24           ` Brian Downing
  2007-07-09 19:30         ` [PATCH] Shoddy pack information tool Brian Downing
  1 sibling, 1 reply; 19+ messages in thread
From: Nicolas Pitre @ 2007-07-09 19:13 UTC (permalink / raw)
  To: Brian Downing; +Cc: Junio C Hamano, git

On Mon, 9 Jul 2007, Brian Downing wrote:

> On Mon, Jul 09, 2007 at 09:39:44AM -0700, Junio C Hamano wrote:
> > > OK here it is.  And results on the GIT repo and another patalogical test 
> > > repo I keep around are actually really nice!  Not only the pack itself 
> > > is a bit smaller, but the delta depth distribution as shown by 
> > > git-verify-pack -v is much nicer with the bulk of deltas in the low 
> > > depth end of the spectrum and no more peak at the max depth level.
> > 
> > Looks obviously correct.  Brian, it would be very interesting to
> > see what Nico's patch does to your dataset.
> 
> Nico's patch makes the overall statistics look better, but the
> version.lisp-expr file still goes 593 levels deep, as opposed to about
> 65 with my patch.  (That's better than 980 with stock Git, though.)

Tuning for such an extreme without impacting normal cases is rather 
hard.

My patch was meant to be used on top of yours.  Is that what you tested?

Also I'd suggest you do not use a max depth of 1000.  It is simply 
insane and might possibly make the existing logic less effective.  Even 
for runtime pack access you want it to be reasonably short, say 100 
maximum, or even the current default of 50.  Any improvements you might 
come with (like automatic depth determined on replay cost) should be 
compared with that default which is known to work well already.


Nicolas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 19:13         ` Nicolas Pitre
@ 2007-07-09 19:24           ` Brian Downing
  2007-07-09 19:49             ` Brian Downing
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Downing @ 2007-07-09 19:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, Jul 09, 2007 at 03:13:54PM -0400, Nicolas Pitre wrote:
> Tuning for such an extreme without impacting normal cases is rather 
> hard.
> 
> My patch was meant to be used on top of yours.  Is that what you tested?
> 
> Also I'd suggest you do not use a max depth of 1000.  It is simply 
> insane and might possibly make the existing logic less effective.  Even 
> for runtime pack access you want it to be reasonably short, say 100 
> maximum, or even the current default of 50.  Any improvements you might 
> come with (like automatic depth determined on replay cost) should be 
> compared with that default which is known to work well already.

No, I didn't try it on top of mine; sorry.  I'll try that out.

I realize a depth of 1000 is nuts, I'm just using it to expose any
suboptimal delta selection for the nasty versions.lisp-expr case, since
I know that with a window size of 100 it should be easy to keep it from
growing too deep.

-bcd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] Shoddy pack information tool
  2007-07-09 18:53       ` Brian Downing
  2007-07-09 19:13         ` Nicolas Pitre
@ 2007-07-09 19:30         ` Brian Downing
  2007-07-11 21:55           ` Junio C Hamano
  1 sibling, 1 reply; 19+ messages in thread
From: Brian Downing @ 2007-07-09 19:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, git

This tool will print vaguely pretty information about a pack.  It
expects the output of "git-verify-pack -v" as input on stdin.

$ git-verify-pack -v pack | packinfo.pl

This prints some full-pack statistics; currently "all sizes", "all path
sizes", "tree sizes", "tree path sizes", and "depths".

* "all sizes" stats are across every object size in the file;
  full sizes for base objects, and delta size for deltas.
* "all path sizes" stats are across all object's "path sizes".
  A path size is the sum of the size of the delta chain, including the
  base object.  In other words, it's how many bytes need be read to
  reassemble the file from deltas.
* "tree sizes" are object sizes grouped into delta trees.
* "tree path sizes" are path sizes grouped into delta trees.
* "depths" should be obvious.

When run as:

$ git-verify-pack -v pack | packinfo.pl -tree

The trees of objects are output along with the stats.  This looks like:

  0 commit 031321c6...      803      803

  0   blob 03156f21...     1767     1767
  1    blob f52a9d7f...       10     1777
  2     blob a8cc5739...       51     1828
  3      blob 660e90b1...       15     1843
  4       blob 0cb8e3bb...       33     1876
  2     blob e48607f0...      311     2088
     size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26

The first number after the sha1 is the object size, the second number
is the path size.  The statistics are across all objects in the previous
delta tree.  Obviously they are omitted for trees of one object.

When run as:

$ git-verify-pack -v pack | packinfo.pl -tree -filenames

It adds filenames to the tree.  Getting this information is somewhat
slow:

  0   blob 03156f21...     1767     1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142
  1    blob f52a9d7f...       10     1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74
  2     blob a8cc5739...       51     1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0
  3      blob 660e90b1...       15     1843 Documentation/git-lost+found.txt @ master~3222^2~2
  4       blob 0cb8e3bb...       33     1876 Documentation/git-lost+found.txt @ master~3222^2~3
  2     blob e48607f0...      311     2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4
     size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26

When run as:

$ git-verify-pack -v pack | packinfo.pl -dump

It prints out "sha1 size pathsize depth" for each sha1 in lexical order.

000079a2eaef17b7eae70e1f0f635557ea67b644 30 472 7
00013cafe6980411aa6fdd940784917b5ff50f0a 44 1542 4
000182eacf99cde27d5916aa415921924b82972c 499 499 0
...

This is handy for comparing two packs.
---
 packinfo.pl |  141 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 141 insertions(+), 0 deletions(-)
 create mode 100644 packinfo.pl

diff --git a/packinfo.pl b/packinfo.pl
new file mode 100644
index 0000000..6abb7f2
--- /dev/null
+++ b/packinfo.pl
@@ -0,0 +1,141 @@
+#!/bin/perl
+
+use strict;
+use Getopt::Long;
+
+my $filenames = 0;
+my $tree = 0;
+my $dump = 0;
+GetOptions("tree" => \$tree,
+           "filenames" => \$filenames,
+           "dump" => \$dump);
+
+my $parents = {};
+my $children = {};
+my $sizes = {};
+my $roots = [];
+my $paths = {};
+my $types = {};
+my $commits = [];
+my $names = {};
+my $depths = {};
+my @depths;
+
+while (<STDIN>) {
+    my ($sha1, $type, $size, $offset, $depth, $parent) = split(/\s+/, $_);
+    next unless ($sha1 =~ /^[0-9a-f]{40}$/);
+    $depths->{$sha1} = $depth || 0;
+    push(@depths, $depth || 0);
+    push(@$commits, $sha1) if ($type eq 'commit');
+    push(@$roots, $sha1) unless $parent;
+    $parents->{$sha1} = $parent;
+    $types->{$sha1} = $type;
+    push(@{$children->{$parent}}, $sha1);
+    $sizes->{$sha1} = $size;
+}
+
+if ($filenames && $tree) {
+    open(NAMES, "git-name-rev --all|");
+    while (<NAMES>) {
+        if (/^(\S+)\s+(.*)$/) {
+            my ($sha1, $name) = ($1, $2);
+            $names->{$sha1} = $name;
+        }
+    }
+    close NAMES;
+
+    for my $commit (@$commits) {
+        my $name = $names->{$commit};
+        open(TREE, "git-ls-tree -t -r $commit|");
+        print STDERR "Plumbing tree $name\n";
+        while (<TREE>) {
+            if (/^(\S+)\s+(\S+)\s+(\S+)\s+(.*)$/) {
+                my ($mode, $type, $sha1, $path) = ($1, $2, $3, $4);
+                $paths->{$sha1} = "$path @ $name";
+            }
+        }
+        close TREE;
+    }
+}
+
+sub stats {
+    my @data = sort {$a <=> $b} @_;
+    my $min = $data[0];
+    my $max = $data[$#data];
+    my $total = 0;
+    my $count = scalar @data;
+    for my $datum (@data) {
+        $total += $datum;
+    }
+    my $mean = $total / $count;
+    my $median = $data[int(@data / 2)];
+    my $diff_sum = 0;
+    for my $datum (@data) {
+        $diff_sum += ($datum - $mean)**2;
+    }
+    my $std_dev = sqrt($diff_sum / $count);
+    return ($count, $total, $min, $max, $mean, $median, $std_dev);
+}
+
+sub print_stats {
+    my $name = shift;
+    my ($count, $total, $min, $max, $mean, $median, $std_dev) = stats(@_);
+    printf("%s: count %s total %s min %s max %s mean %.2f median %s std_dev %.2f\n",
+           $name, $count, $total, $min, $max, $mean, $median, $std_dev);
+}
+
+my @sizes;
+my @path_sizes;
+my @all_sizes;
+my @all_path_sizes;
+my $path_sizes = {};
+
+sub dig {
+    my ($sha1, $depth, $path_size) = @_;
+    $path_size += $sizes->{$sha1};
+    push(@sizes, $sizes->{$sha1});
+    push(@all_sizes, $sizes->{$sha1});
+    push(@path_sizes, $path_size);
+    push(@all_path_sizes, $path_size);
+    $path_sizes->{$sha1} = $path_size;
+    if ($tree) {
+        printf("%3d%s %6s %s %8d %8d %s\n",
+               $depth, (" " x $depth), $types->{$sha1}, 
+               $sha1, $sizes->{$sha1}, $path_size, $paths->{$sha1});
+    }
+    for my $child (@{$children->{$sha1}}) {
+        dig($child, $depth + 1, $path_size);
+    }
+}
+
+my @tree_sizes;
+my @tree_path_sizes;
+
+for my $root (@$roots) {
+    undef @sizes;
+    undef @path_sizes;
+    dig($root, 0, 0);
+    my ($aa, $sz_total) = stats(@sizes);
+    my ($bb, $psz_total) = stats(@path_sizes);
+    push(@tree_sizes, $sz_total);
+    push(@tree_path_sizes, $psz_total);
+    if ($tree) {
+        if (@sizes > 1) {
+            print_stats("     size", @sizes);
+            print_stats("path size", @path_sizes);
+        }
+        print "\n";
+    }
+}
+
+if ($dump) {
+    for my $sha1 (sort keys %$sizes) {
+        print "$sha1 $sizes->{$sha1} $path_sizes->{$sha1} $depths->{$sha1}\n";
+    }
+} else {
+    print_stats("      all sizes", @all_sizes);
+    print_stats(" all path sizes", @all_path_sizes);
+    print_stats("     tree sizes", @tree_sizes);
+    print_stats("tree path sizes", @tree_path_sizes);
+    print_stats("         depths", @depths);
+}
-- 
1.5.2.GIT

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 19:24           ` Brian Downing
@ 2007-07-09 19:49             ` Brian Downing
  2007-07-09 20:22               ` Nicolas Pitre
  2007-07-09 20:23               ` Brian Downing
  0 siblings, 2 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09 19:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, Jul 09, 2007 at 02:24:03PM -0500, Brian Downing wrote:
> No, I didn't try it on top of mine; sorry.  I'll try that out.

The results with both your patch and mine are exactly the same as
yours applied to master (c956395e).  The only thing mine adds on top of
yours is:

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 9a33698..2da78b4 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1373,6 +1373,12 @@ static int try_delta(struct unpacked *trg, struct unpacke
                return 0;

        if (trg_entry->delta_data) {
+               /* Prefer only shallower same-sized deltas. */
+               if (delta_size == trg_entry->delta_size &&
+                   src_entry->depth + 1 >= trg_entry->depth) {
+                       free(delta_buf);
+                       return 0;
+               }
                delta_cache_size -= trg_entry->delta_size;
                free(trg_entry->delta_data);
        }

Which was meant to pick off the cases where I got an equivalently
sized patch that was as deep or deeper.  This was neccessary due to
my changing:

--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked
        if (max_size == 0)
                return 0;
        if (trg_entry->delta && trg_entry->delta_size <= max_size)
-               max_size = trg_entry->delta_size-1;
+               max_size = trg_entry->delta_size;
        src_size = src_entry->size;
        sizediff = src_size < trg_size ? trg_size - src_size : 0;
        if (sizediff >= max_size)

	max_size = trg_entry->delta_size;

Your patch changed the max_size selection logic, so I'm not sure the
rest of mine will do anything anymore.

-bcd

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 19:49             ` Brian Downing
@ 2007-07-09 20:22               ` Nicolas Pitre
  2007-07-09 20:23               ` Brian Downing
  1 sibling, 0 replies; 19+ messages in thread
From: Nicolas Pitre @ 2007-07-09 20:22 UTC (permalink / raw)
  To: Brian Downing; +Cc: Junio C Hamano, git

On Mon, 9 Jul 2007, Brian Downing wrote:

> On Mon, Jul 09, 2007 at 02:24:03PM -0500, Brian Downing wrote:
> > No, I didn't try it on top of mine; sorry.  I'll try that out.
> 
> The results with both your patch and mine are exactly the same as
> yours applied to master (c956395e).  The only thing mine adds on top of
> yours is:
> 
> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index 9a33698..2da78b4 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -1373,6 +1373,12 @@ static int try_delta(struct unpacked *trg, struct unpacke
>                 return 0;
> 
>         if (trg_entry->delta_data) {
> +               /* Prefer only shallower same-sized deltas. */
> +               if (delta_size == trg_entry->delta_size &&
> +                   src_entry->depth + 1 >= trg_entry->depth) {
> +                       free(delta_buf);
> +                       return 0;
> +               }
>                 delta_cache_size -= trg_entry->delta_size;
>                 free(trg_entry->delta_data);
>         }
> 
> Which was meant to pick off the cases where I got an equivalently
> sized patch that was as deep or deeper.  This was neccessary due to
> my changing:
> 
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> @@ -1337,7 +1337,7 @@ static int try_delta(struct unpacked *trg, struct unpacked
>         if (max_size == 0)
>                 return 0;
>         if (trg_entry->delta && trg_entry->delta_size <= max_size)
> -               max_size = trg_entry->delta_size-1;
> +               max_size = trg_entry->delta_size;
>         src_size = src_entry->size;
>         sizediff = src_size < trg_size ? trg_size - src_size : 0;
>         if (sizediff >= max_size)
> 
> 	max_size = trg_entry->delta_size;
> 
> Your patch changed the max_size selection logic, so I'm not sure the
> rest of mine will do anything anymore.

It is still valid nevertheless.


Nicolas

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Preferring shallower deltas on repack
  2007-07-09 19:49             ` Brian Downing
  2007-07-09 20:22               ` Nicolas Pitre
@ 2007-07-09 20:23               ` Brian Downing
  1 sibling, 0 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-09 20:23 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, Jul 09, 2007 at 02:49:32PM -0500, Brian Downing wrote:
> The results with both your patch and mine are exactly the same as
> yours applied to master (c956395e).

Sorry, that's not quite right.  The statistics were close enough that my
vdiff failed, though.  :)

Here are all the diffs through my pack analyzer.  There's amazingly few
and they very nearly cancel each other out:

--- treetree-nico	2007-07-09 15:19:10.000000000 -0500
+++ treetree-nicobd	2007-07-09 15:18:45.000000000 -0500
@@ -25392,12 +25392,15 @@
   0 commit 7cde9fabcd145901785a468a87108f7d9c4291fc      544      544 
 
   0   blob 7ce2b21be4ac425a8423ec69c6036a53311df5cb     2347     2347 
+  1    blob 25140bfb264e7b9c8fe4b97318297630cc93b112      313     2660 
+  2     blob ba80ac333acefc68362e66d6f8e61531263e135a       15     2675 
+  3      blob 6f18434f6815a037d6fa15c38d3f460b530e585b      161     2836 
   1    blob 4dbd333da74ad866f0bff154d6cffea11b9a841a        9     2356 
   2     blob ddb93ae086bf0393c9d53247732a2c29af33c80c      313     2669 
   3      blob d34e346a0f2ea6fdb2acc414d10635dc4d326d25       11     2680 
   4       blob 368243c09993f7ee8b56c7ba184bef134a91607c       11     2691 
-     size: count 5 total 2691 min 9 max 2347 mean 538.20 median 11 std_dev 911.97
-path size: count 5 total 12743 min 2347 max 2691 mean 2548.60 median 2669 std_dev 161.11
+     size: count 8 total 3180 min 9 max 2347 mean 397.50 median 161 std_dev 747.23
+path size: count 8 total 20914 min 2347 max 2836 mean 2614.25 median 2675 std_dev 160.58
 
   0 commit 7ce2c42adf3d62f03086de940adaee48e6161a40      579      579 
 
@@ -47249,6 +47252,9 @@
   0 commit d5319592583dda6833b74b34b52dbd2aa3109948      468      468 
 
   0   tree d53239e82aaed72745097f00b32807f2eb71997d      104      104 
+  1    tree 9c01548d14e5b766397e6d5595566269094057bb        5      109 
+     size: count 2 total 109 min 5 max 104 mean 54.50 median 104 std_dev 49.50
+path size: count 2 total 213 min 104 max 109 mean 106.50 median 109 std_dev 2.50
 
   0 commit d5393dd736972a5c84cd97fec9892cd3da80b669      450      450 
 
@@ -48702,9 +48708,6 @@
 path size: count 16 total 111347 min 6404 max 7298 mean 6959.19 median 7142 std_dev 333.27
 
   0   tree df1a6239457293813f34eee001187d725e718062      104      104 
-  1    tree 9c01548d14e5b766397e6d5595566269094057bb        5      109 
-     size: count 2 total 109 min 5 max 104 mean 54.50 median 104 std_dev 49.50
-path size: count 2 total 213 min 104 max 109 mean 106.50 median 109 std_dev 2.50
 
   0 commit df1eebdf125384f3bf7eb2b15b874043504797e6      364      364 
 
@@ -49407,10 +49410,14 @@
   0   blob e4285f4b5739d4a2385b28c4e4fd60cc40beb2a1      217      217 
 
   0   blob e42a4ea85cbac543f20fba2e1cb95b35c3a8a553     1021     1021 
+  1    blob 9a7565236131c7c2b0f4bc5ae8ddb79cfc0eb418      112     1133 
+  2     blob 9d8583de4217d0515125464300321f20539355e1       65     1198 
+  2     blob f8994669d23fa5c008cfdf58ea63a0b88e46f651       14     1147 
+  3      blob 3eec762b65518ffe72ffb29cd30a1bb481f6e898       19     1166 
   1    blob ae1a1001c0d71a8407a3432ff9e2ba7dab59166b       43     1064 
   2     blob 3127f77e001d0b45359254ca59af9f6a62401d77       12     1076 
-     size: count 3 total 1076 min 12 max 1021 mean 358.67 median 43 std_dev 468.51
-path size: count 3 total 3161 min 1021 max 1076 mean 1053.67 median 1064 std_dev 23.61
+     size: count 7 total 1286 min 12 max 1021 mean 183.71 median 43 std_dev 343.41
+path size: count 7 total 7805 min 1021 max 1198 mean 1115.00 median 1133 std_dev 58.30
 
   0   blob e431cf36086c97b6e22eb741e7566100d3133c42     4766     4766 
   1    blob 344136d826427f400d93240f12b8c070d700ece2      448     5214 
@@ -52503,13 +52510,9 @@
   2     blob 651668cdb7db56dba200db8a53732ac984aaac04       12     4524 
   3      blob b31d0f38830ff8fdcdfd81b386993bc2b2216a38       16     4540 
   4       blob ec7899967381c5993e45df411077b598c7d25831       12     4552 
-  1    blob 9a7565236131c7c2b0f4bc5ae8ddb79cfc0eb418      112     4567 
-  2     blob 9d8583de4217d0515125464300321f20539355e1       65     4632 
-  2     blob f8994669d23fa5c008cfdf58ea63a0b88e46f651       14     4581 
-  3      blob 3eec762b65518ffe72ffb29cd30a1bb481f6e898       19     4600 
   1    blob e8a23b634fef5a9d906e3185d65fae8979612851       20     4475 
-     size: count 60 total 7628 min 11 max 4455 mean 127.13 median 20 std_dev 569.75
-path size: count 60 total 298378 min 4455 max 5455 mean 4972.97 median 5017 std_dev 318.40
+     size: count 56 total 7418 min 11 max 4455 mean 132.46 median 20 std_dev 589.29
+path size: count 56 total 279998 min 4455 max 5455 mean 4999.96 median 5031 std_dev 312.48
 
   0   tree ee1ece9ab33edbe7ec1eebcbe9b259c7a9dded4c      196      196 
 
@@ -54809,16 +54812,13 @@
   0 commit fb76e3acd8b8a53cdadaa65bce1d090d99e004a0      324      324 
 
   0   blob fb7b9c91f11921e2d3ac5b20d2c10728dd5ba173     2417     2417 
-  1    blob 25140bfb264e7b9c8fe4b97318297630cc93b112      313     2730 
-  2     blob ba80ac333acefc68362e66d6f8e61531263e135a       15     2745 
-  3      blob 6f18434f6815a037d6fa15c38d3f460b530e585b      161     2906 
   1    blob f0d45243029e1e1a9ef4d499dcb5934de4c523b9       11     2428 
   2     blob c9eab11115a061a457bb84e0a4b0fa1a140cf0da       41     2469 
   3      blob b54b3745c995f8284d1700a8f9b61eed0ed7548e        7     2476 
   1    blob fb107f406660f5faff8d00d060af1478ad21a6bf       15     2432 
   2     blob 829950c079edad38eeaeb4663ea8e96b4d6fedb7        7     2439 
-     size: count 9 total 2987 min 7 max 2417 mean 331.89 median 15 std_dev 743.62
-path size: count 9 total 23042 min 2417 max 2906 mean 2560.22 median 2469 std_dev 172.26
+     size: count 6 total 2498 min 7 max 2417 mean 416.33 median 15 std_dev 894.80
+path size: count 6 total 14661 min 2417 max 2476 mean 2443.50 median 2439 std_dev 21.61
 
   0 commit fb8533122551bbb7aea669f40bc91c1211809b58      778      778 
 
@@ -56646,7 +56646,7 @@
   0 commit ffe8d65266ed7c2c67a0a6ce7ff0de633000037e      474      474 
 
       all sizes: count 46829 total 30253345 min 0 max 1012295 mean 646.04 median 45 std_dev 9555.49
- all path sizes: count 46829 total 1518730701 min 0 max 1012295 mean 32431.41 median 10819 std_dev 54751.35
+ all path sizes: count 46829 total 1518716755 min 0 max 1012295 mean 32431.12 median 10819 std_dev 54751.51
      tree sizes: count 6442 total 30253345 min 0 max 1012295 mean 4696.27 median 422 std_dev 28603.08
-tree path sizes: count 6442 total 1518730701 min 0 max 306510684 mean 235754.53 median 458 std_dev 4198348.32
+tree path sizes: count 6442 total 1518716755 min 0 max 306510684 mean 235752.37 median 458 std_dev 4198348.25
          depths: count 46829 total 3694511 min 0 max 699 mean 78.89 median 12 std_dev 141.53

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Shoddy pack information tool
  2007-07-09 19:30         ` [PATCH] Shoddy pack information tool Brian Downing
@ 2007-07-11 21:55           ` Junio C Hamano
  2007-07-12  3:02             ` [PATCH] Pack " Brian Downing
  0 siblings, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2007-07-11 21:55 UTC (permalink / raw)
  To: Brian Downing; +Cc: Nicolas Pitre, git

I do not think this is particularlly Shoddy.  Care to move it
somewhere in contrib/ (say, contrib/stats/packinfo.pl) and sign
off the patch?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH] Pack information tool
  2007-07-11 21:55           ` Junio C Hamano
@ 2007-07-12  3:02             ` Brian Downing
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Downing @ 2007-07-12  3:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Brian Downing, Nicolas Pitre, git

This tool will print vaguely pretty information about a pack.  It
expects the output of "git-verify-pack -v" as input on stdin.

$ git-verify-pack -v | packinfo.pl

See the documentation in the script (contrib/stats/packinfo.pl)
for more information.

Signed-off-by: Brian Downing <bdowning@lavos.net>
---
   I've been fighting with the mail configuraton on my laptop so that 
   "git send-email" can send email that vger will accept.  My apologies 
   if anybody got dupes or mangled messages.

 contrib/stats/packinfo.pl |  212 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 212 insertions(+), 0 deletions(-)
 create mode 100755 contrib/stats/packinfo.pl

diff --git a/contrib/stats/packinfo.pl b/contrib/stats/packinfo.pl
new file mode 100755
index 0000000..792673a
--- /dev/null
+++ b/contrib/stats/packinfo.pl
@@ -0,0 +1,212 @@
+#!/bin/perl
+#
+# This tool will print vaguely pretty information about a pack.  It
+# expects the output of "git-verify-pack -v" as input on stdin.
+#
+# $ git-verify-pack -v | packinfo.pl
+#
+# This prints some full-pack statistics; currently "all sizes", "all
+# path sizes", "tree sizes", "tree path sizes", and "depths".
+#
+# * "all sizes" stats are across every object size in the file;
+#   full sizes for base objects, and delta size for deltas.
+# * "all path sizes" stats are across all object's "path sizes".
+#   A path size is the sum of the size of the delta chain, including the
+#   base object.  In other words, it's how many bytes need be read to
+#   reassemble the file from deltas.
+# * "tree sizes" are object sizes grouped into delta trees.
+# * "tree path sizes" are path sizes grouped into delta trees.
+# * "depths" should be obvious.
+#
+# When run as:
+#
+# $ git-verify-pack -v | packinfo.pl -tree
+#
+# the trees of objects are output along with the stats.  This looks
+# like:
+#
+#   0 commit 031321c6...      803      803
+#
+#   0   blob 03156f21...     1767     1767
+#   1    blob f52a9d7f...       10     1777
+#   2     blob a8cc5739...       51     1828
+#   3      blob 660e90b1...       15     1843
+#   4       blob 0cb8e3bb...       33     1876
+#   2     blob e48607f0...      311     2088
+#      size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
+# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26
+#
+# The first number after the sha1 is the object size, the second
+# number is the path size.  The statistics are across all objects in
+# the previous delta tree.  Obviously they are omitted for trees of
+# one object.
+#
+# When run as:
+#
+# $ git-verify-pack -v | packinfo.pl -tree -filenames
+#
+# it adds filenames to the tree.  Getting this information is slow:
+#
+#   0   blob 03156f21...     1767     1767 Documentation/git-lost-found.txt @ tags/v1.2.0~142
+#   1    blob f52a9d7f...       10     1777 Documentation/git-lost-found.txt @ tags/v1.5.0-rc1~74
+#   2     blob a8cc5739...       51     1828 Documentation/git-lost+found.txt @ tags/v0.99.9h^0
+#   3      blob 660e90b1...       15     1843 Documentation/git-lost+found.txt @ master~3222^2~2
+#   4       blob 0cb8e3bb...       33     1876 Documentation/git-lost+found.txt @ master~3222^2~3
+#   2     blob e48607f0...      311     2088 Documentation/git-lost-found.txt @ tags/v1.5.2-rc3~4
+#      size: count 6 total 2187 min 10 max 1767 mean 364.50 median 51 std_dev 635.85
+# path size: count 6 total 11179 min 1767 max 2088 mean 1863.17 median 1843 std_dev 107.26
+#
+# When run as:
+#
+# $ git-verify-pack -v | packinfo.pl -dump
+#
+# it prints out "sha1 size pathsize depth" for each sha1 in lexical
+# order.
+#
+# 000079a2eaef17b7eae70e1f0f635557ea67b644 30 472 7
+# 00013cafe6980411aa6fdd940784917b5ff50f0a 44 1542 4
+# 000182eacf99cde27d5916aa415921924b82972c 499 499 0
+# ...
+#
+# This is handy for comparing two packs.  Adding "-filenames" will add
+# filenames, as per "-tree -filenames" above.
+
+use strict;
+use Getopt::Long;
+
+my $filenames = 0;
+my $tree = 0;
+my $dump = 0;
+GetOptions("tree" => \$tree,
+           "filenames" => \$filenames,
+           "dump" => \$dump);
+
+my %parents;
+my %children;
+my %sizes;
+my @roots;
+my %paths;
+my %types;
+my @commits;
+my %names;
+my %depths;
+my @depths;
+
+while (<STDIN>) {
+    my ($sha1, $type, $size, $offset, $depth, $parent) = split(/\s+/, $_);
+    next unless ($sha1 =~ /^[0-9a-f]{40}$/);
+    $depths{$sha1} = $depth || 0;
+    push(@depths, $depth || 0);
+    push(@commits, $sha1) if ($type eq 'commit');
+    push(@roots, $sha1) unless $parent;
+    $parents{$sha1} = $parent;
+    $types{$sha1} = $type;
+    push(@{$children{$parent}}, $sha1);
+    $sizes{$sha1} = $size;
+}
+
+if ($filenames && ($tree || $dump)) {
+    open(NAMES, "git-name-rev --all|");
+    while (<NAMES>) {
+        if (/^(\S+)\s+(.*)$/) {
+            my ($sha1, $name) = ($1, $2);
+            $names{$sha1} = $name;
+        }
+    }
+    close NAMES;
+
+    for my $commit (@commits) {
+        my $name = $names{$commit};
+        open(TREE, "git-ls-tree -t -r $commit|");
+        print STDERR "Plumbing tree $name\n";
+        while (<TREE>) {
+            if (/^(\S+)\s+(\S+)\s+(\S+)\s+(.*)$/) {
+                my ($mode, $type, $sha1, $path) = ($1, $2, $3, $4);
+                $paths{$sha1} = "$path @ $name";
+            }
+        }
+        close TREE;
+    }
+}
+
+sub stats {
+    my @data = sort {$a <=> $b} @_;
+    my $min = $data[0];
+    my $max = $data[$#data];
+    my $total = 0;
+    my $count = scalar @data;
+    for my $datum (@data) {
+        $total += $datum;
+    }
+    my $mean = $total / $count;
+    my $median = $data[int(@data / 2)];
+    my $diff_sum = 0;
+    for my $datum (@data) {
+        $diff_sum += ($datum - $mean)**2;
+    }
+    my $std_dev = sqrt($diff_sum / $count);
+    return ($count, $total, $min, $max, $mean, $median, $std_dev);
+}
+
+sub print_stats {
+    my $name = shift;
+    my ($count, $total, $min, $max, $mean, $median, $std_dev) = stats(@_);
+    printf("%s: count %s total %s min %s max %s mean %.2f median %s std_dev %.2f\n",
+           $name, $count, $total, $min, $max, $mean, $median, $std_dev);
+}
+
+my @sizes;
+my @path_sizes;
+my @all_sizes;
+my @all_path_sizes;
+my %path_sizes;
+
+sub dig {
+    my ($sha1, $depth, $path_size) = @_;
+    $path_size += $sizes{$sha1};
+    push(@sizes, $sizes{$sha1});
+    push(@all_sizes, $sizes{$sha1});
+    push(@path_sizes, $path_size);
+    push(@all_path_sizes, $path_size);
+    $path_sizes{$sha1} = $path_size;
+    if ($tree) {
+        printf("%3d%s %6s %s %8d %8d %s\n",
+               $depth, (" " x $depth), $types{$sha1},
+               $sha1, $sizes{$sha1}, $path_size, $paths{$sha1});
+    }
+    for my $child (@{$children{$sha1}}) {
+        dig($child, $depth + 1, $path_size);
+    }
+}
+
+my @tree_sizes;
+my @tree_path_sizes;
+
+for my $root (@roots) {
+    undef @sizes;
+    undef @path_sizes;
+    dig($root, 0, 0);
+    my ($aa, $sz_total) = stats(@sizes);
+    my ($bb, $psz_total) = stats(@path_sizes);
+    push(@tree_sizes, $sz_total);
+    push(@tree_path_sizes, $psz_total);
+    if ($tree) {
+        if (@sizes > 1) {
+            print_stats("     size", @sizes);
+            print_stats("path size", @path_sizes);
+        }
+        print "\n";
+    }
+}
+
+if ($dump) {
+    for my $sha1 (sort keys %sizes) {
+        print "$sha1 $sizes{$sha1} $path_sizes{$sha1} $depths{$sha1} $paths{$sha1}\n";
+    }
+} else {
+    print_stats("      all sizes", @all_sizes);
+    print_stats(" all path sizes", @all_path_sizes);
+    print_stats("     tree sizes", @tree_sizes);
+    print_stats("tree path sizes", @tree_path_sizes);
+    print_stats("         depths", @depths);
+}
-- 
1.5.2.GIT

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-07-12  3:10 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-09  4:43 Preferring shallower deltas on repack Brian Downing
2007-07-09  4:45 ` [PATCH] pack-objects: Prefer shallower deltas if the size is equal Brian Downing
2007-07-09  5:31 ` Preferring shallower deltas on repack Junio C Hamano
2007-07-09  5:43   ` Junio C Hamano
2007-07-09  6:52   ` Brian Downing
2007-07-09  7:27     ` Junio C Hamano
2007-07-09  7:36       ` Brian Downing
2007-07-09 15:58   ` Nicolas Pitre
2007-07-09 16:39     ` Junio C Hamano
2007-07-09 18:53       ` Brian Downing
2007-07-09 19:13         ` Nicolas Pitre
2007-07-09 19:24           ` Brian Downing
2007-07-09 19:49             ` Brian Downing
2007-07-09 20:22               ` Nicolas Pitre
2007-07-09 20:23               ` Brian Downing
2007-07-09 19:30         ` [PATCH] Shoddy pack information tool Brian Downing
2007-07-11 21:55           ` Junio C Hamano
2007-07-12  3:02             ` [PATCH] Pack " Brian Downing
2007-07-09  5:41 ` Preferring shallower deltas on repack Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).