git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
@ 2007-05-08  2:54 Theodore Ts'o
  2007-05-08  3:13 ` Nicolas Pitre
  2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
  0 siblings, 2 replies; 40+ messages in thread
From: Theodore Ts'o @ 2007-05-08  2:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Sometimes users might want to use more aggressive packing options
when doing a git-gc.  This allows them to do so without having
to use the low-level plumbing commands.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/git-gc.txt           |    3 ++-
 Documentation/git-pack-objects.txt |   19 +------------------
 Documentation/pack-options.txt     |   18 ++++++++++++++++++
 builtin-gc.c                       |   30 ++++++++++++++++++++++++++++--
 4 files changed, 49 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/pack-options.txt

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index bc16584..3c807c2 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -8,7 +8,7 @@ git-gc - Cleanup unnecessary files and optimize the local repository
 
 SYNOPSIS
 --------
-'git-gc' [--prune]
+'git-gc' [--prune] [--no-reuse-delta] [--window=N] [--depth=N]
 
 DESCRIPTION
 -----------
@@ -35,6 +35,7 @@ OPTIONS
 	repository at the same time (e.g. never use this option
 	in a cron script).
 
+include::pack-options.txt[]
 
 Configuration
 -------------
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index d9e11c6..aac71f6 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -73,18 +73,6 @@ base-name::
 	as if all refs under `$GIT_DIR/refs` are specified to be
 	included.
 
---window=[N], --depth=[N]::
-	These two options affect how the objects contained in
-	the pack are stored using delta compression.  The
-	objects are first internally sorted by type, size and
-	optionally names and compared against the other objects
-	within --window to see if using delta compression saves
-	space.  --depth limits the maximum delta depth; making
-	it too deep affects the performance on the unpacker
-	side, because delta data needs to be applied that many
-	times to get to the necessary object.
-	The default value for both --window and --depth is 10.
-
 --incremental::
 	This flag causes an object already in a pack ignored
 	even if it appears in the standard input.
@@ -120,12 +108,7 @@ base-name::
 	This flag makes the command not to report its progress
 	on the standard error stream.
 
---no-reuse-delta::
-	When creating a packed archive in a repository that
-	has existing packs, the command reuses existing deltas.
-	This sometimes results in a slightly suboptimal pack.
-	This flag tells the command not to reuse existing deltas
-	but compute them from scratch.
+include::pack-options.txt[]
 
 --delta-base-offset::
 	A packed archive can express base object of a delta as
diff --git a/Documentation/pack-options.txt b/Documentation/pack-options.txt
new file mode 100644
index 0000000..7b0ae5f
--- /dev/null
+++ b/Documentation/pack-options.txt
@@ -0,0 +1,18 @@
+--window=[N], --depth=[N]::
+	These two options affect how the objects contained in
+	the pack are stored using delta compression.  The
+	objects are first internally sorted by type, size and
+	optionally names and compared against the other objects
+	within --window to see if using delta compression saves
+	space.  --depth limits the maximum delta depth; making
+	it too deep affects the performance on the unpacker
+	side, because delta data needs to be applied that many
+	times to get to the necessary object.
+	The default value for both --window and --depth is 10.
+
+--no-reuse-delta::
+	When creating a packed archive in a repository that
+	has existing packs, the command reuses existing deltas.
+	This sometimes results in a slightly suboptimal pack.
+	This flag tells the command not to reuse existing deltas
+	but compute them from scratch.
diff --git a/builtin-gc.c b/builtin-gc.c
index 3b1f8c2..7e7775d 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -15,13 +15,15 @@
 
 #define FAILED_RUN "failed to run %s"
 
-static const char builtin_gc_usage[] = "git-gc [--prune]";
+static const char builtin_gc_usage[] = "git-gc [--prune] [--no-reuse-delta] [--window=N] [--depth=N]";
 
 static int pack_refs = -1;
 
+#define MAX_ADD 10
+
 static const char *argv_pack_refs[] = {"pack-refs", "--prune", NULL};
 static const char *argv_reflog[] = {"reflog", "expire", "--all", NULL};
-static const char *argv_repack[] = {"repack", "-a", "-d", "-l", NULL};
+static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
 static const char *argv_prune[] = {"prune", NULL};
 static const char *argv_rerere[] = {"rerere", "gc", NULL};
 
@@ -37,6 +39,21 @@ static int gc_config(const char *var, const char *value)
 	return git_default_config(var, value);
 }
 
+static append_option(const char **cmd, const char *opt, int max_length)
+{
+	int	i;
+
+	for (i=0; cmd[i]; i++)
+		;
+
+	if (i+2 >= max_length) {
+		fprintf(stderr, "Too many options specified\n");
+		exit(1);
+	}
+	cmd[i++] = opt;
+	cmd[i] = 0;
+}
+
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int i;
@@ -53,6 +70,15 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			prune = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--no-reuse-delta")) {
+			append_option(argv_repack, "-f", MAX_ADD);
+			continue;
+		}
+		if (!strncmp(arg, "--window", 8) ||
+		    !strncmp(arg, "--depth", 7)) {
+			append_option(argv_repack, arg, MAX_ADD);
+			continue;
+		}
 		/* perhaps other parameters later... */
 		break;
 	}
-- 
1.5.2.rc1.20.g86b9-dirty

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
  2007-05-08  2:54 [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc Theodore Ts'o
@ 2007-05-08  3:13 ` Nicolas Pitre
  2007-05-08  3:21   ` Theodore Tso
  2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
  1 sibling, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08  3:13 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Junio C Hamano, git

On Mon, 7 May 2007, Theodore Ts'o wrote:

> Sometimes users might want to use more aggressive packing options
> when doing a git-gc.  This allows them to do so without having
> to use the low-level plumbing commands.

The 'git repack' command isn't _that_ low level, is it?  
git-pack-objects is plumbing for sure, but not git-repack?

Especially if you're aware and interested in those options, you won't be 
afraid of 'git repack -a -f -d --window=...".

In the context of "gc", having an option that reads "window" looks a bit 
strange too.

Maybe it's just me...


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
  2007-05-08  3:13 ` Nicolas Pitre
@ 2007-05-08  3:21   ` Theodore Tso
  2007-05-08  3:38     ` Dana How
  2007-05-08  4:43     ` Junio C Hamano
  0 siblings, 2 replies; 40+ messages in thread
From: Theodore Tso @ 2007-05-08  3:21 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, git

On Mon, May 07, 2007 at 11:13:58PM -0400, Nicolas Pitre wrote:
> On Mon, 7 May 2007, Theodore Ts'o wrote:
> 
> > Sometimes users might want to use more aggressive packing options
> > when doing a git-gc.  This allows them to do so without having
> > to use the low-level plumbing commands.
> 
> The 'git repack' command isn't _that_ low level, is it?  
> git-pack-objects is plumbing for sure, but not git-repack?
> 
> Especially if you're aware and interested in those options, you won't be 
> afraid of 'git repack -a -f -d --window=...".
> 
> In the context of "gc", having an option that reads "window" looks a bit 
> strange too.

I suppose, but you either need to then know all of the other commands
which git-gc runs, and do them manually, skipping git-gc altogether,
or use git-gc, and end up rewriting the pack twice, ince using the
git-repack in git-gc, and then once manually so you can give the
options that you really want to give to git-repack.

Maybe the right approach is to have a way to specify default --window
and --depth as git configuration variables?  Looks like there is a
pack.window already, but not a pack.depth.

What if we add a pack.depth configuration option, and add only
--no-reuse-delta to git-gc?   Would that be better?

						- Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
  2007-05-08  3:21   ` Theodore Tso
@ 2007-05-08  3:38     ` Dana How
  2007-05-08  4:43     ` Junio C Hamano
  1 sibling, 0 replies; 40+ messages in thread
From: Dana How @ 2007-05-08  3:38 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nicolas Pitre, Junio C Hamano, git, danahow

On 5/7/07, Theodore Tso <tytso@mit.edu> wrote:
> On Mon, May 07, 2007 at 11:13:58PM -0400, Nicolas Pitre wrote:
> > On Mon, 7 May 2007, Theodore Ts'o wrote:
> > > Sometimes users might want to use more aggressive packing options
> > > when doing a git-gc.  This allows them to do so without having
> > > to use the low-level plumbing commands.
> > In the context of "gc", having an option that reads "window" looks a bit
> > strange too.
> I suppose, but you either need to then know all of the other commands
> which git-gc runs, and do them manually, skipping git-gc altogether,
> or use git-gc, and end up rewriting the pack twice, ince using the
> git-repack in git-gc, and then once manually so you can give the
> options that you really want to give to git-repack.
>
> Maybe the right approach is to have a way to specify default --window
> and --depth as git configuration variables?  Looks like there is a
> pack.window already, but not a pack.depth.
>
> What if we add a pack.depth configuration option, and add only
> --no-reuse-delta to git-gc?   Would that be better?

I would use pack.depth .

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
  2007-05-08  3:21   ` Theodore Tso
  2007-05-08  3:38     ` Dana How
@ 2007-05-08  4:43     ` Junio C Hamano
  2007-05-08 13:46       ` Nicolas Pitre
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2007-05-08  4:43 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nicolas Pitre, git

Theodore Tso <tytso@mit.edu> writes:

> On Mon, May 07, 2007 at 11:13:58PM -0400, Nicolas Pitre wrote:
>> ... 
>> Especially if you're aware and interested in those options, you won't be 
>> afraid of 'git repack -a -f -d --window=...".
>> 
>> In the context of "gc", having an option that reads "window" looks a bit 
>> strange too.
>
> I suppose, but you either need to then know all of the other commands
> which git-gc runs, and do them manually, skipping git-gc altogether,
> or use git-gc, and end up rewriting the pack twice,...

If the user really wants to tweak the parameters that much and
that often, I think what Nico says, plus your pack.depth/window
configuration variables, make more sense.  git-gc is meant to be
a shorthand with reasonable "one size fits all" default, and
there is something wrong if a user has to give customization
option every time it is run.  It could be that the default
parameters are grossly off for _everybody_, in which case we
should fix the default.

With the recent introduction of delta base caching code, we
might want to tweak the default pack depth to larger value for
everybody, by the way.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to
  2007-05-08  2:54 [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc Theodore Ts'o
  2007-05-08  3:13 ` Nicolas Pitre
@ 2007-05-08 13:28 ` Theodore Ts'o
  2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
  2007-05-08 15:30   ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Nicolas Pitre
  1 sibling, 2 replies; 40+ messages in thread
From: Theodore Ts'o @ 2007-05-08 13:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List

OK, here's a patch to implement pack.depth (with the default tweaked
to 50 --- is that too high?), followed by a simplified and reworked
patch to git-gc that only implements --no-reuse-delta.

I don't imagine that most users will want to use that feature most of
the time, hence the long option name, but occasionally, it might be
useful.  Yes, the user could just run "git-repack -a -d -f -l" after
running git-gc, but then the "git-repack -a -d -l" in git-gc is just a
wasted disk i/o.  I don't know if I'll manage to convince you, if not,
just drop the second patch, I guess.  :-)

						 - Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
@ 2007-05-08 13:28   ` Theodore Ts'o
  2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
  2007-05-08 15:38     ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Nicolas Pitre
  2007-05-08 15:30   ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Nicolas Pitre
  1 sibling, 2 replies; 40+ messages in thread
From: Theodore Ts'o @ 2007-05-08 13:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Theodore Ts'o

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/config.txt           |    6 +++++-
 Documentation/git-pack-objects.txt |    2 +-
 Documentation/git-repack.txt       |    2 +-
 builtin-pack-objects.c             |    6 +++++-
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 24f9655..c7674c2 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -544,7 +544,11 @@ merge.<driver>.recursive::
 
 pack.window::
 	The size of the window used by gitlink:git-pack-objects[1] when no
-	window size is given on the command line. Defaults to 10.
+	window size is given on the command line.  Defaults to 10.
+
+pack.depth::
+	The maximum delta depth used by gitlink:git-pack-objects[1] when no
+	maximum depth is given on the command line.  Defaults to 50.
 
 pull.octopus::
 	The default merge strategy to use when pulling multiple branches
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index d9e11c6..bd3ee45 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -83,7 +83,7 @@ base-name::
 	it too deep affects the performance on the unpacker
 	side, because delta data needs to be applied that many
 	times to get to the necessary object.
-	The default value for both --window and --depth is 10.
+	The default value for --window is 10 and --depth is 50.
 
 --incremental::
 	This flag causes an object already in a pack ignored
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index d39abc1..cc3b0b2 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -63,7 +63,7 @@ OPTIONS
 	space. `--depth` limits the maximum delta depth; making it too deep
 	affects the performance on the unpacker side, because delta data needs
 	to be applied that many times to get to the necessary object.
-	The default value for both --window and --depth is 10.
+	The default value for --window is 10 and --depth is 50.
 
 
 Configuration
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 7bff8ea..966f843 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -64,6 +64,7 @@ static char tmpname[PATH_MAX];
 static unsigned char pack_file_sha1[20];
 static int progress = 1;
 static int window = 10;
+static int depth = 50;
 static int pack_to_stdout;
 static int num_preferred_base;
 static struct progress progress_state;
@@ -1489,6 +1490,10 @@ static int git_pack_config(const char *k, const char *v)
 		window = git_config_int(k, v);
 		return 0;
 	}
+	if(!strcmp(k, "pack.depth")) {
+		depth = git_config_int(k, v);
+		return 0;
+	}
 	return git_default_config(k, v);
 }
 
@@ -1584,7 +1589,6 @@ static int adjust_perm(const char *path, mode_t mode)
 
 int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 {
-	int depth = 10;
 	int use_internal_rev_list = 0;
 	int thin = 0;
 	uint32_t i;
-- 
1.5.2.rc2.22.ga39d

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
@ 2007-05-08 13:28     ` Theodore Ts'o
  2007-05-08 15:35       ` Nicolas Pitre
  2007-05-09  5:05       ` Daniel Barkalow
  2007-05-08 15:38     ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Nicolas Pitre
  1 sibling, 2 replies; 40+ messages in thread
From: Theodore Ts'o @ 2007-05-08 13:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Theodore Ts'o

This allows the user to regenerate the deltas in packs while doing
a git-gc.  The user could just run git-repack -a -d -f -l after
running git-gc, but then the first git-repack run by git-gc is
a bit of waste.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/git-gc.txt |    7 ++++++-
 builtin-gc.c             |   24 ++++++++++++++++++++++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index bc16584..0493d06 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -8,7 +8,7 @@ git-gc - Cleanup unnecessary files and optimize the local repository
 
 SYNOPSIS
 --------
-'git-gc' [--prune]
+'git-gc' [--prune] [--no-reuse-delta]
 
 DESCRIPTION
 -----------
@@ -35,6 +35,11 @@ OPTIONS
 	repository at the same time (e.g. never use this option
 	in a cron script).
 
+--no-reuse-delta::
+	This causes deltas in existing packs to be recalculated instead
+	of reusing the existing deltas.  This can save disk space at
+	the cost of taking more time to recalculate them from scratch.
+
 
 Configuration
 -------------
diff --git a/builtin-gc.c b/builtin-gc.c
index 3b1f8c2..5cb7ffd 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -15,13 +15,14 @@
 
 #define FAILED_RUN "failed to run %s"
 
-static const char builtin_gc_usage[] = "git-gc [--prune]";
+static const char builtin_gc_usage[] = "git-gc [--prune] [--no-reuse-delta]";
 
 static int pack_refs = -1;
 
+#define MAX_ADD 10
 static const char *argv_pack_refs[] = {"pack-refs", "--prune", NULL};
 static const char *argv_reflog[] = {"reflog", "expire", "--all", NULL};
-static const char *argv_repack[] = {"repack", "-a", "-d", "-l", NULL};
+static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
 static const char *argv_prune[] = {"prune", NULL};
 static const char *argv_rerere[] = {"rerere", "gc", NULL};
 
@@ -37,6 +38,21 @@ static int gc_config(const char *var, const char *value)
 	return git_default_config(var, value);
 }
 
+static append_option(const char **cmd, const char *opt, int max_length)
+{
+	int	i;
+
+	for (i=0; cmd[i]; i++)
+		;
+
+	if (i+2 >= max_length) {
+		fprintf(stderr, "Too many options specified\n");
+		exit(1);
+	}
+	cmd[i++] = opt;
+	cmd[i] = 0;
+}
+
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int i;
@@ -53,6 +69,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			prune = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--no-reuse-delta")) {
+			append_option(argv_repack, "-f", MAX_ADD);
+			continue;
+		}
 		/* perhaps other parameters later... */
 		break;
 	}
-- 
1.5.2.rc2.22.ga39d

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc
  2007-05-08  4:43     ` Junio C Hamano
@ 2007-05-08 13:46       ` Nicolas Pitre
  0 siblings, 0 replies; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 13:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Theodore Tso, git

On Mon, 7 May 2007, Junio C Hamano wrote:

> Theodore Tso <tytso@mit.edu> writes:
> 
> > On Mon, May 07, 2007 at 11:13:58PM -0400, Nicolas Pitre wrote:
> >> ... 
> >> Especially if you're aware and interested in those options, you won't be 
> >> afraid of 'git repack -a -f -d --window=...".
> >> 
> >> In the context of "gc", having an option that reads "window" looks a bit 
> >> strange too.
> >
> > I suppose, but you either need to then know all of the other commands
> > which git-gc runs, and do them manually, skipping git-gc altogether,
> > or use git-gc, and end up rewriting the pack twice,...
> 
> If the user really wants to tweak the parameters that much and
> that often, I think what Nico says, plus your pack.depth/window
> configuration variables, make more sense.

Also, once you've run git-repack with -f and the window/depth params you 
want, those deltas will be reused as is by default afterwards even if 
you use git-gc.  So the big repack to tighten a large repo needs only to 
be done once.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to
  2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
  2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
@ 2007-05-08 15:30   ` Nicolas Pitre
  2007-05-08 21:12     ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 15:30 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Ts'o wrote:

> OK, here's a patch to implement pack.depth (with the default tweaked
> to 50 --- is that too high?),

This is most likely to affect runtime performances, but some benchmarks 
would be needed to find out how much.

Having both pack.depth and pack.window as config options (with current 
defaults of 10) would certainly be a good thing.  Tweaking those 
defaults should probably be investigated separately.

> followed by a simplified and reworked
> patch to git-gc that only implements --no-reuse-delta.
> 
> I don't imagine that most users will want to use that feature most of
> the time, hence the long option name, but occasionally, it might be
> useful.  Yes, the user could just run "git-repack -a -d -f -l" after
> running git-gc, but then the "git-repack -a -d -l" in git-gc is just a
> wasted disk i/o.

In which case, it is git-gc that needs to get a bit smarter.  Maybe 
something like this:

----- >*
Avoid running git-repack from git-gc if there is evidently nothing to 
repack.

Signed-off-by: Nicolas Pitre <nico@cam.org>
---

diff --git a/builtin-count-objects.c b/builtin-count-objects.c
index ff90ebd..8d79764 100644
--- a/builtin-count-objects.c
+++ b/builtin-count-objects.c
@@ -67,13 +67,40 @@ static void count_objects(DIR *d, char *path, int len, int verbose,
 	}
 }
 
-int cmd_count_objects(int ac, const char **av, const char *prefix)
+static void count_loose_objects(int verbose,
+				unsigned long *loose,
+				unsigned long *loose_size,
+				unsigned long *packed_loose,
+				unsigned long *garbage)
 {
-	int i;
-	int verbose = 0;
 	const char *objdir = get_object_directory();
-	int len = strlen(objdir);
+	int i, len = strlen(objdir);
 	char *path = xmalloc(len + 50);
+	memcpy(path, objdir, len);
+	if (len && objdir[len-1] != '/')
+		path[len++] = '/';
+	for (i = 0; i < 256; i++) {
+		DIR *d;
+		sprintf(path + len, "%02x", i);
+		d = opendir(path);
+		if (!d)
+			continue;
+		count_objects(d, path, len, verbose,
+			      loose, loose_size, packed_loose, garbage);
+		closedir(d);
+	}
+}
+
+unsigned long num_loose_objects(void)
+{
+	unsigned long loose = 0, packed_loose = 0, garbage = 0, loose_size = 0;
+	count_loose_objects(0, &loose, &loose_size, &packed_loose, &garbage);
+	return loose;
+}
+
+int cmd_count_objects(int ac, const char **av, const char *prefix)
+{
+	int i, verbose = 0;
 	unsigned long loose = 0, packed = 0, packed_loose = 0, garbage = 0;
 	unsigned long loose_size = 0;
 
@@ -90,19 +117,8 @@ int cmd_count_objects(int ac, const char **av, const char *prefix)
 	/* we do not take arguments other than flags for now */
 	if (i < ac)
 		usage(count_objects_usage);
-	memcpy(path, objdir, len);
-	if (len && objdir[len-1] != '/')
-		path[len++] = '/';
-	for (i = 0; i < 256; i++) {
-		DIR *d;
-		sprintf(path + len, "%02x", i);
-		d = opendir(path);
-		if (!d)
-			continue;
-		count_objects(d, path, len, verbose,
-			      &loose, &loose_size, &packed_loose, &garbage);
-		closedir(d);
-	}
+
+	count_loose_objects(verbose, &loose, &loose_size, &packed_loose, &garbage);
 	if (verbose) {
 		struct packed_git *p;
 		unsigned long num_pack = 0;
diff --git a/builtin-gc.c b/builtin-gc.c
index 3b1f8c2..b9b3c05 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -40,7 +40,7 @@ static int gc_config(const char *var, const char *value)
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int i;
-	int prune = 0;
+	int prune = 0, do_repack = 0;
 
 	git_config(gc_config);
 
@@ -65,7 +65,20 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (run_command_v_opt(argv_reflog, RUN_GIT_CMD))
 		return error(FAILED_RUN, argv_reflog[0]);
 
-	if (run_command_v_opt(argv_repack, RUN_GIT_CMD))
+	if (num_loose_objects() > 0) {
+		do_repack = 1;
+	} else {
+		struct packed_git *p;
+		unsigned long num_pack = 0;
+		if (!packed_git)
+			prepare_packed_git();
+		for (p = packed_git; p; p = p->next)
+			if (p->pack_local)
+				num_pack++;
+		if (num_pack > 1)
+			do_repack = 1;
+	}
+	if (do_repack && run_command_v_opt(argv_repack, RUN_GIT_CMD))
 		return error(FAILED_RUN, argv_repack[0]);
 
 	if (prune && run_command_v_opt(argv_prune, RUN_GIT_CMD))
diff --git a/cache.h b/cache.h
index 8e76152..3a140f1 100644
--- a/cache.h
+++ b/cache.h
@@ -359,6 +359,8 @@ extern int legacy_loose_object(unsigned char *);
 extern int has_pack_file(const unsigned char *sha1);
 extern int has_pack_index(const unsigned char *sha1);
 
+extern unsigned long num_loose_objects(void);
+
 extern signed char hexval_table[256];
 static inline unsigned int hexval(unsigned int c)
 {

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
@ 2007-05-08 15:35       ` Nicolas Pitre
  2007-05-09  5:05       ` Daniel Barkalow
  1 sibling, 0 replies; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 15:35 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Ts'o wrote:

> This allows the user to regenerate the deltas in packs while doing
> a git-gc.  The user could just run git-repack -a -d -f -l after
> running git-gc, but then the first git-repack run by git-gc is
> a bit of waste.

Given the patch I just posted and my previous arguments I don't think 
this patch is in the spirit of git-gc.  I'd prefer if git-gc remained as 
simple with the least options as possible, but I don't care that much 
either.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
  2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
@ 2007-05-08 15:38     ` Nicolas Pitre
  2007-05-08 16:30       ` Theodore Tso
  1 sibling, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 15:38 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Ts'o wrote:

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

I'd prefer if tests were performed on the performance impact before 
changing the default depth.  If done separately from this patch then the 
commit log could contain those results as well.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 15:38     ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Nicolas Pitre
@ 2007-05-08 16:30       ` Theodore Tso
  2007-05-08 16:49         ` Johannes Schindelin
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Theodore Tso @ 2007-05-08 16:30 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List

On Tue, May 08, 2007 at 11:38:46AM -0400, Nicolas Pitre wrote:
> On Tue, 8 May 2007, Theodore Ts'o wrote:
> 
> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> I'd prefer if tests were performed on the performance impact before 
> changing the default depth.  If done separately from this patch then the 
> commit log could contain those results as well.

The following results are on a recent git repository, using time to
record the real, user, and sys times on the two commands: "git-gc
--no-reuse-delta --window=X --depth=Y" and "git log --pretty=oneline
-S'object' > /dev/null".  All of these tests were done with a hot
cache, so disk speed didn't enter into the calculations.

                git-gc                     git log -S'object'   pack size
w=10,d=10       27.1s/25.2s/0.3s           20.8s/20.4s/0.1s     15292k
w=10,d=30	23.8s/22.3s/0.2s           21.2s/20.9s/0.1s     12996k
w=10,d=50       24.8s/22.4s/0.4s           21.8s/21.2s/0.1s     12340k
w=100,d=100     24.1s/22.8s/0.3s           22.4s/21.8s/0.2s     11772k

w=30,d=10       45.0s/43.1s/0.4s           20.8s/20.5s/0.1s     14388k
w=30,d=30       35.8s/34.1s/0.3s           21.6s/21.1s/0.1s     11800k
w=30,d=50       34.6s/33.0s/0.3s           22.1s/21.4s/0.1s     11376k
w=30,d=100      34.0s/32.2s/0.3s           22.2s/21.6s/0.1s     11012k

w=50,d=10       56.1s/54.3s/0.4s           21.3s/20.5s/0.1s     14224k
w=50,d=30       47.2s/45.4s/0.4s           21.6s/21.0s/0.1s     11496k      
w=50,d=50       44.5s/43.0s/0.3s           21.7s/21.2s/0.1s     11108k
w=50,d=100      44.3s/42.7s/0.4s           22.4s/21.7s/0.1s     10824k

So a couple of things immediately become evident.  First of all, as
Junio predicted, changing --depth makes no difference to the git-gc or
git log times.  The latter is thanks to the delta chaching.  Secondly,
changing --depth does make a signficiant difference to the pack size.

Finally, --window does help somewhat in reducing the pack size, but it
_significantly_ increases the time to calculate the pack.

My conclusion given this quick benchmark is that it seems to me that
changing the defaults of --depth to 50, and keeping --window at 10, is
a reasonable thing to do.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 16:30       ` Theodore Tso
@ 2007-05-08 16:49         ` Johannes Schindelin
  2007-05-08 18:09           ` Theodore Tso
  2007-05-08 17:07         ` Dana How
  2007-05-08 17:35         ` Nicolas Pitre
  2 siblings, 1 reply; 40+ messages in thread
From: Johannes Schindelin @ 2007-05-08 16:49 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List

Hi,

On Tue, 8 May 2007, Theodore Tso wrote:

> On Tue, May 08, 2007 at 11:38:46AM -0400, Nicolas Pitre wrote:
> > On Tue, 8 May 2007, Theodore Ts'o wrote:
> > 
> > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> > 
> > I'd prefer if tests were performed on the performance impact before 
> > changing the default depth.  If done separately from this patch then the 
> > commit log could contain those results as well.
> 
> The following results are on a recent git repository, using time to
> record the real, user, and sys times on the two commands: "git-gc
> --no-reuse-delta --window=X --depth=Y" and "git log --pretty=oneline
> -S'object' > /dev/null".  All of these tests were done with a hot
> cache, so disk speed didn't enter into the calculations.
> 
>                 git-gc                     git log -S'object'   pack size
> w=10,d=10       27.1s/25.2s/0.3s           20.8s/20.4s/0.1s     15292k
> w=10,d=30	23.8s/22.3s/0.2s           21.2s/20.9s/0.1s     12996k
> w=10,d=50       24.8s/22.4s/0.4s           21.8s/21.2s/0.1s     12340k
> w=100,d=100     24.1s/22.8s/0.3s           22.4s/21.8s/0.2s     11772k
> 
> w=30,d=10       45.0s/43.1s/0.4s           20.8s/20.5s/0.1s     14388k
> w=30,d=30       35.8s/34.1s/0.3s           21.6s/21.1s/0.1s     11800k
> w=30,d=50       34.6s/33.0s/0.3s           22.1s/21.4s/0.1s     11376k
> w=30,d=100      34.0s/32.2s/0.3s           22.2s/21.6s/0.1s     11012k
> 
> w=50,d=10       56.1s/54.3s/0.4s           21.3s/20.5s/0.1s     14224k
> w=50,d=30       47.2s/45.4s/0.4s           21.6s/21.0s/0.1s     11496k      
> w=50,d=50       44.5s/43.0s/0.3s           21.7s/21.2s/0.1s     11108k
> w=50,d=100      44.3s/42.7s/0.4s           22.4s/21.7s/0.1s     10824k
> 
> So a couple of things immediately become evident.  First of all, as
> Junio predicted, changing --depth makes no difference to the git-gc or
> git log times.  The latter is thanks to the delta chaching.  Secondly,
> changing --depth does make a signficiant difference to the pack size.
> 
> Finally, --window does help somewhat in reducing the pack size, but it
> _significantly_ increases the time to calculate the pack.
> 
> My conclusion given this quick benchmark is that it seems to me that
> changing the defaults of --depth to 50, and keeping --window at 10, is
> a reasonable thing to do.

I'd be happier if that test was done on _at least_ the kernel repo, if not 
something larger, _plus_ having the numbers on page faults. Swapping can 
kill performance substantially...

git.git is small.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 16:30       ` Theodore Tso
  2007-05-08 16:49         ` Johannes Schindelin
@ 2007-05-08 17:07         ` Dana How
  2007-05-08 17:35         ` Nicolas Pitre
  2 siblings, 0 replies; 40+ messages in thread
From: Dana How @ 2007-05-08 17:07 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List, danahow

On 5/8/07, Theodore Tso <tytso@mit.edu> wrote:
> The following results are on a recent git repository, using time to
> record the real, user, and sys times on the two commands: "git-gc
> --no-reuse-delta --window=X --depth=Y" and "git log --pretty=oneline
> -S'object' > /dev/null".  All of these tests were done with a hot
> cache, so disk speed didn't enter into the calculations.
>
> ...
>
> So a couple of things immediately become evident.  First of all, as
> Junio predicted, changing --depth makes no difference to the git-gc or
> git log times.  The latter is thanks to the delta chaching.  Secondly,
> changing --depth does make a signficiant difference to the pack size.
>
> Finally, --window does help somewhat in reducing the pack size, but it
> _significantly_ increases the time to calculate the pack.
>
> My conclusion given this quick benchmark is that it seems to me that
> changing the defaults of --depth to 50, and keeping --window at 10, is
> a reasonable thing to do.

If you still have the packfiles around,  the times for some non-pickaxe
git-log commands would be interesting,  like from git-log's man page:
       git log v2.6.12.. include/scsi drivers/scsi
       git log --since="2 weeks ago" -- gitk
These operations would be more dominated by processing smaller objects.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 16:30       ` Theodore Tso
  2007-05-08 16:49         ` Johannes Schindelin
  2007-05-08 17:07         ` Dana How
@ 2007-05-08 17:35         ` Nicolas Pitre
  2007-05-09  5:03           ` Junio C Hamano
  2 siblings, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 17:35 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Tso wrote:

> On Tue, May 08, 2007 at 11:38:46AM -0400, Nicolas Pitre wrote:
> > On Tue, 8 May 2007, Theodore Ts'o wrote:
> > 
> > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> > 
> > I'd prefer if tests were performed on the performance impact before 
> > changing the default depth.  If done separately from this patch then the 
> > commit log could contain those results as well.
> 
> The following results are on a recent git repository, using time to
> record the real, user, and sys times on the two commands: "git-gc
> --no-reuse-delta --window=X --depth=Y" and "git log --pretty=oneline
> -S'object' > /dev/null".  All of these tests were done with a hot
> cache, so disk speed didn't enter into the calculations.
[...]
> My conclusion given this quick benchmark is that it seems to me that
> changing the defaults of --depth to 50, and keeping --window at 10, is
> a reasonable thing to do.

Effectively.

I'd still prefer to see the default changed in a patch of its own 
though.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 16:49         ` Johannes Schindelin
@ 2007-05-08 18:09           ` Theodore Tso
  2007-05-08 18:46             ` Nicolas Pitre
  0 siblings, 1 reply; 40+ messages in thread
From: Theodore Tso @ 2007-05-08 18:09 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List

On Tue, May 08, 2007 at 06:49:47PM +0200, Johannes Schindelin wrote:
> I'd be happier if that test was done on _at least_ the kernel repo, if not 
> something larger, _plus_ having the numbers on page faults. Swapping can 
> kill performance substantially...

Given how the delta cache works, I really don't think it's going to
matter.  In any case, my laptop has 2gigs of memory, and the kernel
pack file is only 134megs, so you're not going to see any major page
faults....

In any case, here is a quick run:

                git-gc                     git-log -S'object'     
             real/user/sys/min.faults   real/user/sys/min.faults  pack size
w=10,d=10    4:31/257.7/6.2/391711      5:53/326.9/1.7/255156     155940k
w=10,d=30    4:16/242.7/6.5/378193      5:39/331.6/2.3/437283     143144k
w=10,d=50    4:29/250.1/6.7/554493      5:43/334.5/1.9/362574     140080k

You'll note that it's the same thing; git-gc, git-log doesn't change
much, while the pack size decreases as --depth increases.  We're only
seeing at 10% decrease in the pack size, compared to the 20% decrease
with the git repository, but that's probably because of the HTML and
man branches, which no doubt delta compress really, really well.

I can run a full set of benchmarks, varying both --window and --depth,
and also including a non-pickaxe git-log test as requested, but not
until tonight.  I really don't think we'll see any surprises compared
to the earlier runs, though.

After all, if we just stop and think about how the delta caching
works, and how the repacking algorithm works, it's pretty clear that
there shouldn't be any scaling issues with increasing --depth, and
that increasing --window is just going to be painful, and these should
hold true regardless of the size of the repo.

					- Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 18:09           ` Theodore Tso
@ 2007-05-08 18:46             ` Nicolas Pitre
  2007-05-09 13:49               ` Theodore Tso
  0 siblings, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 18:46 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Johannes Schindelin, Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Tso wrote:

> After all, if we just stop and think about how the delta caching
> works, and how the repacking algorithm works, it's pretty clear that
> there shouldn't be any scaling issues with increasing --depth, and
> that increasing --window is just going to be painful, and these should
> hold true regardless of the size of the repo.

The window size has absolutely no effect on the runtime pack access, 
except maybe for the increased number of deltas.  It is really a pack 
time cost.  The delta depth is the opposite: it has no effect on the 
packing time, but it has the potential to slow down runtime access.  But 
the delta base cache is apparently working really well to mitigate that 
cost, as long as it is big enough of course.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to
  2007-05-08 15:30   ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Nicolas Pitre
@ 2007-05-08 21:12     ` Junio C Hamano
  2007-05-08 23:59       ` Nicolas Pitre
  0 siblings, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2007-05-08 21:12 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Theodore Ts'o, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> In which case, it is git-gc that needs to get a bit smarter.  Maybe 
> something like this:

I agree git-gc should be tuned to "one size fits all well enough"
default, rather than getting more complicated parameters to fine
tune its behaviour to satisfy power users.

> @@ -65,7 +65,20 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (run_command_v_opt(argv_reflog, RUN_GIT_CMD))
>  		return error(FAILED_RUN, argv_reflog[0]);
>  
> -	if (run_command_v_opt(argv_repack, RUN_GIT_CMD))
> +	if (num_loose_objects() > 0) {
> +		do_repack = 1;
> +	} else {
> +		struct packed_git *p;
> +		unsigned long num_pack = 0;
> +		if (!packed_git)
> +			prepare_packed_git();
> +		for (p = packed_git; p; p = p->next)
> +			if (p->pack_local)
> +				num_pack++;
> +		if (num_pack > 1)
> +			do_repack = 1;
> +	}
> +	if (do_repack && run_command_v_opt(argv_repack, RUN_GIT_CMD))
>  		return error(FAILED_RUN, argv_repack[0]);
>  
>  	if (prune && run_command_v_opt(argv_prune, RUN_GIT_CMD))

Is this even correct?

When your repository is fully packed, if you decided to discard
one of your topic branches with "git branch -D", what does this
code do?  We see no loose objects, we see only one pack, so the
unreachable objects are left in the pack?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta, --window, and --depth options to
  2007-05-08 21:12     ` Junio C Hamano
@ 2007-05-08 23:59       ` Nicolas Pitre
  0 siblings, 0 replies; 40+ messages in thread
From: Nicolas Pitre @ 2007-05-08 23:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Theodore Ts'o, Git Mailing List

On Tue, 8 May 2007, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > @@ -65,7 +65,20 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
> >  	if (run_command_v_opt(argv_reflog, RUN_GIT_CMD))
> >  		return error(FAILED_RUN, argv_reflog[0]);
> >  
> > -	if (run_command_v_opt(argv_repack, RUN_GIT_CMD))
> > +	if (num_loose_objects() > 0) {
> > +		do_repack = 1;
> > +	} else {
> > +		struct packed_git *p;
> > +		unsigned long num_pack = 0;
> > +		if (!packed_git)
> > +			prepare_packed_git();
> > +		for (p = packed_git; p; p = p->next)
> > +			if (p->pack_local)
> > +				num_pack++;
> > +		if (num_pack > 1)
> > +			do_repack = 1;
> > +	}
> > +	if (do_repack && run_command_v_opt(argv_repack, RUN_GIT_CMD))
> >  		return error(FAILED_RUN, argv_repack[0]);
> >  
> >  	if (prune && run_command_v_opt(argv_prune, RUN_GIT_CMD))
> 
> Is this even correct?
> 
> When your repository is fully packed, if you decided to discard
> one of your topic branches with "git branch -D", what does this
> code do?  We see no loose objects, we see only one pack, so the
> unreachable objects are left in the pack?

Right.  OK, scrap that.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 17:35         ` Nicolas Pitre
@ 2007-05-09  5:03           ` Junio C Hamano
  0 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2007-05-09  5:03 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Theodore Tso, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> On Tue, 8 May 2007, Theodore Tso wrote:
>
>> On Tue, May 08, 2007 at 11:38:46AM -0400, Nicolas Pitre wrote:
>> > On Tue, 8 May 2007, Theodore Ts'o wrote:
>> > 
>> > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>> > 
>> > I'd prefer if tests were performed on the performance impact before 
>> > changing the default depth.  If done separately from this patch then the 
>> > commit log could contain those results as well.
>> 
>> The following results are on a recent git repository, using time to
>> record the real, user, and sys times on the two commands: "git-gc
>> --no-reuse-delta --window=X --depth=Y" and "git log --pretty=oneline
>> -S'object' > /dev/null".  All of these tests were done with a hot
>> cache, so disk speed didn't enter into the calculations.
> [...]
>> My conclusion given this quick benchmark is that it seems to me that
>> changing the defaults of --depth to 50, and keeping --window at 10, is
>> a reasonable thing to do.
>
> Effectively.
>
> I'd still prefer to see the default changed in a patch of its own 
> though.

I'll split the patch into two and apply them separately.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
  2007-05-08 15:35       ` Nicolas Pitre
@ 2007-05-09  5:05       ` Daniel Barkalow
  2007-05-09  8:15         ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Daniel Barkalow @ 2007-05-09  5:05 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Junio C Hamano, Git Mailing List

On Tue, 8 May 2007, Theodore Ts'o wrote:

> This allows the user to regenerate the deltas in packs while doing
> a git-gc.  The user could just run git-repack -a -d -f -l after
> running git-gc, but then the first git-repack run by git-gc is
> a bit of waste.

Maybe git-gc should have an option for "compress hard"? It seems to me 
like a two-sizes-fit-all solution would be good here; "git gc" for daily 
use, and "git gc --squeeze" for when you want to make the result as small 
as possible, with compute time not being a major factor. If all you know 
is that you're going to burn this repository onto a stack of CDs and mail 
it to somebody (but don't know about delta reuse, window sizes, and 
depths, let alone good values for these), it would be good to have an 
option where it picks an appropriate different set of defaults for you.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09  5:05       ` Daniel Barkalow
@ 2007-05-09  8:15         ` Junio C Hamano
  2007-05-09  9:02           ` Steven Grimm
  2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
  0 siblings, 2 replies; 40+ messages in thread
From: Junio C Hamano @ 2007-05-09  8:15 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Theodore Ts'o, Git Mailing List

Daniel Barkalow <barkalow@iabervon.org> writes:

> On Tue, 8 May 2007, Theodore Ts'o wrote:
>
>> This allows the user to regenerate the deltas in packs while doing
>> a git-gc.  The user could just run git-repack -a -d -f -l after
>> running git-gc, but then the first git-repack run by git-gc is
>> a bit of waste.
>
> Maybe git-gc should have an option for "compress hard"? It seems to me 
> like a two-sizes-fit-all solution would be good here; "git gc" for daily 
> use, and "git gc --squeeze" for when you want to make the result as small 
> as possible, with compute time not being a major factor.

I think that sounds saner and more user friendly than specific
knob to tune "window", "depth" and friends which are too
technical.  It has an added attraction that we can redefine what
exactly "hard" means later.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09  8:15         ` Junio C Hamano
@ 2007-05-09  9:02           ` Steven Grimm
  2007-05-09 11:35             ` Other compression?, was " Johannes Schindelin
                               ` (2 more replies)
  2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
  1 sibling, 3 replies; 40+ messages in thread
From: Steven Grimm @ 2007-05-09  9:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Daniel Barkalow, Theodore Ts'o, Git Mailing List

Junio C Hamano wrote:
> I think that sounds saner and more user friendly than specific
> knob to tune "window", "depth" and friends which are too
> technical.  It has an added attraction that we can redefine what
> exactly "hard" means later.
>   

On that note, has any thought been given to looking at other compression 
algorithms? Gzip is a great high-speed compressor, but there are others 
out there (some a bit slower, some much slower at both compression and 
decompression) that produce substantially smaller output.

One could even, if one were in a particularly twisted state of mind, 
envision using CPU-intensive compression for less frequently-accessed 
objects and using gzip for active ones, on the theory that the best 
time/space tradeoff is not uniform across all the objects in a git 
repository. Presumably most of us never actually unpack the vast 
majority of objects in a git repository of reasonable age, so the fact 
that it'd take a little longer if we *did* want to unpack them isn't 
much of a downside compared to the upside of reclaiming disk space. That 
would mitigate the impact of using an algorithm that's slow at 
decompression.

I think it'd be kind of neat to have my .git directory shrink by another 
20+%. That's conservative; on maximumcompression.com's test of a mix of 
different file types including images, gzip compresses 64% and the 
best-scoring one does 80%. On English text gzip does 71% and the top 
scorer does 89%. Most of the top-tier compressors are proprietary, but 
there are some open-source ones that do pretty well.

Maybe not worth the added complexity, but I thought I'd toss it out 
there. It probably makes more sense (if it makes any at all) after 
Linus's suggestion to not unpack after cloning is in place. Once the 
upstream has gone to the trouble of CPU-intensive compressing, you 
certainly don't want to force clones to have to spend the time repeating 
the same work.

-Steve (who suspects this is a "yes, we talked this over early in git's 
history" question, but what the heck)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: Other compression?, was Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09  9:02           ` Steven Grimm
@ 2007-05-09 11:35             ` Johannes Schindelin
  2007-05-09 15:15             ` Junio C Hamano
  2007-05-09 19:10             ` Shawn O. Pearce
  2 siblings, 0 replies; 40+ messages in thread
From: Johannes Schindelin @ 2007-05-09 11:35 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Git Mailing List

Hi,

On Wed, 9 May 2007, Steven Grimm wrote:

> On that note, has any thought been given to looking at other compression
> algorithms?

I think you could do that. But you would lose compatibility with all 
existing Git clients. IOW nobody could pull from you.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-08 18:46             ` Nicolas Pitre
@ 2007-05-09 13:49               ` Theodore Tso
  2007-05-09 14:17                 ` Johannes Schindelin
  0 siblings, 1 reply; 40+ messages in thread
From: Theodore Tso @ 2007-05-09 13:49 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Johannes Schindelin, Junio C Hamano, Git Mailing List

On Tue, May 08, 2007 at 02:46:35PM -0400, Nicolas Pitre wrote:
> The window size has absolutely no effect on the runtime pack access, 
> except maybe for the increased number of deltas.  It is really a pack 
> time cost.  The delta depth is the opposite: it has no effect on the 
> packing time, but it has the potential to slow down runtime access.  But 
> the delta base cache is apparently working really well to mitigate that 
> cost, as long as it is big enough of course.

Exactly, and the numbers bear out with the theory.  Junio's already
applied the change to up the default to 50, but for the record, here
are the results using a kernel git repository, with times for doing a
git-gc (with the --no-reuse-delta, although I like the suggestion to
change it to just be "--hard", or maybe "--agressive"), and with a git
pickaxe and a git-log with pathname restrictions.  It essentially
confirms that the delta base cache is doing the job just fine, up to
even a depth of 100.  However, there isn't much difference in pack
size between a depth of 50 and 100.

Increasing the window size from 10 to 30 increases the pack run time
by roughly 40%, and saves an extra 5% or so on the pack size.
Increasing the window beyond 30 has ever-smaller diminishing returns,
while the time to repack gets bigger and bigger.

The timing information is real/user/system/minor pagefaults, and as
before, the results are done using a hot cache.  Apologies in advance
for the long lines in the benchmark results.  These results were
generated using a relatively recent, post-2.6.21 kernel git tree on a
Thinkpad T60p laptop with a 2GHz T2500 Intel Core Duo processor.

Regards,

             git-gc                      pack size   git-log -S'object'    git-log include/scsi drivers/scsi
w=10,d=10   5:08.70/293.20/5.56/1027802  162316k  5:31.12/322.20/4.14/153298  0:01.80/1.72/0.03/19077
w=10,d=30   4:14.79/245.46/2.57/398754   149636k  5:41.56/331.21/4.86/517220  0:01.83/1.74/0.05/17880
w=10,d=50   4:31.89/257.16/3.50/576538   146608k  5:51.58/336.63/5.07/259643  0:01.88/1.80/0.04/17658
w=10,d=100  4:35.93/262.77/4.08/715195   144152k  5:58.23/341.26/6.22/624510  0:01.89/1.80/0.04/17571

w=30,d=10   7:23.06/424.08/6.35/1222640  153752k  5:32.85/323.96/2.13/213150  0:01.73/1.64/0.06/18429
w=30,d=30   6:15.76/364.27/3.53/407546   141200k  5:42.29/333.53/2.10/338301  0:01.81/1.71/0.04/17237
w=30,d=50   6:26.85/372.24/5.38/578343   139408k  5:42.21/336.65/1.30/260234  0:01.77/1.70/0.05/17108
w=30,d=100  6:34.31/381.72/5.40/744886   138040k  5:59.03/342.65/4.90/607681  0:01.93/1.79/0.06/17043

w=50,d=10   8:51.08/508.80/5.75/1050358  152168k  5:35.43/327.48/3.70/209655  0:01.70/1.64/0.04/18264
w=50,d=30   8:04.53/471.65/5.58/423755   139000k  5:42.06/335.67/1.55/335333  0:01.78/1.72/0.04/17046
w=50,d=50   8:14.87/479.93/7.09/617310   137244k  5:47.22/339.26/2.73/462412  0:01.79/1.72/0.04/16871
w=50,d=100  8:23.13/490.09/4.10/751742   136152k  5:51.10/343.89/2.29/503573  0:01.83/1.73/0.06/16920

w=100,d=10  12:00.34/702.07/6.27/1167403 150736k  5:34.42/328.88/1.09/207522  0:01.72/1.63/0.07/18024
w=100,d=30  11:56.34/702.36/2.93/436950  137240k  5:42.64/335.82/2.07/422743  0:01.77/1.67/0.05/16852
w=100,d=50  12:20.39/722.99/3.94/671655  135488k  5:47.20/339.38/2.44/468974  0:01.78/1.68/0.03/16621
w=100,d=100 12:37.69/740.63/4.12/733593  134492k  5:52.26/344.16/3.00/636120  0:01.85/1.80/0.03/16681

							- Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50
  2007-05-09 13:49               ` Theodore Tso
@ 2007-05-09 14:17                 ` Johannes Schindelin
  0 siblings, 0 replies; 40+ messages in thread
From: Johannes Schindelin @ 2007-05-09 14:17 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List

Hi,

On Wed, 9 May 2007, Theodore Tso wrote:

> [...] here are the results using a kernel git repository [...]

Thank you very much,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09  9:02           ` Steven Grimm
  2007-05-09 11:35             ` Other compression?, was " Johannes Schindelin
@ 2007-05-09 15:15             ` Junio C Hamano
  2007-05-09 19:10             ` Shawn O. Pearce
  2 siblings, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2007-05-09 15:15 UTC (permalink / raw)
  To: Steven Grimm; +Cc: Daniel Barkalow, Theodore Ts'o, Git Mailing List

Steven Grimm <koreth@midwinter.com> writes:

> -Steve (who suspects this is a "yes, we talked this over early in
> git's history" question, but what the heck)

AFAICR, the answer is "heck, we have never bothered about such
low level implementation details as gzip has been good enough
for us.  If you feel so inclined, go wild."

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09  9:02           ` Steven Grimm
  2007-05-09 11:35             ` Other compression?, was " Johannes Schindelin
  2007-05-09 15:15             ` Junio C Hamano
@ 2007-05-09 19:10             ` Shawn O. Pearce
  2007-06-10  7:40               ` Sam Vilain
  2 siblings, 1 reply; 40+ messages in thread
From: Shawn O. Pearce @ 2007-05-09 19:10 UTC (permalink / raw)
  To: Steven Grimm
  Cc: Junio C Hamano, Daniel Barkalow, Theodore Ts'o,
	Git Mailing List

Steven Grimm <koreth@midwinter.com> wrote:
> On that note, has any thought been given to looking at other compression 
> algorithms? Gzip is a great high-speed compressor, but there are others 
> out there (some a bit slower, some much slower at both compression and 
> decompression) that produce substantially smaller output.

Its been discussed once before on the list, in very recent history,
but not by a whole lot.  As Junio pointed out, I don't think there
ever really was any discussion of is gzip the best way to deflate the
objects.  I think gzip was just chosen simply because it was readily
available in libz, stable, and has a pretty decent speed/size ratio.
 
> I think it'd be kind of neat to have my .git directory shrink by another 
> 20+%. That's conservative; on maximumcompression.com's test of a mix of 
> different file types including images, gzip compresses 64% and the 
> best-scoring one does 80%. On English text gzip does 71% and the top 
> scorer does 89%. Most of the top-tier compressors are proprietary, but 

Yes.  But in many cases we might actually be able to do even better
by going with a pack-wide dictionary.  Why?

Think about source code structure.  E.g.

  $ git grep --cached 'struct object'| cut -d: -f1|wc -l
     402

So 402 files in git.git use the term 'struct object', and that's just
the current revision I had in my index.  With our current packfile
organization we are likely to store this string at least 402 times.
We'll store it once in each file's delta chain, assuming each
file's blobs largely fall into a single delta chain for that file
(reasonable assumption, but certainly not always true).

That's just one string that does appear somewhat frequently in any
file its used in.  Now try 'unsigned char' (its 944 files, but an
even higher frequency-per-file).

So anyway, for the past year I've been thinking about trying to
implement a blob-level dictionary prototype to see if it helps on a
project like linux-2.6.git, but I haven't gotten to it.  The pack v4
work was about applying that basic dicationary principal to trees
and commits, and I think it pays off nicely there.  Just need to
get it cleaned up, rebased onto current master, and submitted to
the list for wider testing.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH] Add --aggressive option to 'git gc'
  2007-05-09  8:15         ` Junio C Hamano
  2007-05-09  9:02           ` Steven Grimm
@ 2007-05-09 19:48           ` Theodore Tso
  2007-05-09 20:19             ` Junio C Hamano
  2007-05-10  7:38             ` Junio C Hamano
  1 sibling, 2 replies; 40+ messages in thread
From: Theodore Tso @ 2007-05-09 19:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Daniel Barkalow, Git Mailing List

On Wed, May 09, 2007 at 01:15:07AM -0700, Junio C Hamano wrote:
> > Maybe git-gc should have an option for "compress hard"? It seems to me 
> > like a two-sizes-fit-all solution would be good here; "git gc" for daily 
> > use, and "git gc --squeeze" for when you want to make the result as small 
> > as possible, with compute time not being a major factor.
> 
> I think that sounds saner and more user friendly than specific
> knob to tune "window", "depth" and friends which are too
> technical.  It has an added attraction that we can redefine what
> exactly "hard" means later.

OK, here's a patch that does exactly that.  I choose git-gc
--aggressive, since I thought that was more descriptive than --hard or
--squeeze.  Junio, would you be willing to apply this?

					- Ted

=== Cut here ===

Add --aggressive option to 'git gc'

This option causes 'git gc' to more aggressively optimize the
repository at the cost of taking much more wall clock and CPU time.

Today this option causes git-pack-objects to use --no-use-delta
option, and it allows the --window parameter to be set via the
gc.aggressiveWindow configuration parameter.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 Documentation/config.txt |    5 +++++
 Documentation/git-gc.txt |   16 +++++++++++++++-
 builtin-gc.c             |   35 +++++++++++++++++++++++++++++++++--
 3 files changed, 53 insertions(+), 3 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ea434af..efcf301 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -384,6 +384,11 @@ format.suffix::
 	`.patch`. Use this variable to change that suffix (make sure to
 	include the dot if you want it).
 
+gc.aggressiveWindow::
+	The window size parameter used in the delta compression
+	algorithm used by 'git gc --aggressive'.  This defaults
+	to 10.
+
 gc.packrefs::
 	`git gc` does not run `git pack-refs` in a bare repository by
 	default so that older dumb-transport clients can still fetch
diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index bc16584..56575e8 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -8,7 +8,7 @@ git-gc - Cleanup unnecessary files and optimize the local repository
 
 SYNOPSIS
 --------
-'git-gc' [--prune]
+'git-gc' [--prune] [--aggressive]
 
 DESCRIPTION
 -----------
@@ -35,6 +35,13 @@ OPTIONS
 	repository at the same time (e.g. never use this option
 	in a cron script).
 
+--aggressive::
+	Usually 'git-gc' runs very quickly while providing good disk
+	space utilization and performance.   This option will cause
+	git-gc to more aggressive optimize the repository at the expense
+	of taking much more time.  The effects of this optimization are
+	persistent, so this option only needs to be sporadically; every
+	few hundred changesets or so.
 
 Configuration
 -------------
@@ -67,6 +74,13 @@ The optional configuration variable 'gc.packrefs' determines if
 is not run in bare repositories by default, to allow older dumb-transport
 clients fetch from the repository,  but this will change in the future.
 
+The optional configuration variable 'gc.aggressiveWindow' controls how
+much time is spent optimizing the delta compression of the objects in
+the repository when the --aggressive option is specified.  The larger
+the value, the more time is spent optimizing the delta compression.  See
+the documentation for the --window' option in gitlink:git-repack[1] for
+more details.  This defaults to 10.
+
 See Also
 --------
 gitlink:git-prune[1]
diff --git a/builtin-gc.c b/builtin-gc.c
index 3b1f8c2..10f92f1 100644
--- a/builtin-gc.c
+++ b/builtin-gc.c
@@ -15,13 +15,15 @@
 
 #define FAILED_RUN "failed to run %s"
 
-static const char builtin_gc_usage[] = "git-gc [--prune]";
+static const char builtin_gc_usage[] = "git-gc [--prune] [--aggressive]";
 
 static int pack_refs = -1;
+static int aggressive_window = -1;
 
+#define MAX_ADD 10
 static const char *argv_pack_refs[] = {"pack-refs", "--prune", NULL};
 static const char *argv_reflog[] = {"reflog", "expire", "--all", NULL};
-static const char *argv_repack[] = {"repack", "-a", "-d", "-l", NULL};
+static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
 static const char *argv_prune[] = {"prune", NULL};
 static const char *argv_rerere[] = {"rerere", "gc", NULL};
 
@@ -34,13 +36,34 @@ static int gc_config(const char *var, const char *value)
 			pack_refs = git_config_bool(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "gc.aggressiveWindow")) {
+		aggressive_window = git_config_int(var, value);
+		printf("aggressive_window = %d\n", aggressive_window);
+		return 0;
+	}
 	return git_default_config(var, value);
 }
 
+static append_option(const char **cmd, const char *opt, int max_length)
+{
+	int	i;
+
+	for (i=0; cmd[i]; i++)
+		;
+
+	if (i+2 >= max_length) {
+		fprintf(stderr, "Too many options specified\n");
+		exit(1);
+	}
+	cmd[i++] = opt;
+	cmd[i] = 0;
+}
+
 int cmd_gc(int argc, const char **argv, const char *prefix)
 {
 	int i;
 	int prune = 0;
+	char buf[80];
 
 	git_config(gc_config);
 
@@ -53,6 +76,14 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			prune = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--aggressive")) {
+			append_option(argv_repack, "-f", MAX_ADD);
+			if (aggressive_window > 0) {
+				sprintf(buf, "--window=%d", aggressive_window);
+				append_option(argv_repack, buf, MAX_ADD);
+			}
+			continue;
+		}
 		/* perhaps other parameters later... */
 		break;
 	}
-- 
1.5.2.rc2.22.ga39d

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --aggressive option to 'git gc'
  2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
@ 2007-05-09 20:19             ` Junio C Hamano
  2007-05-09 22:22               ` Theodore Tso
  2007-05-10  7:38             ` Junio C Hamano
  1 sibling, 1 reply; 40+ messages in thread
From: Junio C Hamano @ 2007-05-09 20:19 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Daniel Barkalow, Git Mailing List

Theodore Tso <tytso@mit.edu> writes:

> On Wed, May 09, 2007 at 01:15:07AM -0700, Junio C Hamano wrote:
>> > Maybe git-gc should have an option for "compress hard"? It seems to me 
>> > like a two-sizes-fit-all solution would be good here; "git gc" for daily 
>> > use, and "git gc --squeeze" for when you want to make the result as small 
>> > as possible, with compute time not being a major factor.
>> 
>> I think that sounds saner and more user friendly than specific
>> knob to tune "window", "depth" and friends which are too
>> technical.  It has an added attraction that we can redefine what
>> exactly "hard" means later.
>
> OK, here's a patch that does exactly that.  I choose git-gc
> --aggressive, since I thought that was more descriptive than --hard or
> --squeeze.  Junio, would you be willing to apply this?

Willing?  Yes.

It's tricky that it defaults to 10 and still called aggressive.
When the configuration variable is left unspecified, the only
reason it is called aggressive is because it passes '-f' to
repack, right?  It was not very clear at the first sight and I
was about to ask why the default is 10, not higher.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --aggressive option to 'git gc'
  2007-05-09 20:19             ` Junio C Hamano
@ 2007-05-09 22:22               ` Theodore Tso
  0 siblings, 0 replies; 40+ messages in thread
From: Theodore Tso @ 2007-05-09 22:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Daniel Barkalow, Git Mailing List

On Wed, May 09, 2007 at 01:19:54PM -0700, Junio C Hamano wrote:
> It's tricky that it defaults to 10 and still called aggressive.
> When the configuration variable is left unspecified, the only
> reason it is called aggressive is because it passes '-f' to
> repack, right?  It was not very clear at the first sight and I
> was about to ask why the default is 10, not higher.

Yes, it's called aggressive because it causes git-gc to recalculate
the delta's.  So that means that git-gc --aggressive will take around
9-10 times longer than git-gc.  If the user changes
gc.aggressive-window to some larger value, say like 30, then git-gc
--aggressive would take around 20 times longer than git-gc.

So that's why I didn't make the default bigger, but instead left it as
something which could be configured by the user.  Maybe the default
should be larger; I don't strong opposition towards making it be more
like 30, but it seemed to me that doubling the run time for a 5%
decrease in pack size wasn't worth it.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --aggressive option to 'git gc'
  2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
  2007-05-09 20:19             ` Junio C Hamano
@ 2007-05-10  7:38             ` Junio C Hamano
  1 sibling, 0 replies; 40+ messages in thread
From: Junio C Hamano @ 2007-05-10  7:38 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Daniel Barkalow, Git Mailing List

Theodore Tso <tytso@mit.edu> writes:

> Junio, would you be willing to apply this?

Yes, but ;-).

>  static int pack_refs = -1;
> +static int aggressive_window = -1;
>  
> +#define MAX_ADD 10
>  static const char *argv_pack_refs[] = {"pack-refs", "--prune", NULL};
>  static const char *argv_reflog[] = {"reflog", "expire", "--all", NULL};
> -static const char *argv_repack[] = {"repack", "-a", "-d", "-l", NULL};
> +static const char *argv_repack[MAX_ADD] = {"repack", "-a", "-d", "-l", NULL};
>  static const char *argv_prune[] = {"prune", NULL};
>  static const char *argv_rerere[] = {"rerere", "gc", NULL};
>  
> @@ -34,13 +36,34 @@ static int gc_config(const char *var, const char *value)
>  			pack_refs = git_config_bool(var, value);
>  		return 0;
>  	}
> +	if (!strcmp(var, "gc.aggressiveWindow")) {

Callbacks to git_config() are called with variable names
downcased (except for the 2nd level for 3-level variables.  E.g.
[REMOTE "Foo"] URL = ...; becomes var = "remote.Foo.url", val =
...).

> +		aggressive_window = git_config_int(var, value);
> +		printf("aggressive_window = %d\n", aggressive_window);

Did you mean to leave this in?  Looks like a debug remnant...

> +		return 0;
> +	}
>  	return git_default_config(var, value);
>  }
>  
> +static append_option(const char **cmd, const char *opt, int max_length)

Type is "static void" I presume.

> +{
> +	int	i;

Funny tab here.

> +
> +	for (i=0; cmd[i]; i++)
> +		;

SP around operator '='.

> +
> +	if (i+2 >= max_length) {

Same for '+'

> +		fprintf(stderr, "Too many options specified\n");
> +		exit(1);

die("Too many...specified"); /* note the lack of \n at the end */

> +	}
> +	cmd[i++] = opt;
> +	cmd[i] = 0;

We tend to spell out NULL, although we all are aware that C says
literal '0' is the null pointer.

All fixups are trivial so I'd take the patch and amend locally.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-05-09 19:10             ` Shawn O. Pearce
@ 2007-06-10  7:40               ` Sam Vilain
  2007-06-11  1:51                 ` Nicolas Pitre
  0 siblings, 1 reply; 40+ messages in thread
From: Sam Vilain @ 2007-06-10  7:40 UTC (permalink / raw)
  To: Steven Grimm
  Cc: Shawn O. Pearce, Junio C Hamano, Daniel Barkalow,
	Theodore Ts'o, Git Mailing List

Shawn O. Pearce wrote:
>> On that note, has any thought been given to looking at other compression 
>> algorithms? Gzip is a great high-speed compressor, but there are others 
>> out there (some a bit slower, some much slower at both compression and 
>> decompression) that produce substantially smaller output.
>>     
> Its been discussed once before on the list, in very recent history,
> but not by a whole lot.  As Junio pointed out, I don't think there
> ever really was any discussion of is gzip the best way to deflate the
> objects.  I think gzip was just chosen simply because it was readily
> available in libz, stable, and has a pretty decent speed/size ratio.
>   

I think it's the right tool. I just don't see any point in changing to
anything slower for the sake of 20% space saving. Especially bzip2.

Consider this.

Compression works primarily through two things: huffman coding and
string matching. The larger the window for your string matching, the
slower the compression and the more memory you need thrashing your CPU
memory cache when decompressing.

Now I'm not an expert on compression algorithms but I think a large part
of the reason gzip is blindingly faster than bzip2 is because gzip uses
a 64k buffer and bzip2 a 900k one. Only now are CPUs getting caches
large enough to deal with that size of buffer, the rest of the time
you're waiting for your RAM. Moore's law was supposed to make bzip2 fast
one of these days but I'm still waiting.

But with git-repack the window is effectively the size of your
repository. So that blows bzip2 out of the water. Why else can git make
compressed packs smaller than a .bz2 of the raw files? This is the same
observation Shawn makes with the pack-wide dictionary, but he sounds
like he wants to apply it to the huffman coding stage as well as the
current delta/string matching stage. Now that would be interesting...

Anyway it's a free world so be my guest to implement it, I guess if this
was selectable it would only be a minor annoyance waiting a bit longer
pulling from from some repositories, and it would be interesting to see
if it did make a big difference with pack file sizes.

Sam

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-10  7:40               ` Sam Vilain
@ 2007-06-11  1:51                 ` Nicolas Pitre
  2007-06-11  6:20                   ` Steven Grimm
  2007-06-11 10:20                   ` Johannes Schindelin
  0 siblings, 2 replies; 40+ messages in thread
From: Nicolas Pitre @ 2007-06-11  1:51 UTC (permalink / raw)
  To: Sam Vilain
  Cc: Steven Grimm, Shawn O. Pearce, Junio C Hamano, Daniel Barkalow,
	Theodore Ts'o, Git Mailing List

On Sun, 10 Jun 2007, Sam Vilain wrote:

> Now I'm not an expert on compression algorithms but I think a large part
> of the reason gzip is blindingly faster than bzip2 is because gzip uses
> a 64k buffer and bzip2 a 900k one. Only now are CPUs getting caches
> large enough to deal with that size of buffer, the rest of the time
> you're waiting for your RAM. Moore's law was supposed to make bzip2 fast
> one of these days but I'm still waiting.
> 
> Anyway it's a free world so be my guest to implement it, I guess if this
> was selectable it would only be a minor annoyance waiting a bit longer
> pulling from from some repositories, and it would be interesting to see
> if it did make a big difference with pack file sizes.

It won't happen for a simple reason: to be backward compatible with 
older GIT clients.  If you have your repo compressed with bzip2 and an 
old client pulls it then the server would have to decompress and 
recompress everything with gzip.  If instead your repo remains with gzip 
and a new client asks for bzip2 then you have to recompress as well 
(slow).  So in practice it is best to remain with a single compression 
method.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-11  1:51                 ` Nicolas Pitre
@ 2007-06-11  6:20                   ` Steven Grimm
  2007-06-11  6:31                     ` Shawn O. Pearce
  2007-06-11 10:20                   ` Johannes Schindelin
  1 sibling, 1 reply; 40+ messages in thread
From: Steven Grimm @ 2007-06-11  6:20 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Shawn O. Pearce, Junio C Hamano, Daniel Barkalow,
	Theodore Ts'o, Git Mailing List

Nicolas Pitre wrote:
>
> It won't happen for a simple reason: to be backward compatible with 
> older GIT clients.  If you have your repo compressed with bzip2 and an 
> old client pulls it then the server would have to decompress and 
> recompress everything with gzip.  If instead your repo remains with gzip 
> and a new client asks for bzip2 then you have to recompress as well 
> (slow).  So in practice it is best to remain with a single compression 
> method.
>   

Not that I really think this is all that important (my original question 
was more out of curiosity than anything) but I don't think those are 
really issues.

Even with a better compression scheme, nobody would want to remove gzip 
support; if you are creating a repo that needs to be compatible with old 
clients -- and of course not all repositories need that -- you'd just 
configure it to use gzip (or more likely, *not* configure it to use the 
other method.) There are already options whose documentation says, 
"Don't enable this if you want your repo to be accessed by git clients 
older than version X." In other words, git already has an established 
approach for adding new non-backward-compatible features. Some projects, 
e.g. internal corporate ones, can mandate that everyone upgrade their 
clients, and more importantly, once any new git feature has been out for 
a long enough time, even public projects can reasonably say, "You need 
version X to access our repo." If alternate compression were introduced 
today, in five years would anyone care that there'd be some ancient, 
ancient clients that couldn't use it?

And since the hypothetical new client would have support for both 
compression types, pulling from a gzip-based repo could be accomplished 
totally transparently; you could, one assumes, even pull from a gzip 
repo and pack locally using the other scheme if you felt like it.

-Steve

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-11  6:20                   ` Steven Grimm
@ 2007-06-11  6:31                     ` Shawn O. Pearce
  0 siblings, 0 replies; 40+ messages in thread
From: Shawn O. Pearce @ 2007-06-11  6:31 UTC (permalink / raw)
  To: Steven Grimm
  Cc: Nicolas Pitre, Sam Vilain, Junio C Hamano, Daniel Barkalow,
	Theodore Ts'o, Git Mailing List

Steven Grimm <koreth@midwinter.com> wrote:
> Nicolas Pitre wrote:
> >
> >It won't happen for a simple reason: to be backward compatible with 
> >older GIT clients.  If you have your repo compressed with bzip2 and an 
> >old client pulls it then the server would have to decompress and 
> >recompress everything with gzip.  If instead your repo remains with gzip 
> >and a new client asks for bzip2 then you have to recompress as well 
> >(slow).  So in practice it is best to remain with a single compression 
> >method.
> >  
> 
> Not that I really think this is all that important (my original question 
> was more out of curiosity than anything) but I don't think those are 
> really issues.
...
> And since the hypothetical new client would have support for both 
> compression types, pulling from a gzip-based repo could be accomplished 
> totally transparently; you could, one assumes, even pull from a gzip 
> repo and pack locally using the other scheme if you felt like it.

Right, I agree with Steven here.

Pack v4 (which is rapidly becoming vaporware it seems) uses such
a radically different encoding for its commit and tree objects
that you cannot send them as-is to a client unless that client
also understands pack v4.  Consequently you need to decompress and
recompress those objects when talking to an older peer; this isn't
very different from a switch to bzip2.

That said, I think switching to a different generic compressor isn't
that interesting.  We can probably do a lot better by organizing
files into clusters, where those clusters have large number of
symbols in common, and then compress everything in the same cluster
with the same pack-wide dictionary.

Specifically I'm thinking that you may be able to cluster the *.h/*.c
into one cluster, and the Makefile into another, and the .gitignore
into a third cluster, and then leverage the large commonalities
between those blobs when building a dictionary and compressing them.
No, I haven't prototyped it out yet.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-11  1:51                 ` Nicolas Pitre
  2007-06-11  6:20                   ` Steven Grimm
@ 2007-06-11 10:20                   ` Johannes Schindelin
  2007-06-11 14:01                     ` Nicolas Pitre
  1 sibling, 1 reply; 40+ messages in thread
From: Johannes Schindelin @ 2007-06-11 10:20 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Steven Grimm, Shawn O. Pearce, Junio C Hamano,
	Daniel Barkalow, Theodore Ts'o, Git Mailing List

Hi,

On Sun, 10 Jun 2007, Nicolas Pitre wrote:

> On Sun, 10 Jun 2007, Sam Vilain wrote:
> 
> > Anyway it's a free world so be my guest to implement it, I guess if 
> > this was selectable it would only be a minor annoyance waiting a bit 
> > longer pulling from from some repositories, and it would be 
> > interesting to see if it did make a big difference with pack file 
> > sizes.
> 
> It won't happen for a simple reason: to be backward compatible with 
> older GIT clients.  If you have your repo compressed with bzip2 and an 
> old client pulls it then the server would have to decompress and 
> recompress everything with gzip.  If instead your repo remains with gzip 
> and a new client asks for bzip2 then you have to recompress as well 
> (slow).  So in practice it is best to remain with a single compression 
> method.

With the extension mechanism we have in place, the client can send what 
kind of compression it supports, and the server can actually refuse to 
send anything if it does not want to recompress.

What I am trying to say: you do not necessarily have to allow every client 
to access that particular repository. I agree that mixed-compression repos 
are evil, but nothing stands in the way of a flag allowing (or 
disallowing) recompression in a different format when fetching.

So if you should decide someday to track data with Git (remember: Generic 
Information Tracker, not just source code), that is particularly unfit for 
compression with gzip, but that you _need_ to store in a different 
compressed manner, you can set up a repository which will _only_ _ever_ 
use that compression.

Of course, you'd need to prepare Git for that, but I could imagine 
something like a music library, which stores everything as ogg encoded 
snippets. It might even use some perception-based hash on small chunks of 
the music, and store the music as tree objects, which concatenate the 
small chunks. I might even try to do this for fun, some day in the distant 
future.

It's a wild world,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-11 10:20                   ` Johannes Schindelin
@ 2007-06-11 14:01                     ` Nicolas Pitre
  2007-06-11 21:40                       ` Johannes Schindelin
  0 siblings, 1 reply; 40+ messages in thread
From: Nicolas Pitre @ 2007-06-11 14:01 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Sam Vilain, Steven Grimm, Shawn O. Pearce, Junio C Hamano,
	Daniel Barkalow, Theodore Ts'o, Git Mailing List

On Mon, 11 Jun 2007, Johannes Schindelin wrote:

> Hi,
> 
> On Sun, 10 Jun 2007, Nicolas Pitre wrote:
> 
> > On Sun, 10 Jun 2007, Sam Vilain wrote:
> > 
> > > Anyway it's a free world so be my guest to implement it, I guess if 
> > > this was selectable it would only be a minor annoyance waiting a bit 
> > > longer pulling from from some repositories, and it would be 
> > > interesting to see if it did make a big difference with pack file 
> > > sizes.
> > 
> > It won't happen for a simple reason: to be backward compatible with 
> > older GIT clients.  If you have your repo compressed with bzip2 and an 
> > old client pulls it then the server would have to decompress and 
> > recompress everything with gzip.  If instead your repo remains with gzip 
> > and a new client asks for bzip2 then you have to recompress as well 
> > (slow).  So in practice it is best to remain with a single compression 
> > method.
> 
> With the extension mechanism we have in place, the client can send what 
> kind of compression it supports, and the server can actually refuse to 
> send anything if it does not want to recompress.
> 
> What I am trying to say: you do not necessarily have to allow every client 
> to access that particular repository. I agree that mixed-compression repos 
> are evil, but nothing stands in the way of a flag allowing (or 
> disallowing) recompression in a different format when fetching.

I know.

But is it worthwhile?  I think not.

However I won't stand in the way of anyone who wants to try and provide 
numbers.  I just don't believe this is worthwhile and am not inclined to 
do it.

OK... Well, I just performed a really quick test:

$ mkdir test-bzip2
$ mkdir test-gzip
$ cp git/*.[cho] test-bzip2
$ cp git/*.[cho] test-gzip
$ bzip2 test-bzip2/*
$ gzip test-gzip/*
$ du -s test-bzip2 test-gzip
5016    test-bzip2
4956    test-gzip

It is true that bzip2 is better with large files, but we typically have 
very few of them in a Git repo, and in the presence of large files bzip2 
then becomes _much_ slower than gzip. So, given that the nature of Git 
objects are likely to be small in 98% of the cases due to deltas, it 
appears that bzip2 won't be a gain at all but rather a waste, making a 
poor case for supporting it forever afterwards.

> So if you should decide someday to track data with Git (remember: Generic 
> Information Tracker, not just source code),

Bah... if you please.

> that is particularly unfit for 
> compression with gzip, but that you _need_ to store in a different 
> compressed manner, you can set up a repository which will _only_ _ever_ 
> use that compression.

Maybe.  But you'd better have a concrete data set and result numbers to 
convince me.  Designing software for hypothetical situations before they 
actually exist leads to bloatware.


Nicolas

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] Add --no-reuse-delta option to git-gc
  2007-06-11 14:01                     ` Nicolas Pitre
@ 2007-06-11 21:40                       ` Johannes Schindelin
  0 siblings, 0 replies; 40+ messages in thread
From: Johannes Schindelin @ 2007-06-11 21:40 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Sam Vilain, Steven Grimm, Shawn O. Pearce, Junio C Hamano,
	Daniel Barkalow, Theodore Ts'o, Git Mailing List

Hi,

On Mon, 11 Jun 2007, Nicolas Pitre wrote:

> But you'd better have a concrete data set and result numbers to convince 
> me.  Designing software for hypothetical situations before they actually 
> exist leads to bloatware.

Right you are. So I'll not mention it until I have it ;-)

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2007-06-11 21:44 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-08  2:54 [PATCH] Add --no-reuse-delta, --window, and --depth options to git-gc Theodore Ts'o
2007-05-08  3:13 ` Nicolas Pitre
2007-05-08  3:21   ` Theodore Tso
2007-05-08  3:38     ` Dana How
2007-05-08  4:43     ` Junio C Hamano
2007-05-08 13:46       ` Nicolas Pitre
2007-05-08 13:28 ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Theodore Ts'o
2007-05-08 13:28   ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Theodore Ts'o
2007-05-08 13:28     ` [PATCH] Add --no-reuse-delta option to git-gc Theodore Ts'o
2007-05-08 15:35       ` Nicolas Pitre
2007-05-09  5:05       ` Daniel Barkalow
2007-05-09  8:15         ` Junio C Hamano
2007-05-09  9:02           ` Steven Grimm
2007-05-09 11:35             ` Other compression?, was " Johannes Schindelin
2007-05-09 15:15             ` Junio C Hamano
2007-05-09 19:10             ` Shawn O. Pearce
2007-06-10  7:40               ` Sam Vilain
2007-06-11  1:51                 ` Nicolas Pitre
2007-06-11  6:20                   ` Steven Grimm
2007-06-11  6:31                     ` Shawn O. Pearce
2007-06-11 10:20                   ` Johannes Schindelin
2007-06-11 14:01                     ` Nicolas Pitre
2007-06-11 21:40                       ` Johannes Schindelin
2007-05-09 19:48           ` [PATCH] Add --aggressive option to 'git gc' Theodore Tso
2007-05-09 20:19             ` Junio C Hamano
2007-05-09 22:22               ` Theodore Tso
2007-05-10  7:38             ` Junio C Hamano
2007-05-08 15:38     ` [PATCH] Add pack.depth option to git-pack-objects and change default depth to 50 Nicolas Pitre
2007-05-08 16:30       ` Theodore Tso
2007-05-08 16:49         ` Johannes Schindelin
2007-05-08 18:09           ` Theodore Tso
2007-05-08 18:46             ` Nicolas Pitre
2007-05-09 13:49               ` Theodore Tso
2007-05-09 14:17                 ` Johannes Schindelin
2007-05-08 17:07         ` Dana How
2007-05-08 17:35         ` Nicolas Pitre
2007-05-09  5:03           ` Junio C Hamano
2007-05-08 15:30   ` [PATCH] Add --no-reuse-delta, --window, and --depth options to Nicolas Pitre
2007-05-08 21:12     ` Junio C Hamano
2007-05-08 23:59       ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).