[PATCH v3] Prevent megablobs from gunking up git packs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] Prevent megablobs from gunking up git packs
@ 2007-05-26 19:16 Dana How
  2007-05-26 22:51 ` Junio C Hamano
  2007-05-27  3:15 ` Nicolas Pitre
  0 siblings, 2 replies; 6+ messages in thread
From: Dana How @ 2007-05-26 19:16 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow


Extremely large blobs distort general-purpose git packfiles.
These megablobs can be either stored in separate "kept" packfiles,
or left as loose objects.  Here we add some features to help
either approach.

This patch implements the following:
1. git pack-objects accepts --max-blob-size=N,  with the effect that
   only loose blobs less than N KB are written to the packfiles(s).
   If an already packed blob violates this limit (perhaps these are
   fast-import packs or max-blob-size was reduced),  it _is_ passed
   through if from a local pack and no loose copy exists.
2. git repack inspects repack.maxblobsize .  If set,  its
   value is passed to git pack-objects on the command line.
   --max-blob-size=N is also accepted by git repack.
3. No other git pack-objects caller uses this feature or sees any change.

During pack *creation* this minimizes including & deltifying megablobs.

During pack *use* this feature helps performance by keeping metadata
in a single smaller packfile,  and possibly reducing the number of index
files that must be read.  Megablobs could be separately packed,  or
left as loose objects.

Documentation has been updated and operation with pack-object's
--stdout is prevented.  This patch is based on "next".

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 Documentation/config.txt           |    6 ++++++
 Documentation/git-pack-objects.txt |    5 +++++
 Documentation/git-repack.txt       |    9 +++++++++
 builtin-pack-objects.c             |   33 ++++++++++++++++++++++++++++-----
 cache.h                            |    1 +
 git-repack.sh                      |    9 ++++++++-
 sha1_file.c                        |    2 +-
 7 files changed, 58 insertions(+), 7 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 179cb17..4a14f05 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -599,6 +599,12 @@ remotes.<group>::
 	The list of remotes which are fetched by "git remote update
 	<group>".  See gitlink:git-remote[1].
 
+repack.maxblobsize::
+	Prevent gitlink:git-repack[1] from newly packing blobs larger than
+	the specified number in kB,  unless overridden by --max-blob-size=N switch.
+	Affected blobs will still be repacked if from a local pack and no loose
+	copy exists.  Defaults to zero which means no maximum size is in effect.
+
 repack.usedeltabaseoffset::
 	Allow gitlink:git-repack[1] to create packs that uses
 	delta-base offset.  Defaults to false.
diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index cfe127a..9b2e33d 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -85,6 +85,11 @@ base-name::
 	times to get to the necessary object.
 	The default value for --window is 10 and --depth is 50.
 
+--max-blob-size=<n>::
+	Maximum size of newly packed blobs, expressed in kB.
+	The default is unlimited.  Affected blobs will still be repacked
+	if from a local pack and no loose copy exists.
+
 --max-pack-size=<n>::
 	Maximum size of each output packfile, expressed in MiB.
 	If specified,  multiple packfiles may be created.
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 2847c9b..b9d47e1 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -65,6 +65,11 @@ OPTIONS
 	to be applied that many times to get to the necessary object.
 	The default value for --window is 10 and --depth is 50.
 
+--max-blob-size=<n>::
+	Maximum size of newly packed blobs, expressed in kB.
+	The default is unlimited.  Affected blobs will still be repacked
+	if from a local pack and no loose copy exists.
+
 --max-pack-size=<n>::
 	Maximum size of each output packfile, expressed in MiB.
 	If specified,  multiple packfiles may be created.
@@ -84,6 +89,10 @@ be able to read (this includes repositories from which packs can
 be copied out over http or rsync, and people who obtained packs
 that way can try to use older git with it).
 
+The configuration variable `repack.MaxBlobSize` provides the
+default for the --max-blob-size option if set.  The latter
+takes precedence.
+
 
 Author
 ------
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 19b0aa1..59be849 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -17,7 +17,7 @@
 
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] [--max-pack-size=N] \n\
-	[--local] [--incremental] [--window=N] [--depth=N] \n\
+	[--local] [--incremental] [--window=N] [--depth=N] [--max-blob-size=N]\n\
 	[--no-reuse-delta] [--no-reuse-object] [--delta-base-offset] \n\
 	[--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
 	[--stdout | base-name] [<ref-list | <object-list]";
@@ -75,6 +75,7 @@ static int num_preferred_base;
 static struct progress progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
+static uint32_t max_blob_size;
 
 /*
  * The object names in objects array are hashed with this hashtable,
@@ -371,8 +372,6 @@ static unsigned long write_object(struct sha1file *f,
 				pack_size_limit - write_offset : 0;
 				/* no if no delta */
 	int usable_delta =	!entry->delta ? 0 :
-				/* yes if unlimited packfile */
-				!pack_size_limit ? 1 :
 				/* no if base written to previous pack */
 				entry->delta->offset == (off_t)-1 ? 0 :
 				/* otherwise double-check written to this
@@ -408,7 +407,7 @@ static unsigned long write_object(struct sha1file *f,
 		buf = read_sha1_file(entry->sha1, &type, &size);
 		if (!buf)
 			die("unable to read %s", sha1_to_hex(entry->sha1));
-		if (size != entry->size)
+		if (size != entry->size && type == obj_type)
 			die("object %s size inconsistency (%lu vs %lu)",
 			    sha1_to_hex(entry->sha1), size, entry->size);
 		if (usable_delta) {
@@ -564,6 +563,17 @@ static off_t write_one(struct sha1file *f,
 			return 0;
 	}
 
+	/* refuse to include as many megablobs as possible */
+	if (max_blob_size && e->size >= max_blob_size) {
+		struct stat st;
+		/* skip if unpacked, remotely packed, or loose anywhere */
+		if (!e->in_pack || !e->in_pack->pack_local || find_sha1_file(e->sha1, &st)) {
+			e->offset = (off_t)-1;	/* might drop reused delta base if mbs less */
+			written++;
+			return offset;
+		}
+	}
+
 	e->offset = offset;
 	size = write_object(f, e, offset);
 	if (!size) {
@@ -1422,13 +1432,16 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 
 	/* Now some size filtering heuristics. */
 	trg_size = trg_entry->size;
+	src_size = src_entry->size;
+	/* prevent use if could be later dropped from packfile */
+	if (max_blob_size && (trg_size >= max_blob_size || src_size >= max_blob_size))
+		return 0;
 	max_size = trg_size/2 - 20;
 	max_size = max_size * (max_depth - src_entry->depth) / max_depth;
 	if (max_size == 0)
 		return 0;
 	if (trg_entry->delta && trg_entry->delta_size <= max_size)
 		max_size = trg_entry->delta_size-1;
-	src_size = src_entry->size;
 	sizediff = src_size < trg_size ? trg_size - src_size : 0;
 	if (sizediff >= max_size)
 		return 0;
@@ -1735,6 +1748,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			incremental = 1;
 			continue;
 		}
+		if (!prefixcmp(arg, "--max-blob-size=")) {
+			char *end;
+			max_blob_size = strtoul(arg+16, &end, 0) * 1024;
+			if (!arg[16] || *end)
+				usage(pack_usage);
+			continue;
+		}
 		if (!prefixcmp(arg, "--compression=")) {
 			char *end;
 			int level = strtoul(arg+14, &end, 0);
@@ -1855,6 +1875,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!pack_to_stdout && thin)
 		die("--thin cannot be used to build an indexable pack.");
 
+	if (pack_to_stdout && max_blob_size)
+		die("--max-blob-size cannot be used to build a pack for transfer.");
+
 	prepare_packed_git();
 
 	if (progress)
diff --git a/cache.h b/cache.h
index 4994d03..424b321 100644
--- a/cache.h
+++ b/cache.h
@@ -356,6 +356,7 @@ extern int move_temp_to_file(const char *tmpfile, const char *filename);
 
 extern int has_sha1_pack(const unsigned char *sha1, const char **ignore);
 extern int has_sha1_file(const unsigned char *sha1);
+extern char *find_sha1_file(const unsigned char *sha1, struct stat *st);
 extern void *map_sha1_file(const unsigned char *sha1, unsigned long *);
 
 extern int has_pack_file(const unsigned char *sha1);
diff --git a/git-repack.sh b/git-repack.sh
index 4ea6e5b..6b4e1af 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -8,7 +8,7 @@ SUBDIRECTORY_OK='Yes'
 . git-sh-setup
 
 no_update_info= all_into_one= remove_redundant=
-local= quiet= no_reuse= extra=
+local= quiet= no_reuse= extra= max_blob_size=
 while case "$#" in 0) break ;; esac
 do
 	case "$1" in
@@ -18,6 +18,7 @@ do
 	-q)	quiet=-q ;;
 	-f)	no_reuse=--no-reuse-object ;;
 	-l)	local=--local ;;
+	--max-blob-size=*) extra="$extra $1" max_blob_size=t ;;
 	--max-pack-size=*) extra="$extra $1" ;;
 	--window=*) extra="$extra $1" ;;
 	--depth=*) extra="$extra $1" ;;
@@ -35,6 +36,12 @@ true)
 	extra="$extra --delta-base-offset" ;;
 esac
 
+# handle blob limiting
+if [ -z "$max_blob_size" ]; then
+	mbs="`git config --int repack.maxblobsize`"
+	[ -n "$mbs" ] && extra="$extra --max-blob-size=$mbs"
+fi
+
 PACKDIR="$GIT_OBJECT_DIRECTORY/pack"
 PACKTMP="$GIT_OBJECT_DIRECTORY/.tmp-$$-pack"
 rm -f "$PACKTMP"-*
diff --git a/sha1_file.c b/sha1_file.c
index e4c3288..17e9dbf 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -387,7 +387,7 @@ void prepare_alt_odb(void)
 	read_info_alternates(get_object_directory(), 0);
 }
 
-static char *find_sha1_file(const unsigned char *sha1, struct stat *st)
+char *find_sha1_file(const unsigned char *sha1, struct stat *st)
 {
 	char *name = sha1_file_name(sha1);
 	struct alternate_object_database *alt;
-- 
1.5.2.764.g7ae34

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] Prevent megablobs from gunking up git packs
  2007-05-26 19:16 [PATCH v3] Prevent megablobs from gunking up git packs Dana How
@ 2007-05-26 22:51 ` Junio C Hamano
  2007-05-26 23:48   ` Dana How
  2007-05-27  3:15 ` Nicolas Pitre
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2007-05-26 22:51 UTC (permalink / raw)
  To: Dana How; +Cc: Git Mailing List

Dana How <danahow@gmail.com> writes:

> diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> index 19b0aa1..59be849 100644
> --- a/builtin-pack-objects.c
> +++ b/builtin-pack-objects.c
> ...
> @@ -371,8 +372,6 @@ static unsigned long write_object(struct sha1file *f,
>  				pack_size_limit - write_offset : 0;
>  				/* no if no delta */
>  	int usable_delta =	!entry->delta ? 0 :
> -				/* yes if unlimited packfile */
> -				!pack_size_limit ? 1 :
>  				/* no if base written to previous pack */
>  				entry->delta->offset == (off_t)-1 ? 0 :
>  				/* otherwise double-check written to this
> @@ -408,7 +407,7 @@ static unsigned long write_object(struct sha1file *f,
>  		buf = read_sha1_file(entry->sha1, &type, &size);
>  		if (!buf)
>  			die("unable to read %s", sha1_to_hex(entry->sha1));
> -		if (size != entry->size)
> +		if (size != entry->size && type == obj_type)
>  			die("object %s size inconsistency (%lu vs %lu)",
>  			    sha1_to_hex(entry->sha1), size, entry->size);
>  		if (usable_delta) {

I do not quite get how these two hunks relate to the topic of
this patch.  Care to enlighten?

> @@ -564,6 +563,17 @@ static off_t write_one(struct sha1file *f,
>  			return 0;
>  	}
>  
> +	/* refuse to include as many megablobs as possible */
> +	if (max_blob_size && e->size >= max_blob_size) {
> +		struct stat st;
> +		/* skip if unpacked, remotely packed, or loose anywhere */
> +		if (!e->in_pack || !e->in_pack->pack_local || find_sha1_file(e->sha1, &st)) {
> +			e->offset = (off_t)-1;	/* might drop reused delta base if mbs less */
> +			written++;
> +			return offset;
> +		}
> +	}
> +
>  	e->offset = offset;
>  	size = write_object(f, e, offset);
>  	if (!size) {

I thought that you are simply ignoring the "naughty blobs"---why
should it be done this late in the call sequence?  I haven't
followed the existing code nor your patch closely, but I wonder
why the filtering is simply done inside (or by the caller of)
add_object_entry().  You would need to do sha1_object_info()
much earlier than the current code does, though.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] Prevent megablobs from gunking up git packs
  2007-05-26 22:51 ` Junio C Hamano
@ 2007-05-26 23:48   ` Dana How
  0 siblings, 0 replies; 6+ messages in thread
From: Dana How @ 2007-05-26 23:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow

On 5/26/07, Junio C Hamano <junkio@cox.net> wrote:
> Dana How <danahow@gmail.com> writes:
> > diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
> > @@ -371,8 +372,6 @@ static unsigned long write_object(struct sha1file *f,
> >                               /* no if no delta */
> >       int usable_delta =      !entry->delta ? 0 :
> > -                             /* yes if unlimited packfile */
> > -                             !pack_size_limit ? 1 :
> >                               /* no if base written to previous pack */
> >                               entry->delta->offset == (off_t)-1 ? 0 :
> >                               /* otherwise double-check written to this
> > @@ -408,7 +407,7 @@ static unsigned long write_object(struct sha1file *f,
> >               buf = read_sha1_file(entry->sha1, &type, &size);
> >               if (!buf)
> >                       die("unable to read %s", sha1_to_hex(entry->sha1));
> > -             if (size != entry->size)
> > +             if (size != entry->size && type == obj_type)
> >                       die("object %s size inconsistency (%lu vs %lu)",
> >                           sha1_to_hex(entry->sha1), size, entry->size);
>
> I do not quite get how these two hunks relate to the topic of
> this patch.  Care to enlighten?

No problem.

When the code decides that a blob should not be written to the output file,
then I must make sure it is not used as a delta base.  A large blob
that triggered the size test and _was_ a delta base could be the result
of maxblobsize decreasing or being newly specified,
both without -f/--no-object-reuse,
and we need to tolerate the user forgetting the option.

To make sure that it is not so used,  I re-use the trick from maxpacksize
which ensures that a delta base is not in the previous split pack:
I set the offset field to -1.  Unfortunately,  I only checked for this magic
value when computing usable_delta if pack_size_limit was set.  It turns
out the test doesn't need to be conditional on pack_size_limit,  it works
for all cases;  so since I need to do the test when maxblobsize was specified
and maxpacksize wasn't, I deleted the pack_size_limit test.

Now for the second hunk.  The facts above mean we could have marked
this entry as a re-used delta, but we are unable to re-use the delta
because its delta base is not being written to this pack.  So we fall into
the !to_reuse case even though the size field in the object_entry is the
size of the delta,  not the object.  We can detect this by the type coming
from read_sha1_file being unequal to the type set from the pack (which is
one of OBJ_{REF,OFS}_DELTA).  So I disable the size matching
test in this case.

> > @@ -564,6 +563,17 @@ static off_t write_one(struct sha1file *f,
> > +     /* refuse to include as many megablobs as possible */
> > +     if (max_blob_size && e->size >= max_blob_size) {
> > +             struct stat st;
> > +             /* skip if unpacked, remotely packed, or loose anywhere */
> > +             if (!e->in_pack || !e->in_pack->pack_local || find_sha1_file(e->sha1, &st)) {
> > +                     e->offset = (off_t)-1;  /* might drop reused delta base if mbs less */
> > +                     written++;
> > +                     return offset;
> > +             }
> > +     }
> > +
>
> I thought that you are simply ignoring the "naughty blobs"---why
> should it be done this late in the call sequence?  I haven't
> followed the existing code nor your patch closely, but I wonder
> why the filtering is simply done inside (or by the caller of)
> add_object_entry().  You would need to do sha1_object_info()
> much earlier than the current code does, though.

Recently Nicolas Pitre improved the code as follows:
(1) tree-walking etc. which calls add_object_entry.
    We learn sha1, type, name(path), pack&offset, no_try_delta
    during this step.
(2) NEW: sort a table of pointers to these objects by pack_offset.
(3) Now call check_object on each object, but in the order
     determined in (2).  We learn each object's size during
     this step.  This requires us to inspect each object's header
     in the pack(s).

The result is that we smoothly scan through the pack(s),
instead of jumping all over the place.

If I move sha1_object_info earlier,  before (2),  then I undo
his optimization.  This fact ultimately justifies the first two
hunks that you commented on,  since it means we want
the objects to appear in the object list _before_ we can
decide not to write them,  and thus we need to handle
objects not written and all their consequences
(which didn't seem too strange to me,
since you already have preferred bases).

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] Prevent megablobs from gunking up git packs
  2007-05-26 19:16 [PATCH v3] Prevent megablobs from gunking up git packs Dana How
  2007-05-26 22:51 ` Junio C Hamano
@ 2007-05-27  3:15 ` Nicolas Pitre
  2007-05-27  5:46   ` Dana How
  1 sibling, 1 reply; 6+ messages in thread
From: Nicolas Pitre @ 2007-05-27  3:15 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

On Sat, 26 May 2007, Dana How wrote:

> 
> Extremely large blobs distort general-purpose git packfiles.
> These megablobs can be either stored in separate "kept" packfiles,
> or left as loose objects.  Here we add some features to help
> either approach.
> 
> This patch implements the following:
> 1. git pack-objects accepts --max-blob-size=N,  with the effect that
>    only loose blobs less than N KB are written to the packfiles(s).
>    If an already packed blob violates this limit (perhaps these are
>    fast-import packs or max-blob-size was reduced),  it _is_ passed
>    through if from a local pack and no loose copy exists.

I'm still not convainced by this feature.  Is it really necessary?

Wouldn't it be better if the --max-blob-size=N was instead a 
--trailing-blob-size=N to specify which blobs are considered "naughty" 
per our previous discussion? This way there is no incoherency with 
already packed blobs larger than the treshold that you have to pass 
through.

This, combined with the option to disable deltification of large blobs 
(both options can be specified with the same size), and possibly the 
pack size limit, would solve your large blob issue, shouldn't it?


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] Prevent megablobs from gunking up git packs
  2007-05-27  3:15 ` Nicolas Pitre
@ 2007-05-27  5:46   ` Dana How
  2007-05-27 15:09     ` Nicolas Pitre
  0 siblings, 1 reply; 6+ messages in thread
From: Dana How @ 2007-05-27  5:46 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List, danahow

On 5/26/07, Nicolas Pitre <nico@cam.org> wrote:
> On Sat, 26 May 2007, Dana How wrote:
> > Extremely large blobs distort general-purpose git packfiles.
> > These megablobs can be either stored in separate "kept" packfiles,
> > or left as loose objects.  Here we add some features to help
> > either approach.
> >
> > This patch implements the following:
> > 1. git pack-objects accepts --max-blob-size=N,  with the effect that
> >    only loose blobs less than N KB are written to the packfiles(s).
> >    If an already packed blob violates this limit (perhaps these are
> >    fast-import packs or max-blob-size was reduced),  it _is_ passed
> >    through if from a local pack and no loose copy exists.
>
> I'm still not convainced by this feature.  Is it really necessary?
>
> Wouldn't it be better if the --max-blob-size=N was instead a
> --trailing-blob-size=N to specify which blobs are considered "naughty"
> per our previous discussion? This way there is no incoherency with
> already packed blobs larger than the treshold that you have to pass
> through.
>
> This, combined with the option to disable deltification of large blobs
> (both options can be specified with the same size), and possibly the
> pack size limit, would solve your large blob issue, shouldn't it?

Unfortunately, it doesn't.

There are at least three reasonable ways to handle large blobs:
(1) git-repack -a repacks everything.  Naughty blobs get pushed to
     the end as discussed (possibly dominating later split packs).
(2) Naughty blobs accumulate in separate "kept" packs.
     git-repack -a only repacks nice blobs.  Separate scripts,
     or new options to git-repack,  are needed to repack the "kept" packs.
     A number of people have discussed ideas like this.
(3) Naughty blobs are kept loose.

We have 255GB compressed in our Perforce repository and
it grows by 2GB+ per week.  Although I'm only considering bringing ~10%
of this into git,  it would be good for me to be able to argue that
I could bring more.  Every day the equivalent of ~1K+ blobs are committed.
How often should I repack the shared repository [that replaces Perforce]?
With this level of traffic I believe I should do it every night.

I've been discussing these plans with IT here since they maintain
everything else.
They would like any part of the database that is going to be reorganized
and replaced to be backed up first.  If only (1) is available,  and I
repack every
night,  then I need to back up the entire repository every night as well.
If I use (2) or (3),  then I back up just the repacked portion each night,
back up the kept packs only when they are repacked (on a slower schedule),
and/or back up the loose blobs on a similar schedule.

Besides this back up issue,  I simply don't want to have to repack _all_
of such a large repository each night.  With (1), nightly repacks get longer
and longer, and harder to schedule.

I think the minimum features needed to support (2) and (3) are the same:
(a) An easy way to prevent loose blobs exceeding some size limit
     from migrating into "nice" packs;
(b) A way to prevent packed objects from being copied when
     (i) they no longer meet the (new or reduced) size limit AND
     (ii) they exist in some other safe form in the repository.
The behavior of --max-blob-size=N in this patch provides both of these
while deleting other behavior people didn't like.

You mentioned "incoherency" above;
I'm not too sure how to proceed on that.
If you have a more coherent way to provide (a) and (b) above,
please let me know.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v3] Prevent megablobs from gunking up git packs
  2007-05-27  5:46   ` Dana How
@ 2007-05-27 15:09     ` Nicolas Pitre
  0 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2007-05-27 15:09 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

On Sat, 26 May 2007, Dana How wrote:

> I've been discussing these plans with IT here since they maintain
> everything else.
> They would like any part of the database that is going to be reorganized
> and replaced to be backed up first.  If only (1) is available,  and I
> repack every
> night,  then I need to back up the entire repository every night as well.

Why so?  The initial repack would create a set of packs where the last 
packs to be produced will contain large blobs that you don't have to 
ever repack.  Or maybe you produce large blobs every day and you want to 
prevent those from entering the pack up front?

> If I use (2) or (3),  then I back up just the repacked portion each night,
> back up the kept packs only when they are repacked (on a slower schedule),
> and/or back up the loose blobs on a similar schedule.
> 
> Besides this back up issue,  I simply don't want to have to repack _all_
> of such a large repository each night.  With (1), nightly repacks get longer
> and longer, and harder to schedule.
> 
> I think the minimum features needed to support (2) and (3) are the same:
> (a) An easy way to prevent loose blobs exceeding some size limit
>     from migrating into "nice" packs;
> (b) A way to prevent packed objects from being copied when
>     (i) they no longer meet the (new or reduced) size limit AND
>     (ii) they exist in some other safe form in the repository.
> The behavior of --max-blob-size=N in this patch provides both of these
> while deleting other behavior people didn't like.
> 
> You mentioned "incoherency" above;
> I'm not too sure how to proceed on that.
> If you have a more coherent way to provide (a) and (b) above,
> please let me know.

I think it boils down to a question of proper wordings.  Describing this 
as max-blob-size is misleading if in the end you still can end up with 
larger blobs in your pack.  I think there are two solutions to this 
incoherency: either the feature is called something else to reflect the 
fact that it concerns itself only with migration of loose blobs into the 
packed space (I cannot come up with a good name though), or the whole 
pack-objects process is aborted with an error whenever the max-blob-size 
condition cannot be satisfied due to large blobs existing in packed form 
only indicating that a separate extraction of large blobs process is 
required.

Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2007-05-27 15:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-26 19:16 [PATCH v3] Prevent megablobs from gunking up git packs Dana How
2007-05-26 22:51 ` Junio C Hamano
2007-05-26 23:48   ` Dana How
2007-05-27  3:15 ` Nicolas Pitre
2007-05-27  5:46   ` Dana How
2007-05-27 15:09     ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).