[PATCH] Prevent megablobs from gunking up git packs

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Prevent megablobs from gunking up git packs
@ 2007-05-22  6:14 Dana How
  2007-05-22  6:30 ` Shawn O. Pearce
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Dana How @ 2007-05-22  6:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow


Using fast-import and repack with the max-pack-size patch,
3628 commits were imported from Perforce comprising
100.35GB (uncompressed) in 38829 blobs,  and saved in
7 packfiles of 12.5GB total (--window=0 and --depth=0 were
used due to runtime limits).  When using these packfiles,
several git commands showed very large process sizes,
and some slowdowns (compared to comparable operations
on the linux kernel repo) were also apparent.

git stores data in loose blobs or in packfiles.  The former
has essentially now become an exception mechanism,  to store
exceptionally *young* blobs.  Why not use this to store
exceptionally *large* blobs as well?  This allows us to
re-use all the "exception" machinery with only a small change.

Repacking the entire repository with a max-blob-size of 256KB
resulted in a single 13.1MB packfile,  as well as 2853 loose
objects totaling 15.4GB compressed and 100.08GB uncompressed,
11 files per objects/xx directory on average.  All was created
in half the runtime of the previous yet with standard
--window=10 and --depth=50 parameters.  The data in the
packfile was 270MB uncompressed in 35976 blobs.  Operations
such as "git-log --pretty=oneline" were about 30X faster
on a cold cache and 2 to 3X faster otherwise.  Process sizes
remained reasonable.

This patch implements the following:
1. git pack-objects takes a new --max-blob-size=N flag,
   with the effect that only blobs less than N KB are written
   to the packfiles(s).  If a blob was in a pack but violates
   this limit (perhaps the packs were created by fast-import
   or max-blob-size was reduced),  then a new loose object
   is written out if needed so the data is not lost.
2. git repack inspects repack.maxblobsize .  If set,  its
   value is passed to git pack-objects on the command line.
   The user should change repack.maxblobsize ,  NOT specify
   --max-blob-size=N .
3. No other caller of git pack-objects supplies this new flag,
   so other callers see no change.

This patch is on top of the earlier max-pack-size patch,
because I thought I needed some behavior it supplied,
but could be rebased on master if desired.

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 builtin-pack-objects.c |   33 +++++++++++++++++++++++++++++++--
 cache.h                |    2 ++
 git-repack.sh          |    4 ++++
 sha1_file.c            |   17 +++++++++++++++--
 4 files changed, 52 insertions(+), 4 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index 930b57a..e88f6b7 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -16,7 +16,7 @@
 
 static const char pack_usage[] = "\
 git-pack-objects [{ -q | --progress | --all-progress }] [--max-pack-size=N] \n\
-	[--local] [--incremental] [--window=N] [--depth=N] \n\
+	[--local] [--incremental] [--window=N] [--depth=N] [--max-blob-size=N]\n\
 	[--no-reuse-delta] [--no-reuse-object] [--delta-base-offset] \n\
 	[--non-empty] [--revs [--unpacked | --all]*] [--reflog] \n\
 	[--stdout | base-name] [<ref-list | <object-list]";
@@ -73,6 +73,7 @@ static int num_preferred_base;
 static struct progress progress_state;
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 static int pack_compression_seen;
+static uint32_t max_blob_size;
 
 /*
  * The object names in objects array are hashed with this hashtable,
@@ -562,6 +563,24 @@ static off_t write_one(struct sha1file *f,
 			return 0;
 	}
 
+	/* refuse to include megablobs */
+	if (max_blob_size && e->size >= max_blob_size) {
+		if (e->in_pack) {
+			/* rewrite as loose object so git-repack doesn't lose data */
+			void *buf;
+			enum object_type type;
+			unsigned long size;
+			buf = read_sha1_file(e->sha1, &type, &size);
+			if (!buf)
+				die("unable to read %s", sha1_to_hex(e->sha1));
+			if (write_sha1_file_ignore_packs(buf, size, typename(type), NULL) < 0)
+				die("failed to write object");
+			free(buf);
+		}
+		written++;
+		return offset;
+	}
+
 	e->offset = offset;
 	size = write_object(f, e, offset);
 	if (!size) {
@@ -1391,13 +1410,16 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 
 	/* Now some size filtering heuristics. */
 	trg_size = trg_entry->size;
+	src_size = src_entry->size;
+	/* prevent use if later dropped from packfile */
+	if (max_blob_size && (trg_size >= max_blob_size || src_size >= max_blob_size))
+		return 0;
 	max_size = trg_size/2 - 20;
 	max_size = max_size * (max_depth - src_entry->depth) / max_depth;
 	if (max_size == 0)
 		return 0;
 	if (trg_entry->delta && trg_entry->delta_size <= max_size)
 		max_size = trg_entry->delta_size-1;
-	src_size = src_entry->size;
 	sizediff = src_size < trg_size ? trg_size - src_size : 0;
 	if (sizediff >= max_size)
 		return 0;
@@ -1701,6 +1723,13 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			incremental = 1;
 			continue;
 		}
+		if (!prefixcmp(arg, "--max-blob-size=")) {
+			char *end;
+			max_blob_size = strtoul(arg+16, &end, 0) * 1024;
+			if (!arg[16] || *end)
+				usage(pack_usage);
+			continue;
+		}
 		if (!prefixcmp(arg, "--compression=")) {
 			char *end;
 			int level = strtoul(arg+14, &end, 0);
diff --git a/cache.h b/cache.h
index 3143853..ad39f67 100644
--- a/cache.h
+++ b/cache.h
@@ -343,6 +343,8 @@ extern int sha1_object_info(const unsigned char *, unsigned long *);
 extern void * read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size);
 extern int hash_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *sha1);
 extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
+extern int write_sha1_file_ignore_packs(void *buf, unsigned long len, const char *type,
+					unsigned char *return_sha1);
 extern int pretend_sha1_file(void *, unsigned long, enum object_type, unsigned char *);
 
 extern int check_sha1_signature(const unsigned char *sha1, void *buf, unsigned long size, const char *type);
diff --git a/git-repack.sh b/git-repack.sh
index 4ea6e5b..c2b2112 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -35,6 +35,10 @@ true)
 	extra="$extra --delta-base-offset" ;;
 esac
 
+# handle blob limiting
+mbs="`git config --int repack.maxblobsize`"
+[ -n "$mbs" ] && extra="$extra --max-blob-size=$mbs"
+
 PACKDIR="$GIT_OBJECT_DIRECTORY/pack"
 PACKTMP="$GIT_OBJECT_DIRECTORY/.tmp-$$-pack"
 rm -f "$PACKTMP"-*
diff --git a/sha1_file.c b/sha1_file.c
index e715527..8786af7 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1979,7 +1979,8 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type,
 	return 0;
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+static int write_sha1_file_core(void *buf, unsigned long len, const char *type,
+				int checkpacks, unsigned char *returnsha1)
 {
 	int size, ret;
 	unsigned char *compressed;
@@ -1997,7 +1998,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 	filename = sha1_file_name(sha1);
 	if (returnsha1)
 		hashcpy(returnsha1, sha1);
-	if (has_sha1_file(sha1))
+	if (checkpacks && has_sha1_file(sha1))
 		return 0;
 	fd = open(filename, O_RDONLY);
 	if (fd >= 0) {
@@ -2062,6 +2063,18 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 	return move_temp_to_file(tmpfile, filename);
 }
 
+int write_sha1_file(void *buf, unsigned long len, const char *type,
+		    unsigned char *returnsha1)
+{
+	return write_sha1_file_core(buf, len, type, 1, returnsha1);
+}
+
+int write_sha1_file_ignore_packs(void *buf, unsigned long len, const char *type,
+				 unsigned char *returnsha1)
+{
+	return write_sha1_file_core(buf, len, type, 0, returnsha1);
+}
+
 /*
  * We need to unpack and recompress the object for writing
  * it out to a different file.
-- 
1.5.2.rc3.726.g279d-dirty

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
@ 2007-05-22  6:30 ` Shawn O. Pearce
  2007-05-22  7:33   ` Dana How
  2007-05-22  6:52 ` Junio C Hamano
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Shawn O. Pearce @ 2007-05-22  6:30 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

Dana How <danahow@gmail.com> wrote:
...
> 7 packfiles of 12.5GB total (--window=0 and --depth=0 were
> used due to runtime limits).  When using these packfiles,
...
> Repacking the entire repository with a max-blob-size of 256KB
> resulted in a single 13.1MB packfile,  as well as 2853 loose
> objects totaling 15.4GB compressed and 100.08GB uncompressed,
> 11 files per objects/xx directory on average.  All was created
> in half the runtime of the previous yet with standard
> --window=10 and --depth=50 parameters.  The data in the
> packfile was 270MB uncompressed in 35976 blobs.  Operations
> such as "git-log --pretty=oneline" were about 30X faster
> on a cold cache and 2 to 3X faster otherwise.  Process sizes
> remained reasonable.

Can you give me details about your system?  Is this a 64 bit binary?
What is your core.packedGitWindowSize and core.packedGitLimit set to?

It sounds like the packed version was almost 3 GiB smaller, but
was slower because we were mmap'ing far too much data at startup
and that was making your OS page in things that you didn't really
need to have.

Mind trying git-log with a smaller core.packedGitWindow{Size,Limit}?
Perhaps its just as simple as our defaults are far far too high for
your workload...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:30 ` Shawn O. Pearce
@ 2007-05-22  7:33   ` Dana How
  0 siblings, 0 replies; 24+ messages in thread
From: Dana How @ 2007-05-22  7:33 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Git Mailing List, danahow

On 5/21/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Dana How <danahow@gmail.com> wrote:
> > ...  Operations
> > such as "git-log --pretty=oneline" were about 30X faster
> > on a cold cache and 2 to 3X faster otherwise.  Process sizes
> > remained reasonable.
>
> Can you give me details about your system?  Is this a 64 bit binary?
RHEL4/Nahant on an Opteron. Yes.

> What is your core.packedGitWindowSize and core.packedGitLimit set to?
I didn't change the default.

> It sounds like the packed version was almost 3 GiB smaller, but
> was slower because we were mmap'ing far too much data at startup
> and that was making your OS page in things that you didn't really
> need to have.
The difference in size is because of the "Custom compression levels"
patch -- now the loose objects use Z_BEST_SPEED,  whereas the packs
use Z_DEFAULT_COMPRESSION.

> Mind trying git-log with a smaller core.packedGitWindow{Size,Limit}?
> Perhaps its just as simple as our defaults are far far too high for
> your workload...
I think that's a good idea and it should be easy to try tomorrow.
It will improve the cold cache case definitely.

But we need to consider both *read* and *creation* performance.
The portion of the repo I imported to git grows at about 500MB/week
(compressed).  Should I repack -a every week? Every month?  In any case,
should I use default window/depth, or 0/0?  If default, run-times are
prohibitive (in fact, I've always killed each attempt so the machine
could be used for "real" work), and if 0/0, then I lose deltification
on all objects.

These megablobs really are outliers and stress the "one size fits
all" approach of packing in git.  As a thought experiment,
let's (1) pretend git-repack takes --max-blob-size= and --max-pack-size= ,
(2) pretend the patch doesn't add the repack.maxblobsize variable,
and (3) do the following:
 % git-repack -a -d --max-blob-size=256
 % git-repack --max-pack-size=2047 --window=0 --depth=0
The first step makes a digestible 13MB packfile, and the second
puts all the megablobs in 6+ 2GB packfiles.  Is there really any
advantage to carrying out the second step?  If I'm processing
a 100MB+ blob,  do I really care about an extra open(2) call?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
  2007-05-22  6:30 ` Shawn O. Pearce
@ 2007-05-22  6:52 ` Junio C Hamano
  2007-05-22  8:00   ` Dana How
  2007-05-22 17:38 ` Nicolas Pitre
  2007-05-23 22:08 ` Junio C Hamano
  3 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2007-05-22  6:52 UTC (permalink / raw)
  To: Dana How; +Cc: Git Mailing List

Dana How <danahow@gmail.com> writes:

> Using fast-import and repack with the max-pack-size patch,
> 3628 commits were imported from Perforce comprising
> 100.35GB (uncompressed) in 38829 blobs,  and saved in
> 7 packfiles of 12.5GB total (--window=0 and --depth=0 were
> used due to runtime limits).  When using these packfiles,
> several git commands showed very large process sizes,
> and some slowdowns (compared to comparable operations
> on the linux kernel repo) were also apparent.
>
> git stores data in loose blobs or in packfiles.  The former
> has essentially now become an exception mechanism,  to store
> exceptionally *young* blobs.  Why not use this to store
> exceptionally *large* blobs as well?  This allows us to
> re-use all the "exception" machinery with only a small change.

Well, I had an impression that mmapping a single loose object
(and then munmapping it after done) would be more expensive than
mmapping a whole pack and accessing that object through window,
as long as you touch the same set of objects and the object in
the pack is not deltified.

> Repacking the entire repository with a max-blob-size of 256KB
> resulted in a single 13.1MB packfile,  as well as 2853 loose
> objects totaling 15.4GB compressed and 100.08GB uncompressed,
> 11 files per objects/xx directory on average.  All was created
> in half the runtime of the previous yet with standard
> --window=10 and --depth=50 parameters.  The data in the
> packfile was 270MB uncompressed in 35976 blobs.  Operations
> such as "git-log --pretty=oneline" were about 30X faster
> on a cold cache and 2 to 3X faster otherwise.  Process sizes
> remained reasonable.

I think more reasonable comparison to figure out what is really
going on would be to create such a pack with the same 0/0 window
and depth (i.e. "keeping the huge objects out of the pack" would
be the only difference with the "horrible" case).  With huge
packs, I wouldn't be surprised if seeking to extract base object
from a far away part of a packfile takes a lot longer than
reading delta and applying the delta to base object that is kept
in the in-core delta base cache.

Also if you mean by "process size" the total VM size, not RSS, I
think it is a wrong measure.  As long as you do not touch the
rest of the pack, even if you mmap a huge packfile, you would
not bring that much data actually into your main memory, would
you?  Well, assuming that your mmap() implementation and virtual
memory subsystem does a descent job... maybe we are spoiled by
Linux here...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:52 ` Junio C Hamano
@ 2007-05-22  8:00   ` Dana How
  2007-05-22 11:05     ` Jakub Narebski
  0 siblings, 1 reply; 24+ messages in thread
From: Dana How @ 2007-05-22  8:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow

On 5/21/07, Junio C Hamano <junkio@cox.net> wrote:
> Dana How <danahow@gmail.com> writes:
> > git stores data in loose blobs or in packfiles.  The former
> > has essentially now become an exception mechanism,  to store
> > exceptionally *young* blobs.  Why not use this to store
> > exceptionally *large* blobs as well?  This allows us to
> > re-use all the "exception" machinery with only a small change.
> Well, I had an impression that mmapping a single loose object
> (and then munmapping it after done) would be more expensive than
> mmapping a whole pack and accessing that object through window,
> as long as you touch the same set of objects and the object in
> the pack is not deltified.
I agree with your comparison.  However,  if I'm processing a 100MB+
blob,  I doubt the extra open/mmap/munmap/close calls are going
to matter to me.  What I think _helped_ me was that, with the megablobs
pushed out of the pack,  git-log etc could play around inside a
"tiny" 13MB packfile very quickly.  This packfile contained all the
commits, all the trees, and all the blobs < 256KB.

> > Repacking the entire repository with a max-blob-size of 256KB
> > resulted in a single 13.1MB packfile,  as well as 2853 loose
> > objects totaling 15.4GB compressed and 100.08GB uncompressed,
> > 11 files per objects/xx directory on average.  All was created
> > in half the runtime of the previous yet with standard
> > --window=10 and --depth=50 parameters.  The data in the
> > packfile was 270MB uncompressed in 35976 blobs.  Operations
> > such as "git-log --pretty=oneline" were about 30X faster
> > on a cold cache and 2 to 3X faster otherwise.  Process sizes
> > remained reasonable.
>
> I think more reasonable comparison to figure out what is really
> going on would be to create such a pack with the same 0/0 window
> and depth (i.e. "keeping the huge objects out of the pack" would
> be the only difference with the "horrible" case).  With huge
> packs, I wouldn't be surprised if seeking to extract base object
> from a far away part of a packfile takes a lot longer than
> reading delta and applying the delta to base object that is kept
> in the in-core delta base cache.
Yes,  changing only one variable at a time would be better.
I will do that experiment.  However,  the huge pack _did_ have
0/0, and the small pack had default/default,  which I think is the
reverse of what you concluded above?,  so the experiment should
make things no better for the huge pack case.

> Also if you mean by "process size" the total VM size, not RSS, I
> think it is a wrong measure.  As long as you do not touch the
> rest of the pack, even if you mmap a huge packfile, you would
> not bring that much data actually into your main memory, would
> you?  Well, assuming that your mmap() implementation and virtual
> memory subsystem does a descent job... maybe we are spoiled by
> Linux here...
You are right that the VM number was more shocking,  but both
were too high.  But let's compare using 12GB+ of packfiles versus 13MB.
In the former case,  I'm depending on the sliding mmap windows doing
the right thing in an operating regime no one uses (which is why
Shawn was asking about my packedGitLimit settings etc), and in the
latter case, the packfile is <10% of the linux2.6 packfile but I have
to endure an extra open/mmap/munmap/close sequence when accessing
enormouse files.  The small extra cost of the latter is more attractive
to me than an unknown amount of tuning to get the former right,
and in the former case I still have to figure out how to *create*
the packfiles efficiently.

There's actually an even more extreme example from my day job.
The software team has a project whose files/revisions would be
similar to those in the linux kernel (larger commits, I'm sure).
But they have *ONE* 500MB file they check in because it takes
2 or 3 days to generate and different people use different versions of it.
I'm sure it has 50+ revisions now.  If they converted to git and included
these blobs in their packfile, that's a 25GB uncompressed increase!
*Every* git operation must wade through 10X -- 100X more packfile.
Or it could be kept in 50+ loose objects in objects/xx ,
requiring a few extra syscalls by each user to get a new version.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  8:00   ` Dana How
@ 2007-05-22 11:05     ` Jakub Narebski
  2007-05-22 16:59       ` Dana How
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Narebski @ 2007-05-22 11:05 UTC (permalink / raw)
  To: git

Dana How wrote:

> There's actually an even more extreme example from my day job.
> The software team has a project whose files/revisions would be
> similar to those in the linux kernel (larger commits, I'm sure).
> But they have *ONE* 500MB file they check in because it takes
> 2 or 3 days to generate and different people use different versions of it.
> I'm sure it has 50+ revisions now.  If they converted to git and included
> these blobs in their packfile, that's a 25GB uncompressed increase!
> *Every* git operation must wade through 10X -- 100X more packfile.
> Or it could be kept in 50+ loose objects in objects/xx ,
> requiring a few extra syscalls by each user to get a new version.

Or keeping those large objects in separate, _kept_ packfile, containing
only those objects (which can delta well, even if they are large).

-- 
Jakub Narebski
Warsaw, Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22 11:05     ` Jakub Narebski
@ 2007-05-22 16:59       ` Dana How
  2007-05-22 23:44         ` Jakub Narebski
  0 siblings, 1 reply; 24+ messages in thread
From: Dana How @ 2007-05-22 16:59 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git, danahow, Junio C Hamano

On 5/22/07, Jakub Narebski <jnareb@gmail.com> wrote:
> Dana How wrote:
> > There's actually an even more extreme example from my day job.
> > The software team has a project whose files/revisions would be
> > similar to those in the linux kernel (larger commits, I'm sure).
> > But they have *ONE* 500MB file they check in because it takes
> > 2 or 3 days to generate and different people use different versions of it.
> > I'm sure it has 50+ revisions now. If they converted to git and included
> > these blobs in their packfile, that's a 25GB uncompressed increase!
> > *Every* git operation must wade through 10X -- 100X more packfile.
> > Or it could be kept in 50+ loose objects in objects/xx ,
> > requiring a few extra syscalls by each user to get a new version.
> Or keeping those large objects in separate, _kept_ packfile, containing
> only those objects (which can delta well, even if they are large).

Yes, I experimented with various changes to git-repack and
having it create .keep files just before coming up with the maxblobsize
approach.  The problem with a 12GB+ repo is not only the large
repack time,  but the fact that the repack time keeps growing with
the repo size.  So, with split packs, I had repack create .keep
files for all new packs except the last (fragmentary) one.  The next
repack would then only repack new stuff plus the single fragmentary
pack, keeping repack time from growing (until you deleted the .keep
files [just the ones with "repack" in them] to start over from scratch).
But this approach is not going to distribute commits and trees all that well.

Last night before signing off Junio proposed some partitioning ideas.
He presented them as ordering things *within* one pack;  what I had
tried was making repack operate in 2 passes: the first one would create
pack(s) containing commits+trees+tags, the 2nd would create
pack(s) containing only blobs.  Of course the first group would contain
only 1 tiny pack, and the latter 6 or 7 enormous packs.  I also combined
this with the previous paragraph, putting .keep files on all but the last
pack in each group.  Then the metadata always got repacked,
and the blob data only got its "tail" repacked.

Let's just stipulate that you've convinced me that putting everything
in packs, and not ejecting megablobs, is better or equivalent on
the "central" git repository which will replace (part of) our Perforce
repository.  What about the users' repositories?

Each person at my day job has his own workstation.  They are all
on a grid and are constantly running jobs in the background.
Each person would have at least one personal repo.  What should the
packing strategy be there?

(1) If we must put everything in packs,  then we could:
(1a) Repack everything in local repos, incurring large local runtimes.
       This extra work then denies the CPU cycles to the grid,
       which WILL be noticed and cause much whining.
       So the response will be to reduce window and/or turn
       on nodelta for some group of objects, worsening packing
       and failing to squash the whining.  This happens across
       20 to 30 workstations.  Or we reduce the frequency of
       repacking and stagger it across the network.  Since daily
       pull/fetch/checkout ("sync" in p4 parlance) grabs 400+ new
       revisions each day,  if we make repacking weekly we have
       a policy that results in 400*5/2=1000 extra loose blobs on average,
       and there will still be whining.  Why not just set maxblobsize
       to some size resulting in ~1000 loose blobs, leave window/depth
       at default, and enjoy <1hr repacking?
(1b) Repack everything ONLY in the central repo, and have the users' repos
      point to it as an alternate.  Now we have enormous network traffic.
       However, this is better than (1a), and was what I thought I'd be
       stuck with.  We still do have the possible problem of excessive
       packing time on the central repo,  but it's easier to solve/hide
      in just one place.
(2) We repack everything but leave megablobs loose.  Now packfiles
     are 13MB, repack time with default window/depth is <1hr,  and we
     can repack each users' repository from his own cron job.  This will
     be noticed,  but it won't cause too much complaining.  Most git
     operations by users will be against their local repos,  but the
     server's db will still be an alternate to fetch at least megablobs.
     This is not a problem compared to Perforce,  which stores *NO*
     repository state locally at all.

I really think megablob ejection from packs makes a lot of sense for local
repos on a network of workstations.  It lets me keep almost all repo
state locally very cheaply.  It is just another consequence of the tendency
that an adequate solution that operates principally on only 13MB of data
doesn't have to work as hard or as carefully as something
operating on the full 12GB -- three orders of magnitude larger.

If there's interest,  I could submit my other alterations to git-repack.
They still have bugs which would take a while to work out since
each run operates on 12GB of data.  With quicker runtimes,
maxblobsize was much quicker to debug even though I made
more stupid mistakes at first ;-)

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22 16:59       ` Dana How
@ 2007-05-22 23:44         ` Jakub Narebski
  2007-05-23  0:28           ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Jakub Narebski @ 2007-05-22 23:44 UTC (permalink / raw)
  To: Dana How; +Cc: git, Junio C Hamano

Dana How wrote:
> On 5/22/07, Jakub Narebski <jnareb@gmail.com> wrote:
>> Dana How wrote:

>>> There's actually an even more extreme example from my day job.
>>> The software team has a project whose files/revisions would be
>>> similar to those in the linux kernel (larger commits, I'm sure).
>>> But they have *ONE* 500MB file they check in because it takes
>>> 2 or 3 days to generate and different people use different versions of it.
>>> I'm sure it has 50+ revisions now. If they converted to git and included
>>> these blobs in their packfile, that's a 25GB uncompressed increase!
>>> *Every* git operation must wade through 10X -- 100X more packfile.
>>> Or it could be kept in 50+ loose objects in objects/xx ,
>>> requiring a few extra syscalls by each user to get a new version.
>>
>> Or keeping those large objects in separate, _kept_ packfile, containing
>> only those objects (which can delta well, even if they are large).
> 
> Yes, I experimented with various changes to git-repack and
> having it create .keep files just before coming up with the maxblobsize
> approach.  The problem with a 12GB+ repo is not only the large
> repack time,  but the fact that the repack time keeps growing with
> the repo size.  So, with split packs, I had repack create .keep
> files for all new packs except the last (fragmentary) one.  The next
> repack would then only repack new stuff plus the single fragmentary
> pack, keeping repack time from growing (until you deleted the .keep
> files [just the ones with "repack" in them] to start over from scratch).
> But this approach is not going to distribute commits and trees all that well.

No, I was thinking about separate _kept_ pack (so it would be not 
repacked unless -f option is given) containing _only_ the large blobs.
The only difference between this and your proposal is that megablobs
would be in their mergablobs pack, but not loose.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22 23:44         ` Jakub Narebski
@ 2007-05-23  0:28           ` Junio C Hamano
  2007-05-23  1:58             ` Nicolas Pitre
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2007-05-23  0:28 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Dana How, git

Jakub Narebski <jnareb@gmail.com> writes:

> No, I was thinking about separate _kept_ pack (so it would be not 
> repacked unless -f option is given) containing _only_ the large blobs.
> The only difference between this and your proposal is that megablobs
> would be in their mergablobs pack, but not loose.

I am not sure about the "unless -f option is given" part, but a
single .kept pack that contains only problematic blobs would be
an interesting experiment.

 (0) prepare object names of problematic blobs, in huge.txt, one
     object name per line;

 (1) prepare a single pack that has them:

     $ N=$(git-pack-object --depth=0 --window=0 pack <huge.txt)
     $ echo 'Huge blobs -- do not repack' >pack-$N.keep
     $ mv pack-$N.* .git/object/pack/.

 (2) repack the remainder, with the default depth/window:

     $ git repack -a -d
     $ git prune

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-23  0:28           ` Junio C Hamano
@ 2007-05-23  1:58             ` Nicolas Pitre
  0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2007-05-23  1:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jakub Narebski, Dana How, git

On Tue, 22 May 2007, Junio C Hamano wrote:

> Jakub Narebski <jnareb@gmail.com> writes:
> 
> > No, I was thinking about separate _kept_ pack (so it would be not 
> > repacked unless -f option is given) containing _only_ the large blobs.
> > The only difference between this and your proposal is that megablobs
> > would be in their mergablobs pack, but not loose.
> 
> I am not sure about the "unless -f option is given" part, but a
> single .kept pack that contains only problematic blobs would be
> an interesting experiment.
> 
>  (0) prepare object names of problematic blobs, in huge.txt, one
>      object name per line;
> 
>  (1) prepare a single pack that has them:
> 
>      $ N=$(git-pack-object --depth=0 --window=0 pack <huge.txt)
>      $ echo 'Huge blobs -- do not repack' >pack-$N.keep
>      $ mv pack-$N.* .git/object/pack/.

If you're going to keep this pack, I think it might be worth attempting 
deltas between those blobs anyway.  If they ever deltify you'll gain in 
disk space.  And if they don't, well, you wasted the CPU cycles only 
once.  Unless you know for sure they're unlikely to deltify well.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
  2007-05-22  6:30 ` Shawn O. Pearce
  2007-05-22  6:52 ` Junio C Hamano
@ 2007-05-22 17:38 ` Nicolas Pitre
  2007-05-22 18:07   ` Dana How
  2007-05-23 22:08 ` Junio C Hamano
  3 siblings, 1 reply; 24+ messages in thread
From: Nicolas Pitre @ 2007-05-22 17:38 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

On Mon, 21 May 2007, Dana How wrote:

> 
> Using fast-import and repack with the max-pack-size patch,
> 3628 commits were imported from Perforce comprising
> 100.35GB (uncompressed) in 38829 blobs,  and saved in
> 7 packfiles of 12.5GB total (--window=0 and --depth=0 were
> used due to runtime limits).  When using these packfiles,
> several git commands showed very large process sizes,
> and some slowdowns (compared to comparable operations
> on the linux kernel repo) were also apparent.
> 
> git stores data in loose blobs or in packfiles.  The former
> has essentially now become an exception mechanism,  to store
> exceptionally *young* blobs.  Why not use this to store
> exceptionally *large* blobs as well?  This allows us to
> re-use all the "exception" machinery with only a small change.
> 
> Repacking the entire repository with a max-blob-size of 256KB
> resulted in a single 13.1MB packfile,  as well as 2853 loose
> objects totaling 15.4GB compressed and 100.08GB uncompressed,
> 11 files per objects/xx directory on average.  All was created
> in half the runtime of the previous yet with standard
> --window=10 and --depth=50 parameters.  The data in the
> packfile was 270MB uncompressed in 35976 blobs.  Operations
> such as "git-log --pretty=oneline" were about 30X faster
> on a cold cache and 2 to 3X faster otherwise.  Process sizes
> remained reasonable.
> 
> This patch implements the following:
> 1. git pack-objects takes a new --max-blob-size=N flag,
>    with the effect that only blobs less than N KB are written
>    to the packfiles(s).  If a blob was in a pack but violates
>    this limit (perhaps the packs were created by fast-import
>    or max-blob-size was reduced),  then a new loose object
>    is written out if needed so the data is not lost.
> 2. git repack inspects repack.maxblobsize .  If set,  its
>    value is passed to git pack-objects on the command line.
>    The user should change repack.maxblobsize ,  NOT specify
>    --max-blob-size=N .
> 3. No other caller of git pack-objects supplies this new flag,
>    so other callers see no change.
> 
> This patch is on top of the earlier max-pack-size patch,
> because I thought I needed some behavior it supplied,
> but could be rebased on master if desired.

I think what this patch is missing is a test after all options have been 
parsed to prevent --stdout and --max-blob-size to be used together.


Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22 17:38 ` Nicolas Pitre
@ 2007-05-22 18:07   ` Dana How
  0 siblings, 0 replies; 24+ messages in thread
From: Dana How @ 2007-05-22 18:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List, danahow

On 5/22/07, Nicolas Pitre <nico@cam.org> wrote:
> On Mon, 21 May 2007, Dana How wrote:
> > This patch implements the following:
> > 1. git pack-objects takes a new --max-blob-size=N flag,
> >    with the effect that only blobs less than N KB are written
> >    to the packfiles(s).  If a blob was in a pack but violates
> >    this limit (perhaps the packs were created by fast-import
> >    or max-blob-size was reduced),  then a new loose object
> >    is written out if needed so the data is not lost.
> > 2. git repack inspects repack.maxblobsize .  If set,  its
> >    value is passed to git pack-objects on the command line.
> >    The user should change repack.maxblobsize ,  NOT specify
> >    --max-blob-size=N .
> > 3. No other caller of git pack-objects supplies this new flag,
> >    so other callers see no change.
> >
> > This patch is on top of the earlier max-pack-size patch,
> > because I thought I needed some behavior it supplied,
> > but could be rebased on master if desired.
>
> I think what this patch is missing is a test after all options have been
> parsed to prevent --stdout and --max-blob-size to be used together.
Yes.  I will also update the documentation.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
                   ` (2 preceding siblings ...)
  2007-05-22 17:38 ` Nicolas Pitre
@ 2007-05-23 22:08 ` Junio C Hamano
  2007-05-23 23:55   ` Dana How
  3 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2007-05-23 22:08 UTC (permalink / raw)
  To: Dana How; +Cc: Git Mailing List

Dana How <danahow@gmail.com> writes:

> This patch implements the following:
> 1. git pack-objects takes a new --max-blob-size=N flag,
>    with the effect that only blobs less than N KB are written
>    to the packfiles(s).  If a blob was in a pack but violates
>    this limit (perhaps the packs were created by fast-import
>    or max-blob-size was reduced),  then a new loose object
>    is written out if needed so the data is not lost.

Why?

I really do not like that "write a new loose object" part
without proper justification.  From your description, I thought
the most natural way to do this is to pretend you did not hear
about large objects at all, by rejecting them early, perhaps
inside add_object_entry() or inside get_object_details() --
either case you would do sha1_object_info() early instead of
doing it in check_object().

By the way, is there fundamental reason that this needs to be
"blob size" limit?  Wouldn't "max-object-size" be more clean in
theory, and work the same way in practice?

> 2. git repack inspects repack.maxblobsize .  If set,  its
>    value is passed to git pack-objects on the command line.
>    The user should change repack.maxblobsize ,  NOT specify
>    --max-blob-size=N .

Why not?

> This patch is on top of the earlier max-pack-size patch,
> because I thought I needed some behavior it supplied,
> but could be rebased on master if desired.

Your earlier "split according to max-pack-size" will hopefully be
on master shortly.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-23 22:08 ` Junio C Hamano
@ 2007-05-23 23:55   ` Dana How
  2007-05-24  1:44     ` Junio C Hamano
  0 siblings, 1 reply; 24+ messages in thread
From: Dana How @ 2007-05-23 23:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow

On 5/23/07, Junio C Hamano <junkio@cox.net> wrote:
> Dana How <danahow@gmail.com> writes:
> > This patch implements the following:
> > 1. git pack-objects takes a new --max-blob-size=N flag,
> >    with the effect that only blobs less than N KB are written
> >    to the packfiles(s).  If a blob was in a pack but violates
> >    this limit (perhaps the packs were created by fast-import
> >    or max-blob-size was reduced),  then a new loose object
> >    is written out if needed so the data is not lost.
>
> Why?
>
> I really do not like that "write a new loose object" part
> without proper justification.  From your description, I thought
> the most natural way to do this is to pretend you did not hear
> about large objects at all, by rejecting them early, perhaps
> inside add_object_entry() or inside get_object_details() --
> either case you would do sha1_object_info() early instead of
> doing it in check_object().
>
> By the way, is there fundamental reason that this needs to be
> "blob size" limit?  Wouldn't "max-object-size" be more clean in
> theory, and work the same way in practice?

I agree with your sentiments.  Some nasty details pushed
me to implement it this way.  Let me explain them and perhaps
you can come up with a different combination of solutions.

Each object to be packed can be in one of 4 states:
{loose or packed} X {small or too big}.
Regardless of the {loose or packed} selection,
a "small" object can always be placed in the pack.
A loose object which is too big does not require
any special action either -- if we drop it from the pack,
we don't lose anything since it already exists in objects/xx .

The packed X too big combination is the problem.  As the
commit message says,  this could happen if the packs
came from fast-import,  or if the max-blob-size were reduced.
We have three options in this case:
(1) Drop the object (do not put it in the new pack(s)).
(2) Pass the object into the new pack(s).
(3) Write out the object as a new loose object.

Option (1) is unacceptable.  When you call git-repack -a,
it blindly deletes all the non-kept packs at the end.  So
the megablobs would be lost.

Let's suppose we always used Option (2),  and Option (3)
were never available. Then once a large object got into
a pack,  the only way to get it out would be to explode
the pack using git-unpack-objects -- but hey,  that doesn't
work because they all already exist in this repository packed.
So we make a temporary repository to explode the pack,
then copy over the loose objects,  then delete the pack,
then repack with the correct/updated max-blob-size specification.
I would claim this is even more ugly,  and more importantly,
more error-prone than supporting Option (3) in at least some cases.

The way large objects get into a pack is that you realize after
the fact that you need a max-blob-size,  or that you need to
decrease it since your pack files have become unwieldy.

I think I've shown we need Option (3) in at least some cases.
To better address _which_ cases,  let's move on to your
second point:: why did I implement --max-blob-size instead
of --max-object-size?  I take this to mean that I should use
the blob size if undeltified, and the delta size if previously deltified?
That's actually what I do internally,  but the total effect
over many repackings is what one would expect from --max-blob-size.

In normal use on a personal repository starting from scratch,
all blobs are first packed from loose objects.  When I am
deciding if I want to include the loose object or not,  I want
to use the object's direct size,  not its deltified size.  If I use
the latter,  then I have to deltify all the megablobs to see
if their deltas are small enough.  This takes a long time
and is one of the things this patch aims to eliminate.
So I take the list of loose objects,  throw out the megablobs,
try to deltify the rest,  and write out the pack.

When I repack,  any deltas are smaller than the original
objects,  so in any sequence of packings from loose
objects and repackings of packs,  in which all packs
were created with a max-blob-size limit which stayed the
same or increased over time,  there is no difference
between max-blob-size or max-object-size: checking loose
object size or in-pack size gives the same pack contents.

Now let's say you decrease max-blob-size and repack,
or repack from some packs that had no max-blob-size
(e.g. from fast-import).  Now there is some difficulty in
a clean definition when you follow Option (3).  You might
have a base object B in an old pack,  and a deltified object
D based on it,  where B is now larger than the newly-decreased
max-blob-size while D's deltified size is still smaller.
So you write out B,  but D can no longer be stored deltified
and may become bigger as well.

But in this case you should do the following:
Since you are going to decrease max-blob-size when you decide
your repository packs have become "gunked up" with megablobs,
I would propose git-repack -a -f (git-pack-objects --no-object-reuse)
be used.  This means all deltas are recomputed
from the new,  smaller universe of blobs,  which makes sense,
and since no deltas are reused,  all blob-size limits can only
be based on the actual blob size by the arguments 3 paragraphs back.
(Note: I made sure the current patch loses no data if you decrease
 max-blob-size and forget to specify -f -- you just get a few large
 packed objects resulting from "lost" deltifications.)

In actuality,  the object_entry.size field is checked,  so objects
_are_ being filtered by delta size if deltified.  But I went through
the (suggested) usage scenarios above to show that the *result* is that
you get a pack limited in raw object size.  --max-object-size
is what I implemented because it's easier,  but in use it acts
like --max-blob-size so I named it that.

Now,  concerning Option (2) vs Option (3):  I need to write
out loose objects in *some* cases.  Right now I always write
out a large object as a loose one.  It would be reasonable to
change this so that it only happens with --no-object-reuse
(i.e. git-repack -f).  This is what you should specify when you're
decreasing max-blob-size.  Shall I change it so that large objects
are passed to the result pack without --no-object-reuse,
and written out as loose objects with --no-object-reuse?
I did not do this because it's a strange interaction --
just look how long it took me to build up to explaining it!

> > 2. git repack inspects repack.maxblobsize .  If set,  its
> >    value is passed to git pack-objects on the command line.
> >    The user should change repack.maxblobsize ,  NOT specify
> >    --max-blob-size=N .
> Why not?
Well,  indeed.  That's why [PATCH *v2*] lets you specify
--max-blob-size to git-repack,  and gives an example of
why you might want to do that [I don't].

> > This patch is on top of the earlier max-pack-size patch,
> > because I thought I needed some behavior it supplied,
> > but could be rebased on master if desired.
>
> Your earlier "split according to max-pack-size" will hopefully be
> on master shortly.
Thanks!

-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-23 23:55   ` Dana How
@ 2007-05-24  1:44     ` Junio C Hamano
  2007-05-24  7:12       ` Shawn O. Pearce
  0 siblings, 1 reply; 24+ messages in thread
From: Junio C Hamano @ 2007-05-24  1:44 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

"Dana How" <danahow@gmail.com> writes:

> The packed X too big combination is the problem.  As the
> commit message says,  this could happen if the packs
> came from fast-import,...
> We have three options in this case:
> (1) Drop the object (do not put it in the new pack(s)).
> (2) Pass the object into the new pack(s).
> (3) Write out the object as a new loose object.
>
> Option (1) is unacceptable.  When you call git-repack -a,
> it blindly deletes all the non-kept packs at the end.  So
> the megablobs would be lost.

Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
the only sane thing to do for a previously packed objects that
exceed the size limit.

Since you have to handle that case _anyway_, I think it makes
sense to always say "Ok, we will write it out if there is no
loose representation already available".

That is, unless somebody smarter than me, like Nico or Shawn,
come up with better ideas to do this ;-).

> ... why did I implement --max-blob-size instead
> of --max-object-size?  I take this to mean that I should use
> the blob size if undeltified, and the delta size if previously deltified?

No, I think the only sensible way for the end user to specify
the size is uncompressed size of the object.  For a blob, that
is the size of checked-out file.  IOW:

	$ git cat-file $type $sha | wc -c

Nothing else would make any sense.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24  1:44     ` Junio C Hamano
@ 2007-05-24  7:12       ` Shawn O. Pearce
  2007-05-24  9:38         ` Johannes Schindelin
                           ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Shawn O. Pearce @ 2007-05-24  7:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Dana How, Git Mailing List

Junio C Hamano <junkio@cox.net> wrote:
> "Dana How" <danahow@gmail.com> writes:
> 
> > The packed X too big combination is the problem.  As the
> > commit message says,  this could happen if the packs
> > came from fast-import,...
> > We have three options in this case:
> > (1) Drop the object (do not put it in the new pack(s)).
> > (2) Pass the object into the new pack(s).
> > (3) Write out the object as a new loose object.
> >
> > Option (1) is unacceptable.  When you call git-repack -a,
> > it blindly deletes all the non-kept packs at the end.  So
> > the megablobs would be lost.
> 
> Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
> the only sane thing to do for a previously packed objects that
> exceed the size limit.

I still don't buy the idea that these megablobs shouldn't be packed.
I understand Dana's pain here (at least a little bit, my problems
aren't as bad as his are), but I also hate to see us run away from
packfiles for these really sick cases just because we have some
issues in our current packfile handling.

Packfiles give us a lot of benefits:

 1) less inode usage;
 2) transport can write directly to local disk;
 3) transport can (quickly) copy from local disk;
 4) testing for existance is *much* faster;
 5) deltafication is possible;

Now #3 is actually really important here.  Don't forget that we
*just* disabled the fancy "new loose object format".  It doesn't
exist.  We can read the packfile-like loose objects, but we cannot
write them anymore.  So lets say we explode a megablob into a loose
object, and its 800 MiB by itself.  Now we have to send that object
to a client.  Yes, that's right, we must *RECOMPRESS* 800 MiB for
no reason.  Not the best choice.  Maybe we shouldn't have deleted
that packfile formatted loose object writer...

Now one argument to work around that recompression problem would
be to NFS share out the loose objects directory, and let clients
mount that volume and add it to their .git/objects/info/alternates
list.  But this doesn't work in the very general distributed case,
such as me getting huge files from kernel.org.  Last I checked,
the kernel.org admins did not offer up NSF mounts.  Besides, the
round-trip latency between me and kernel.org is too large for it
to be useful anyway over NFS.  :)

So I think this "explode out megablobs" is a bad idea.  Its violating
other things that make us fast, like #3's being able to reuse large
parts of an existing packfile during transfer.

Dana pointed out the megablobs make access slower because their
packfile indexes must still be searched to locate a commit; but if
the megablob packfile(s) contain only blobs then there is no value
in looking at those packfiles.

We might be able to fix this by altering the sort_pack function
in sha1_file.c to not only order by mtime, but also by the ratio
of the size of the .pack to the number of objects stored in it.
Any packfile with a high size/object ratio is likely to be what
Dana has been calling a "metadata" pack, holding things like tags,
commits, trees and small blobs.  Its these packfiles that we want
to search first, as they are the most likely to be accessed.

By pushing the megablob packs to the end of our packed_git search
list we won't tend to scan their indexes, as most of our objects
will be found earlier in the search list.  Hence we will generally
avoid any costs associated with their indexes.

Huge packfiles probably should be scheduled for keeping with a .keep
automatically.  We probably should teach pack-objects to generate a
.keep file if the resulting .pack was over a certain size threshold
(say 1.5 GiB by default) and teach git-repack to rename the .keep
file as it also renames the .idx and .pack.

Better that we degrade gracefully when faced with massive inputs
than we do something stupid by default and make the poor user pay
for their mistake of not throughly reading plumbing documentation
before use.

Now I would agree that we should punt on deltification of anything
that is just too large, and let the user decide what too large means,
and default it around 500 or 1024 MiB.  But I would still stuff it
into a packfile.

Maybe it still makes sense to have a limit on the maximum size of a
loose object to pack, but I think that's only to avoid the sick case
of a very simple no-argument "git repack" taking a long while because
there's 8 loose objects and 6 of them are 900 MiB image files.

Once in a packfile, I'd keep it there, even if the user decreases
the threshold, as the advantages of it being in the packfile outweigh
the disadvantages of it being in the packfile.  And there's like no
advantage to being loose once packed.

All of that is actually a very minor set of changes to the system,
and doesn't create odd corner cases.  It should also degrade better
out of the box.

> > ... why did I implement --max-blob-size instead
> > of --max-object-size?  I take this to mean that I should use
> > the blob size if undeltified, and the delta size if previously deltified?
> 
> No, I think the only sensible way for the end user to specify
> the size is uncompressed size of the object.  For a blob, that
> is the size of checked-out file.  IOW:
> 
> 	$ git cat-file $type $sha | wc -c
> 
> Nothing else would make any sense.

I agree.  And when you combine it with what I'm saying above about
only applying this to loose objects, its really quite easy to fetch
that value from the header and perform the test.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24  7:12       ` Shawn O. Pearce
@ 2007-05-24  9:38         ` Johannes Schindelin
  2007-05-24 17:23         ` david
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Johannes Schindelin @ 2007-05-24  9:38 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Dana How, Git Mailing List

Hi,

On Thu, 24 May 2007, Shawn O. Pearce wrote:

> Junio C Hamano <junkio@cox.net> wrote:
> > "Dana How" <danahow@gmail.com> writes:
> > 
> > > The packed X too big combination is the problem.  As the
> > > commit message says,  this could happen if the packs
> > > came from fast-import,...
> > > We have three options in this case:
> > > (1) Drop the object (do not put it in the new pack(s)).
> > > (2) Pass the object into the new pack(s).
> > > (3) Write out the object as a new loose object.
> > >
> > > Option (1) is unacceptable.  When you call git-repack -a,
> > > it blindly deletes all the non-kept packs at the end.  So
> > > the megablobs would be lost.
> > 
> > Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
> > the only sane thing to do for a previously packed objects that
> > exceed the size limit.
> 
> I still don't buy the idea that these megablobs shouldn't be packed.
> I understand Dana's pain here (at least a little bit, my problems
> aren't as bad as his are), but I also hate to see us run away from
> packfiles for these really sick cases just because we have some
> issues in our current packfile handling.

Isn't this issue helpable by the "-delta" attribute?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24  7:12       ` Shawn O. Pearce
  2007-05-24  9:38         ` Johannes Schindelin
@ 2007-05-24 17:23         ` david
  2007-05-24 17:29           ` Johannes Schindelin
  2007-05-24 20:43         ` Geert Bosch
  2007-05-24 23:29         ` Dana How
  3 siblings, 1 reply; 24+ messages in thread
From: david @ 2007-05-24 17:23 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Dana How, Git Mailing List

On Thu, 24 May 2007, Shawn O. Pearce wrote:

> Now #3 is actually really important here.  Don't forget that we
> *just* disabled the fancy "new loose object format".  It doesn't
> exist.  We can read the packfile-like loose objects, but we cannot
> write them anymore.  So lets say we explode a megablob into a loose
> object, and its 800 MiB by itself.  Now we have to send that object
> to a client.  Yes, that's right, we must *RECOMPRESS* 800 MiB for
> no reason.  Not the best choice.  Maybe we shouldn't have deleted
> that packfile formatted loose object writer...

when did the object store get changed so that loose objects aren't 
compressed?

if the problem is that the codepath for fetching does an uncompress 
followed by a compress then it would seem that this is a fairly easy 
problem to fix (how hard would it be to add the headers around the 
compressed object to make it look to the receiver like it's a pack with 
only one thing in it)

David Lang

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24 17:23         ` david
@ 2007-05-24 17:29           ` Johannes Schindelin
  2007-05-25  0:55             ` Shawn O. Pearce
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Schindelin @ 2007-05-24 17:29 UTC (permalink / raw)
  To: david; +Cc: Shawn O. Pearce, Junio C Hamano, Dana How, Git Mailing List

Hi,

On Thu, 24 May 2007, david@lang.hm wrote:

> On Thu, 24 May 2007, Shawn O. Pearce wrote:
> 
> > Now #3 is actually really important here.  Don't forget that we
> > *just* disabled the fancy "new loose object format".  It doesn't
> > exist.  We can read the packfile-like loose objects, but we cannot
> > write them anymore.  So lets say we explode a megablob into a loose
> > object, and its 800 MiB by itself.  Now we have to send that object
> > to a client.  Yes, that's right, we must *RECOMPRESS* 800 MiB for
> > no reason.  Not the best choice.  Maybe we shouldn't have deleted
> > that packfile formatted loose object writer...
> 
> when did the object store get changed so that loose objects aren't
> compressed?

That never happened. But we had a different file format for loose objects, 
which was meant to make it easier to copy as-is into a pack. That file 
format went away, since it was not as useful as we hoped.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24 17:29           ` Johannes Schindelin
@ 2007-05-25  0:55             ` Shawn O. Pearce
  0 siblings, 0 replies; 24+ messages in thread
From: Shawn O. Pearce @ 2007-05-25  0:55 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: david, Junio C Hamano, Dana How, Git Mailing List

Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
> On Thu, 24 May 2007, david@lang.hm wrote:
> > On Thu, 24 May 2007, Shawn O. Pearce wrote:
> > 
> > > Now #3 is actually really important here.  Don't forget that we
> > > *just* disabled the fancy "new loose object format".  It doesn't
> > > exist.  We can read the packfile-like loose objects, but we cannot
> > > write them anymore.  So lets say we explode a megablob into a loose
> > > object, and its 800 MiB by itself.  Now we have to send that object
> > > to a client.  Yes, that's right, we must *RECOMPRESS* 800 MiB for
> > > no reason.  Not the best choice.  Maybe we shouldn't have deleted
> > > that packfile formatted loose object writer...
> > 
> > when did the object store get changed so that loose objects aren't
> > compressed?
> 
> That never happened. But we had a different file format for loose objects, 
> which was meant to make it easier to copy as-is into a pack. That file 
> format went away, since it was not as useful as we hoped.

That "different file format" thing was added exactly for this type
of problem.  Someone added a bunch of large blobs to their repository
and then spent a lot of time decompressing and recompressing them
during their next repack.

The reason that recompress must happen is the deflate stream in a
standard (aka legacy) loose object contains both the Git object
header and the raw data; in a packfile the Git object header is
stored external from the deflate stream.  The "different file format"
used the packfile format, allowing us to store the Git object header
external from the deflate stream.  That meant we could just copy
the raw bytes as-is from the loose object into the packfile.

So we still store loose objects compressed, its just that we can
no longer create loose objects that can be copied directly into
a packfile without recompression.  And that is sort of Dana's
problem here.  OK, not entirely, but whatever.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24  7:12       ` Shawn O. Pearce
  2007-05-24  9:38         ` Johannes Schindelin
  2007-05-24 17:23         ` david
@ 2007-05-24 20:43         ` Geert Bosch
  2007-05-24 23:29         ` Dana How
  3 siblings, 0 replies; 24+ messages in thread
From: Geert Bosch @ 2007-05-24 20:43 UTC (permalink / raw)
  To: Shawn O. Pearce, Geert Bosch; +Cc: Git Mailing List, Junio C Hamano, Dana How

[resent because of malformed headers causing rejection]
On May 24, 2007, at 03:12, Shawn O. Pearce wrote:
> I still don't buy the idea that these megablobs shouldn't be packed.
> I understand Dana's pain here (at least a little bit, my problems
> aren't as bad as his are), but I also hate to see us run away from
> packfiles for these really sick cases just because we have some
> issues in our current packfile handling.
>
> Packfiles give us a lot of benefits:
>
>  1) less inode usage;
Using 1 inode per huge blob can never be an issue
>  2) transport can write directly to local disk;
>  3) transport can (quickly) copy from local disk;
Can do these by re-enabling the new loose object format
>  4) testing for existance is *much* faster;
>  5) deltafication is possible;

Look at it the other way. If we have huge objects (say >1GB),
we should put them in a pack of their own anyway. What's better:
having a pack with a separate index file or just a loose object?
While the one object per file model is awful for many small files
with lots of similarity, it is really quite efficient for large
objects, and the most reasonable model for huge objects.

Such blobs are just too large to do anything useful with.
The only operations done on them will be to check them in
or check them out. Ideally, we should never try to have them
in memory at all, but just stream them to/from disk while
compressing or decompressing.

Trying to deltify huge objects just takes too much time.
Similarly, we don't want to read 100MB to then apply a delta
and maybe throw out half of the data we read in the first place.
It's just too inefficient. If we'd even read the huge blobs once
during "git repack", we'll waste so much time that we're unlikely
to ever gain it back in any real world scenario.

   -Geert

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24  7:12       ` Shawn O. Pearce
                           ` (2 preceding siblings ...)
  2007-05-24 20:43         ` Geert Bosch
@ 2007-05-24 23:29         ` Dana How
  2007-05-25  2:06           ` Shawn O. Pearce
  3 siblings, 1 reply; 24+ messages in thread
From: Dana How @ 2007-05-24 23:29 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Junio C Hamano, Git Mailing List, danahow

On 5/24/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> Junio C Hamano <junkio@cox.net> wrote:
> > "Dana How" <danahow@gmail.com> writes:
> > > We have three options in this case:
> > > (1) Drop the object (do not put it in the new pack(s)).
> > > (2) Pass the object into the new pack(s).
> > > (3) Write out the object as a new loose object.
> > > Option (1) is unacceptable.  When you call git-repack -a,
> > > it blindly deletes all the non-kept packs at the end.  So
> > > the megablobs would be lost.
> > Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
> > the only sane thing to do for a previously packed objects that
> > exceed the size limit.
>
> I still don't buy the idea that these megablobs shouldn't be packed.
> I understand Dana's pain here (at least a little bit, my problems
> aren't as bad as his are), but I also hate to see us run away from
> packfiles for these really sick cases just because we have some
> issues in our current packfile handling.
>
> Packfiles give us a lot of benefits:
>
>  1) less inode usage;
I agree with Geert that blowing an inode on a 100MB+ object
is no big deal.
>  2) transport can write directly to local disk;
>  3) transport can (quickly) copy from local disk;
For (2) and (3) see comments on next para plus NFS discussion.
>  4) testing for existance is *much* faster;
This is true.  But I don't care about this cost if it
is only incurred on large objects which are leaf nodes
in the git "data relationship tree" (tags->commits->trees->blobs) anyway.
>  5) deltafication is possible;
Again Geert made a good argument that didn't occur to me that
you definitely DON'T want to do deltification on such large objects.
Junio recently added delta/nodelta attribute; this would be useful
to me,  but unfortunately I have several continua of files,  each with
the same suffix,  but with largely varying sizes, so attributes won't
help me unless the name globs in .gitattributes are expanded to full
expressions similar to find(1) [i.e. include testing based on size,
perms, type],  which I think would be insane.

> Now #3 is actually really important here.  Don't forget that we
> *just* disabled the fancy "new loose object format".  It doesn't
> exist.  We can read the packfile-like loose objects, but we cannot
> write them anymore.  So lets say we explode a megablob into a loose
> object, and its 800 MiB by itself.  Now we have to send that object
> to a client.  Yes, that's right, we must *RECOMPRESS* 800 MiB for
> no reason.  Not the best choice.  Maybe we shouldn't have deleted
> that packfile formatted loose object writer...
I completely agree with your argument.  I do not suggest that repositories
that communicate with others via packs should use this feature.
Our repositories will communicate via alternates/NFS in one direction
and probably packs in the other.  In the latter case the packs would
be generated with maxblobsize=0. See comments on next para.

> Now one argument to work around that recompression problem would
> be to NFS share out the loose objects directory, and let clients
> mount that volume and add it to their .git/objects/info/alternates
> list.  But this doesn't work in the very general distributed case,
> such as me getting huge files from kernel.org.  Last I checked,
> the kernel.org admins did not offer up NSF mounts.  Besides, the
> round-trip latency between me and kernel.org is too large for it
> to be useful anyway over NFS.  :)
The NFS case is exactly what I want to use.  I want each repo to
have their own packfiles to reduce load on the central alternate,
but these local repos would not include megablobs.  I do not have
as strong a feeling about whether the central alternate should
pack its megablobs or not [but I don't want to do it if it costs me
deltification for everybody], but I need a way to exclude megablobs
from getting into local packs.  WIth such exclusion,  git-gc/repack
is extremely quick.  There is NO WAY this can be true if
several GB have to be copied around,  which again comes from Geert.

I think this conversation does suggest one alteration to my patch:
as submitted it writes out a loose object if the object is packed
and has no loose object.  It should really do this only if the
object is packed LOCALLY and has no loose object.

> So I think this "explode out megablobs" is a bad idea.  Its violating
> other things that make us fast, like #3's being able to reuse large
> parts of an existing packfile during transfer.
As I said,  your true argument doesn't apply in my case.

> Dana pointed out the megablobs make access slower because their
> packfile indexes must still be searched to locate a commit; but if
> the megablob packfile(s) contain only blobs then there is no value
> in looking at those packfiles.
>
> We might be able to fix this by altering the sort_pack function
> in sha1_file.c to not only order by mtime, but also by the ratio
> of the size of the .pack to the number of objects stored in it.
> Any packfile with a high size/object ratio is likely to be what
> Dana has been calling a "metadata" pack, holding things like tags,
> commits, trees and small blobs.  Its these packfiles that we want
> to search first, as they are the most likely to be accessed.
>
> By pushing the megablob packs to the end of our packed_git search
> list we won't tend to scan their indexes, as most of our objects
> will be found earlier in the search list.  Hence we will generally
> avoid any costs associated with their indexes.
Good argument and I submitted a patch to do this.
Let's see who chokes on the floating arithmetic ;-)

> Huge packfiles probably should be scheduled for keeping with a .keep
> automatically.  We probably should teach pack-objects to generate a
> .keep file if the resulting .pack was over a certain size threshold
> (say 1.5 GiB by default) and teach git-repack to rename the .keep
> file as it also renames the .idx and .pack.
I have experimented with this,  and Jakub Narebski made related
suggestions.  I find this quite hokey,  but even if I do it in my central
alternate,  I still do not want to be packing megablobs in individual user's
repos EVER,  and need some way to exclude them.

> Better that we degrade gracefully when faced with massive inputs
> than we do something stupid by default and make the poor user pay
> for their mistake of not throughly reading plumbing documentation
> before use.
Unnecessary copying of several GB is not degrading gracefully in my view.
In fact having repack.maxblobsize = 2000 (K) in the default "config"
strikes me as degrading much more gracefully than what the code
would currently do.

This silly patch took my packfile sets from 12GB+ to 13MB,
and it's difficult to describe how relieved I now feel.

It's also difficult for me to believe that a setup that treats 12GB
(almost) equally is going to be as efficient as one which concentrates on 13MB.
That's three orders of magnitude,  a point I've made before.

But I think you have an understandable motivation:
you want packfiles to be as good as possible,  and any escape
mechanism from them decreases the motivation to "fix" packing.
Now I agree with this, which is why I just submitted some other patches,
but I don't share your goal of the universality of all packfiles --
just the ones used for transport.  Don't your packv4 plans introduce
mods which won't be used for transport as well?

> Now I would agree that we should punt on deltification of anything
> that is just too large, and let the user decide what too large means,
> and default it around 500 or 1024 MiB.  But I would still stuff it
> into a packfile.
>
> Maybe it still makes sense to have a limit on the maximum size of a
> loose object to pack, but I think that's only to avoid the sick case
> of a very simple no-argument "git repack" taking a long while because
> there's 8 loose objects and 6 of them are 900 MiB image files.
Perhaps we _will_ make progress if we all agree to describe
my situation as "sick" ;-) .  In this paragraph you seem to agree that
there is some argument for keeping megablobs from _entering_ packs?

One reason I like my patch is because I do view megablobs
as perverting the system,  and just keeping them out of the optimized
packfile system is a big step forward.

> Once in a packfile, I'd keep it there, even if the user decreases
> the threshold, as the advantages of it being in the packfile outweigh
> the disadvantages of it being in the packfile.  And there's like no
> advantage to being loose once packed.
To (almost) follow this suggestion I would need git-fast-import to respect
repack.maxblobsize as well.  Is that OK with you?

I previously offered to Junio that the "write loose object" thing
could be restricted:  it would only happen if -f were supplied to
git-repack,  otherwise the bad blob would pass through to the new pack.
Does this "reduction in strength" make this feature more palatable to you?

If the stats on a repo change significantly,  "write loose object"
becomes more important if you have to make a significant reduction
to repack.maxblobsize (or specify it for the first time).

I don't agree that once in a packfile,  a blob should stay there.
Its presence is degrading access to "normal" blobs co-habiting with it.
So you will want to repack to separate them in different packs
(the various .keep-related ideas) or just write them out loose.

To conclude:
the patch wrote out a new loose object when it was previously
packed and is larger then repack.maxblobsize.
I could change this to only happen when the object is
(1) packed AND
(2) locally packed AND
(3) -f/--no-object-reuse was specified to git-repack/git-pack-objects.
The previous behavior that a megablob never _enters_ the pack
would remain unchanged.

This more restrictive behavior would be sufficient for me,
and I think I *need* it at least in the users' repositories
in an NFS/alternates setup.

What do you think?

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-24 23:29         ` Dana How
@ 2007-05-25  2:06           ` Shawn O. Pearce
  2007-05-25  5:44             ` Nicolas Pitre
  0 siblings, 1 reply; 24+ messages in thread
From: Shawn O. Pearce @ 2007-05-25  2:06 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

Dana How <danahow@gmail.com> wrote:
> On 5/24/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> >Junio C Hamano <junkio@cox.net> wrote:
> >> "Dana How" <danahow@gmail.com> writes:
> >> > We have three options in this case:
> >> > (1) Drop the object (do not put it in the new pack(s)).
> >> > (2) Pass the object into the new pack(s).
> >> > (3) Write out the object as a new loose object.
> >> > Option (1) is unacceptable.  When you call git-repack -a,
> >> > it blindly deletes all the non-kept packs at the end.  So
> >> > the megablobs would be lost.
> >> Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
> >> the only sane thing to do for a previously packed objects that
> >> exceed the size limit.
> >
> >I still don't buy the idea that these megablobs shouldn't be packed.
> >I understand Dana's pain here (at least a little bit, my problems
> >aren't as bad as his are), but I also hate to see us run away from
> >packfiles for these really sick cases just because we have some
> >issues in our current packfile handling.
> >
> >Packfiles give us a lot of benefits:
> >
> > 1) less inode usage;
> I agree with Geert that blowing an inode on a 100MB+ object
> is no big deal.

You and me both.  If the size of the blob is high enough than
overhead associated with the inode and any tail-block wastage is
noise.  Filesystems are pretty good at tracking large-ish files.
My comment here wasn't so much about blowing an inode on a 100MiB+
object, but just in general that packfiles reduce inode usage,
which in the common metadata case (300,000 small objects) is a
big difference.

I think you are right; if we get an object in the >100MiB size
range we can certainly afford an inode for it.

> > 2) transport can write directly to local disk;
> > 3) transport can (quickly) copy from local disk;
> For (2) and (3) see comments on next para plus NFS discussion.
> > 4) testing for existance is *much* faster;
> This is true.  But I don't care about this cost if it
> is only incurred on large objects which are leaf nodes
> in the git "data relationship tree" (tags->commits->trees->blobs) anyway.

Yes, that's true.

> > 5) deltafication is possible;
> Again Geert made a good argument that didn't occur to me that
> you definitely DON'T want to do deltification on such large objects.
> Junio recently added delta/nodelta attribute; this would be useful
> to me,  but unfortunately I have several continua of files,  each with
> the same suffix,  but with largely varying sizes, so attributes won't
> help me unless the name globs in .gitattributes are expanded to full
> expressions similar to find(1) [i.e. include testing based on size,
> perms, type],  which I think would be insane.

Which brings up the comment I think I made (below) about skipping
deltas on very large objects.  Things over a certain size are not
likely to delta well, or in any reasonable time.  We probably should
default to not trying to delta those, but let the user force us to
do so with a .gitattributes option.  Maybe.

> >By pushing the megablob packs to the end of our packed_git search
> >list we won't tend to scan their indexes, as most of our objects
> >will be found earlier in the search list.  Hence we will generally
> >avoid any costs associated with their indexes.
> Good argument and I submitted a patch to do this.
> Let's see who chokes on the floating arithmetic ;-)

I actually had another thought in this area.  I'll try to work up a
patch to accompany yours.  I think we can avoid even touching the
alternate object databases half of the time, and I'd like to be
able to do that.  Why?  Because I started to setup this megablob
approach on my own Windows based repositories.  Unfortunately it
makes git-log about 1 second slower, and I suspect its in the
alternate repository initialization.

> >Huge packfiles probably should be scheduled for keeping with a .keep
> >automatically.  We probably should teach pack-objects to generate a
> >.keep file if the resulting .pack was over a certain size threshold
> >(say 1.5 GiB by default) and teach git-repack to rename the .keep
> >file as it also renames the .idx and .pack.
> I have experimented with this,  and Jakub Narebski made related
> suggestions.  I find this quite hokey,  but even if I do it in my central
> alternate,  I still do not want to be packing megablobs in individual user's
> repos EVER,  and need some way to exclude them.

Yes, that makes a lot of sense.

> >Better that we degrade gracefully when faced with massive inputs
> >than we do something stupid by default and make the poor user pay
> >for their mistake of not throughly reading plumbing documentation
> >before use.
> Unnecessary copying of several GB is not degrading gracefully in my view.
> In fact having repack.maxblobsize = 2000 (K) in the default "config"
> strikes me as degrading much more gracefully than what the code
> would currently do.

Sure.  But I think this goes back to #3 (network transport) and
how our loose object format now doesn't support it well.  And even
if that's fixed I don't think 2 MiB is a good default; its *far*
too low.  I have a number of blobs that are in the 12-16 MiB range
and they delta very well in a pretty reasonable time.

> This silly patch took my packfile sets from 12GB+ to 13MB,
> and it's difficult to describe how relieved I now feel.

I think I understand a little bit.  Today I took 3 repositories
that were about 70 MiB each and dropped them down to 16 MiB, 2
MiB and 4.5 MiB by creating a single 120 MiB "megablob" packfile
that spanned all 3 of them.  This isn't the same scale as what you
are dealing with, but now I have a current metadata pack for each
that isn't gummed up with large blobs, making repacking faster.
I also have a smaller working set size.  :-

> But I think you have an understandable motivation:
> you want packfiles to be as good as possible,  and any escape
> mechanism from them decreases the motivation to "fix" packing.

Yes, that's correct.  I'm not against stepping outside of
packfiles and making usage of loose objects for megablobs easier.
I just want to make sure its the best way to handle these things.
Generally we've made major improvements in things when we've been
pushed by large repositories/datasets.

> Now I agree with this, which is why I just submitted some other patches,
> but I don't share your goal of the universality of all packfiles --
> just the ones used for transport.  Don't your packv4 plans introduce
> mods which won't be used for transport as well?

Yes, at least initially we'd reencode from pack v4 down to pack v2
for transport, because transporting the dictionary with delta reuse
is an interesting problem.  However Nico and I have discussed it
at length and have plans for how to code a pack v4 based transport,
and pack v4's file format concepts are partially based upon making
pack v4 transport easier to implement.  But from a "start simple and
keep it simple, stupid" principle we'd like to avoid the complexity
early on.

> >Now I would agree that we should punt on deltification of anything
> >that is just too large, and let the user decide what too large means,
> >and default it around 500 or 1024 MiB.  But I would still stuff it
> >into a packfile.
> >
> >Maybe it still makes sense to have a limit on the maximum size of a
> >loose object to pack, but I think that's only to avoid the sick case
> >of a very simple no-argument "git repack" taking a long while because
> >there's 8 loose objects and 6 of them are 900 MiB image files.
> Perhaps we _will_ make progress if we all agree to describe
> my situation as "sick" ;-) .  In this paragraph you seem to agree that
> there is some argument for keeping megablobs from _entering_ packs?

Yes.  If your workflow is basically "git add HUGE; git commit; git push;
git prune-alternates" then you only have to pack the huge object once,
and can remove the loose object from the user's .git/objects directly
pretty quickly, because its available via your NFS alternate.  In such
a configuration yes, it does make some sense to never allow a megablob
from entering a pack.

So I guess I'm partially in agreement with you...

> >Once in a packfile, I'd keep it there, even if the user decreases
> >the threshold, as the advantages of it being in the packfile outweigh
> >the disadvantages of it being in the packfile.  And there's like no
> >advantage to being loose once packed.
> To (almost) follow this suggestion I would need git-fast-import to respect
> repack.maxblobsize as well.  Is that OK with you?

Yes I could implement that (or better accept a patch that does so)
but I'd actually wonder why not just categorize the objects into
two different packfiles.  Have one for "small stuff" and another
for "everything larger than small stuff".  Split the two packfiles
independently of each other.  Hence fast-import would produce more
packfiles, but each output packfile would probably have a couple
of megablobs in it, and you'd have one single packfile with all of
the smaller metadata.

And actually if you are trying to shove large objects through
fast-import we really can do a lot better.  Like avoiding
deltification attempts, adjusting the compression level to something
better suited to your blob (I don't know if its compressable or not)
and streaming to the output packfile, rather than holding the entire
thing in memory before writing the first byte.

Given the advantages discussed above about being in a packfile, and
that fast-import was writing specifically for creating packfiles
instead of loose objects during large IO transfers into Git, I
think it is sort of anti-fast-import to have it create loose objects.

> I previously offered to Junio that the "write loose object" thing
> could be restricted:  it would only happen if -f were supplied to
> git-repack,  otherwise the bad blob would pass through to the new pack.
> Does this "reduction in strength" make this feature more palatable to you?

Yes.  But go back to the .keep discussion above where I suggest
we automatically .keep any "large" packfile.  Once you get a huge
packfile you probably don't want to redo the disk IO associated with
repacking it, unless you really are trying to force a large reorg.
So I'd agree with the idea of making a -f needed to eject a megablob,
but I think you'd need more than that as you'd also be trying to
bypass the usual .keep logic.

> If the stats on a repo change significantly,  "write loose object"
> becomes more important if you have to make a significant reduction
> to repack.maxblobsize (or specify it for the first time).

I'm not sure users should be tweaking this...  Its fine to make
the knob available, but we really should have the knob adjust
itself somewhat intelligently.  Don't ever add a knob that you
cannot write a rule to control; if you can write a rule than write
it dammit and don't make the user do your job for you.  Knobs are
good for when your case is just so far away from the normal that
the rule is utterly wrong...

Guess what, your repository is such a repository (its far away from
our normal rules).  But I think its problems are also common enough
that we really should attempt to make our rules handle it better.

> I don't agree that once in a packfile,  a blob should stay there.
> Its presence is degrading access to "normal" blobs co-habiting with it.
> So you will want to repack to separate them in different packs
> (the various .keep-related ideas) or just write them out loose.

If we're evicting a megablob from one packfile to a loose object
(because its degrading access to the other objects in its current
packfile) we're already committed to doing the massive disk IO
required for that eviction.  We might as well write it back out
to a file format that we can more easily work with, than one that
we cannot.

But I have to circle back here and say "why is a megablob degrading
access in a packfile"?  This goes right back to your point above
about my wanting to stay in the packfile format just to make the
packfile format better.  What's so wrong with the packfile format
(and the code that writes/reads it) that makes access for a small
metadata more expensive when there are megablobs attached in the
same packfile?

Or is it just because we like to repack the smaller metadata
frequently, but that's horribly expensive because the megablobs
are in the same packfile?  If its really just about repacking then
.keep marked megablob packs are the way to go.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Prevent megablobs from gunking up git packs
  2007-05-25  2:06           ` Shawn O. Pearce
@ 2007-05-25  5:44             ` Nicolas Pitre
  0 siblings, 0 replies; 24+ messages in thread
From: Nicolas Pitre @ 2007-05-25  5:44 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Junio C Hamano, Git Mailing List

OK..... I ignore git@vger.kernel.org for a day or two and things really 
start to go wild!  ;-)

I'll try to cover only those points that are still debatable.  I think 
everybody agrees with huge blobs as loose objects using extra inodes 
being the least of our worries.

On Thu, 24 May 2007, Shawn O. Pearce wrote:

> Dana How <danahow@gmail.com> wrote:
> > On 5/24/07, Shawn O. Pearce <spearce@spearce.org> wrote:
> > >Junio C Hamano <junkio@cox.net> wrote:
> > >> "Dana How" <danahow@gmail.com> writes:
> > >> > We have three options in this case:
> > >> > (1) Drop the object (do not put it in the new pack(s)).
> > >> > (2) Pass the object into the new pack(s).
> > >> > (3) Write out the object as a new loose object.
> > >> > Option (1) is unacceptable.  When you call git-repack -a,
> > >> > it blindly deletes all the non-kept packs at the end.  So
> > >> > the megablobs would be lost.
> > >> Ok, I can buy that -- (1) nor (2) are unacceptable and (3) is
> > >> the only sane thing to do for a previously packed objects that
> > >> exceed the size limit.

OK... I sort of agree, but not entirely.

First, let's examine the reasons for wanting to expulse a big blob out 
of a pack.

The first reason I've seen is that big blobs put surrounding objects way 
apart and pack access performance gets bad, especially tree walking.  
The solution to this problem is trivial: let's simply store big blobs 
together at the end of the pack!  Problem solved.

The other reason for keeping huge blobs out is that they bring repack 
performance down and create unnecessary IO.  Well, in that case I think 
that you should simply avoid (re)packing them in the first place.  I 
think it should be possible to combine both features: the split packs 
and the big-blobs-go-at-the-end solution I mentioned above so that those 
big blobs could end up in one or more packs of their own.

But writing loose objects from git-pack-objects... Nah, this is just too 
hacky and ugly.  The tool is about packing objects and starting to 
create loose objects from there is pushing the packing concept a bit too 
far for my taste.

I wouldn't mind a _separate_ tool that would load a pack index, 
determine object sizes from it, and then extract big objects to write 
them as loose objects (although I question its usefulness).  But not 
within pack-objects please.

So I think the best solution really involves a new parameter to 
git-pack-objects allowing for objects which size exceed a certain 
treshold to go at the end of the pack.  If they end up in a different 
pack because of pack size limit then so be it, at which point you could 
always explode that huge-blob pack into loose objects, avoiding the need 
for the extra tool I mention above, but again I don't think that would 
be that useful.

> > Again Geert made a good argument that didn't occur to me that
> > you definitely DON'T want to do deltification on such large objects.
> > Junio recently added delta/nodelta attribute; this would be useful
> > to me,  but unfortunately I have several continua of files,  each with
> > the same suffix,  but with largely varying sizes, so attributes won't
> > help me unless the name globs in .gitattributes are expanded to full
> > expressions similar to find(1) [i.e. include testing based on size,
> > perms, type],  which I think would be insane.

I think having a parameter to exclude object which size exceed a 
specified size treshold from deltification attempts would also be a 
valid option. But...

> Which brings up the comment I think I made (below) about skipping
> deltas on very large objects.  Things over a certain size are not
> likely to delta well, or in any reasonable time.  We probably should
> default to not trying to delta those, but let the user force us to
> do so with a .gitattributes option.  Maybe.

I don't agree with the presumption that huge objects are unlikely to 
delta well.  It really depends on the data you have.  If, for example, 
you want to store, say, different versions of a filesystem image, then 
those different images have the potential to be really huge. Yet they 
might delta extremely well against each other and provide a tremendous 
space saving.

It all depends on the kind of data you work with.
It is good to have the possibility to skip deltification based on a file 
attribute.  It is also good to have the possibility to skip 
deltification based on object size (through a command line switch or 
config entry).  But those must remain _options_.

> > >Huge packfiles probably should be scheduled for keeping with a .keep
> > >automatically.  We probably should teach pack-objects to generate a
> > >.keep file if the resulting .pack was over a certain size threshold
> > >(say 1.5 GiB by default) and teach git-repack to rename the .keep
> > >file as it also renames the .idx and .pack.

Nah.  Those kind of arbitrary defaults are most likely to be fine for 
some cases and bad for many others.  These "sick" cases such as Dana's 
are so special that they better be manually tuned for best operations 
according to the data set, and more importantly to the work flow used, 
because different work flows are likely to require different "defaults".  
Better not put any arbitrary default and create a "Advanced tuning for 
best performances with insane repositories" section in the documentation 
instead.

> > >Better that we degrade gracefully when faced with massive inputs
> > >than we do something stupid by default and make the poor user pay
> > >for their mistake of not throughly reading plumbing documentation
> > >before use.

Well, I think that if someone is seriously considering GIT for a 
multi-gigabyte repository, that person has better read a little 
documentation before starting to play.  Of course this advanced tuning 
for huge repository section I'm suggesting should stand out in the main 
index.  And most "poor users" usually don't have such a big repo to 
fool themselves with.

> > >Now I would agree that we should punt on deltification of anything
> > >that is just too large, and let the user decide what too large means,
> > >and default it around 500 or 1024 MiB.  But I would still stuff it
> > >into a packfile.

Well, thing is, once deltified, those huge objects won't be subject to 
deltification attempts anymore, unless -f is used.  So the deltification 
cost will happen only once anyway.  Then it is only the issue of 
flagging a particular pack with .keep to exclude it from any further 
repacking which would simply end up wasting disk IO anyway.

There is certainly a hard default on deltification attempt that we 
should impose right now though, which is 4GB.  The reason is that the 
delta encoding doesn't do offsets larger than 32 bits at the moment.

> > I previously offered to Junio that the "write loose object" thing
> > could be restricted:  it would only happen if -f were supplied to
> > git-repack,  otherwise the bad blob would pass through to the new pack.
> > Does this "reduction in strength" make this feature more palatable to you?

Not really.

Like I said before, I'd much prefer to have a split pack for huge 
objects, and a separate unpack-object pass on it if you really want them 
loose.  If you want to deny entry of loose objects into a pack based on 
their size that's understandable, but only if they're already loose.

> > I don't agree that once in a packfile,  a blob should stay there.
> > Its presence is degrading access to "normal" blobs co-habiting with it.

As mentioned at the top I don't think this is a big issue.

> Or is it just because we like to repack the smaller metadata
> frequently, but that's horribly expensive because the megablobs
> are in the same packfile?  If its really just about repacking then
> .keep marked megablob packs are the way to go.

I think so as well.

Nicolas

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2007-05-25  5:45 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-22  6:14 [PATCH] Prevent megablobs from gunking up git packs Dana How
2007-05-22  6:30 ` Shawn O. Pearce
2007-05-22  7:33   ` Dana How
2007-05-22  6:52 ` Junio C Hamano
2007-05-22  8:00   ` Dana How
2007-05-22 11:05     ` Jakub Narebski
2007-05-22 16:59       ` Dana How
2007-05-22 23:44         ` Jakub Narebski
2007-05-23  0:28           ` Junio C Hamano
2007-05-23  1:58             ` Nicolas Pitre
2007-05-22 17:38 ` Nicolas Pitre
2007-05-22 18:07   ` Dana How
2007-05-23 22:08 ` Junio C Hamano
2007-05-23 23:55   ` Dana How
2007-05-24  1:44     ` Junio C Hamano
2007-05-24  7:12       ` Shawn O. Pearce
2007-05-24  9:38         ` Johannes Schindelin
2007-05-24 17:23         ` david
2007-05-24 17:29           ` Johannes Schindelin
2007-05-25  0:55             ` Shawn O. Pearce
2007-05-24 20:43         ` Geert Bosch
2007-05-24 23:29         ` Dana How
2007-05-25  2:06           ` Shawn O. Pearce
2007-05-25  5:44             ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).