Horrible re-packing?

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Horrible re-packing?
@ 2006-06-05 17:08 Linus Torvalds
  2006-06-05 18:44 ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 17:08 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List

Junio, Nico,
 I just tried doing a "git repack -a -d -f" to because I expected a full 
re-pack to do _better_ than doing occasional incrementals, and verify the 
pack generation, but imagine my shock when IT SUCKS.

I didn't look at where the suckage started, but look at this:

	[torvalds@g5 git]$ git repack -a -d
	Generating pack...
	Done counting 21322 objects.
	Deltifying 21322 objects.
	 100% (21322/21322) done
	Writing 21322 objects.
	 100% (21322/21322) done
	Total 21322, written 21322 (delta 14489), reused 21319 (delta 14486)
	Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created.

	[torvalds@g5 git]$ ll .git/objects/pack/
	total 7008
	-rw-r--r-- 1 torvalds torvalds  512792 Jun  5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx
	-rw-r--r-- 1 torvalds torvalds 6643695 Jun  5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack

Ie, we have  anice 6.33MB pack-file.

Now:

	[torvalds@g5 git]$ git repack -a -d -f
	Generating pack...
	Done counting 21322 objects.
	Deltifying 21322 objects.
	 100% (21322/21322) done
	Writing 21322 objects.
	 100% (21322/21322) done
	Total 21322, written 21322 (delta 10187), reused 6777 (delta 0)
	Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created.

	[torvalds@g5 git]$ ll .git/objects/pack/
	total 15352
	-rw-r--r-- 1 torvalds torvalds   512792 Jun  5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx
	-rw-r--r-- 1 torvalds torvalds 15176139 Jun  5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack

Whaah! That nice 6.33MB pack-file exploded to 14.5MB!

Doing repeated "git repack -a -d" to try to do incrementals, it stopped 
improving after the sixth one, at which point it was down to 11.7MB, still 
almost twice as big as before.

Re-doing it with 

	git repack -a -d -f --depth=100 --window=100

got me back to 6.94MB, but that's still 10% larger than the pack-file I 
had before.

Interestingly, it's the "window" that matters more. The depth part didn't 
make that huge of a difference, so it looks like it's the sorting 
heuristic that may be broken again.

And it's possibly broken by the fact that we've been renaming things 
lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not 
finding things)

Nico? Any ideas?

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 17:08 Horrible re-packing? Linus Torvalds
@ 2006-06-05 18:44 ` Linus Torvalds
  2006-06-05 19:03   ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 18:44 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List

On Mon, 5 Jun 2006, Linus Torvalds wrote:
> 
> Whaah! That nice 6.33MB pack-file exploded to 14.5MB!
> 
> And it's possibly broken by the fact that we've been renaming things 
> lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not 
> finding things)

No, it's even simpler.

The breakage is entirely mine, and due to the tree-walking conversion of 
the "process_tree()" function.

In that function, we used to have a local "const char *name" that 
_shadowed_ the incoming _argument_ with the same type, and the 
tree-walking conversion did not notice that the inner "name" should have 
been converted to "entry.path" - so it used the outer-level "name".

Gaah. We should probably use -Wshadow or something, which would hopefully 
have warned about the re-use of the same variable name in two different 
scopes.

Regardless, this fixes it.

		Linus
---
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index 17c04b9..e885624 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -135,9 +135,9 @@ static struct object_list **process_tree

 	while (tree_entry(&desc, &entry)) {
 		if (S_ISDIR(entry.mode))
-			p = process_tree(lookup_tree(entry.sha1), p, &me, name);
+			p = process_tree(lookup_tree(entry.sha1), p, &me, entry.path);
 		else
-			p = process_blob(lookup_blob(entry.sha1), p, &me, name);
+			p = process_blob(lookup_blob(entry.sha1), p, &me, entry.path);
 	}
 	free(tree->buffer);
 	tree->buffer = NULL;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 18:44 ` Linus Torvalds
@ 2006-06-05 19:03   ` Linus Torvalds
  2006-06-05 19:37     ` Junio C Hamano
                       ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 19:03 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List



On this same thread..

This trivial patch not only simplifies the name hashing, it actually 
improves packing for both git and the kernel.

The git archive pack shrinks from 6824090->6622627 bytes (a 3% 
improvement), and the kernel pack shrinks from 108756213 to 108219021 (a 
mere 0.5% improvement, but still, it's an improvement from making the 
hashing much simpler!)

I think the hash function with its comment is self-explanatory:

        /*
         * This effectively just creates a sortable number from the
         * last sixteen non-whitespace characters. Last characters
         * count "most", so things that end in ".c" sort together.
         */
        while ((c = *name++) != 0) {
                if (isspace(c))
                        continue;
                hash = (hash >> 2) + (c << 24);
        }
        return hash;

ie we just create a 32-bit hash, where we "age" previous characters by two 
bits, so the last characters in a filename count most. So when we then 
compare the hashes in the sort routine, filenames that end the same way 
sort the same way.

It takes the subdirectory into account (unless the filename is > 16 
characters), but files with the same name within the same subdirectory 
will obviously sort closer than files in different subdirectories.

And, incidentally (which is why I tried the hash change in the first 
place, of course) builtin-rev-list.c will sort fairly close to rev-list.c.

And no, it's not a "good hash" in the sense of being secure or unique, but 
that's not what we're looking for. The whole "hash" thing is misnamed 
here. It's not so much a hash as a "sorting number".

Comments?

		Linus

----
 pack-objects.c |   41 +++++++++++------------------------------
 1 files changed, 11 insertions(+), 30 deletions(-)

diff --git a/pack-objects.c b/pack-objects.c
index 3590cd5..ae49fba 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -473,38 +473,19 @@ #define DIRBITS 12
 
 static unsigned name_hash(struct name_path *path, const char *name)
 {
-	struct name_path *p = path;
-	const char *n = name + strlen(name);
-	unsigned hash = 0, name_hash = 0, name_done = 0;
-
-	if (n != name && n[-1] == '\n')
-		n--;
-	while (name <= --n) {
-		unsigned char c = *n;
-		if (c == '/' && !name_done) {
-			name_hash = hash;
-			name_done = 1;
-			hash = 0;
-		}
-		hash = hash * 11 + c;
-	}
-	if (!name_done) {
-		name_hash = hash;
-		hash = 0;
-	}
-	for (p = path; p; p = p->up) {
-		hash = hash * 11 + '/';
-		n = p->elem + p->len;
-		while (p->elem <= --n) {
-			unsigned char c = *n;
-			hash = hash * 11 + c;
-		}
-	}
+	unsigned char c;
+	unsigned hash = 0;
+
 	/*
-	 * Make sure "Makefile" and "t/Makefile" are hashed separately
-	 * but close enough.
+	 * This effectively just creates a sortable number from the
+	 * last sixteen non-whitespace characters. Last characters
+	 * count "most", so things that end in ".c" sort together.
 	 */
-	hash = (name_hash<<DIRBITS) | (hash & ((1U<<DIRBITS )-1));
+	while ((c = *name++) != 0) {
+		if (isspace(c))
+			continue;
+		hash = (hash >> 2) + (c << 24);
+	}
 	return hash;
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 19:03   ` Linus Torvalds
@ 2006-06-05 19:37     ` Junio C Hamano
  2006-06-05 19:57       ` Linus Torvalds
  2006-06-05 21:14     ` Olivier Galibert
  2006-06-05 21:20     ` Nicolas Pitre
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2006-06-05 19:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> I think the hash function with its comment is self-explanatory:
>
>         /*
>          * This effectively just creates a sortable number from the
>          * last sixteen non-whitespace characters. Last characters
>          * count "most", so things that end in ".c" sort together.
>          */
>         while ((c = *name++) != 0) {
>                 if (isspace(c))
>                         continue;
>                 hash = (hash >> 2) + (c << 24);
>         }
>         return hash;
>
> ie we just create a 32-bit hash, where we "age" previous characters by two 
> bits, so the last characters in a filename count most. So when we then 
> compare the hashes in the sort routine, filenames that end the same way 
> sort the same way.

IIRC, sometimes this function is called with path and name split
and sometimes with full path in name, depending on who calls you
(the latter happens for rev-list --object generated names, and
the former is for objects we extract ourselves from the --thin
base tree, or something like that). I suspect your patch may
break paths whose filename after the last slash is shorter than
16 bytes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 19:37     ` Junio C Hamano
@ 2006-06-05 19:57       ` Linus Torvalds
  2006-06-05 23:54         ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 19:57 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List



On Mon, 5 Jun 2006, Junio C Hamano wrote:
> 
> IIRC, sometimes this function is called with path and name split
> and sometimes with full path in name

Yeah, I was pretty confused by the whole hashing thing. Are you sure that 
complexity is needed, it seems a bit overkill.

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 19:57       ` Linus Torvalds
@ 2006-06-05 23:54         ` Junio C Hamano
  2006-06-06  0:14           ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2006-06-05 23:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> On Mon, 5 Jun 2006, Junio C Hamano wrote:
>> 
>> IIRC, sometimes this function is called with path and name split
>> and sometimes with full path in name
>
> Yeah, I was pretty confused by the whole hashing thing. Are you sure that 
> complexity is needed, it seems a bit overkill.

Two issues in the code confuses any reader of that function.

 - The code wants to hash Makefile from different revisions
   together, and Makefile and t/Makefile close to each other.
   The current code did it by treating '/' specially, used
   basename hash as the upper bits of the resulting hash and
   dirname hash as the lower bits.  It's my tendency to treat
   slashes specially too much, which is one of your favorite
   things to pick me on.

   This is not needed by your change anymore -- by only using
   the tail of the filename, and making sure tail part weighs
   more in the resulting group number, the new code gives the
   desired grouping characteristics in a much simpler way.

 - The output from rev-list --objects is fed to the function as
   its name parameter while path is set to NULL.  When we allow
   a thin pack to be generated, rev-list --objects output also
   contains "-<commit-object-name>" lines.  We read trees for
   these commits that are not going to be sent but can be used
   as base objects, and pass the pathname discovered from the
   tree using path and name pair (path is set to the linked list
   of struct name_path that describes the dirname, and name is
   set to the basename).  This was done to reduce the need for
   allocating and copying the pathname in preparation for
   calling name_hash() function.

   If you use only the "name" variable in your group number
   computation, and suppose we are doing send-pack to send
   updates between rev A..B, contrib/git-svn/Makefile from rev B
   will use git-svn/Makefile (tail 16 characters) to compute the
   number, but the blob from rev A (which we are not going to
   send but would want to use as a potential delta base) will
   have contrib/git-svn part in "path" (the element points at
   string "git-svn", and its uplink points at another element
   that points at "contrib" with an uplink that says it is at
   the root level), and Makefile in "name".  They will be hashed
   slightly differently.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 23:54         ` Junio C Hamano
@ 2006-06-06  0:14           ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2006-06-06  0:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

>  - The output from rev-list --objects is fed to the function as
>    its name parameter while path is set to NULL.  When we allow
>    a thin pack to be generated, rev-list --objects output also
>    contains "-<commit-object-name>" lines.  We read trees for
>    these commits that are not going to be sent but can be used
>    as base objects, and pass the pathname discovered from the
>    tree using path and name pair (path is set to the linked list
>    of struct name_path that describes the dirname, and name is
>    set to the basename).  This was done to reduce the need for
>    allocating and copying the pathname in preparation for
>    calling name_hash() function.

but it can trivially fixed by doing something like this.

---

diff --git a/pack-objects.c b/pack-objects.c
index 3590cd5..e0328f8 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -463,17 +463,10 @@ static void rehash_objects(void)
 	}
 }
 
-struct name_path {
-	struct name_path *up;
-	const char *elem;
-	int len;
-};
-
 #define DIRBITS 12
 
-static unsigned name_hash(struct name_path *path, const char *name)
+static unsigned name_hash(const char *name)
 {
-	struct name_path *p = path;
 	const char *n = name + strlen(name);
 	unsigned hash = 0, name_hash = 0, name_done = 0;
 
@@ -492,14 +485,6 @@ static unsigned name_hash(struct name_pa
 		name_hash = hash;
 		hash = 0;
 	}
-	for (p = path; p; p = p->up) {
-		hash = hash * 11 + '/';
-		n = p->elem + p->len;
-		while (p->elem <= --n) {
-			unsigned char c = *n;
-			hash = hash * 11 + c;
-		}
-	}
 	/*
 	 * Make sure "Makefile" and "t/Makefile" are hashed separately
 	 * but close enough.
@@ -686,9 +671,9 @@ static int name_cmp_len(const char *name
 }
 
 static void add_pbase_object(struct tree_desc *tree,
-			     struct name_path *up,
 			     const char *name,
-			     int cmplen)
+			     int cmplen,
+			     const char *fullname)
 {
 	struct name_entry entry;
 
@@ -702,13 +687,12 @@ static void add_pbase_object(struct tree
 		    sha1_object_info(entry.sha1, type, &size))
 			continue;
 		if (name[cmplen] != '/') {
-			unsigned hash = name_hash(up, name);
+			unsigned hash = name_hash(fullname);
 			add_object_entry(entry.sha1, hash, 1);
 			return;
 		}
 		if (!strcmp(type, tree_type)) {
 			struct tree_desc sub;
-			struct name_path me;
 			struct pbase_tree_cache *tree;
 			const char *down = name+cmplen+1;
 			int downlen = name_cmp_len(down);
@@ -719,10 +703,7 @@ static void add_pbase_object(struct tree
 			sub.buf = tree->tree_data;
 			sub.size = tree->tree_size;
 
-			me.up = up;
-			me.elem = entry.path;
-			me.len = entry.pathlen;
-			add_pbase_object(&sub, &me, down, downlen);
+			add_pbase_object(&sub, down, downlen, fullname);
 			pbase_tree_put(tree);
 		}
 	}
@@ -778,14 +759,14 @@ static void add_preferred_base_object(ch
 
 	for (it = pbase_tree; it; it = it->next) {
 		if (cmplen == 0) {
-			hash = name_hash(NULL, "");
+			hash = name_hash("");
 			add_object_entry(it->pcache.sha1, hash, 1);
 		}
 		else {
 			struct tree_desc tree;
 			tree.buf = it->pcache.tree_data;
 			tree.size = it->pcache.tree_size;
-			add_pbase_object(&tree, NULL, name, cmplen);
+			add_pbase_object(&tree, name, cmplen, name);
 		}
 	}
 }
@@ -1328,7 +1309,7 @@ int main(int argc, char **argv)
 		}
 		if (get_sha1_hex(line, sha1))
 			die("expected sha1, got garbage:\n %s", line);
-		hash = name_hash(NULL, line+41);
+		hash = name_hash(line+41);
 		add_preferred_base_object(line+41, hash);
 		add_object_entry(sha1, hash, 0);
 	}

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 19:03   ` Linus Torvalds
  2006-06-05 19:37     ` Junio C Hamano
@ 2006-06-05 21:14     ` Olivier Galibert
  2006-06-05 21:22       ` Nicolas Pitre
  2006-06-05 21:27       ` Linus Torvalds
  2006-06-05 21:20     ` Nicolas Pitre
  2 siblings, 2 replies; 15+ messages in thread
From: Olivier Galibert @ 2006-06-05 21:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List

On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote:
> Comments?

Why don't you just sort the full path+filename with a strcmp variant
that starts by the end of the string for comparison?  May at least be
simpler to understand.

  OG.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 21:14     ` Olivier Galibert
@ 2006-06-05 21:22       ` Nicolas Pitre
  2006-06-06  0:18         ` Chris Wedgwood
  2006-06-05 21:27       ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 21:22 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List

On Mon, 5 Jun 2006, Olivier Galibert wrote:

> On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote:
> > Comments?
> 
> Why don't you just sort the full path+filename with a strcmp variant
> that starts by the end of the string for comparison?  May at least be
> simpler to understand.

Much more expensive for both memory usage and CPU cycles.


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 21:22       ` Nicolas Pitre
@ 2006-06-06  0:18         ` Chris Wedgwood
  2006-06-06  0:35           ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Wedgwood @ 2006-06-06  0:18 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Olivier Galibert, Linus Torvalds, Junio C Hamano,
	Git Mailing List

On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote:

> Much more expensive for both memory usage and CPU cycles.

On a modern CPU I'm not sure you would notice unless the dataset was
insanely large.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-06  0:18         ` Chris Wedgwood
@ 2006-06-06  0:35           ` Linus Torvalds
  0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-06  0:35 UTC (permalink / raw)
  To: Chris Wedgwood
  Cc: Nicolas Pitre, Olivier Galibert, Junio C Hamano, Git Mailing List

On Mon, 5 Jun 2006, Chris Wedgwood wrote:
>
> On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote:
> 
> > Much more expensive for both memory usage and CPU cycles.
> 
> On a modern CPU I'm not sure you would notice unless the dataset was
> insanely large.

The dataset really _is_ pretty large.

For the kernel, we're talking about a quarter million strings, and it's 
not shrinking. Other (CVS imported from long histories) are in the several 
million object range.

The real problem, btw, is not the CPU cost of sorting them (hey, O(nlogn) 
works ok), but the memory use. We have to keep those quarter million names 
in memory too. At a "pointer + average of ~30 bytes of full pathname 
allocation + malloc overhead", the strings would basically take about 40 
bytes of memory each, and cause constant cache-misses.  In contrast, the 
"hash" value is a 32-bit unsigned, with no pointer overhead.

It's not the biggest part to keep track of (we need to obviously keep 
track of the 20-byte SHA1 and the pointers between objects), but if we had 
the full string info, it would be quite noticeable overhead, on the order 
of several tens of megabytes. 

Now, "tens of megabytes" is not a make-or-break issue any more (you 
definitely want lots of memory if you want to develop with large trees), 
but in a very real sense, the memory footprint of an app is often very 
closely correlated with its performance these days. 

Trying to keep things within the L2 cache can help a lot, and even if we 
expect 2MB and 4MB L2's to get more and more common, it means that we 
should _strive_ to keep datasets down. As it is, we have no _chance_ of 
staying in the L2 cache on the kernel, but for smaller projects we 
hopefully can still do so ;)

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 21:14     ` Olivier Galibert
  2006-06-05 21:22       ` Nicolas Pitre
@ 2006-06-05 21:27       ` Linus Torvalds
  1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 21:27 UTC (permalink / raw)
  To: Olivier Galibert; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List

On Mon, 5 Jun 2006, Olivier Galibert wrote:
> 
> Why don't you just sort the full path+filename with a strcmp variant
> that starts by the end of the string for comparison?  May at least be
> simpler to understand.

That's actually what I was going to do, but we don't save the whole name, 
just the sorting number.

(This is actually an area where saving space is important - we can easily 
be working with hundreds of thousands or millions of objects, and we don't 
want to keep the name of each of them around).

So the suggested hash sort is designed exactly to end up approximating 
that ascii sort-from-end-of-string.

		Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 19:03   ` Linus Torvalds
  2006-06-05 19:37     ` Junio C Hamano
  2006-06-05 21:14     ` Olivier Galibert
@ 2006-06-05 21:20     ` Nicolas Pitre
  2006-06-05 21:40       ` Linus Torvalds
  2 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 21:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Mon, 5 Jun 2006, Linus Torvalds wrote:

> 
> 
> On this same thread..
> 
> This trivial patch not only simplifies the name hashing, it actually 
> improves packing for both git and the kernel.
> 
> The git archive pack shrinks from 6824090->6622627 bytes (a 3% 
> improvement), and the kernel pack shrinks from 108756213 to 108219021 (a 
> mere 0.5% improvement, but still, it's an improvement from making the 
> hashing much simpler!)

OK here's the scoop.  I still have a sample repo (I forget who it was 
from) that used to exhibit a big packing size regression which was fixed 
a while ago.  I tend to test new packing strategies on that repo as well 
since it has rather interesting characteristics that makes it pretty 
sensitive to changes to name hashing and size filtering heuristics.

Before this hashing patch (including the rev-list fix):

$ git repack -a -f
Generating pack...
Done counting 46391 objects.
Deltifying 46391 objects.
 100% (46391/46391) done
Writing 46391 objects.
 100% (46391/46391) done
Total 46391, written 46391 (delta 7457), reused 38934 (delta 0)
Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created.
$ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
-rw-rw-r--  1 nico nico 39486095 Jun  5 16:28 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack

Now with this patch applied:

$ git repack -a -f
Generating pack...
Done counting 46391 objects.
Deltifying 46391 objects.
 100% (46391/46391) done
Writing 46391 objects.
 100% (46391/46391) done
Total 46391, written 46391 (delta 9920), reused 36447 (delta 0)
Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created.
$ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
-rw-rw-r--  1 nico nico 16150417 Jun  5 16:31 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack

In other words, the pack shrunk to less than half the size of the 
previous one !

And yes fsck-objects still pass (I was doubtful at first).


Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 21:20     ` Nicolas Pitre
@ 2006-06-05 21:40       ` Linus Torvalds
  2006-06-05 23:13         ` Nicolas Pitre
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 21:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List

On Mon, 5 Jun 2006, Nicolas Pitre wrote:
> 
> In other words, the pack shrunk to less than half the size of the 
> previous one !

Ok, that's a bit more extreme than expected.

It's obviously great news, and says that the approach of sorting by 
"reversed name" is a great heuristic, but at the same time it makes me 
worry a bit that this thing that is supposed to be a heuristic ends up 
being _so_ important from a pack size standpoint. I was happier when it 
was more about saving a couple of percent.

Now, your repo may be a strange case, and it just happens to fit the 
suggested hash, but on the other hand it's nice to see three totally 
different repositories that all improve, albeit with wildly different 
numbers.

I'm wondering if we could have some "incremental optimizer" thing that 
would take a potentially badly packed archive, and just start looking for 
better delta chain possibilities? That way we would still try to get a 
good initial pack with some heuristic, but we could have people run the 
incremental improver every once in a while looking for good deltas that it 
missed due to the project not fitting the heuristics..

The fact that we normally do incremental repacking (and "-f" is unusual) 
is obviously one thing that makes us less susceptible to bad patterns (and 
is also what allows us to run the incremental optimizer - any good delta 
choice will automatically percolate into subsequent versions, including 
packs that have been cloned).

So the packing strategy itself seems to be very stable (and partly _due_ 
to the "optimization" to re-use earlier pack choices), but we currently 
lack the thing that fixes up any initial bad assumptions in case they 
happen.

			Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Horrible re-packing?
  2006-06-05 21:40       ` Linus Torvalds
@ 2006-06-05 23:13         ` Nicolas Pitre
  0 siblings, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 23:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Mon, 5 Jun 2006, Linus Torvalds wrote:

> 
> 
> On Mon, 5 Jun 2006, Nicolas Pitre wrote:
> > 
> > In other words, the pack shrunk to less than half the size of the 
> > previous one !
> 
> Ok, that's a bit more extreme than expected.
> 
> It's obviously great news, and says that the approach of sorting by 
> "reversed name" is a great heuristic, but at the same time it makes me 
> worry a bit that this thing that is supposed to be a heuristic ends up 
> being _so_ important from a pack size standpoint. I was happier when it 
> was more about saving a couple of percent.

Well... this is the repository that exhibited a repack regression a 
while ago, going from something like ~40MB to ~160MB when Junio 
initially added the directory in the name hash.  No other popular 
repositories had that problem.

Which is why I said this repo is particularly sensitive to heuristic 
changes.  So I wouldn't worry too much about your proposed patch making 
it too great in this case.  It certainly didn't cause any (significant) 
regression overall which is what matters.

We already have surprizing results when combining two heuristics 
together although if used separately they do worse.  So trying to have 
fallback/incremental heuristics is going to make things simply too 
complicated for when it breaks.  Better experiment with new ideas and 
adopt them when they do a better job universally.

... which your proposed hashing change does.

Nicolas

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2006-06-06  0:35 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-05 17:08 Horrible re-packing? Linus Torvalds
2006-06-05 18:44 ` Linus Torvalds
2006-06-05 19:03   ` Linus Torvalds
2006-06-05 19:37     ` Junio C Hamano
2006-06-05 19:57       ` Linus Torvalds
2006-06-05 23:54         ` Junio C Hamano
2006-06-06  0:14           ` Junio C Hamano
2006-06-05 21:14     ` Olivier Galibert
2006-06-05 21:22       ` Nicolas Pitre
2006-06-06  0:18         ` Chris Wedgwood
2006-06-06  0:35           ` Linus Torvalds
2006-06-05 21:27       ` Linus Torvalds
2006-06-05 21:20     ` Nicolas Pitre
2006-06-05 21:40       ` Linus Torvalds
2006-06-05 23:13         ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).