* Horrible re-packing? @ 2006-06-05 17:08 Linus Torvalds 2006-06-05 18:44 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 17:08 UTC (permalink / raw) To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List Junio, Nico, I just tried doing a "git repack -a -d -f" to because I expected a full re-pack to do _better_ than doing occasional incrementals, and verify the pack generation, but imagine my shock when IT SUCKS. I didn't look at where the suckage started, but look at this: [torvalds@g5 git]$ git repack -a -d Generating pack... Done counting 21322 objects. Deltifying 21322 objects. 100% (21322/21322) done Writing 21322 objects. 100% (21322/21322) done Total 21322, written 21322 (delta 14489), reused 21319 (delta 14486) Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created. [torvalds@g5 git]$ ll .git/objects/pack/ total 7008 -rw-r--r-- 1 torvalds torvalds 512792 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx -rw-r--r-- 1 torvalds torvalds 6643695 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack Ie, we have anice 6.33MB pack-file. Now: [torvalds@g5 git]$ git repack -a -d -f Generating pack... Done counting 21322 objects. Deltifying 21322 objects. 100% (21322/21322) done Writing 21322 objects. 100% (21322/21322) done Total 21322, written 21322 (delta 10187), reused 6777 (delta 0) Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created. [torvalds@g5 git]$ ll .git/objects/pack/ total 15352 -rw-r--r-- 1 torvalds torvalds 512792 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx -rw-r--r-- 1 torvalds torvalds 15176139 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack Whaah! That nice 6.33MB pack-file exploded to 14.5MB! Doing repeated "git repack -a -d" to try to do incrementals, it stopped improving after the sixth one, at which point it was down to 11.7MB, still almost twice as big as before. Re-doing it with git repack -a -d -f --depth=100 --window=100 got me back to 6.94MB, but that's still 10% larger than the pack-file I had before. Interestingly, it's the "window" that matters more. The depth part didn't make that huge of a difference, so it looks like it's the sorting heuristic that may be broken again. And it's possibly broken by the fact that we've been renaming things lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not finding things) Nico? Any ideas? Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 17:08 Horrible re-packing? Linus Torvalds @ 2006-06-05 18:44 ` Linus Torvalds 2006-06-05 19:03 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 18:44 UTC (permalink / raw) To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List On Mon, 5 Jun 2006, Linus Torvalds wrote: > > Whaah! That nice 6.33MB pack-file exploded to 14.5MB! > > And it's possibly broken by the fact that we've been renaming things > lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not > finding things) No, it's even simpler. The breakage is entirely mine, and due to the tree-walking conversion of the "process_tree()" function. In that function, we used to have a local "const char *name" that _shadowed_ the incoming _argument_ with the same type, and the tree-walking conversion did not notice that the inner "name" should have been converted to "entry.path" - so it used the outer-level "name". Gaah. We should probably use -Wshadow or something, which would hopefully have warned about the re-use of the same variable name in two different scopes. Regardless, this fixes it. Linus --- diff --git a/builtin-rev-list.c b/builtin-rev-list.c index 17c04b9..e885624 100644 --- a/builtin-rev-list.c +++ b/builtin-rev-list.c @@ -135,9 +135,9 @@ static struct object_list **process_tree while (tree_entry(&desc, &entry)) { if (S_ISDIR(entry.mode)) - p = process_tree(lookup_tree(entry.sha1), p, &me, name); + p = process_tree(lookup_tree(entry.sha1), p, &me, entry.path); else - p = process_blob(lookup_blob(entry.sha1), p, &me, name); + p = process_blob(lookup_blob(entry.sha1), p, &me, entry.path); } free(tree->buffer); tree->buffer = NULL; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 18:44 ` Linus Torvalds @ 2006-06-05 19:03 ` Linus Torvalds 2006-06-05 19:37 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 19:03 UTC (permalink / raw) To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List On this same thread.. This trivial patch not only simplifies the name hashing, it actually improves packing for both git and the kernel. The git archive pack shrinks from 6824090->6622627 bytes (a 3% improvement), and the kernel pack shrinks from 108756213 to 108219021 (a mere 0.5% improvement, but still, it's an improvement from making the hashing much simpler!) I think the hash function with its comment is self-explanatory: /* * This effectively just creates a sortable number from the * last sixteen non-whitespace characters. Last characters * count "most", so things that end in ".c" sort together. */ while ((c = *name++) != 0) { if (isspace(c)) continue; hash = (hash >> 2) + (c << 24); } return hash; ie we just create a 32-bit hash, where we "age" previous characters by two bits, so the last characters in a filename count most. So when we then compare the hashes in the sort routine, filenames that end the same way sort the same way. It takes the subdirectory into account (unless the filename is > 16 characters), but files with the same name within the same subdirectory will obviously sort closer than files in different subdirectories. And, incidentally (which is why I tried the hash change in the first place, of course) builtin-rev-list.c will sort fairly close to rev-list.c. And no, it's not a "good hash" in the sense of being secure or unique, but that's not what we're looking for. The whole "hash" thing is misnamed here. It's not so much a hash as a "sorting number". Comments? Linus ---- pack-objects.c | 41 +++++++++++------------------------------ 1 files changed, 11 insertions(+), 30 deletions(-) diff --git a/pack-objects.c b/pack-objects.c index 3590cd5..ae49fba 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -473,38 +473,19 @@ #define DIRBITS 12 static unsigned name_hash(struct name_path *path, const char *name) { - struct name_path *p = path; - const char *n = name + strlen(name); - unsigned hash = 0, name_hash = 0, name_done = 0; - - if (n != name && n[-1] == '\n') - n--; - while (name <= --n) { - unsigned char c = *n; - if (c == '/' && !name_done) { - name_hash = hash; - name_done = 1; - hash = 0; - } - hash = hash * 11 + c; - } - if (!name_done) { - name_hash = hash; - hash = 0; - } - for (p = path; p; p = p->up) { - hash = hash * 11 + '/'; - n = p->elem + p->len; - while (p->elem <= --n) { - unsigned char c = *n; - hash = hash * 11 + c; - } - } + unsigned char c; + unsigned hash = 0; + /* - * Make sure "Makefile" and "t/Makefile" are hashed separately - * but close enough. + * This effectively just creates a sortable number from the + * last sixteen non-whitespace characters. Last characters + * count "most", so things that end in ".c" sort together. */ - hash = (name_hash<<DIRBITS) | (hash & ((1U<<DIRBITS )-1)); + while ((c = *name++) != 0) { + if (isspace(c)) + continue; + hash = (hash >> 2) + (c << 24); + } return hash; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 19:03 ` Linus Torvalds @ 2006-06-05 19:37 ` Junio C Hamano 2006-06-05 19:57 ` Linus Torvalds 2006-06-05 21:14 ` Olivier Galibert 2006-06-05 21:20 ` Nicolas Pitre 2 siblings, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2006-06-05 19:37 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List Linus Torvalds <torvalds@osdl.org> writes: > I think the hash function with its comment is self-explanatory: > > /* > * This effectively just creates a sortable number from the > * last sixteen non-whitespace characters. Last characters > * count "most", so things that end in ".c" sort together. > */ > while ((c = *name++) != 0) { > if (isspace(c)) > continue; > hash = (hash >> 2) + (c << 24); > } > return hash; > > ie we just create a 32-bit hash, where we "age" previous characters by two > bits, so the last characters in a filename count most. So when we then > compare the hashes in the sort routine, filenames that end the same way > sort the same way. IIRC, sometimes this function is called with path and name split and sometimes with full path in name, depending on who calls you (the latter happens for rev-list --object generated names, and the former is for objects we extract ourselves from the --thin base tree, or something like that). I suspect your patch may break paths whose filename after the last slash is shorter than 16 bytes. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 19:37 ` Junio C Hamano @ 2006-06-05 19:57 ` Linus Torvalds 2006-06-05 23:54 ` Junio C Hamano 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 19:57 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List On Mon, 5 Jun 2006, Junio C Hamano wrote: > > IIRC, sometimes this function is called with path and name split > and sometimes with full path in name Yeah, I was pretty confused by the whole hashing thing. Are you sure that complexity is needed, it seems a bit overkill. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 19:57 ` Linus Torvalds @ 2006-06-05 23:54 ` Junio C Hamano 2006-06-06 0:14 ` Junio C Hamano 0 siblings, 1 reply; 15+ messages in thread From: Junio C Hamano @ 2006-06-05 23:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Mon, 5 Jun 2006, Junio C Hamano wrote: >> >> IIRC, sometimes this function is called with path and name split >> and sometimes with full path in name > > Yeah, I was pretty confused by the whole hashing thing. Are you sure that > complexity is needed, it seems a bit overkill. Two issues in the code confuses any reader of that function. - The code wants to hash Makefile from different revisions together, and Makefile and t/Makefile close to each other. The current code did it by treating '/' specially, used basename hash as the upper bits of the resulting hash and dirname hash as the lower bits. It's my tendency to treat slashes specially too much, which is one of your favorite things to pick me on. This is not needed by your change anymore -- by only using the tail of the filename, and making sure tail part weighs more in the resulting group number, the new code gives the desired grouping characteristics in a much simpler way. - The output from rev-list --objects is fed to the function as its name parameter while path is set to NULL. When we allow a thin pack to be generated, rev-list --objects output also contains "-<commit-object-name>" lines. We read trees for these commits that are not going to be sent but can be used as base objects, and pass the pathname discovered from the tree using path and name pair (path is set to the linked list of struct name_path that describes the dirname, and name is set to the basename). This was done to reduce the need for allocating and copying the pathname in preparation for calling name_hash() function. If you use only the "name" variable in your group number computation, and suppose we are doing send-pack to send updates between rev A..B, contrib/git-svn/Makefile from rev B will use git-svn/Makefile (tail 16 characters) to compute the number, but the blob from rev A (which we are not going to send but would want to use as a potential delta base) will have contrib/git-svn part in "path" (the element points at string "git-svn", and its uplink points at another element that points at "contrib" with an uplink that says it is at the root level), and Makefile in "name". They will be hashed slightly differently. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 23:54 ` Junio C Hamano @ 2006-06-06 0:14 ` Junio C Hamano 0 siblings, 0 replies; 15+ messages in thread From: Junio C Hamano @ 2006-06-06 0:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Junio C Hamano <junkio@cox.net> writes: > - The output from rev-list --objects is fed to the function as > its name parameter while path is set to NULL. When we allow > a thin pack to be generated, rev-list --objects output also > contains "-<commit-object-name>" lines. We read trees for > these commits that are not going to be sent but can be used > as base objects, and pass the pathname discovered from the > tree using path and name pair (path is set to the linked list > of struct name_path that describes the dirname, and name is > set to the basename). This was done to reduce the need for > allocating and copying the pathname in preparation for > calling name_hash() function. but it can trivially fixed by doing something like this. --- diff --git a/pack-objects.c b/pack-objects.c index 3590cd5..e0328f8 100644 --- a/pack-objects.c +++ b/pack-objects.c @@ -463,17 +463,10 @@ static void rehash_objects(void) } } -struct name_path { - struct name_path *up; - const char *elem; - int len; -}; - #define DIRBITS 12 -static unsigned name_hash(struct name_path *path, const char *name) +static unsigned name_hash(const char *name) { - struct name_path *p = path; const char *n = name + strlen(name); unsigned hash = 0, name_hash = 0, name_done = 0; @@ -492,14 +485,6 @@ static unsigned name_hash(struct name_pa name_hash = hash; hash = 0; } - for (p = path; p; p = p->up) { - hash = hash * 11 + '/'; - n = p->elem + p->len; - while (p->elem <= --n) { - unsigned char c = *n; - hash = hash * 11 + c; - } - } /* * Make sure "Makefile" and "t/Makefile" are hashed separately * but close enough. @@ -686,9 +671,9 @@ static int name_cmp_len(const char *name } static void add_pbase_object(struct tree_desc *tree, - struct name_path *up, const char *name, - int cmplen) + int cmplen, + const char *fullname) { struct name_entry entry; @@ -702,13 +687,12 @@ static void add_pbase_object(struct tree sha1_object_info(entry.sha1, type, &size)) continue; if (name[cmplen] != '/') { - unsigned hash = name_hash(up, name); + unsigned hash = name_hash(fullname); add_object_entry(entry.sha1, hash, 1); return; } if (!strcmp(type, tree_type)) { struct tree_desc sub; - struct name_path me; struct pbase_tree_cache *tree; const char *down = name+cmplen+1; int downlen = name_cmp_len(down); @@ -719,10 +703,7 @@ static void add_pbase_object(struct tree sub.buf = tree->tree_data; sub.size = tree->tree_size; - me.up = up; - me.elem = entry.path; - me.len = entry.pathlen; - add_pbase_object(&sub, &me, down, downlen); + add_pbase_object(&sub, down, downlen, fullname); pbase_tree_put(tree); } } @@ -778,14 +759,14 @@ static void add_preferred_base_object(ch for (it = pbase_tree; it; it = it->next) { if (cmplen == 0) { - hash = name_hash(NULL, ""); + hash = name_hash(""); add_object_entry(it->pcache.sha1, hash, 1); } else { struct tree_desc tree; tree.buf = it->pcache.tree_data; tree.size = it->pcache.tree_size; - add_pbase_object(&tree, NULL, name, cmplen); + add_pbase_object(&tree, name, cmplen, name); } } } @@ -1328,7 +1309,7 @@ int main(int argc, char **argv) } if (get_sha1_hex(line, sha1)) die("expected sha1, got garbage:\n %s", line); - hash = name_hash(NULL, line+41); + hash = name_hash(line+41); add_preferred_base_object(line+41, hash); add_object_entry(sha1, hash, 0); } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 19:03 ` Linus Torvalds 2006-06-05 19:37 ` Junio C Hamano @ 2006-06-05 21:14 ` Olivier Galibert 2006-06-05 21:22 ` Nicolas Pitre 2006-06-05 21:27 ` Linus Torvalds 2006-06-05 21:20 ` Nicolas Pitre 2 siblings, 2 replies; 15+ messages in thread From: Olivier Galibert @ 2006-06-05 21:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote: > Comments? Why don't you just sort the full path+filename with a strcmp variant that starts by the end of the string for comparison? May at least be simpler to understand. OG. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 21:14 ` Olivier Galibert @ 2006-06-05 21:22 ` Nicolas Pitre 2006-06-06 0:18 ` Chris Wedgwood 2006-06-05 21:27 ` Linus Torvalds 1 sibling, 1 reply; 15+ messages in thread From: Nicolas Pitre @ 2006-06-05 21:22 UTC (permalink / raw) To: Olivier Galibert; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List On Mon, 5 Jun 2006, Olivier Galibert wrote: > On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote: > > Comments? > > Why don't you just sort the full path+filename with a strcmp variant > that starts by the end of the string for comparison? May at least be > simpler to understand. Much more expensive for both memory usage and CPU cycles. Nicolas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 21:22 ` Nicolas Pitre @ 2006-06-06 0:18 ` Chris Wedgwood 2006-06-06 0:35 ` Linus Torvalds 0 siblings, 1 reply; 15+ messages in thread From: Chris Wedgwood @ 2006-06-06 0:18 UTC (permalink / raw) To: Nicolas Pitre Cc: Olivier Galibert, Linus Torvalds, Junio C Hamano, Git Mailing List On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote: > Much more expensive for both memory usage and CPU cycles. On a modern CPU I'm not sure you would notice unless the dataset was insanely large. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-06 0:18 ` Chris Wedgwood @ 2006-06-06 0:35 ` Linus Torvalds 0 siblings, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2006-06-06 0:35 UTC (permalink / raw) To: Chris Wedgwood Cc: Nicolas Pitre, Olivier Galibert, Junio C Hamano, Git Mailing List On Mon, 5 Jun 2006, Chris Wedgwood wrote: > > On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote: > > > Much more expensive for both memory usage and CPU cycles. > > On a modern CPU I'm not sure you would notice unless the dataset was > insanely large. The dataset really _is_ pretty large. For the kernel, we're talking about a quarter million strings, and it's not shrinking. Other (CVS imported from long histories) are in the several million object range. The real problem, btw, is not the CPU cost of sorting them (hey, O(nlogn) works ok), but the memory use. We have to keep those quarter million names in memory too. At a "pointer + average of ~30 bytes of full pathname allocation + malloc overhead", the strings would basically take about 40 bytes of memory each, and cause constant cache-misses. In contrast, the "hash" value is a 32-bit unsigned, with no pointer overhead. It's not the biggest part to keep track of (we need to obviously keep track of the 20-byte SHA1 and the pointers between objects), but if we had the full string info, it would be quite noticeable overhead, on the order of several tens of megabytes. Now, "tens of megabytes" is not a make-or-break issue any more (you definitely want lots of memory if you want to develop with large trees), but in a very real sense, the memory footprint of an app is often very closely correlated with its performance these days. Trying to keep things within the L2 cache can help a lot, and even if we expect 2MB and 4MB L2's to get more and more common, it means that we should _strive_ to keep datasets down. As it is, we have no _chance_ of staying in the L2 cache on the kernel, but for smaller projects we hopefully can still do so ;) Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 21:14 ` Olivier Galibert 2006-06-05 21:22 ` Nicolas Pitre @ 2006-06-05 21:27 ` Linus Torvalds 1 sibling, 0 replies; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 21:27 UTC (permalink / raw) To: Olivier Galibert; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List On Mon, 5 Jun 2006, Olivier Galibert wrote: > > Why don't you just sort the full path+filename with a strcmp variant > that starts by the end of the string for comparison? May at least be > simpler to understand. That's actually what I was going to do, but we don't save the whole name, just the sorting number. (This is actually an area where saving space is important - we can easily be working with hundreds of thousands or millions of objects, and we don't want to keep the name of each of them around). So the suggested hash sort is designed exactly to end up approximating that ascii sort-from-end-of-string. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 19:03 ` Linus Torvalds 2006-06-05 19:37 ` Junio C Hamano 2006-06-05 21:14 ` Olivier Galibert @ 2006-06-05 21:20 ` Nicolas Pitre 2006-06-05 21:40 ` Linus Torvalds 2 siblings, 1 reply; 15+ messages in thread From: Nicolas Pitre @ 2006-06-05 21:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Mon, 5 Jun 2006, Linus Torvalds wrote: > > > On this same thread.. > > This trivial patch not only simplifies the name hashing, it actually > improves packing for both git and the kernel. > > The git archive pack shrinks from 6824090->6622627 bytes (a 3% > improvement), and the kernel pack shrinks from 108756213 to 108219021 (a > mere 0.5% improvement, but still, it's an improvement from making the > hashing much simpler!) OK here's the scoop. I still have a sample repo (I forget who it was from) that used to exhibit a big packing size regression which was fixed a while ago. I tend to test new packing strategies on that repo as well since it has rather interesting characteristics that makes it pretty sensitive to changes to name hashing and size filtering heuristics. Before this hashing patch (including the rev-list fix): $ git repack -a -f Generating pack... Done counting 46391 objects. Deltifying 46391 objects. 100% (46391/46391) done Writing 46391 objects. 100% (46391/46391) done Total 46391, written 46391 (delta 7457), reused 38934 (delta 0) Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created. $ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack -rw-rw-r-- 1 nico nico 39486095 Jun 5 16:28 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack Now with this patch applied: $ git repack -a -f Generating pack... Done counting 46391 objects. Deltifying 46391 objects. 100% (46391/46391) done Writing 46391 objects. 100% (46391/46391) done Total 46391, written 46391 (delta 9920), reused 36447 (delta 0) Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created. $ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack -rw-rw-r-- 1 nico nico 16150417 Jun 5 16:31 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack In other words, the pack shrunk to less than half the size of the previous one ! And yes fsck-objects still pass (I was doubtful at first). Nicolas ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 21:20 ` Nicolas Pitre @ 2006-06-05 21:40 ` Linus Torvalds 2006-06-05 23:13 ` Nicolas Pitre 0 siblings, 1 reply; 15+ messages in thread From: Linus Torvalds @ 2006-06-05 21:40 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List On Mon, 5 Jun 2006, Nicolas Pitre wrote: > > In other words, the pack shrunk to less than half the size of the > previous one ! Ok, that's a bit more extreme than expected. It's obviously great news, and says that the approach of sorting by "reversed name" is a great heuristic, but at the same time it makes me worry a bit that this thing that is supposed to be a heuristic ends up being _so_ important from a pack size standpoint. I was happier when it was more about saving a couple of percent. Now, your repo may be a strange case, and it just happens to fit the suggested hash, but on the other hand it's nice to see three totally different repositories that all improve, albeit with wildly different numbers. I'm wondering if we could have some "incremental optimizer" thing that would take a potentially badly packed archive, and just start looking for better delta chain possibilities? That way we would still try to get a good initial pack with some heuristic, but we could have people run the incremental improver every once in a while looking for good deltas that it missed due to the project not fitting the heuristics.. The fact that we normally do incremental repacking (and "-f" is unusual) is obviously one thing that makes us less susceptible to bad patterns (and is also what allows us to run the incremental optimizer - any good delta choice will automatically percolate into subsequent versions, including packs that have been cloned). So the packing strategy itself seems to be very stable (and partly _due_ to the "optimization" to re-use earlier pack choices), but we currently lack the thing that fixes up any initial bad assumptions in case they happen. Linus ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing? 2006-06-05 21:40 ` Linus Torvalds @ 2006-06-05 23:13 ` Nicolas Pitre 0 siblings, 0 replies; 15+ messages in thread From: Nicolas Pitre @ 2006-06-05 23:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Mon, 5 Jun 2006, Linus Torvalds wrote: > > > On Mon, 5 Jun 2006, Nicolas Pitre wrote: > > > > In other words, the pack shrunk to less than half the size of the > > previous one ! > > Ok, that's a bit more extreme than expected. > > It's obviously great news, and says that the approach of sorting by > "reversed name" is a great heuristic, but at the same time it makes me > worry a bit that this thing that is supposed to be a heuristic ends up > being _so_ important from a pack size standpoint. I was happier when it > was more about saving a couple of percent. Well... this is the repository that exhibited a repack regression a while ago, going from something like ~40MB to ~160MB when Junio initially added the directory in the name hash. No other popular repositories had that problem. Which is why I said this repo is particularly sensitive to heuristic changes. So I wouldn't worry too much about your proposed patch making it too great in this case. It certainly didn't cause any (significant) regression overall which is what matters. We already have surprizing results when combining two heuristics together although if used separately they do worse. So trying to have fallback/incremental heuristics is going to make things simply too complicated for when it breaks. Better experiment with new ideas and adopt them when they do a better job universally. ... which your proposed hashing change does. Nicolas ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-06-06 0:35 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-06-05 17:08 Horrible re-packing? Linus Torvalds 2006-06-05 18:44 ` Linus Torvalds 2006-06-05 19:03 ` Linus Torvalds 2006-06-05 19:37 ` Junio C Hamano 2006-06-05 19:57 ` Linus Torvalds 2006-06-05 23:54 ` Junio C Hamano 2006-06-06 0:14 ` Junio C Hamano 2006-06-05 21:14 ` Olivier Galibert 2006-06-05 21:22 ` Nicolas Pitre 2006-06-06 0:18 ` Chris Wedgwood 2006-06-06 0:35 ` Linus Torvalds 2006-06-05 21:27 ` Linus Torvalds 2006-06-05 21:20 ` Nicolas Pitre 2006-06-05 21:40 ` Linus Torvalds 2006-06-05 23:13 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).