* Horrible re-packing?
@ 2006-06-05 17:08 Linus Torvalds
2006-06-05 18:44 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 17:08 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
Junio, Nico,
I just tried doing a "git repack -a -d -f" to because I expected a full
re-pack to do _better_ than doing occasional incrementals, and verify the
pack generation, but imagine my shock when IT SUCKS.
I didn't look at where the suckage started, but look at this:
[torvalds@g5 git]$ git repack -a -d
Generating pack...
Done counting 21322 objects.
Deltifying 21322 objects.
100% (21322/21322) done
Writing 21322 objects.
100% (21322/21322) done
Total 21322, written 21322 (delta 14489), reused 21319 (delta 14486)
Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created.
[torvalds@g5 git]$ ll .git/objects/pack/
total 7008
-rw-r--r-- 1 torvalds torvalds 512792 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx
-rw-r--r-- 1 torvalds torvalds 6643695 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack
Ie, we have anice 6.33MB pack-file.
Now:
[torvalds@g5 git]$ git repack -a -d -f
Generating pack...
Done counting 21322 objects.
Deltifying 21322 objects.
100% (21322/21322) done
Writing 21322 objects.
100% (21322/21322) done
Total 21322, written 21322 (delta 10187), reused 6777 (delta 0)
Pack pack-fe4ff117c9959ead3443b826a777423b3062b666 created.
[torvalds@g5 git]$ ll .git/objects/pack/
total 15352
-rw-r--r-- 1 torvalds torvalds 512792 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.idx
-rw-r--r-- 1 torvalds torvalds 15176139 Jun 5 09:41 pack-fe4ff117c9959ead3443b826a777423b3062b666.pack
Whaah! That nice 6.33MB pack-file exploded to 14.5MB!
Doing repeated "git repack -a -d" to try to do incrementals, it stopped
improving after the sixth one, at which point it was down to 11.7MB, still
almost twice as big as before.
Re-doing it with
git repack -a -d -f --depth=100 --window=100
got me back to 6.94MB, but that's still 10% larger than the pack-file I
had before.
Interestingly, it's the "window" that matters more. The depth part didn't
make that huge of a difference, so it looks like it's the sorting
heuristic that may be broken again.
And it's possibly broken by the fact that we've been renaming things
lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not
finding things)
Nico? Any ideas?
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 17:08 Horrible re-packing? Linus Torvalds
@ 2006-06-05 18:44 ` Linus Torvalds
2006-06-05 19:03 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 18:44 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
On Mon, 5 Jun 2006, Linus Torvalds wrote:
>
> Whaah! That nice 6.33MB pack-file exploded to 14.5MB!
>
> And it's possibly broken by the fact that we've been renaming things
> lately (ie the "rev-list.c" -> "builtin-rev-list.c" thing ends up not
> finding things)
No, it's even simpler.
The breakage is entirely mine, and due to the tree-walking conversion of
the "process_tree()" function.
In that function, we used to have a local "const char *name" that
_shadowed_ the incoming _argument_ with the same type, and the
tree-walking conversion did not notice that the inner "name" should have
been converted to "entry.path" - so it used the outer-level "name".
Gaah. We should probably use -Wshadow or something, which would hopefully
have warned about the re-use of the same variable name in two different
scopes.
Regardless, this fixes it.
Linus
---
diff --git a/builtin-rev-list.c b/builtin-rev-list.c
index 17c04b9..e885624 100644
--- a/builtin-rev-list.c
+++ b/builtin-rev-list.c
@@ -135,9 +135,9 @@ static struct object_list **process_tree
while (tree_entry(&desc, &entry)) {
if (S_ISDIR(entry.mode))
- p = process_tree(lookup_tree(entry.sha1), p, &me, name);
+ p = process_tree(lookup_tree(entry.sha1), p, &me, entry.path);
else
- p = process_blob(lookup_blob(entry.sha1), p, &me, name);
+ p = process_blob(lookup_blob(entry.sha1), p, &me, entry.path);
}
free(tree->buffer);
tree->buffer = NULL;
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 18:44 ` Linus Torvalds
@ 2006-06-05 19:03 ` Linus Torvalds
2006-06-05 19:37 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 19:03 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre; +Cc: Git Mailing List
On this same thread..
This trivial patch not only simplifies the name hashing, it actually
improves packing for both git and the kernel.
The git archive pack shrinks from 6824090->6622627 bytes (a 3%
improvement), and the kernel pack shrinks from 108756213 to 108219021 (a
mere 0.5% improvement, but still, it's an improvement from making the
hashing much simpler!)
I think the hash function with its comment is self-explanatory:
/*
* This effectively just creates a sortable number from the
* last sixteen non-whitespace characters. Last characters
* count "most", so things that end in ".c" sort together.
*/
while ((c = *name++) != 0) {
if (isspace(c))
continue;
hash = (hash >> 2) + (c << 24);
}
return hash;
ie we just create a 32-bit hash, where we "age" previous characters by two
bits, so the last characters in a filename count most. So when we then
compare the hashes in the sort routine, filenames that end the same way
sort the same way.
It takes the subdirectory into account (unless the filename is > 16
characters), but files with the same name within the same subdirectory
will obviously sort closer than files in different subdirectories.
And, incidentally (which is why I tried the hash change in the first
place, of course) builtin-rev-list.c will sort fairly close to rev-list.c.
And no, it's not a "good hash" in the sense of being secure or unique, but
that's not what we're looking for. The whole "hash" thing is misnamed
here. It's not so much a hash as a "sorting number".
Comments?
Linus
----
pack-objects.c | 41 +++++++++++------------------------------
1 files changed, 11 insertions(+), 30 deletions(-)
diff --git a/pack-objects.c b/pack-objects.c
index 3590cd5..ae49fba 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -473,38 +473,19 @@ #define DIRBITS 12
static unsigned name_hash(struct name_path *path, const char *name)
{
- struct name_path *p = path;
- const char *n = name + strlen(name);
- unsigned hash = 0, name_hash = 0, name_done = 0;
-
- if (n != name && n[-1] == '\n')
- n--;
- while (name <= --n) {
- unsigned char c = *n;
- if (c == '/' && !name_done) {
- name_hash = hash;
- name_done = 1;
- hash = 0;
- }
- hash = hash * 11 + c;
- }
- if (!name_done) {
- name_hash = hash;
- hash = 0;
- }
- for (p = path; p; p = p->up) {
- hash = hash * 11 + '/';
- n = p->elem + p->len;
- while (p->elem <= --n) {
- unsigned char c = *n;
- hash = hash * 11 + c;
- }
- }
+ unsigned char c;
+ unsigned hash = 0;
+
/*
- * Make sure "Makefile" and "t/Makefile" are hashed separately
- * but close enough.
+ * This effectively just creates a sortable number from the
+ * last sixteen non-whitespace characters. Last characters
+ * count "most", so things that end in ".c" sort together.
*/
- hash = (name_hash<<DIRBITS) | (hash & ((1U<<DIRBITS )-1));
+ while ((c = *name++) != 0) {
+ if (isspace(c))
+ continue;
+ hash = (hash >> 2) + (c << 24);
+ }
return hash;
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 19:03 ` Linus Torvalds
@ 2006-06-05 19:37 ` Junio C Hamano
2006-06-05 19:57 ` Linus Torvalds
2006-06-05 21:14 ` Olivier Galibert
2006-06-05 21:20 ` Nicolas Pitre
2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2006-06-05 19:37 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Git Mailing List
Linus Torvalds <torvalds@osdl.org> writes:
> I think the hash function with its comment is self-explanatory:
>
> /*
> * This effectively just creates a sortable number from the
> * last sixteen non-whitespace characters. Last characters
> * count "most", so things that end in ".c" sort together.
> */
> while ((c = *name++) != 0) {
> if (isspace(c))
> continue;
> hash = (hash >> 2) + (c << 24);
> }
> return hash;
>
> ie we just create a 32-bit hash, where we "age" previous characters by two
> bits, so the last characters in a filename count most. So when we then
> compare the hashes in the sort routine, filenames that end the same way
> sort the same way.
IIRC, sometimes this function is called with path and name split
and sometimes with full path in name, depending on who calls you
(the latter happens for rev-list --object generated names, and
the former is for objects we extract ourselves from the --thin
base tree, or something like that). I suspect your patch may
break paths whose filename after the last slash is shorter than
16 bytes.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 19:37 ` Junio C Hamano
@ 2006-06-05 19:57 ` Linus Torvalds
2006-06-05 23:54 ` Junio C Hamano
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 19:57 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Git Mailing List
On Mon, 5 Jun 2006, Junio C Hamano wrote:
>
> IIRC, sometimes this function is called with path and name split
> and sometimes with full path in name
Yeah, I was pretty confused by the whole hashing thing. Are you sure that
complexity is needed, it seems a bit overkill.
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 19:03 ` Linus Torvalds
2006-06-05 19:37 ` Junio C Hamano
@ 2006-06-05 21:14 ` Olivier Galibert
2006-06-05 21:22 ` Nicolas Pitre
2006-06-05 21:27 ` Linus Torvalds
2006-06-05 21:20 ` Nicolas Pitre
2 siblings, 2 replies; 15+ messages in thread
From: Olivier Galibert @ 2006-06-05 21:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote:
> Comments?
Why don't you just sort the full path+filename with a strcmp variant
that starts by the end of the string for comparison? May at least be
simpler to understand.
OG.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 19:03 ` Linus Torvalds
2006-06-05 19:37 ` Junio C Hamano
2006-06-05 21:14 ` Olivier Galibert
@ 2006-06-05 21:20 ` Nicolas Pitre
2006-06-05 21:40 ` Linus Torvalds
2 siblings, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 21:20 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
On Mon, 5 Jun 2006, Linus Torvalds wrote:
>
>
> On this same thread..
>
> This trivial patch not only simplifies the name hashing, it actually
> improves packing for both git and the kernel.
>
> The git archive pack shrinks from 6824090->6622627 bytes (a 3%
> improvement), and the kernel pack shrinks from 108756213 to 108219021 (a
> mere 0.5% improvement, but still, it's an improvement from making the
> hashing much simpler!)
OK here's the scoop. I still have a sample repo (I forget who it was
from) that used to exhibit a big packing size regression which was fixed
a while ago. I tend to test new packing strategies on that repo as well
since it has rather interesting characteristics that makes it pretty
sensitive to changes to name hashing and size filtering heuristics.
Before this hashing patch (including the rev-list fix):
$ git repack -a -f
Generating pack...
Done counting 46391 objects.
Deltifying 46391 objects.
100% (46391/46391) done
Writing 46391 objects.
100% (46391/46391) done
Total 46391, written 46391 (delta 7457), reused 38934 (delta 0)
Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created.
$ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
-rw-rw-r-- 1 nico nico 39486095 Jun 5 16:28 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
Now with this patch applied:
$ git repack -a -f
Generating pack...
Done counting 46391 objects.
Deltifying 46391 objects.
100% (46391/46391) done
Writing 46391 objects.
100% (46391/46391) done
Total 46391, written 46391 (delta 9920), reused 36447 (delta 0)
Pack pack-7f766f5af5547554bacb28c0294bd562589dc5e7 created.
$ ll .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
-rw-rw-r-- 1 nico nico 16150417 Jun 5 16:31 .git/objects/pack/pack-7f766f5af5547554bacb28c0294bd562589dc5e7.pack
In other words, the pack shrunk to less than half the size of the
previous one !
And yes fsck-objects still pass (I was doubtful at first).
Nicolas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 21:14 ` Olivier Galibert
@ 2006-06-05 21:22 ` Nicolas Pitre
2006-06-06 0:18 ` Chris Wedgwood
2006-06-05 21:27 ` Linus Torvalds
1 sibling, 1 reply; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 21:22 UTC (permalink / raw)
To: Olivier Galibert; +Cc: Linus Torvalds, Junio C Hamano, Git Mailing List
On Mon, 5 Jun 2006, Olivier Galibert wrote:
> On Mon, Jun 05, 2006 at 12:03:31PM -0700, Linus Torvalds wrote:
> > Comments?
>
> Why don't you just sort the full path+filename with a strcmp variant
> that starts by the end of the string for comparison? May at least be
> simpler to understand.
Much more expensive for both memory usage and CPU cycles.
Nicolas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 21:14 ` Olivier Galibert
2006-06-05 21:22 ` Nicolas Pitre
@ 2006-06-05 21:27 ` Linus Torvalds
1 sibling, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 21:27 UTC (permalink / raw)
To: Olivier Galibert; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Mon, 5 Jun 2006, Olivier Galibert wrote:
>
> Why don't you just sort the full path+filename with a strcmp variant
> that starts by the end of the string for comparison? May at least be
> simpler to understand.
That's actually what I was going to do, but we don't save the whole name,
just the sorting number.
(This is actually an area where saving space is important - we can easily
be working with hundreds of thousands or millions of objects, and we don't
want to keep the name of each of them around).
So the suggested hash sort is designed exactly to end up approximating
that ascii sort-from-end-of-string.
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 21:20 ` Nicolas Pitre
@ 2006-06-05 21:40 ` Linus Torvalds
2006-06-05 23:13 ` Nicolas Pitre
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-06-05 21:40 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List
On Mon, 5 Jun 2006, Nicolas Pitre wrote:
>
> In other words, the pack shrunk to less than half the size of the
> previous one !
Ok, that's a bit more extreme than expected.
It's obviously great news, and says that the approach of sorting by
"reversed name" is a great heuristic, but at the same time it makes me
worry a bit that this thing that is supposed to be a heuristic ends up
being _so_ important from a pack size standpoint. I was happier when it
was more about saving a couple of percent.
Now, your repo may be a strange case, and it just happens to fit the
suggested hash, but on the other hand it's nice to see three totally
different repositories that all improve, albeit with wildly different
numbers.
I'm wondering if we could have some "incremental optimizer" thing that
would take a potentially badly packed archive, and just start looking for
better delta chain possibilities? That way we would still try to get a
good initial pack with some heuristic, but we could have people run the
incremental improver every once in a while looking for good deltas that it
missed due to the project not fitting the heuristics..
The fact that we normally do incremental repacking (and "-f" is unusual)
is obviously one thing that makes us less susceptible to bad patterns (and
is also what allows us to run the incremental optimizer - any good delta
choice will automatically percolate into subsequent versions, including
packs that have been cloned).
So the packing strategy itself seems to be very stable (and partly _due_
to the "optimization" to re-use earlier pack choices), but we currently
lack the thing that fixes up any initial bad assumptions in case they
happen.
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 21:40 ` Linus Torvalds
@ 2006-06-05 23:13 ` Nicolas Pitre
0 siblings, 0 replies; 15+ messages in thread
From: Nicolas Pitre @ 2006-06-05 23:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
On Mon, 5 Jun 2006, Linus Torvalds wrote:
>
>
> On Mon, 5 Jun 2006, Nicolas Pitre wrote:
> >
> > In other words, the pack shrunk to less than half the size of the
> > previous one !
>
> Ok, that's a bit more extreme than expected.
>
> It's obviously great news, and says that the approach of sorting by
> "reversed name" is a great heuristic, but at the same time it makes me
> worry a bit that this thing that is supposed to be a heuristic ends up
> being _so_ important from a pack size standpoint. I was happier when it
> was more about saving a couple of percent.
Well... this is the repository that exhibited a repack regression a
while ago, going from something like ~40MB to ~160MB when Junio
initially added the directory in the name hash. No other popular
repositories had that problem.
Which is why I said this repo is particularly sensitive to heuristic
changes. So I wouldn't worry too much about your proposed patch making
it too great in this case. It certainly didn't cause any (significant)
regression overall which is what matters.
We already have surprizing results when combining two heuristics
together although if used separately they do worse. So trying to have
fallback/incremental heuristics is going to make things simply too
complicated for when it breaks. Better experiment with new ideas and
adopt them when they do a better job universally.
... which your proposed hashing change does.
Nicolas
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 19:57 ` Linus Torvalds
@ 2006-06-05 23:54 ` Junio C Hamano
2006-06-06 0:14 ` Junio C Hamano
0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2006-06-05 23:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> On Mon, 5 Jun 2006, Junio C Hamano wrote:
>>
>> IIRC, sometimes this function is called with path and name split
>> and sometimes with full path in name
>
> Yeah, I was pretty confused by the whole hashing thing. Are you sure that
> complexity is needed, it seems a bit overkill.
Two issues in the code confuses any reader of that function.
- The code wants to hash Makefile from different revisions
together, and Makefile and t/Makefile close to each other.
The current code did it by treating '/' specially, used
basename hash as the upper bits of the resulting hash and
dirname hash as the lower bits. It's my tendency to treat
slashes specially too much, which is one of your favorite
things to pick me on.
This is not needed by your change anymore -- by only using
the tail of the filename, and making sure tail part weighs
more in the resulting group number, the new code gives the
desired grouping characteristics in a much simpler way.
- The output from rev-list --objects is fed to the function as
its name parameter while path is set to NULL. When we allow
a thin pack to be generated, rev-list --objects output also
contains "-<commit-object-name>" lines. We read trees for
these commits that are not going to be sent but can be used
as base objects, and pass the pathname discovered from the
tree using path and name pair (path is set to the linked list
of struct name_path that describes the dirname, and name is
set to the basename). This was done to reduce the need for
allocating and copying the pathname in preparation for
calling name_hash() function.
If you use only the "name" variable in your group number
computation, and suppose we are doing send-pack to send
updates between rev A..B, contrib/git-svn/Makefile from rev B
will use git-svn/Makefile (tail 16 characters) to compute the
number, but the blob from rev A (which we are not going to
send but would want to use as a potential delta base) will
have contrib/git-svn part in "path" (the element points at
string "git-svn", and its uplink points at another element
that points at "contrib" with an uplink that says it is at
the root level), and Makefile in "name". They will be hashed
slightly differently.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 23:54 ` Junio C Hamano
@ 2006-06-06 0:14 ` Junio C Hamano
0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2006-06-06 0:14 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Junio C Hamano <junkio@cox.net> writes:
> - The output from rev-list --objects is fed to the function as
> its name parameter while path is set to NULL. When we allow
> a thin pack to be generated, rev-list --objects output also
> contains "-<commit-object-name>" lines. We read trees for
> these commits that are not going to be sent but can be used
> as base objects, and pass the pathname discovered from the
> tree using path and name pair (path is set to the linked list
> of struct name_path that describes the dirname, and name is
> set to the basename). This was done to reduce the need for
> allocating and copying the pathname in preparation for
> calling name_hash() function.
but it can trivially fixed by doing something like this.
---
diff --git a/pack-objects.c b/pack-objects.c
index 3590cd5..e0328f8 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -463,17 +463,10 @@ static void rehash_objects(void)
}
}
-struct name_path {
- struct name_path *up;
- const char *elem;
- int len;
-};
-
#define DIRBITS 12
-static unsigned name_hash(struct name_path *path, const char *name)
+static unsigned name_hash(const char *name)
{
- struct name_path *p = path;
const char *n = name + strlen(name);
unsigned hash = 0, name_hash = 0, name_done = 0;
@@ -492,14 +485,6 @@ static unsigned name_hash(struct name_pa
name_hash = hash;
hash = 0;
}
- for (p = path; p; p = p->up) {
- hash = hash * 11 + '/';
- n = p->elem + p->len;
- while (p->elem <= --n) {
- unsigned char c = *n;
- hash = hash * 11 + c;
- }
- }
/*
* Make sure "Makefile" and "t/Makefile" are hashed separately
* but close enough.
@@ -686,9 +671,9 @@ static int name_cmp_len(const char *name
}
static void add_pbase_object(struct tree_desc *tree,
- struct name_path *up,
const char *name,
- int cmplen)
+ int cmplen,
+ const char *fullname)
{
struct name_entry entry;
@@ -702,13 +687,12 @@ static void add_pbase_object(struct tree
sha1_object_info(entry.sha1, type, &size))
continue;
if (name[cmplen] != '/') {
- unsigned hash = name_hash(up, name);
+ unsigned hash = name_hash(fullname);
add_object_entry(entry.sha1, hash, 1);
return;
}
if (!strcmp(type, tree_type)) {
struct tree_desc sub;
- struct name_path me;
struct pbase_tree_cache *tree;
const char *down = name+cmplen+1;
int downlen = name_cmp_len(down);
@@ -719,10 +703,7 @@ static void add_pbase_object(struct tree
sub.buf = tree->tree_data;
sub.size = tree->tree_size;
- me.up = up;
- me.elem = entry.path;
- me.len = entry.pathlen;
- add_pbase_object(&sub, &me, down, downlen);
+ add_pbase_object(&sub, down, downlen, fullname);
pbase_tree_put(tree);
}
}
@@ -778,14 +759,14 @@ static void add_preferred_base_object(ch
for (it = pbase_tree; it; it = it->next) {
if (cmplen == 0) {
- hash = name_hash(NULL, "");
+ hash = name_hash("");
add_object_entry(it->pcache.sha1, hash, 1);
}
else {
struct tree_desc tree;
tree.buf = it->pcache.tree_data;
tree.size = it->pcache.tree_size;
- add_pbase_object(&tree, NULL, name, cmplen);
+ add_pbase_object(&tree, name, cmplen, name);
}
}
}
@@ -1328,7 +1309,7 @@ int main(int argc, char **argv)
}
if (get_sha1_hex(line, sha1))
die("expected sha1, got garbage:\n %s", line);
- hash = name_hash(NULL, line+41);
+ hash = name_hash(line+41);
add_preferred_base_object(line+41, hash);
add_object_entry(sha1, hash, 0);
}
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-05 21:22 ` Nicolas Pitre
@ 2006-06-06 0:18 ` Chris Wedgwood
2006-06-06 0:35 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Chris Wedgwood @ 2006-06-06 0:18 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Olivier Galibert, Linus Torvalds, Junio C Hamano,
Git Mailing List
On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote:
> Much more expensive for both memory usage and CPU cycles.
On a modern CPU I'm not sure you would notice unless the dataset was
insanely large.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Horrible re-packing?
2006-06-06 0:18 ` Chris Wedgwood
@ 2006-06-06 0:35 ` Linus Torvalds
0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-06-06 0:35 UTC (permalink / raw)
To: Chris Wedgwood
Cc: Nicolas Pitre, Olivier Galibert, Junio C Hamano, Git Mailing List
On Mon, 5 Jun 2006, Chris Wedgwood wrote:
>
> On Mon, Jun 05, 2006 at 05:22:02PM -0400, Nicolas Pitre wrote:
>
> > Much more expensive for both memory usage and CPU cycles.
>
> On a modern CPU I'm not sure you would notice unless the dataset was
> insanely large.
The dataset really _is_ pretty large.
For the kernel, we're talking about a quarter million strings, and it's
not shrinking. Other (CVS imported from long histories) are in the several
million object range.
The real problem, btw, is not the CPU cost of sorting them (hey, O(nlogn)
works ok), but the memory use. We have to keep those quarter million names
in memory too. At a "pointer + average of ~30 bytes of full pathname
allocation + malloc overhead", the strings would basically take about 40
bytes of memory each, and cause constant cache-misses. In contrast, the
"hash" value is a 32-bit unsigned, with no pointer overhead.
It's not the biggest part to keep track of (we need to obviously keep
track of the 20-byte SHA1 and the pointers between objects), but if we had
the full string info, it would be quite noticeable overhead, on the order
of several tens of megabytes.
Now, "tens of megabytes" is not a make-or-break issue any more (you
definitely want lots of memory if you want to develop with large trees),
but in a very real sense, the memory footprint of an app is often very
closely correlated with its performance these days.
Trying to keep things within the L2 cache can help a lot, and even if we
expect 2MB and 4MB L2's to get more and more common, it means that we
should _strive_ to keep datasets down. As it is, we have no _chance_ of
staying in the L2 cache on the kernel, but for smaller projects we
hopefully can still do so ;)
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-06-06 0:35 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-05 17:08 Horrible re-packing? Linus Torvalds
2006-06-05 18:44 ` Linus Torvalds
2006-06-05 19:03 ` Linus Torvalds
2006-06-05 19:37 ` Junio C Hamano
2006-06-05 19:57 ` Linus Torvalds
2006-06-05 23:54 ` Junio C Hamano
2006-06-06 0:14 ` Junio C Hamano
2006-06-05 21:14 ` Olivier Galibert
2006-06-05 21:22 ` Nicolas Pitre
2006-06-06 0:18 ` Chris Wedgwood
2006-06-06 0:35 ` Linus Torvalds
2006-06-05 21:27 ` Linus Torvalds
2006-06-05 21:20 ` Nicolas Pitre
2006-06-05 21:40 ` Linus Torvalds
2006-06-05 23:13 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).