* Fix object re-hashing
@ 2006-02-12 18:04 Linus Torvalds
2006-02-12 18:10 ` Linus Torvalds
2006-02-12 18:16 ` Linus Torvalds
0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 18:04 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List, Johannes Schindelin
The hashed object lookup had a subtle bug in re-hashing: it did
for (i = 0; i < count; i++)
if (objs[i]) {
.. rehash ..
where "count" was the old hash couny. Oon the face of it is obvious, since
it clearly re-hashes all the old objects.
However, it's wrong.
If the last old hash entry before re-hashing was in use (or became in use
by the re-hashing), then when re-hashing could have inserted an object
into the hash entries with idx >= count due to overflow. When we then
rehash the last old entry, that old entry might become empty, which means
that the overflow entries should be re-hashed again.
In other words, the loop has to be fixed to either traverse the whole
array, rather than just the old count.
(There's room for a slight optimization: instead of counting all the way
up, we can break when we see the first empty slot that is above the old
"count". At that point we know we don't have any collissions that we might
have to fix up any more. This patch only does the trivial fix)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---
I actually didn't see any of this trigger in real life, so maybe my
analysis is wrong. Junio? Johannes?
diff --git a/object.c b/object.c
index 59e5e36..aeda228 100644
--- a/object.c
+++ b/object.c
@@ -65,7 +65,7 @@ void created_object(const unsigned char
objs = xrealloc(objs, obj_allocs * sizeof(struct object *));
memset(objs + count, 0, (obj_allocs - count)
* sizeof(struct object *));
- for (i = 0; i < count; i++)
+ for (i = 0; obj_allocs ; i++)
if (objs[i]) {
int j = find_object(objs[i]->sha1);
if (j != i) {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:04 Fix object re-hashing Linus Torvalds
@ 2006-02-12 18:10 ` Linus Torvalds
2006-02-12 18:32 ` Junio C Hamano
2006-02-12 18:16 ` Linus Torvalds
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 18:10 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List, Johannes Schindelin
On Sun, 12 Feb 2006, Linus Torvalds wrote:
>
> I actually didn't see any of this trigger in real life, so maybe my
> analysis is wrong. Junio? Johannes?
Btw, if it does trigger, the behaviour would be that a subsequent object
lookup will fail, because the last old slot would be NULL, and a few
entries following it (likely just a couple - never mind that the event
triggering in the first place is probably fairly rare) wouldn't have
gotten re-hashed down.
As a result, we'd allocate a new object, and have _two_ "struct object"s
that describe the same real object. I don't know what would get upset, but
git-fsck-index certainly would be (one of them would likely be marked
unreachable, because lookup wouldn't find it, but you might have other
issues too).
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:04 Fix object re-hashing Linus Torvalds
2006-02-12 18:10 ` Linus Torvalds
@ 2006-02-12 18:16 ` Linus Torvalds
2006-02-12 18:18 ` Linus Torvalds
1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 18:16 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List, Johannes Schindelin
On Sun, 12 Feb 2006, Linus Torvalds wrote:
> - for (i = 0; i < count; i++)
> + for (i = 0; obj_allocs ; i++)
GAAH.
That should obviously be "i < obj_allocs".
That's what I get for editing the patch in-place to remove the optimized
version that I felt wasn't worth worrying about due to being subtle. So
instead I sent out a patch that was not-so-subtly obvious crap!
Sorry about that.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:16 ` Linus Torvalds
@ 2006-02-12 18:18 ` Linus Torvalds
0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 18:18 UTC (permalink / raw)
To: Junio C Hamano, Git Mailing List, Johannes Schindelin
On Sun, 12 Feb 2006, Linus Torvalds wrote:
>
> That's what I get for editing the patch in-place to remove the optimized
> version that I felt wasn't worth worrying about due to being subtle. So
> instead I sent out a patch that was not-so-subtly obvious crap!
Btw: the reason I edited out the optimization is that it doesn't actually
matter. Re-hashing the whole thing is a trivial thing, and has basically
zero overhead in my testing. The costs are all elsewhere now.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:10 ` Linus Torvalds
@ 2006-02-12 18:32 ` Junio C Hamano
2006-02-12 18:53 ` Linus Torvalds
0 siblings, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2006-02-12 18:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> On Sun, 12 Feb 2006, Linus Torvalds wrote:
>>
>> I actually didn't see any of this trigger in real life, so maybe my
>> analysis is wrong. Junio? Johannes?
>
> Btw, if it does trigger, the behaviour would be that a subsequent object
> lookup will fail, because the last old slot would be NULL, and a few
> entries following it (likely just a couple - never mind that the event
> triggering in the first place is probably fairly rare) wouldn't have
> gotten re-hashed down.
>
> As a result, we'd allocate a new object, and have _two_ "struct object"s
> that describe the same real object. I don't know what would get upset, but
> git-fsck-index certainly would be (one of them would likely be marked
> unreachable, because lookup wouldn't find it, but you might have other
> issues too).
This "fix" makes the symptom that me fire two (maybe three)
Grrrrr messages earlier this morning disappear. I haven't had
my caffeine nor nicotine yet after my short sleep, so I need to
take some time understanding your explanation first, but I am
reasonably sure this must be it (not that I do not trust you,
not at all -- it is that I do not trust *me* applying a patch
without understanding when I have a bug reproducible).
Thanks.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:32 ` Junio C Hamano
@ 2006-02-12 18:53 ` Linus Torvalds
2006-02-12 19:10 ` Linus Torvalds
2006-02-12 19:13 ` Junio C Hamano
0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 18:53 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sun, 12 Feb 2006, Junio C Hamano wrote:
>
> This "fix" makes the symptom that me fire two (maybe three)
> Grrrrr messages earlier this morning disappear.
Goodie. I assume that was the fixed fix, not my original "edit out the
useless optimization and then break it totally" fix ;)
> I haven't had my caffeine nor nicotine yet after my short sleep, so I
> need to take some time understanding your explanation first, but I am
> reasonably sure this must be it (not that I do not trust you, not at all
> -- it is that I do not trust *me* applying a patch without understanding
> when I have a bug reproducible).
The basic notion is that this hashing algorithm uses a normal "linear
probing" overflow approach, which basically means that overflows in
a hash bucket always just probe the next few buckets to find an empty one.
That's a really simple (and fairly cache-friendly) approach, and it makes
tons of sense, especially since we always re-size the hash to guarantee
that we'll have empty slots. It's a bit more subtle - especially when
re-hashing - than the probably more common "collission chain" approach,
though.
Now, when we re-hash, the important rule is:
- the re-hashing has to walk in the same direction as the overflow.
This is important, because when we move a hashed entry, that automatically
means that even otherwise _already_correctly_ hashed entries may need to
be moved down (ie even if their "inherent hash" does not change, their
_effective_ hash address changes because their overflow position needs to
be fixed up).
There are two interesting cases:
- the "overflow of the overflow": when the linear probing itself
overflows the size of the hash queue, it will "change direction" by
overflowing back to index zero.
Happily, the re-hashing does not need to care about this case, because
the new hash is bigger: the rule we have when doing the re-hashing is
that as we re-hash, the "i" entries we have already re-hashed are all
valid in the new hash, so even if overflow occurs, it will occur the
right way (and if it overflows all the way past the current "i", we'll
re-hash the already re-hashed entry anyway).
- the old/new border case. In particular, the trivial logic says that we
only need to re-hash entries that were hashed with the old hash. That's
what the broken code did: it only traversed "0..oldcount-1", because
any entries that had an index bigger than or equal to "oldcount" were
obviously _already_ re-hashed.
That logic sounds obvious, but it falls down on exactly the fact that
we may indeed have to re-hash even entries that already were re-hashed
with the new algorithm, exactly because of the overflow changes.
So the boundary for old/new is really: "you need to rehash all entries
that were old, but then you _also_ need to rehash the list of entries that
you rehashed that might need to be moved down to an empty spot vacated by
an old hash".
So the stop condition really ends up being: "stop when you have seen all
old hash entries _and_ at least one empty entry after that", since an
empty entry means that there was no overflow from earlier positions past
that position. But it's just simpler to walk the whole damn new thing and
not worry about it.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:53 ` Linus Torvalds
@ 2006-02-12 19:10 ` Linus Torvalds
2006-02-12 19:21 ` Junio C Hamano
2006-02-12 23:55 ` Johannes Schindelin
2006-02-12 19:13 ` Junio C Hamano
1 sibling, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 19:10 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sun, 12 Feb 2006, Linus Torvalds wrote:
>
> - the "overflow of the overflow": when the linear probing itself
> overflows the size of the hash queue, it will "change direction" by
> overflowing back to index zero.
>
> Happily, the re-hashing does not need to care about this case, because
> the new hash is bigger: the rule we have when doing the re-hashing is
> that as we re-hash, the "i" entries we have already re-hashed are all
> valid in the new hash, so even if overflow occurs, it will occur the
> right way (and if it overflows all the way past the current "i", we'll
> re-hash the already re-hashed entry anyway).
Btw, this is only always true if the new hash is at least twice the size
of the old hash, I think. Otherwise a re-hash can fill up the new entries
and overflow entirely before we've actually even re-hashed all the old
entries, and then we'd need to re-hash even the overflowed entries (which
are now below "i").
If the new size is at least twice the old size, the "upper area" cannot
overflow completely (there has to be empty room), and we cannot be in the
situation that we need to move even the overflowed entries when we remove
an old hash entry.
Anyway, if all this makes you nervous, the conceptually much simpler way
to do the re-sizing is to not do the in-place re-hashing. Instead of doing
the xrealloc(), just do a "xmalloc()" of the new area, do the re-hashing
(which now _must_ re-hash in just the "0..oldcount-1" old area) into the
new area, and then free the old area after rehashing.
That would make things more obviously correct, and perhaps simpler.
Johannes, do you want to try that?
Btw, as it currently stands, I worry a tiny tiny bit about the
obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs)
thing, because I think that second "32" needs to be a "64" to be really
safe (ie guarantee that the new obj_allocs value is always at least twice
the old one).
Anyway, I'm pretty sure people smarter than me have already codified
exactly what needs to be done for a in-place rehash of a linear probe hash
overflow algorithm. This must all be in some "hashing 101" book. I had to
think it through from first principles rather than "knowing" what the
right answer was (which probably means that I slept through some
fundamental algorithms class in University ;)
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 18:53 ` Linus Torvalds
2006-02-12 19:10 ` Linus Torvalds
@ 2006-02-12 19:13 ` Junio C Hamano
1 sibling, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2006-02-12 19:13 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> On Sun, 12 Feb 2006, Junio C Hamano wrote:
>>
>> This "fix" makes the symptom that me fire two (maybe three)
>> Grrrrr messages earlier this morning disappear.
>
> Goodie. I assume that was the fixed fix, not my original "edit out the
> useless optimization and then break it totally" fix ;)
>
>> I haven't had my caffeine nor nicotine yet after my short sleep, so I
>> need to take some time understanding your explanation first, but I am
>> reasonably sure this must be it (not that I do not trust you, not at all
>> -- it is that I do not trust *me* applying a patch without understanding
>> when I have a bug reproducible).
Your explanation finally made sense to me, without caffeine nor
nicotine yet, but when I tried to do an illustration.
If the initial obj_allocs were 4 instead of 32, we may have
something lie this before rehashing.
slot value
0 3
1 -
2 -
3 7
Rehash to double the hash goes like this:
step1 step2 step3 fixup rehash
enlarge rehash rehash missing from
array "3%8" "7%8" the original
0 3 - - -
1 - - - -
2 - - - -
3 7 7 - 3
4 - 3 3 -
5 - - - -
6 - - - -
7 - - 7 7
We cannot find "3%8" without the fix.
Thanks for the fix. Will do an updated "master" soon.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 19:10 ` Linus Torvalds
@ 2006-02-12 19:21 ` Junio C Hamano
2006-02-12 19:39 ` Linus Torvalds
2006-02-12 23:55 ` Johannes Schindelin
1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2006-02-12 19:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: git
Linus Torvalds <torvalds@osdl.org> writes:
> Anyway, if all this makes you nervous,...
I did draw an illustration like the one I sent in my previous
message when I received the first patch from Johannes, and it was
reasonably obvious to me that it was meant to redistribute about
half of the existing entries to the upper area, always going
upwards, so modulo that wraparound corner case you fixed, I
think doubling is fine.
> Btw, as it currently stands, I worry a tiny tiny bit about the
>
> obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs)
>
> thing, because I think that second "32" needs to be a "64" to be really
> safe (ie guarantee that the new obj_allocs value is always at least twice
> the old one).
obj_allocs starts out as 0 so the first value it gets is 32 when
you need to insert the first element.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 19:21 ` Junio C Hamano
@ 2006-02-12 19:39 ` Linus Torvalds
0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2006-02-12 19:39 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git
On Sun, 12 Feb 2006, Junio C Hamano wrote:
>
> > Btw, as it currently stands, I worry a tiny tiny bit about the
> >
> > obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs)
> >
> > thing, because I think that second "32" needs to be a "64" to be really
> > safe (ie guarantee that the new obj_allocs value is always at least twice
> > the old one).
>
> obj_allocs starts out as 0 so the first value it gets is 32 when
> you need to insert the first element.
Yes. The point being that the code is "conceptually wrong", not that it
doesn't work in practice. If we somehow could get into the situation that
we had a hash size of 31, resizing it to 32 would be incorrect.
Of course, if we just make it a rule that the hash size must always be a
power-of-two (add a comment, and enforce the rule by changing the modulus
into a bitwise "and"), then that issue too goes away.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 19:10 ` Linus Torvalds
2006-02-12 19:21 ` Junio C Hamano
@ 2006-02-12 23:55 ` Johannes Schindelin
2006-02-13 0:16 ` Linus Torvalds
1 sibling, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2006-02-12 23:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git
Hi,
On Sun, 12 Feb 2006, Linus Torvalds wrote:
> [something about the overflow in another mail]
Thank you for thinking it through! I was soooo stuck with my original
idea: Ideally (i.e. if there are no collisions), if the hashtable is
doubled in size, then each offset should either stay the same, or be just
incremented by the original size (since the index is the hash modulo the
hashtable size).
So I wanted to be clever about resizing, and just increment the offset if
necessary. As it turns out, it's more complicated than that. You have to
make sure that those entries which collided with another entry, but do no
longer, are adjusted appropriately.
And the overflow problem eluded my attention entirely. (I feel quite silly
about it, because I fixed so many buffer-overflow problems myself, and
the cause of the problem is the same there.)
> On Sun, 12 Feb 2006, Linus Torvalds wrote:
> >
> > - the "overflow of the overflow": when the linear probing itself
> > overflows the size of the hash queue, it will "change direction" by
> > overflowing back to index zero.
> >
> > Happily, the re-hashing does not need to care about this case, because
> > the new hash is bigger: the rule we have when doing the re-hashing is
> > that as we re-hash, the "i" entries we have already re-hashed are all
> > valid in the new hash, so even if overflow occurs, it will occur the
> > right way (and if it overflows all the way past the current "i", we'll
> > re-hash the already re-hashed entry anyway).
>
> Btw, this is only always true if the new hash is at least twice the size
> of the old hash, I think. Otherwise a re-hash can fill up the new entries
> and overflow entirely before we've actually even re-hashed all the old
> entries, and then we'd need to re-hash even the overflowed entries (which
> are now below "i").
After thinking long and hard about it, I tend to agree.
Note: I chose the factor 2 because hashtables tend to have *awful*
performance when space becomes scarce. So, 2 is not only a wise choice for
rehashing, but for the operation in general.
> Anyway, if all this makes you nervous, the conceptually much simpler way
> to do the re-sizing is to not do the in-place re-hashing. Instead of doing
> the xrealloc(), just do a "xmalloc()" of the new area, do the re-hashing
> (which now _must_ re-hash in just the "0..oldcount-1" old area) into the
> new area, and then free the old area after rehashing.
>
> That would make things more obviously correct, and perhaps simpler.
>
> Johannes, do you want to try that?
I do not particularly like it, since doubling the hashtable size is not
particularly space efficient, and this makes it worse. Anyway, see below.
> Btw, as it currently stands, I worry a tiny tiny bit about the
>
> obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs)
>
> thing, because I think that second "32" needs to be a "64" to be really
> safe (ie guarantee that the new obj_allocs value is always at least twice
> the old one).
As Junio already pointed out: obj_allocs is initially set to 0. But you're
right, it is conceptually wrong.
> Anyway, I'm pretty sure people smarter than me have already codified
> exactly what needs to be done for a in-place rehash of a linear probe hash
> overflow algorithm. This must all be in some "hashing 101" book. I had to
> think it through from first principles rather than "knowing" what the
> right answer was (which probably means that I slept through some
> fundamental algorithms class in University ;)
Well, it seems like a long time, doesn't it? But I always liked the
Fibonacci numbers, and therefore the Fibonacci heap.
---
Make hashtable resizing more robust AKA do not resize in-place
diff --git a/object.c b/object.c
index c9ca481..94f0f5d 100644
--- a/object.c
+++ b/object.c
@@ -56,18 +56,14 @@ void created_object(const unsigned char
if (obj_allocs - 1 <= nr_objs * 2) {
int i, count = obj_allocs;
- obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs);
- objs = xrealloc(objs, obj_allocs * sizeof(struct object *));
- memset(objs + count, 0, (obj_allocs - count)
- * sizeof(struct object *));
- for (i = 0; i < obj_allocs; i++)
- if (objs[i]) {
- int j = find_object(objs[i]->sha1);
- if (j != i) {
- j = -1 - j;
- objs[j] = objs[i];
- objs[i] = NULL;
- }
+ struct object** old_objs = objs;
+ obj_allocs = (obj_allocs < 32 ? 64 : 2 * obj_allocs);
+ objs = xcalloc(obj_allocs, sizeof(struct object *));
+ for (i = 0; i < count; i++)
+ if (old_objs[i]) {
+ /* it is guaranteed to be new */
+ int j = -1 - find_object(old_objs[i]->sha1);
+ objs[j] = old_objs[i];
}
}
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-12 23:55 ` Johannes Schindelin
@ 2006-02-13 0:16 ` Linus Torvalds
2006-02-13 0:31 ` Johannes Schindelin
0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2006-02-13 0:16 UTC (permalink / raw)
To: Johannes Schindelin; +Cc: Junio C Hamano, git
On Mon, 13 Feb 2006, Johannes Schindelin wrote:
>
> Make hashtable resizing more robust AKA do not resize in-place
You forgot to release the old array afterwards.
Anyway, I think the in-place version is fine now, even if it has a few
subtleties. So this isn't needed, but keep it in mind if we find another
bug, or if somebody wants to shrink the hash table less aggressively than
with doubling it every time.
Linus
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing
2006-02-13 0:16 ` Linus Torvalds
@ 2006-02-13 0:31 ` Johannes Schindelin
0 siblings, 0 replies; 13+ messages in thread
From: Johannes Schindelin @ 2006-02-13 0:31 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, git
Hi,
On Sun, 12 Feb 2006, Linus Torvalds wrote:
> On Mon, 13 Feb 2006, Johannes Schindelin wrote:
> >
> > Make hashtable resizing more robust AKA do not resize in-place
>
> You forgot to release the old array afterwards.
D'oh! I am going to bed now.
> Anyway, I think the in-place version is fine now, even if it has a few
> subtleties. So this isn't needed, but keep it in mind if we find another
> bug, or if somebody wants to shrink the hash table less aggressively than
> with doubling it every time.
Sounds fine to me.
Good night,
Dscho
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-02-13 0:31 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-12 18:04 Fix object re-hashing Linus Torvalds
2006-02-12 18:10 ` Linus Torvalds
2006-02-12 18:32 ` Junio C Hamano
2006-02-12 18:53 ` Linus Torvalds
2006-02-12 19:10 ` Linus Torvalds
2006-02-12 19:21 ` Junio C Hamano
2006-02-12 19:39 ` Linus Torvalds
2006-02-12 23:55 ` Johannes Schindelin
2006-02-13 0:16 ` Linus Torvalds
2006-02-13 0:31 ` Johannes Schindelin
2006-02-12 19:13 ` Junio C Hamano
2006-02-12 18:16 ` Linus Torvalds
2006-02-12 18:18 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).