* Fix object re-hashing @ 2006-02-12 18:04 Linus Torvalds 2006-02-12 18:10 ` Linus Torvalds 2006-02-12 18:16 ` Linus Torvalds 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 18:04 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List, Johannes Schindelin The hashed object lookup had a subtle bug in re-hashing: it did for (i = 0; i < count; i++) if (objs[i]) { .. rehash .. where "count" was the old hash couny. Oon the face of it is obvious, since it clearly re-hashes all the old objects. However, it's wrong. If the last old hash entry before re-hashing was in use (or became in use by the re-hashing), then when re-hashing could have inserted an object into the hash entries with idx >= count due to overflow. When we then rehash the last old entry, that old entry might become empty, which means that the overflow entries should be re-hashed again. In other words, the loop has to be fixed to either traverse the whole array, rather than just the old count. (There's room for a slight optimization: instead of counting all the way up, we can break when we see the first empty slot that is above the old "count". At that point we know we don't have any collissions that we might have to fix up any more. This patch only does the trivial fix) Signed-off-by: Linus Torvalds <torvalds@osdl.org> --- I actually didn't see any of this trigger in real life, so maybe my analysis is wrong. Junio? Johannes? diff --git a/object.c b/object.c index 59e5e36..aeda228 100644 --- a/object.c +++ b/object.c @@ -65,7 +65,7 @@ void created_object(const unsigned char objs = xrealloc(objs, obj_allocs * sizeof(struct object *)); memset(objs + count, 0, (obj_allocs - count) * sizeof(struct object *)); - for (i = 0; i < count; i++) + for (i = 0; obj_allocs ; i++) if (objs[i]) { int j = find_object(objs[i]->sha1); if (j != i) { ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:04 Fix object re-hashing Linus Torvalds @ 2006-02-12 18:10 ` Linus Torvalds 2006-02-12 18:32 ` Junio C Hamano 2006-02-12 18:16 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 18:10 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List, Johannes Schindelin On Sun, 12 Feb 2006, Linus Torvalds wrote: > > I actually didn't see any of this trigger in real life, so maybe my > analysis is wrong. Junio? Johannes? Btw, if it does trigger, the behaviour would be that a subsequent object lookup will fail, because the last old slot would be NULL, and a few entries following it (likely just a couple - never mind that the event triggering in the first place is probably fairly rare) wouldn't have gotten re-hashed down. As a result, we'd allocate a new object, and have _two_ "struct object"s that describe the same real object. I don't know what would get upset, but git-fsck-index certainly would be (one of them would likely be marked unreachable, because lookup wouldn't find it, but you might have other issues too). Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:10 ` Linus Torvalds @ 2006-02-12 18:32 ` Junio C Hamano 2006-02-12 18:53 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2006-02-12 18:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 12 Feb 2006, Linus Torvalds wrote: >> >> I actually didn't see any of this trigger in real life, so maybe my >> analysis is wrong. Junio? Johannes? > > Btw, if it does trigger, the behaviour would be that a subsequent object > lookup will fail, because the last old slot would be NULL, and a few > entries following it (likely just a couple - never mind that the event > triggering in the first place is probably fairly rare) wouldn't have > gotten re-hashed down. > > As a result, we'd allocate a new object, and have _two_ "struct object"s > that describe the same real object. I don't know what would get upset, but > git-fsck-index certainly would be (one of them would likely be marked > unreachable, because lookup wouldn't find it, but you might have other > issues too). This "fix" makes the symptom that me fire two (maybe three) Grrrrr messages earlier this morning disappear. I haven't had my caffeine nor nicotine yet after my short sleep, so I need to take some time understanding your explanation first, but I am reasonably sure this must be it (not that I do not trust you, not at all -- it is that I do not trust *me* applying a patch without understanding when I have a bug reproducible). Thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:32 ` Junio C Hamano @ 2006-02-12 18:53 ` Linus Torvalds 2006-02-12 19:10 ` Linus Torvalds 2006-02-12 19:13 ` Junio C Hamano 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 18:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 12 Feb 2006, Junio C Hamano wrote: > > This "fix" makes the symptom that me fire two (maybe three) > Grrrrr messages earlier this morning disappear. Goodie. I assume that was the fixed fix, not my original "edit out the useless optimization and then break it totally" fix ;) > I haven't had my caffeine nor nicotine yet after my short sleep, so I > need to take some time understanding your explanation first, but I am > reasonably sure this must be it (not that I do not trust you, not at all > -- it is that I do not trust *me* applying a patch without understanding > when I have a bug reproducible). The basic notion is that this hashing algorithm uses a normal "linear probing" overflow approach, which basically means that overflows in a hash bucket always just probe the next few buckets to find an empty one. That's a really simple (and fairly cache-friendly) approach, and it makes tons of sense, especially since we always re-size the hash to guarantee that we'll have empty slots. It's a bit more subtle - especially when re-hashing - than the probably more common "collission chain" approach, though. Now, when we re-hash, the important rule is: - the re-hashing has to walk in the same direction as the overflow. This is important, because when we move a hashed entry, that automatically means that even otherwise _already_correctly_ hashed entries may need to be moved down (ie even if their "inherent hash" does not change, their _effective_ hash address changes because their overflow position needs to be fixed up). There are two interesting cases: - the "overflow of the overflow": when the linear probing itself overflows the size of the hash queue, it will "change direction" by overflowing back to index zero. Happily, the re-hashing does not need to care about this case, because the new hash is bigger: the rule we have when doing the re-hashing is that as we re-hash, the "i" entries we have already re-hashed are all valid in the new hash, so even if overflow occurs, it will occur the right way (and if it overflows all the way past the current "i", we'll re-hash the already re-hashed entry anyway). - the old/new border case. In particular, the trivial logic says that we only need to re-hash entries that were hashed with the old hash. That's what the broken code did: it only traversed "0..oldcount-1", because any entries that had an index bigger than or equal to "oldcount" were obviously _already_ re-hashed. That logic sounds obvious, but it falls down on exactly the fact that we may indeed have to re-hash even entries that already were re-hashed with the new algorithm, exactly because of the overflow changes. So the boundary for old/new is really: "you need to rehash all entries that were old, but then you _also_ need to rehash the list of entries that you rehashed that might need to be moved down to an empty spot vacated by an old hash". So the stop condition really ends up being: "stop when you have seen all old hash entries _and_ at least one empty entry after that", since an empty entry means that there was no overflow from earlier positions past that position. But it's just simpler to walk the whole damn new thing and not worry about it. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:53 ` Linus Torvalds @ 2006-02-12 19:10 ` Linus Torvalds 2006-02-12 19:21 ` Junio C Hamano 2006-02-12 23:55 ` Johannes Schindelin 2006-02-12 19:13 ` Junio C Hamano 1 sibling, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 19:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 12 Feb 2006, Linus Torvalds wrote: > > - the "overflow of the overflow": when the linear probing itself > overflows the size of the hash queue, it will "change direction" by > overflowing back to index zero. > > Happily, the re-hashing does not need to care about this case, because > the new hash is bigger: the rule we have when doing the re-hashing is > that as we re-hash, the "i" entries we have already re-hashed are all > valid in the new hash, so even if overflow occurs, it will occur the > right way (and if it overflows all the way past the current "i", we'll > re-hash the already re-hashed entry anyway). Btw, this is only always true if the new hash is at least twice the size of the old hash, I think. Otherwise a re-hash can fill up the new entries and overflow entirely before we've actually even re-hashed all the old entries, and then we'd need to re-hash even the overflowed entries (which are now below "i"). If the new size is at least twice the old size, the "upper area" cannot overflow completely (there has to be empty room), and we cannot be in the situation that we need to move even the overflowed entries when we remove an old hash entry. Anyway, if all this makes you nervous, the conceptually much simpler way to do the re-sizing is to not do the in-place re-hashing. Instead of doing the xrealloc(), just do a "xmalloc()" of the new area, do the re-hashing (which now _must_ re-hash in just the "0..oldcount-1" old area) into the new area, and then free the old area after rehashing. That would make things more obviously correct, and perhaps simpler. Johannes, do you want to try that? Btw, as it currently stands, I worry a tiny tiny bit about the obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs) thing, because I think that second "32" needs to be a "64" to be really safe (ie guarantee that the new obj_allocs value is always at least twice the old one). Anyway, I'm pretty sure people smarter than me have already codified exactly what needs to be done for a in-place rehash of a linear probe hash overflow algorithm. This must all be in some "hashing 101" book. I had to think it through from first principles rather than "knowing" what the right answer was (which probably means that I slept through some fundamental algorithms class in University ;) Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 19:10 ` Linus Torvalds @ 2006-02-12 19:21 ` Junio C Hamano 2006-02-12 19:39 ` Linus Torvalds 2006-02-12 23:55 ` Johannes Schindelin 1 sibling, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2006-02-12 19:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Anyway, if all this makes you nervous,... I did draw an illustration like the one I sent in my previous message when I received the first patch from Johannes, and it was reasonably obvious to me that it was meant to redistribute about half of the existing entries to the upper area, always going upwards, so modulo that wraparound corner case you fixed, I think doubling is fine. > Btw, as it currently stands, I worry a tiny tiny bit about the > > obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs) > > thing, because I think that second "32" needs to be a "64" to be really > safe (ie guarantee that the new obj_allocs value is always at least twice > the old one). obj_allocs starts out as 0 so the first value it gets is 32 when you need to insert the first element. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 19:21 ` Junio C Hamano @ 2006-02-12 19:39 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 19:39 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Sun, 12 Feb 2006, Junio C Hamano wrote: > > > Btw, as it currently stands, I worry a tiny tiny bit about the > > > > obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs) > > > > thing, because I think that second "32" needs to be a "64" to be really > > safe (ie guarantee that the new obj_allocs value is always at least twice > > the old one). > > obj_allocs starts out as 0 so the first value it gets is 32 when > you need to insert the first element. Yes. The point being that the code is "conceptually wrong", not that it doesn't work in practice. If we somehow could get into the situation that we had a hash size of 31, resizing it to 32 would be incorrect. Of course, if we just make it a rule that the hash size must always be a power-of-two (add a comment, and enforce the rule by changing the modulus into a bitwise "and"), then that issue too goes away. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 19:10 ` Linus Torvalds 2006-02-12 19:21 ` Junio C Hamano @ 2006-02-12 23:55 ` Johannes Schindelin 2006-02-13 0:16 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Johannes Schindelin @ 2006-02-12 23:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Hi, On Sun, 12 Feb 2006, Linus Torvalds wrote: > [something about the overflow in another mail] Thank you for thinking it through! I was soooo stuck with my original idea: Ideally (i.e. if there are no collisions), if the hashtable is doubled in size, then each offset should either stay the same, or be just incremented by the original size (since the index is the hash modulo the hashtable size). So I wanted to be clever about resizing, and just increment the offset if necessary. As it turns out, it's more complicated than that. You have to make sure that those entries which collided with another entry, but do no longer, are adjusted appropriately. And the overflow problem eluded my attention entirely. (I feel quite silly about it, because I fixed so many buffer-overflow problems myself, and the cause of the problem is the same there.) > On Sun, 12 Feb 2006, Linus Torvalds wrote: > > > > - the "overflow of the overflow": when the linear probing itself > > overflows the size of the hash queue, it will "change direction" by > > overflowing back to index zero. > > > > Happily, the re-hashing does not need to care about this case, because > > the new hash is bigger: the rule we have when doing the re-hashing is > > that as we re-hash, the "i" entries we have already re-hashed are all > > valid in the new hash, so even if overflow occurs, it will occur the > > right way (and if it overflows all the way past the current "i", we'll > > re-hash the already re-hashed entry anyway). > > Btw, this is only always true if the new hash is at least twice the size > of the old hash, I think. Otherwise a re-hash can fill up the new entries > and overflow entirely before we've actually even re-hashed all the old > entries, and then we'd need to re-hash even the overflowed entries (which > are now below "i"). After thinking long and hard about it, I tend to agree. Note: I chose the factor 2 because hashtables tend to have *awful* performance when space becomes scarce. So, 2 is not only a wise choice for rehashing, but for the operation in general. > Anyway, if all this makes you nervous, the conceptually much simpler way > to do the re-sizing is to not do the in-place re-hashing. Instead of doing > the xrealloc(), just do a "xmalloc()" of the new area, do the re-hashing > (which now _must_ re-hash in just the "0..oldcount-1" old area) into the > new area, and then free the old area after rehashing. > > That would make things more obviously correct, and perhaps simpler. > > Johannes, do you want to try that? I do not particularly like it, since doubling the hashtable size is not particularly space efficient, and this makes it worse. Anyway, see below. > Btw, as it currently stands, I worry a tiny tiny bit about the > > obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs) > > thing, because I think that second "32" needs to be a "64" to be really > safe (ie guarantee that the new obj_allocs value is always at least twice > the old one). As Junio already pointed out: obj_allocs is initially set to 0. But you're right, it is conceptually wrong. > Anyway, I'm pretty sure people smarter than me have already codified > exactly what needs to be done for a in-place rehash of a linear probe hash > overflow algorithm. This must all be in some "hashing 101" book. I had to > think it through from first principles rather than "knowing" what the > right answer was (which probably means that I slept through some > fundamental algorithms class in University ;) Well, it seems like a long time, doesn't it? But I always liked the Fibonacci numbers, and therefore the Fibonacci heap. --- Make hashtable resizing more robust AKA do not resize in-place diff --git a/object.c b/object.c index c9ca481..94f0f5d 100644 --- a/object.c +++ b/object.c @@ -56,18 +56,14 @@ void created_object(const unsigned char if (obj_allocs - 1 <= nr_objs * 2) { int i, count = obj_allocs; - obj_allocs = (obj_allocs < 32 ? 32 : 2 * obj_allocs); - objs = xrealloc(objs, obj_allocs * sizeof(struct object *)); - memset(objs + count, 0, (obj_allocs - count) - * sizeof(struct object *)); - for (i = 0; i < obj_allocs; i++) - if (objs[i]) { - int j = find_object(objs[i]->sha1); - if (j != i) { - j = -1 - j; - objs[j] = objs[i]; - objs[i] = NULL; - } + struct object** old_objs = objs; + obj_allocs = (obj_allocs < 32 ? 64 : 2 * obj_allocs); + objs = xcalloc(obj_allocs, sizeof(struct object *)); + for (i = 0; i < count; i++) + if (old_objs[i]) { + /* it is guaranteed to be new */ + int j = -1 - find_object(old_objs[i]->sha1); + objs[j] = old_objs[i]; } } ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 23:55 ` Johannes Schindelin @ 2006-02-13 0:16 ` Linus Torvalds 2006-02-13 0:31 ` Johannes Schindelin 0 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2006-02-13 0:16 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Junio C Hamano, git On Mon, 13 Feb 2006, Johannes Schindelin wrote: > > Make hashtable resizing more robust AKA do not resize in-place You forgot to release the old array afterwards. Anyway, I think the in-place version is fine now, even if it has a few subtleties. So this isn't needed, but keep it in mind if we find another bug, or if somebody wants to shrink the hash table less aggressively than with doubling it every time. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-13 0:16 ` Linus Torvalds @ 2006-02-13 0:31 ` Johannes Schindelin 0 siblings, 0 replies; 13+ messages in thread From: Johannes Schindelin @ 2006-02-13 0:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, git Hi, On Sun, 12 Feb 2006, Linus Torvalds wrote: > On Mon, 13 Feb 2006, Johannes Schindelin wrote: > > > > Make hashtable resizing more robust AKA do not resize in-place > > You forgot to release the old array afterwards. D'oh! I am going to bed now. > Anyway, I think the in-place version is fine now, even if it has a few > subtleties. So this isn't needed, but keep it in mind if we find another > bug, or if somebody wants to shrink the hash table less aggressively than > with doubling it every time. Sounds fine to me. Good night, Dscho ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:53 ` Linus Torvalds 2006-02-12 19:10 ` Linus Torvalds @ 2006-02-12 19:13 ` Junio C Hamano 1 sibling, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2006-02-12 19:13 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > On Sun, 12 Feb 2006, Junio C Hamano wrote: >> >> This "fix" makes the symptom that me fire two (maybe three) >> Grrrrr messages earlier this morning disappear. > > Goodie. I assume that was the fixed fix, not my original "edit out the > useless optimization and then break it totally" fix ;) > >> I haven't had my caffeine nor nicotine yet after my short sleep, so I >> need to take some time understanding your explanation first, but I am >> reasonably sure this must be it (not that I do not trust you, not at all >> -- it is that I do not trust *me* applying a patch without understanding >> when I have a bug reproducible). Your explanation finally made sense to me, without caffeine nor nicotine yet, but when I tried to do an illustration. If the initial obj_allocs were 4 instead of 32, we may have something lie this before rehashing. slot value 0 3 1 - 2 - 3 7 Rehash to double the hash goes like this: step1 step2 step3 fixup rehash enlarge rehash rehash missing from array "3%8" "7%8" the original 0 3 - - - 1 - - - - 2 - - - - 3 7 7 - 3 4 - 3 3 - 5 - - - - 6 - - - - 7 - - 7 7 We cannot find "3%8" without the fix. Thanks for the fix. Will do an updated "master" soon. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:04 Fix object re-hashing Linus Torvalds 2006-02-12 18:10 ` Linus Torvalds @ 2006-02-12 18:16 ` Linus Torvalds 2006-02-12 18:18 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 18:16 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List, Johannes Schindelin On Sun, 12 Feb 2006, Linus Torvalds wrote: > - for (i = 0; i < count; i++) > + for (i = 0; obj_allocs ; i++) GAAH. That should obviously be "i < obj_allocs". That's what I get for editing the patch in-place to remove the optimized version that I felt wasn't worth worrying about due to being subtle. So instead I sent out a patch that was not-so-subtly obvious crap! Sorry about that. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Fix object re-hashing 2006-02-12 18:16 ` Linus Torvalds @ 2006-02-12 18:18 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2006-02-12 18:18 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List, Johannes Schindelin On Sun, 12 Feb 2006, Linus Torvalds wrote: > > That's what I get for editing the patch in-place to remove the optimized > version that I felt wasn't worth worrying about due to being subtle. So > instead I sent out a patch that was not-so-subtly obvious crap! Btw: the reason I edited out the optimization is that it doesn't actually matter. Re-hashing the whole thing is a trivial thing, and has basically zero overhead in my testing. The costs are all elsewhere now. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2006-02-13 0:31 UTC | newest] Thread overview: 13+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-02-12 18:04 Fix object re-hashing Linus Torvalds 2006-02-12 18:10 ` Linus Torvalds 2006-02-12 18:32 ` Junio C Hamano 2006-02-12 18:53 ` Linus Torvalds 2006-02-12 19:10 ` Linus Torvalds 2006-02-12 19:21 ` Junio C Hamano 2006-02-12 19:39 ` Linus Torvalds 2006-02-12 23:55 ` Johannes Schindelin 2006-02-13 0:16 ` Linus Torvalds 2006-02-13 0:31 ` Johannes Schindelin 2006-02-12 19:13 ` Junio C Hamano 2006-02-12 18:16 ` Linus Torvalds 2006-02-12 18:18 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).