* bug: git-repack -a -d produces broken pack on NFS @ 2006-04-27 21:32 Alex Riesen 2006-04-27 22:11 ` Linus Torvalds 2006-04-27 23:54 ` Linus Torvalds 0 siblings, 2 replies; 9+ messages in thread From: Alex Riesen @ 2006-04-27 21:32 UTC (permalink / raw) To: git NFS server: 2.6.15 Client: 2.6.17-rc2 mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0 Repack protocol ($SRC is /net/home/src): $SRC/linux.git$ git repack -a -d Generating pack... Done counting 235947 objects. Deltifying 235947 objects. 100% (235947/235947) done Writing 235947 objects. 100% (235947/235947) done Total 235947, written 235947 (delta 182131), reused 235466 (delta 181650) Pack pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87 created. $SRC/linux.git$ git fsck-objects --full git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd 30ebec13b07df2c87.pack SHA1 mismatch with idx git-fsck-objects: fatal: delta data unpack failed $SRC/linux.git$ git fsck-objects --full git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd 30ebec13b07df2c87.pack SHA1 mismatch with idx git-fsck-objects: fatal: delta data unpack failed $SRC/linux.git$ du -sh . 124M . $SRC/linux.git$ cp . -a $BIG/linux.git $SRC/linux.git$ cd $BIG/linux.git $BIG/linux.git$ git fsck-objects --full git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd 30ebec13b07df2c87.pack SHA1 mismatch with idx git-fsck-objects: fatal: delta data unpack failed $BIG/linux.git$ git clone -n . ../tmp Generating pack... Done counting 235947 objects. Deltifying 235947 objects. 100% (235947/235947) done Total 235947, written 235947 (delta 182131), reused 235947 (delta 182131) git-index-pack: fatal: packfile '/mnt/large/tmp/raa/tmp/.git/objects/pack/tmp-wc Rvk5': bad object at offset 102601801: inflate returned -3 git-fetch-pack: error: git-fetch-pack: unable to read from git-index-pack git-fetch-pack: error: git-index-pack died with error code 128 fetch-pack from '/mnt/large/tmp/raa/linux.git/.git' failed. So the repository is now _really_ broken. I didn't noticed when it started to happen, sorry. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen @ 2006-04-27 22:11 ` Linus Torvalds 2006-04-27 22:17 ` Junio C Hamano ` (2 more replies) 2006-04-27 23:54 ` Linus Torvalds 1 sibling, 3 replies; 9+ messages in thread From: Linus Torvalds @ 2006-04-27 22:11 UTC (permalink / raw) To: Alex Riesen, Junio C Hamano; +Cc: Git Mailing List On Thu, 27 Apr 2006, Alex Riesen wrote: > > NFS server: 2.6.15 > Client: 2.6.17-rc2 > mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0 It's repeatable? Can you check if it goes away if your remove "intr"? That said, the pack-file should all be written with the "sha1write()" interface, which is very careful indeed. I wonder if the _pack-file_ itself might be ok, and the problem is an index file corruption. For some reason we check the index file first, which is insane. We should check that the pack-file matches its _own_ SHA1 first, and check the index file second. So the appended patch would be sensible: before we bother to look at the index file at all, we should check the pack-file itself. If it's just the index file that is corrupt, you may even have a chance to recover the data. The index file is also written with sha1write(), though, so I really don't see where it would break. Unless you just simply literally have data corruption on the server for some strange reason. Linus --- diff --git a/pack-check.c b/pack-check.c index 84ed90d..e575879 100644 --- a/pack-check.c +++ b/pack-check.c @@ -29,12 +29,12 @@ static int verify_packfile(struct packed pack_base = p->pack_base; SHA1_Update(&ctx, pack_base, pack_size - 20); SHA1_Final(sha1, &ctx); - if (memcmp(sha1, index_base + index_size - 40, 20)) - return error("Packfile %s SHA1 mismatch with idx", - p->pack_name); if (memcmp(sha1, pack_base + pack_size - 20, 20)) return error("Packfile %s SHA1 mismatch with itself", p->pack_name); + if (memcmp(sha1, index_base + index_size - 40, 20)) + return error("Packfile %s SHA1 mismatch with idx", + p->pack_name); /* Make sure everything reachable from idx is valid. Since we * have verified that nr_objects matches between idx and pack, ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 22:11 ` Linus Torvalds @ 2006-04-27 22:17 ` Junio C Hamano 2006-04-27 22:29 ` Linus Torvalds 2006-04-27 22:18 ` Linus Torvalds 2006-04-28 22:27 ` Alex Riesen 2 siblings, 1 reply; 9+ messages in thread From: Junio C Hamano @ 2006-04-27 22:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > That said, the pack-file should all be written with the "sha1write()" > interface, which is very careful indeed. > > I wonder if the _pack-file_ itself might be ok, and the problem is an > index file corruption. For some reason we check the index file first, > which is insane. We should check that the pack-file matches its _own_ SHA1 > first, and check the index file second. We need to check both, so I fail to see why the order matters. > If it's just the index file that is corrupt, you may even have a chance to > recover the data. > > The index file is also written with sha1write(), though, so I really don't > see where it would break. Unless you just simply literally have data > corruption on the server for some strange reason. I haven't seen this, and was wondering why. Independently, and probably unrelated, but another person reported failure while cloning, but the log appeared it had trouble spawning the git-index-pack executable for some reason. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 22:17 ` Junio C Hamano @ 2006-04-27 22:29 ` Linus Torvalds 2006-04-27 22:44 ` Junio C Hamano 0 siblings, 1 reply; 9+ messages in thread From: Linus Torvalds @ 2006-04-27 22:29 UTC (permalink / raw) To: Junio C Hamano; +Cc: git On Thu, 27 Apr 2006, Junio C Hamano wrote: > Linus Torvalds <torvalds@osdl.org> writes: > > > That said, the pack-file should all be written with the "sha1write()" > > interface, which is very careful indeed. > > > > I wonder if the _pack-file_ itself might be ok, and the problem is an > > index file corruption. For some reason we check the index file first, > > which is insane. We should check that the pack-file matches its _own_ SHA1 > > first, and check the index file second. > > We need to check both, so I fail to see why the order matters. It's insane to do any _cross_-file checking before you've even verified that the files themselves are valid. Another way of saying the same thing: checking whether the SHA1 of the pack-file matches the index file is pointless before you've verified that the SHA1 itself is valid. Basically, if the pack-file is corrupt, you want to know that. You don't want to know that it's SHA1 doesn't match the index file - that's a "secondary" issue to the fact that the SHA1 wasn't correct in the first place. Right now, if the pack-file is corrupt, it doesn't actually tell us so. It says that it doesn't match the index file. Which is likely wrong - it probably _does_ match the index file, but it's been corrupted. See the difference? Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 22:29 ` Linus Torvalds @ 2006-04-27 22:44 ` Junio C Hamano 0 siblings, 0 replies; 9+ messages in thread From: Junio C Hamano @ 2006-04-27 22:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds <torvalds@osdl.org> writes: > Right now, if the pack-file is corrupt, it doesn't actually tell us so. It > says that it doesn't match the index file. Which is likely wrong - it > probably _does_ match the index file, but it's been corrupted. > > See the difference? Makes perfect sense. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 22:11 ` Linus Torvalds 2006-04-27 22:17 ` Junio C Hamano @ 2006-04-27 22:18 ` Linus Torvalds 2006-04-28 22:27 ` Alex Riesen 2 siblings, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2006-04-27 22:18 UTC (permalink / raw) To: Alex Riesen, Junio C Hamano; +Cc: Git Mailing List On Thu, 27 Apr 2006, Linus Torvalds wrote: > > I wonder if the _pack-file_ itself might be ok, and the problem is an > index file corruption. Hmm. verify_pack() actually checks that the index file matches its own SHA1 earlier, so the index file will have passed (my suggested patch is still correct: the same way we check the index file internal integrity first, we should also check the pack-file internal integrity before we bother to cross-check them with each other). Anyway, the index file SHA1 check means that it's unlikely that the index file was corrupt. But it would be interesting to hear if the pack-file was internally consistent or not.. (Something that git-pack-check didn't check in your case, because it checked the pack-file against the index file data first). Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 22:11 ` Linus Torvalds 2006-04-27 22:17 ` Junio C Hamano 2006-04-27 22:18 ` Linus Torvalds @ 2006-04-28 22:27 ` Alex Riesen 2006-04-28 23:18 ` Linus Torvalds 2 siblings, 1 reply; 9+ messages in thread From: Alex Riesen @ 2006-04-28 22:27 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List Linus Torvalds, Fri, Apr 28, 2006 00:11:13 +0200: > > NFS server: 2.6.15 > > Client: 2.6.17-rc2 > > mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0 > > It's repeatable? Can you check if it goes away if your remove "intr"? It does not go away if I remove intr: $ grep 'nfs\>' /proc/mounts tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,proto=udp,timeo=7,retrans=3,addr=tigra 0 0 And this is really a broken packfile: $ git fsck-objects --full git-fsck-objects: error: Packfile .git/objects/pack/pack-9021635f04e29bb9f3313a54124f64589eca5764.pack SHA1 mismatch with itself git-fsck-objects: fatal: failed to read delta-pack base object a23816d3e9a1684794c8e5a8f1cc0cce26fb61d8 And I actually was kind of sure about the hardware (like in: "it worked flawlessly for in the past 2 years"). Until looked today in the logs and saw this: Apr 19 11:49:35 tigra kernel: eth1: tx underrun with maximum tx threshold, txcfg 0xd0f0102e. Apr 19 11:49:35 tigra kernel: eth1: Link wake-up event 0xffffffff Apr 19 11:49:35 tigra kernel: eth1: PCI error 0xf00000 Well, this is actually not _that_ day. And this: Apr 28 23:42:19 tigra kernel: eth1: tx underrun with maximum tx threshold, txcfg 0xd0f0102e. is not exactly the time of most recent test (the one without the "hard" mount option). But this _is_ that very same interface, and "PCI error" looks nasty. Ok, looking at the card... Seats kinda skewed in the slot, pressing on it... Wow! (lights go out): Apr 29 00:13:35 tigra kernel: eth1: Link wake-up event 0x00020b Apr 29 00:13:35 tigra kernel: eth1: PCI error 0xf00000 Apr 29 00:13:39 tigra kernel: NETDEV WATCHDOG: eth1: transmit timed out Apr 29 00:13:39 tigra kernel: eth1: Transmit timed out, status 0x000000, resetting... Apr 29 00:13:39 tigra kernel: eth1: DSPCFG accepted after 0 usec. Apr 29 00:13:39 tigra kernel: eth1: Setting full-duplex based on negotiated link capability. Redoing test... (Two times only, it's late already): $SRC/test2.git$ git repack -a -d Generating pack... Done counting 235775 objects. Deltifying 235775 objects. 100% (235775/235775) done Writing 235775 objects. 100% (235775/235775) done Total 235775, written 235775 (delta 181885), reused 223766 (delta 171462) Pack pack-9021635f04e29bb9f3313a54124f64589eca5764 created. $SRC/test2.git$ git fsck-objects --full dangling blob 419301f9bff67932cb9551f2d8436b277a3022b0 $SRC/test2.git$ git repack -a -d Generating pack... Done counting 235775 objects. Deltifying 235775 objects. 100% (235775/235775) done Writing 235775 objects. 100% (235775/235775) done Total 235775, written 235775 (delta 181958), reused 235702 (delta 181885) Pack pack-9021635f04e29bb9f3313a54124f64589eca5764 created. $SRC/test2.git$ git fsck-objects --full dangling blob 419301f9bff67932cb9551f2d8436b277a3022b0 Hmm... Ok, apologies everyone, I'm just lazy and stupid. Still, would be nice not to loose a repository just because user is an idiot. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-28 22:27 ` Alex Riesen @ 2006-04-28 23:18 ` Linus Torvalds 0 siblings, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2006-04-28 23:18 UTC (permalink / raw) To: Alex Riesen; +Cc: Junio C Hamano, Git Mailing List On Sat, 29 Apr 2006, Alex Riesen wrote: > > mount option). But this _is_ that very same interface, and "PCI error" > looks nasty. Ok, looking at the card... Seats kinda skewed in the > slot, pressing on it... Wow! (lights go out): > > Apr 29 00:13:35 tigra kernel: eth1: Link wake-up event 0x00020b > Apr 29 00:13:35 tigra kernel: eth1: PCI error 0xf00000 > Apr 29 00:13:39 tigra kernel: NETDEV WATCHDOG: eth1: transmit timed out > Apr 29 00:13:39 tigra kernel: eth1: Transmit timed out, status 0x000000, resetting... > Apr 29 00:13:39 tigra kernel: eth1: DSPCFG accepted after 0 usec. > Apr 29 00:13:39 tigra kernel: eth1: Setting full-duplex based on negotiated link capability. Ok, that "PCI error" meaning is not entirely clear, but it sounds like noise on the line. The driver just has a comment saying /* Hmmmmm, it's not clear how to recover from PCI faults. */ and it's actually pretty possible (and even likely) that whatever corrupted the pack-file happened on the _send_ side, so by the time we get a PCI error report, there's already a packet out the door that likely has corrupted data. Now, corrupted data would normally be caught by the UDP checksum, but: - a lot of modern cards do the IP checksum on the card. Whether the natsemi driver does or not, I have no clue. If it does, it would always make the checksum match the (corrupted) data. - the IP-level checksums are really quite weak. Realistically, you really really _really_ want to have link-layer checksums working, but since the link-level checksum is _always_ computed by the card (and since a PCI error would have corrupted the packet data that the card saw), link-level checksums will always have the same error that the data got, and pass. So even if we did the UDP checksum in software, it has a reasonable chance of not triggering. It's just 16 bits, and not a _good_ 16 bits at that. Anyway, definitely looks hw-induced. > Hmm... Ok, apologies everyone, I'm just lazy and stupid. Having flaky hw is just unhappy. I'm glad git caught it quickly. > Still, would be nice not to loose a repository just because > user is an idiot. Well, there's two sides to this: - packs are very much designed to be very very dense. That means that even the _slightest_ corruption will likely totally destroy them. Even a single-bit flip likely causes massive damage to an archive. This is a direct and unavoidable consequence of compression and density. A less dense format automatically has a lot more redundant data, and the less dense it is, the more likely you can recover from single-bit errors (or even worse ones). So this is the downside to compression and packing. All forms of compression (and we do both traditional stream-compression with zlib and aggressive delta compression between objects) inevitably make the effects of corruption much worse, and much harder to recover from. That's the bad part. - The good part is that replication is obviously trivial, and git is inherently very good at making incremental backups. Now, we could try to have something that tries to recover as much of a corrupted pack-file as possible. Right now, git-unpack-objects just dies (as early as possible) if the pack is corrupt. Having some mode where it tries to recover from errors and continue would probably be a good debugging and recovery tool. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: bug: git-repack -a -d produces broken pack on NFS 2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen 2006-04-27 22:11 ` Linus Torvalds @ 2006-04-27 23:54 ` Linus Torvalds 1 sibling, 0 replies; 9+ messages in thread From: Linus Torvalds @ 2006-04-27 23:54 UTC (permalink / raw) To: Alex Riesen; +Cc: git Ok, trying to think some more about this.. On Thu, 27 Apr 2006, Alex Riesen wrote: > > $SRC/linux.git$ git repack -a -d > Generating pack... > Done counting 235947 objects. > Deltifying 235947 objects. > 100% (235947/235947) done > Writing 235947 objects. > 100% (235947/235947) done > Total 235947, written 235947 (delta 182131), reused 235466 (delta 181650) > Pack pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87 created. > $SRC/linux.git$ git fsck-objects --full > git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87.pack SHA1 mismatch with idx This is interesting on so many levels. First off, the index file or the pack-file is clearly somehow corrupt, because when you then try to do the "git clone" off the result later on (which won't actually check the SHA1's), it gets > git-index-pack: fatal: packfile '/mnt/large/tmp/raa/tmp/.git/objects/pack/tmp-wcRvk5': bad object at offset 102601801: inflate returned -3 which means that either the offset was wrong, or the data at that offset was wrong. That made me suspect the object re-use code - it might have been broken in the original pack, and then on re-use the broken data would have been just copied over. HOWEVER - that doesn't actually fly as an explanation, because even if the data itself was broken, the repack would have re-generated the SHA1, so if the problem had been about copying an already broken pack over, you'd have gotten the "git clone" error, but you would _not_ have gotten the "pack SHA1 does not match index" error. So in order for the SHA1 to not match, we literally must have corrupted things when we created the pack-file. However, I've stared and stared at the sha1file writing code, and I don't see how you _could_ corrupt it. We use it with interruptible file descriptors all the time (sockets - the exact same code is used to transfer packs over the network), and that "intr" shouldn't matter one whit. We're doing very safe things, as far as I can tell. The thing is, even if a wild pointer corrupts the write buffer for the sha1file writing code somehow, we actually always do the "calculate the SHA1" and "flush the buffer to the file" together. So even if somebody corrupted the buffer, we'd still generate the "right" SHA1 (of the corrupted buffer). So the only thing that I can see that can generate bad SHA1 checksums is - actual problem in the SHA1 buffers themselves (ie a wild pointer corrupting the "SHA1_CTX" thing itself) - real filesystem corruption. With NFS, the UDP checksums aren't all that strong, but the ethernet CRC should catch things (there have been reports of network cards that don't check the CRC well, but quite frankly, I haven't seen one in a _loong_ time) - RAM corruption and/or kernel NFS bugs. I'll continue to stare at the code, but I can't see anything even remotely suspicious in git itself so far. Linus ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2006-04-28 23:19 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen 2006-04-27 22:11 ` Linus Torvalds 2006-04-27 22:17 ` Junio C Hamano 2006-04-27 22:29 ` Linus Torvalds 2006-04-27 22:44 ` Junio C Hamano 2006-04-27 22:18 ` Linus Torvalds 2006-04-28 22:27 ` Alex Riesen 2006-04-28 23:18 ` Linus Torvalds 2006-04-27 23:54 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).