git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bug: git-repack -a -d produces broken pack on NFS
@ 2006-04-27 21:32 Alex Riesen
  2006-04-27 22:11 ` Linus Torvalds
  2006-04-27 23:54 ` Linus Torvalds
  0 siblings, 2 replies; 9+ messages in thread
From: Alex Riesen @ 2006-04-27 21:32 UTC (permalink / raw)
  To: git

NFS server: 2.6.15
Client: 2.6.17-rc2
mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0

Repack protocol ($SRC is /net/home/src):

$SRC/linux.git$ git repack -a -d
Generating pack...
Done counting 235947 objects.
Deltifying 235947 objects.
 100% (235947/235947) done
Writing 235947 objects.
 100% (235947/235947) done
Total 235947, written 235947 (delta 182131), reused 235466 (delta 181650)
Pack pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87 created.
$SRC/linux.git$ git fsck-objects --full
git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd
30ebec13b07df2c87.pack SHA1 mismatch with idx
git-fsck-objects: fatal: delta data unpack failed
$SRC/linux.git$ git fsck-objects --full
git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd
30ebec13b07df2c87.pack SHA1 mismatch with idx
git-fsck-objects: fatal: delta data unpack failed
$SRC/linux.git$ du -sh .
124M    .
$SRC/linux.git$ cp . -a $BIG/linux.git
$SRC/linux.git$ cd $BIG/linux.git
$BIG/linux.git$ git fsck-objects --full
git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd
30ebec13b07df2c87.pack SHA1 mismatch with idx
git-fsck-objects: fatal: delta data unpack failed
$BIG/linux.git$ git clone -n . ../tmp
Generating pack...
Done counting 235947 objects.
Deltifying 235947 objects.
 100% (235947/235947) done
Total 235947, written 235947 (delta 182131), reused 235947 (delta 182131)

git-index-pack: fatal: packfile '/mnt/large/tmp/raa/tmp/.git/objects/pack/tmp-wc
Rvk5': bad object at offset 102601801: inflate returned -3
git-fetch-pack: error: git-fetch-pack: unable to read from git-index-pack
git-fetch-pack: error: git-index-pack died with error code 128
fetch-pack from '/mnt/large/tmp/raa/linux.git/.git' failed.

So the repository is now _really_ broken.
I didn't noticed when it started to happen, sorry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen
@ 2006-04-27 22:11 ` Linus Torvalds
  2006-04-27 22:17   ` Junio C Hamano
                     ` (2 more replies)
  2006-04-27 23:54 ` Linus Torvalds
  1 sibling, 3 replies; 9+ messages in thread
From: Linus Torvalds @ 2006-04-27 22:11 UTC (permalink / raw)
  To: Alex Riesen, Junio C Hamano; +Cc: Git Mailing List



On Thu, 27 Apr 2006, Alex Riesen wrote:
>
> NFS server: 2.6.15
> Client: 2.6.17-rc2
> mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0

It's repeatable? Can you check if it goes away if your remove "intr"?

That said, the pack-file should all be written with the "sha1write()" 
interface, which is very careful indeed.

I wonder if the _pack-file_ itself might be ok, and the problem is an 
index file corruption. For some reason we check the index file first, 
which is insane. We should check that the pack-file matches its _own_ SHA1 
first, and check the index file second.

So the appended patch would be sensible: before we bother to look at the 
index file at all, we should check the pack-file itself.

If it's just the index file that is corrupt, you may even have a chance to 
recover the data.

The index file is also written with sha1write(), though, so I really don't 
see where it would break. Unless you just simply literally have data 
corruption on the server for some strange reason.

		Linus
---
diff --git a/pack-check.c b/pack-check.c
index 84ed90d..e575879 100644
--- a/pack-check.c
+++ b/pack-check.c
@@ -29,12 +29,12 @@ static int verify_packfile(struct packed
 	pack_base = p->pack_base;
 	SHA1_Update(&ctx, pack_base, pack_size - 20);
 	SHA1_Final(sha1, &ctx);
-	if (memcmp(sha1, index_base + index_size - 40, 20))
-		return error("Packfile %s SHA1 mismatch with idx",
-			     p->pack_name);
 	if (memcmp(sha1, pack_base + pack_size - 20, 20))
 		return error("Packfile %s SHA1 mismatch with itself",
 			     p->pack_name);
+	if (memcmp(sha1, index_base + index_size - 40, 20))
+		return error("Packfile %s SHA1 mismatch with idx",
+			     p->pack_name);
 
 	/* Make sure everything reachable from idx is valid.  Since we
 	 * have verified that nr_objects matches between idx and pack,

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 22:11 ` Linus Torvalds
@ 2006-04-27 22:17   ` Junio C Hamano
  2006-04-27 22:29     ` Linus Torvalds
  2006-04-27 22:18   ` Linus Torvalds
  2006-04-28 22:27   ` Alex Riesen
  2 siblings, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2006-04-27 22:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> That said, the pack-file should all be written with the "sha1write()" 
> interface, which is very careful indeed.
>
> I wonder if the _pack-file_ itself might be ok, and the problem is an 
> index file corruption. For some reason we check the index file first, 
> which is insane. We should check that the pack-file matches its _own_ SHA1 
> first, and check the index file second.

We need to check both, so I fail to see why the order matters.

> If it's just the index file that is corrupt, you may even have a chance to 
> recover the data.
>
> The index file is also written with sha1write(), though, so I really don't 
> see where it would break. Unless you just simply literally have data 
> corruption on the server for some strange reason.

I haven't seen this, and was wondering why.

Independently, and probably unrelated, but another person
reported failure while cloning, but the log appeared it had
trouble spawning the git-index-pack executable for some reason.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 22:11 ` Linus Torvalds
  2006-04-27 22:17   ` Junio C Hamano
@ 2006-04-27 22:18   ` Linus Torvalds
  2006-04-28 22:27   ` Alex Riesen
  2 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2006-04-27 22:18 UTC (permalink / raw)
  To: Alex Riesen, Junio C Hamano; +Cc: Git Mailing List



On Thu, 27 Apr 2006, Linus Torvalds wrote:
> 
> I wonder if the _pack-file_ itself might be ok, and the problem is an 
> index file corruption.

Hmm. verify_pack() actually checks that the index file matches its own 
SHA1 earlier, so the index file will have passed (my suggested patch is 
still correct: the same way we check the index file internal integrity 
first, we should also check the pack-file internal integrity before we 
bother to cross-check them with each other).

Anyway, the index file SHA1 check means that it's unlikely that the index 
file was corrupt. But it would be interesting to hear if the pack-file was 
internally consistent or not.. (Something that git-pack-check didn't check 
in your case, because it checked the pack-file against the index file data 
first).

		Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 22:17   ` Junio C Hamano
@ 2006-04-27 22:29     ` Linus Torvalds
  2006-04-27 22:44       ` Junio C Hamano
  0 siblings, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2006-04-27 22:29 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Thu, 27 Apr 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > That said, the pack-file should all be written with the "sha1write()" 
> > interface, which is very careful indeed.
> >
> > I wonder if the _pack-file_ itself might be ok, and the problem is an 
> > index file corruption. For some reason we check the index file first, 
> > which is insane. We should check that the pack-file matches its _own_ SHA1 
> > first, and check the index file second.
> 
> We need to check both, so I fail to see why the order matters.

It's insane to do any _cross_-file checking before you've even verified 
that the files themselves are valid.

Another way of saying the same thing: checking whether the SHA1 of the 
pack-file matches the index file is pointless before you've verified that 
the SHA1 itself is valid.

Basically, if the pack-file is corrupt, you want to know that. You don't 
want to know that it's SHA1 doesn't match the index file - that's a 
"secondary" issue to the fact that the SHA1 wasn't correct in the first 
place.

Right now, if the pack-file is corrupt, it doesn't actually tell us so. It 
says that it doesn't match the index file. Which is likely wrong - it 
probably _does_ match the index file, but it's been corrupted.

See the difference?

			Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 22:29     ` Linus Torvalds
@ 2006-04-27 22:44       ` Junio C Hamano
  0 siblings, 0 replies; 9+ messages in thread
From: Junio C Hamano @ 2006-04-27 22:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Right now, if the pack-file is corrupt, it doesn't actually tell us so. It 
> says that it doesn't match the index file. Which is likely wrong - it 
> probably _does_ match the index file, but it's been corrupted.
>
> See the difference?

Makes perfect sense.  Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen
  2006-04-27 22:11 ` Linus Torvalds
@ 2006-04-27 23:54 ` Linus Torvalds
  1 sibling, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2006-04-27 23:54 UTC (permalink / raw)
  To: Alex Riesen; +Cc: git


Ok, trying to think some more about this..

On Thu, 27 Apr 2006, Alex Riesen wrote:
> 
> $SRC/linux.git$ git repack -a -d
> Generating pack...
> Done counting 235947 objects.
> Deltifying 235947 objects.
>  100% (235947/235947) done
> Writing 235947 objects.
>  100% (235947/235947) done
> Total 235947, written 235947 (delta 182131), reused 235466 (delta 181650)
> Pack pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87 created.
> $SRC/linux.git$ git fsck-objects --full
> git-fsck-objects: error: Packfile .git/objects/pack/pack-6dcda5a7782864d57ec44bd30ebec13b07df2c87.pack SHA1 mismatch with idx

This is interesting on so many levels.

First off, the index file or the pack-file is clearly somehow corrupt, 
because when you then try to do the "git clone" off the result later on 
(which won't actually check the SHA1's), it gets

> git-index-pack: fatal: packfile '/mnt/large/tmp/raa/tmp/.git/objects/pack/tmp-wcRvk5': bad object at offset 102601801: inflate returned -3

which means that either the offset was wrong, or the data at that offset 
was wrong.

That made me suspect the object re-use code - it might have been broken in 
the original pack, and then on re-use the broken data would have been just 
copied over.

HOWEVER - that doesn't actually fly as an explanation, because even if the 
data itself was broken, the repack would have re-generated the SHA1, so if 
the problem had been about copying an already broken pack over, you'd have 
gotten the "git clone" error, but you would _not_ have gotten the "pack 
SHA1 does not match index" error.

So in order for the SHA1 to not match, we literally must have corrupted 
things when we created the pack-file.

However, I've stared and stared at the sha1file writing code, and I don't 
see how you _could_ corrupt it. We use it with interruptible file 
descriptors all the time (sockets - the exact same code is used to 
transfer packs over the network), and that "intr" shouldn't matter one 
whit. We're doing very safe things, as far as I can tell.

The thing is, even if a wild pointer corrupts the write buffer for the 
sha1file writing code somehow, we actually always do the "calculate the 
SHA1" and "flush the buffer to the file" together. So even if somebody 
corrupted the buffer, we'd still generate the "right" SHA1 (of the 
corrupted buffer).

So the only thing that I can see that can generate bad SHA1 checksums is
 - actual problem in the SHA1 buffers themselves (ie a wild pointer 
   corrupting the "SHA1_CTX" thing itself)
 - real filesystem corruption. With NFS, the UDP checksums aren't all that 
   strong, but the ethernet CRC should catch things (there have been 
   reports of network cards that don't check the CRC well, but quite 
   frankly, I haven't seen one in a _loong_ time)
 - RAM corruption and/or kernel NFS bugs.

I'll continue to stare at the code, but I can't see anything even remotely 
suspicious in git itself so far.

		Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-27 22:11 ` Linus Torvalds
  2006-04-27 22:17   ` Junio C Hamano
  2006-04-27 22:18   ` Linus Torvalds
@ 2006-04-28 22:27   ` Alex Riesen
  2006-04-28 23:18     ` Linus Torvalds
  2 siblings, 1 reply; 9+ messages in thread
From: Alex Riesen @ 2006-04-28 22:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

Linus Torvalds, Fri, Apr 28, 2006 00:11:13 +0200:
> > NFS server: 2.6.15
> > Client: 2.6.17-rc2
> > mount options: tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,intr,proto=udp,timeo=7,retrans=3,addr=tigra 0 0
> 
> It's repeatable? Can you check if it goes away if your remove "intr"?

It does not go away if I remove intr:

    $ grep 'nfs\>' /proc/mounts
    tigra:/home /net/home nfs rw,nosuid,nodev,noatime,vers=3,rsize=8192,wsize=32768,hard,proto=udp,timeo=7,retrans=3,addr=tigra 0 0

And this is really a broken packfile:

    $ git fsck-objects --full
    git-fsck-objects: error: Packfile .git/objects/pack/pack-9021635f04e29bb9f3313a54124f64589eca5764.pack SHA1 mismatch with itself
    git-fsck-objects: fatal: failed to read delta-pack base object a23816d3e9a1684794c8e5a8f1cc0cce26fb61d8

And I actually was kind of sure about the hardware (like in: "it
worked flawlessly for in the past 2 years"). Until looked today in the
logs and saw this:

    Apr 19 11:49:35 tigra kernel: eth1: tx underrun with maximum tx threshold, txcfg 0xd0f0102e.
    Apr 19 11:49:35 tigra kernel: eth1: Link wake-up event 0xffffffff
    Apr 19 11:49:35 tigra kernel: eth1: PCI error 0xf00000

Well, this is actually not _that_ day. And this:

    Apr 28 23:42:19 tigra kernel: eth1: tx underrun with maximum tx threshold, txcfg 0xd0f0102e.

is not exactly the time of most recent test (the one without the "hard"
mount option). But this _is_ that very same interface, and "PCI error"
looks nasty. Ok, looking at the card... Seats kinda skewed in the
slot, pressing on it... Wow! (lights go out):

    Apr 29 00:13:35 tigra kernel: eth1: Link wake-up event 0x00020b
    Apr 29 00:13:35 tigra kernel: eth1: PCI error 0xf00000
    Apr 29 00:13:39 tigra kernel: NETDEV WATCHDOG: eth1: transmit timed out
    Apr 29 00:13:39 tigra kernel: eth1: Transmit timed out, status 0x000000, resetting...
    Apr 29 00:13:39 tigra kernel: eth1: DSPCFG accepted after 0 usec.
    Apr 29 00:13:39 tigra kernel: eth1: Setting full-duplex based on negotiated link capability.

Redoing test... (Two times only, it's late already):

    $SRC/test2.git$ git repack -a -d
    Generating pack...
    Done counting 235775 objects.
    Deltifying 235775 objects.
     100% (235775/235775) done
    Writing 235775 objects.
     100% (235775/235775) done
    Total 235775, written 235775 (delta 181885), reused 223766 (delta 171462)
    Pack pack-9021635f04e29bb9f3313a54124f64589eca5764 created.
    $SRC/test2.git$ git fsck-objects --full
    dangling blob 419301f9bff67932cb9551f2d8436b277a3022b0
    $SRC/test2.git$ git repack -a -d
    Generating pack...
    Done counting 235775 objects.
    Deltifying 235775 objects.
     100% (235775/235775) done
    Writing 235775 objects.
     100% (235775/235775) done
    Total 235775, written 235775 (delta 181958), reused 235702 (delta 181885)
    Pack pack-9021635f04e29bb9f3313a54124f64589eca5764 created.
    $SRC/test2.git$ git fsck-objects --full
    dangling blob 419301f9bff67932cb9551f2d8436b277a3022b0

Hmm... Ok, apologies everyone, I'm just lazy and stupid.

Still, would be nice not to loose a repository just because
user is an idiot.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: bug: git-repack -a -d produces broken pack on NFS
  2006-04-28 22:27   ` Alex Riesen
@ 2006-04-28 23:18     ` Linus Torvalds
  0 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2006-04-28 23:18 UTC (permalink / raw)
  To: Alex Riesen; +Cc: Junio C Hamano, Git Mailing List



On Sat, 29 Apr 2006, Alex Riesen wrote:
>
> mount option). But this _is_ that very same interface, and "PCI error"
> looks nasty. Ok, looking at the card... Seats kinda skewed in the
> slot, pressing on it... Wow! (lights go out):
> 
>     Apr 29 00:13:35 tigra kernel: eth1: Link wake-up event 0x00020b
>     Apr 29 00:13:35 tigra kernel: eth1: PCI error 0xf00000
>     Apr 29 00:13:39 tigra kernel: NETDEV WATCHDOG: eth1: transmit timed out
>     Apr 29 00:13:39 tigra kernel: eth1: Transmit timed out, status 0x000000, resetting...
>     Apr 29 00:13:39 tigra kernel: eth1: DSPCFG accepted after 0 usec.
>     Apr 29 00:13:39 tigra kernel: eth1: Setting full-duplex based on negotiated link capability.

Ok, that "PCI error" meaning is not entirely clear, but it sounds like 
noise on the line. The driver just has a comment saying

        /* Hmmmmm, it's not clear how to recover from PCI faults. */

and it's actually pretty possible (and even likely) that whatever 
corrupted the pack-file happened on the _send_ side, so by the time we get 
a PCI error report, there's already a packet out the door that likely has 
corrupted data.

Now, corrupted data would normally be caught by the UDP checksum, but:

 - a lot of modern cards do the IP checksum on the card. Whether the 
   natsemi driver does or not, I have no clue. If it does, it would always 
   make the checksum match the (corrupted) data.

 - the IP-level checksums are really quite weak. Realistically, you really 
   really _really_ want to have link-layer checksums working, but since 
   the link-level checksum is _always_ computed by the card (and since a 
   PCI error would have corrupted the packet data that the card saw), 
   link-level checksums will always have the same error that the data got, 
   and pass.

   So even if we did the UDP checksum in software, it has a reasonable 
   chance of not triggering. It's just 16 bits, and not a _good_ 16 bits 
   at that.

Anyway, definitely looks hw-induced.

> Hmm... Ok, apologies everyone, I'm just lazy and stupid.

Having flaky hw is just unhappy. I'm glad git caught it quickly.

> Still, would be nice not to loose a repository just because
> user is an idiot.

Well, there's two sides to this:

 - packs are very much designed to be very very dense. That means that 
   even the _slightest_ corruption will likely totally destroy them. Even 
   a single-bit flip likely causes massive damage to an archive.

   This is a direct and unavoidable consequence of compression and 
   density.  A less dense format automatically has a lot more redundant 
   data, and the less dense it is, the more likely you can recover from 
   single-bit errors (or even worse ones).

   So this is the downside to compression and packing. All forms of 
   compression (and we do both traditional stream-compression with zlib 
   and aggressive delta compression between objects) inevitably make the 
   effects of corruption much worse, and much harder to recover from.

   That's the bad part.

 - The good part is that replication is obviously trivial, and git is 
   inherently very good at making incremental backups.

Now, we could try to have something that tries to recover as much of a 
corrupted pack-file as possible. Right now, git-unpack-objects just dies 
(as early as possible) if the pack is corrupt. Having some mode where it 
tries to recover from errors and continue would probably be a good 
debugging and recovery tool.

		Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2006-04-28 23:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-27 21:32 bug: git-repack -a -d produces broken pack on NFS Alex Riesen
2006-04-27 22:11 ` Linus Torvalds
2006-04-27 22:17   ` Junio C Hamano
2006-04-27 22:29     ` Linus Torvalds
2006-04-27 22:44       ` Junio C Hamano
2006-04-27 22:18   ` Linus Torvalds
2006-04-28 22:27   ` Alex Riesen
2006-04-28 23:18     ` Linus Torvalds
2006-04-27 23:54 ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).