* Superblock checksum problems
@ 2006-09-02 22:36 Josh Litherland
2006-09-04 5:12 ` Neil Brown
0 siblings, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-02 22:36 UTC (permalink / raw)
To: linux-raid
Attempting to build a new raid5 md array across 4 hard drives. At the
exact moment that the drive finishes rebuilding, the superblock checksum
changes to an invalid value. During the rebuild, mdadm -E for the 4
drives shows:
Checksum : 70c0863a - correct
Checksum : 70c0864c - correct
Checksum : 70c0865e - correct
Checksum : 70c08670 - correct
The instant it finishes:
Checksum : 70c0bc47 - expected 70c0bc27
Checksum : 70c0bc59 - expected 70c0bc39
Checksum : 70c0bc6b - expected 70c0bc4b
Checksum : 70c0bc7d - expected 70c0bc5d
Kernel is 2.6.17.11.
Controller is 4-port silicon image:
0000:00:08.0 RAID bus controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)
Drives are seagate 250G sata ST3250824AS
--
Josh Litherland (josh@temp123.org)
--
VGER BF report: U 0.669701
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-02 22:36 Superblock checksum problems Josh Litherland
@ 2006-09-04 5:12 ` Neil Brown
2006-09-04 16:00 ` Josh Litherland
0 siblings, 1 reply; 14+ messages in thread
From: Neil Brown @ 2006-09-04 5:12 UTC (permalink / raw)
To: josh; +Cc: linux-raid
On Saturday September 2, josh@temp123.org wrote:
>
> Attempting to build a new raid5 md array across 4 hard drives. At the
> exact moment that the drive finishes rebuilding, the superblock checksum
> changes to an invalid value. During the rebuild, mdadm -E for the 4
> drives shows:
>
> Checksum : 70c0863a - correct
> Checksum : 70c0864c - correct
> Checksum : 70c0865e - correct
> Checksum : 70c08670 - correct
>
> The instant it finishes:
>
> Checksum : 70c0bc47 - expected 70c0bc27
> Checksum : 70c0bc59 - expected 70c0bc39
> Checksum : 70c0bc6b - expected 70c0bc4b
> Checksum : 70c0bc7d - expected 70c0bc5d
>
> Kernel is 2.6.17.11.
Very odd.
It this repeatable?
Does the checksum stay wrong?
i.e. if you run 'mdadm -E' again, a few seconds later, does it still
report the wrong number?
If you stop the array and run 'mdadm -E' on the drives, what values do
they have now.
NeilBrown
--
VGER BF report: U 0.5
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 5:12 ` Neil Brown
@ 2006-09-04 16:00 ` Josh Litherland
2006-09-04 20:55 ` Josh Litherland
0 siblings, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 16:00 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Mon, 2006-09-04 at 15:12 +1000, Neil Brown wrote:
> It this repeatable?
100% repeatable.
> Does the checksum stay wrong?
Yes, once they have changed to the bad value, they don't move again that
I've seen (done several trials over the past few days).
> If you stop the array and run 'mdadm -E' on the drives, what values do
> they have now.
I'll test to see if they actually change values, but I can say for
certain that they are still invalid checksum, i.e. once I stop the array
I have to assemble it with -U resync to get it back online. (and it of
course rebuilds)
I'm starting to suspect the card's BIOS; it might either be bad or be
some version which interacts strangely with the sata_sil driver. I am
going to hunt for a flash update for it soon.
Thanks for the reply!
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 16:00 ` Josh Litherland
@ 2006-09-04 20:55 ` Josh Litherland
2006-09-04 21:13 ` Josh Litherland
0 siblings, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 20:55 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Mon, 2006-09-04 at 12:00 -0400, Josh Litherland wrote:
> I'll test to see if they actually change values, but I can say for
> certain that they are still invalid checksum, i.e. once I stop the array
> I have to assemble it with -U resync to get it back online. (and it of
> course rebuilds)
Some real strangeness here... So, while the array was up and running
(after mdadm -A -f -U resync ...) I checked the checksums:
Checksum : 70c30de5 - expected 70c30db5
Checksum : 70c30df7 - expected 70c30dd7
Checksum : 70c30e09 - expected 70c30de9
Checksum : 70c30e1b - expected 70c30deb
Then I unmounted, checked again... here's where things get weird
Checksum : 70c352e8 - correct
Checksum : 70c352fa - expected 70c352da
Checksum : 70c3530c - expected 70c352ec
Checksum : 70c3531e - correct
Went ahead and issues mdadm -S
Checksum : 70c352e8 - expected 70c352c8
Checksum : 70c352fa - expected 70c352da
Checksum : 70c3530c - expected 70c352ec
Checksum : 70c3531e - expected 70c352fe
Now utterly perplexed, I went ahead and checked mdadm -E several more
times. The actual stored checksum value isn't changing, but it's
switching around to saying "expected <something else>" to saying
"correct"... on ALL 4 drives at different times.
Anybody have a clue what's going on here? How does mdadm (or the
kernel, for that matter) decide what the checksum SHOULD be? I'll
code-dive to see if I can answer that myself, but if anyones knows, I'd
appreaciate a pointer.
Thanks!
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 20:55 ` Josh Litherland
@ 2006-09-04 21:13 ` Josh Litherland
2006-09-04 21:35 ` Neil Brown
0 siblings, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 21:13 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
This one will really curl your hair. So, operating with the knowledge
that the checksum's state of correctness or incorrectness was changing
all the time, I did this:
while [ $? != 0 ] ; do
mdadm -A /dev/md0 /dev/sd[abcd]1
done
after 1,518 trials it successfully assembled.
*beats head against wall*
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 21:13 ` Josh Litherland
@ 2006-09-04 21:35 ` Neil Brown
2006-09-04 21:46 ` Josh Litherland
2006-09-04 21:54 ` Josh Litherland
0 siblings, 2 replies; 14+ messages in thread
From: Neil Brown @ 2006-09-04 21:35 UTC (permalink / raw)
To: josh; +Cc: linux-raid
On Monday September 4, josh@temp123.org wrote:
>
> This one will really curl your hair. So, operating with the knowledge
> that the checksum's state of correctness or incorrectness was changing
> all the time, I did this:
>
> while [ $? != 0 ] ; do
> mdadm -A /dev/md0 /dev/sd[abcd]1
> done
>
> after 1,518 trials it successfully assembled.
>
> *beats head against wall*
I think maybe you should beat the computer against the wall, and then
get it replaced under warranty ...
Something is SERIOUSLY wrong.
As it affects all drives, I suspect the drives are fine.
As that machine doesn't crash instantly, I suspect the cpu/memory is
fine.
Which leaves the controller and cables.
Try
for i in `seq 1 20`; do
dd if=/dev/sda of=/tmp/try-$i conv=direct
done
for i in `seq 2 20`; do
cmp -l /tmp/try-1 /tmp/try=$i
done
and look for a pattern.
NeilBrown
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 21:35 ` Neil Brown
@ 2006-09-04 21:46 ` Josh Litherland
2006-09-04 23:11 ` Josh Litherland
2006-09-04 21:54 ` Josh Litherland
1 sibling, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 21:46 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Tue, 2006-09-05 at 07:35 +1000, Neil Brown wrote:
> Something is SERIOUSLY wrong.
> As it affects all drives, I suspect the drives are fine.
> As that machine doesn't crash instantly, I suspect the cpu/memory is
> fine.
> Which leaves the controller and cables.
> for i in `seq 1 20`; do
> dd if=/dev/sda of=/tmp/try-$i conv=direct
> done
> for i in `seq 2 20`; do
> cmp -l /tmp/try-1 /tmp/try=$i
> done
>
> and look for a pattern.
-nod- You're thinking the card is reading/writing different values intermittently, I'm guessing. The only thing
which makes me dubious about that is, once the md is up and running it seems to do perfectly fine. I've only used
it for a couple days, but never got any read errors or invalid file problems. I was doing pretty heavy IO on it,
and these header checksum readings are changing several times a SECOND.
The other is that the actual checksum value being generated never changes, just the kernel's idea of whether it's
valid or invalid.
Weird, weird, weird. I will probably get a new card, but I'd like to keep examining this one for SCIENCE for a
little while longer.
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 21:35 ` Neil Brown
2006-09-04 21:46 ` Josh Litherland
@ 2006-09-04 21:54 ` Josh Litherland
1 sibling, 0 replies; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 21:54 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Tue, 2006-09-05 at 07:35 +1000, Neil Brown wrote:
> Try
> for i in `seq 1 20`; do
> dd if=/dev/sda of=/tmp/try-$i conv=direct
> done
> for i in `seq 2 20`; do
> cmp -l /tmp/try-1 /tmp/try=$i
> done
I tried a variant of this:
for i in `seq 1 20`; do
dd if=/dev/sda1 of=/tmp/try-$i count=4 bs=4096 iflag=direct
sleep 0.25
done
Tried it with and without the sleep, on all 4 drives. In all cases, the
20 "try" files were identical.
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 21:46 ` Josh Litherland
@ 2006-09-04 23:11 ` Josh Litherland
2006-09-04 23:17 ` Neil Brown
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Josh Litherland @ 2006-09-04 23:11 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Mon, 2006-09-04 at 17:46 -0400, Josh Litherland wrote:
> I've only used it for a couple days, but never got any read errors or invalid file problems.
Feh, disregard that. I've beaten it up some more, and occasional errors
are cropping up. Bad card. Nothing more to see here, move along.
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 23:11 ` Josh Litherland
@ 2006-09-04 23:17 ` Neil Brown
2006-09-04 23:43 ` Henrik Holst
2006-09-06 13:26 ` Josh Litherland
2 siblings, 0 replies; 14+ messages in thread
From: Neil Brown @ 2006-09-04 23:17 UTC (permalink / raw)
To: josh; +Cc: linux-raid
On Monday September 4, josh@temp123.org wrote:
> On Mon, 2006-09-04 at 17:46 -0400, Josh Litherland wrote:
>
> > I've only used it for a couple days, but never got any read errors or invalid file problems.
>
> Feh, disregard that. I've beaten it up some more, and occasional errors
> are cropping up. Bad card. Nothing more to see here, move along.
Good! Thanks for keeping us informed. It is good to have closure.
NeilBrown
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 23:11 ` Josh Litherland
2006-09-04 23:17 ` Neil Brown
@ 2006-09-04 23:43 ` Henrik Holst
2006-09-06 7:07 ` Mario 'BitKoenig' Holbe
2006-09-06 13:26 ` Josh Litherland
2 siblings, 1 reply; 14+ messages in thread
From: Henrik Holst @ 2006-09-04 23:43 UTC (permalink / raw)
To: linux-raid
Josh Litherland wrote:
> Feh, disregard that. I've beaten it up some more, and occasional errors
> are cropping up. Bad card. Nothing more to see here, move along.
It would be good to have an analog to "memtest" but for PATA and SATA
ports. Anyone seen something like that out there on the web?
Henrik Holst
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 23:43 ` Henrik Holst
@ 2006-09-06 7:07 ` Mario 'BitKoenig' Holbe
0 siblings, 0 replies; 14+ messages in thread
From: Mario 'BitKoenig' Holbe @ 2006-09-06 7:07 UTC (permalink / raw)
To: linux-raid
Henrik Holst <henrik.holst@idgmail.se> wrote:
> It would be good to have an analog to "memtest" but for PATA and SATA
> ports. Anyone seen something like that out there on the web?
Are you looking for `badblocks'?
There is also a `memtest.sh' from Doug Ledford. It's main intention is,
as the name suggests, to find memory problems. However, it does also
stress the disk a lot. In fact, it finds problems that occur somewhere
on the path from a disk to a CPU. If you increase its $NR_SIMULTANEOUS
to fill up your disk entirely, it should nearly do what you are looking
for (except for some blocks that are never touched because of filesystem
organization) :)
regards
Mario
--
I thought the only thing the internet was good for was porn. -- Futurama
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-04 23:11 ` Josh Litherland
2006-09-04 23:17 ` Neil Brown
2006-09-04 23:43 ` Henrik Holst
@ 2006-09-06 13:26 ` Josh Litherland
2006-09-07 20:40 ` Josh Litherland
2 siblings, 1 reply; 14+ messages in thread
From: Josh Litherland @ 2006-09-06 13:26 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Stranger than we dreamt it...
I got a new, identical card, the Syba 4-port PCI card and had EXACTLY
the same problems. Tried with all new cables, exact same problems.
Moved the card into each PCI slot in the system. Same problems.
I put the drives and cables into a different system onto an onboard SATA
controller, and they work fine, so drives and cables are "known good".
It seems unlikely that two identical cards would be broken in exactly
the same way, so I have to revert back to my previous suspicion: bad
interaction between the card's firmware and the driver.
This evening, I will try out the weird card in the new system's
motherboard, to see if it's something in the old mobo that doesn't like
this card.
I will hold on to this card at least until weekend for testing, and
longer if anyone thinks they might have a patch for me to test.
My thanks to everyone who's still paying attention to this thread. =)
--
Josh Litherland (josh@temp123.org)
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Superblock checksum problems
2006-09-06 13:26 ` Josh Litherland
@ 2006-09-07 20:40 ` Josh Litherland
0 siblings, 0 replies; 14+ messages in thread
From: Josh Litherland @ 2006-09-07 20:40 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Apparently, the card I had is incompatibile in some way with the
motherboard I was using, a Via chipset board. Both the cards work fine
in a different motherboard. I guess this is case closed.
Thanks to everyone!
--
Josh Litherland (josh@temp123.org
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2006-09-07 20:40 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-02 22:36 Superblock checksum problems Josh Litherland
2006-09-04 5:12 ` Neil Brown
2006-09-04 16:00 ` Josh Litherland
2006-09-04 20:55 ` Josh Litherland
2006-09-04 21:13 ` Josh Litherland
2006-09-04 21:35 ` Neil Brown
2006-09-04 21:46 ` Josh Litherland
2006-09-04 23:11 ` Josh Litherland
2006-09-04 23:17 ` Neil Brown
2006-09-04 23:43 ` Henrik Holst
2006-09-06 7:07 ` Mario 'BitKoenig' Holbe
2006-09-06 13:26 ` Josh Litherland
2006-09-07 20:40 ` Josh Litherland
2006-09-04 21:54 ` Josh Litherland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).