* Debugging a strange array corruption
@ 2010-12-14 8:10 Brad Campbell
2010-12-14 9:22 ` Roman Mamedov
0 siblings, 1 reply; 7+ messages in thread
From: Brad Campbell @ 2010-12-14 8:10 UTC (permalink / raw)
To: RAID Linux
G'day all,
I have a 10 x 1TB drive RAID-6 here. It's been great for ages, but recently I've seen nasty random
corruption across the entire array that I can not pin down.
The machine also has a number of RAID-1 and a RAID-5 which are all behaving perfectly.
The machine has 16GB of RAM, so all my read tests are done with dd bs=1G count=20 to make sure I'm
actually hitting the disk somewhere.
The array is partitioned into three approximately equal partitions.
If I do something like -
for i in `seq 3` ; do dd if=/dev/md0p1 bs=1G count=20 | md5sum ; done
- I get three completely different checksums
The filesystems are unmounted and the array is idle.
I've run the same test individually on all 10 disks in the array and they all appear to give
consistent data. Reading anything from the array gives me mostly correct data with intermittent garbage.
I've tried both a 2.6.36.[12] kernel, and I'm currently running 2.6.37-rc5-git3 with the same odd
results.
All the disks pass long SMART tests. They all checksum correctly from end to end with repeated
sequential runs.
No libata errors in the logs.
The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042 controllers and 2 are
on a SIL3132. This has occurred since I upgraded the mainboard (and kernel at the same time -
nothing like throwing more variables in the mix) and its effects were subtle enough that I missed
them until it had successfully rotated out all of my good backups with broken data. Lesson learned.
I'm stumped and I don't even know where to begin. I've never seen something like this happen without
a bad disk, controller or cable and they are easy to diagnose.
Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 8:10 Debugging a strange array corruption Brad Campbell
@ 2010-12-14 9:22 ` Roman Mamedov
2010-12-14 9:37 ` Brad Campbell
2010-12-14 11:59 ` David W.
0 siblings, 2 replies; 7+ messages in thread
From: Roman Mamedov @ 2010-12-14 9:22 UTC (permalink / raw)
To: Brad Campbell; +Cc: RAID Linux
[-- Attachment #1: Type: text/plain, Size: 1123 bytes --]
On Tue, 14 Dec 2010 16:10:07 +0800
Brad Campbell <brad@wasp.net.au> wrote:
> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042
> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
> mainboard (and kernel at the same time - nothing like throwing more
> variables in the mix) and its effects were subtle enough that I missed them
> until it had successfully rotated out all of my good backups with broken
> data. Lesson learned.
I'd suggest that you try moving two disks away from SiI3132, change your
setup so that at most ONE port on that controller is used, or none at all.
Some time ago there was a report of data corruption with controllers using
that chip when both ports simultaneously read at full speed:
http://forum.ixbt.com/topic.cgi?id=11:35147:1200#1200 (in Russian)
Perhaps problem not in the chip itself, but in some variations of
schematics/components/soldering, because only two of five supposedly identical
boards the reporter bought were corrupting data in that way, one much
more often than the other.
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 9:22 ` Roman Mamedov
@ 2010-12-14 9:37 ` Brad Campbell
2010-12-14 9:42 ` Roman Mamedov
2010-12-14 11:59 ` David W.
1 sibling, 1 reply; 7+ messages in thread
From: Brad Campbell @ 2010-12-14 9:37 UTC (permalink / raw)
To: Roman Mamedov; +Cc: RAID Linux
On 14/12/10 17:22, Roman Mamedov wrote:
> On Tue, 14 Dec 2010 16:10:07 +0800
> Brad Campbell<brad@wasp.net.au> wrote:
>
>> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042
>> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
>> mainboard (and kernel at the same time - nothing like throwing more
>> variables in the mix) and its effects were subtle enough that I missed them
>> until it had successfully rotated out all of my good backups with broken
>> data. Lesson learned.
>
> I'd suggest that you try moving two disks away from SiI3132, change your
> setup so that at most ONE port on that controller is used, or none at all.
>
> Some time ago there was a report of data corruption with controllers using
> that chip when both ports simultaneously read at full speed:
> http://forum.ixbt.com/topic.cgi?id=11:35147:1200#1200 (in Russian)
> Perhaps problem not in the chip itself, but in some variations of
> schematics/components/soldering, because only two of five supposedly identical
> boards the reporter bought were corrupting data in that way, one much
> more often than the other.
>
And in the prior incarnation I was only using 1 port on that controller so the problem would never
have manifested itself. Thanks, at least I have something to try.
Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 9:37 ` Brad Campbell
@ 2010-12-14 9:42 ` Roman Mamedov
2010-12-14 10:29 ` Brad Campbell
0 siblings, 1 reply; 7+ messages in thread
From: Roman Mamedov @ 2010-12-14 9:42 UTC (permalink / raw)
To: Brad Campbell; +Cc: RAID Linux
[-- Attachment #1: Type: text/plain, Size: 643 bytes --]
On Tue, 14 Dec 2010 17:37:53 +0800
Brad Campbell <brad@wasp.net.au> wrote:
> And in the prior incarnation I was only using 1 port on that controller so
> the problem would never have manifested itself. Thanks, at least I have
> something to try.
To add another idea to my previous reply, it should be pretty easy to test for
the presence of this particular problem by running your dd+md5sum test on just
the two physical disks which are plugged into that controller, starting the
tests simultaneously on both disks (in two terminal windows). See if you get
varying sums over several runs that way.
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 9:42 ` Roman Mamedov
@ 2010-12-14 10:29 ` Brad Campbell
0 siblings, 0 replies; 7+ messages in thread
From: Brad Campbell @ 2010-12-14 10:29 UTC (permalink / raw)
To: Roman Mamedov; +Cc: RAID Linux
On 14/12/10 17:42, Roman Mamedov wrote:
> On Tue, 14 Dec 2010 17:37:53 +0800
> Brad Campbell<brad@wasp.net.au> wrote:
>
>> And in the prior incarnation I was only using 1 port on that controller so
>> the problem would never have manifested itself. Thanks, at least I have
>> something to try.
>
> To add another idea to my previous reply, it should be pretty easy to test for
> the presence of this particular problem by running your dd+md5sum test on just
> the two physical disks which are plugged into that controller, starting the
> tests simultaneously on both disks (in two terminal windows). See if you get
> varying sums over several runs that way.
I just finished that test. Can I cry now?
Thanks Roman, I'd *never* have pegged it otherwise.
Anyone want a slightly second hand SIL 2 port controller?
Regards,
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 9:22 ` Roman Mamedov
2010-12-14 9:37 ` Brad Campbell
@ 2010-12-14 11:59 ` David W.
2010-12-14 12:07 ` Roman Mamedov
1 sibling, 1 reply; 7+ messages in thread
From: David W. @ 2010-12-14 11:59 UTC (permalink / raw)
To: Roman Mamedov; +Cc: Brad Campbell, RAID Linux
On Tue, Dec 14, 2010 at 03:22, Roman Mamedov <roman@rm.pp.ru> wrote:
> On Tue, 14 Dec 2010 16:10:07 +0800
> Brad Campbell <brad@wasp.net.au> wrote:
>
>> The drives are all on separate channels. 8 are on a pair of Marvell 88SX7042
>> controllers and 2 are on a SIL3132. This has occurred since I upgraded the
>> mainboard (and kernel at the same time - nothing like throwing more
>> variables in the mix) and its effects were subtle enough that I missed them
>> until it had successfully rotated out all of my good backups with broken
>> data. Lesson learned.
>
> I'd suggest that you try moving two disks away from SiI3132, change your
> setup so that at most ONE port on that controller is used, or none at all.
>
> Some time ago there was a report of data corruption with controllers using
> that chip when both ports simultaneously read at full speed:
> http://forum.ixbt.com/topic.cgi?id=11:35147:1200#1200 (in Russian)
> Perhaps problem not in the chip itself, but in some variations of
> schematics/components/soldering, because only two of five supposedly identical
> boards the reporter bought were corrupting data in that way, one much
> more often than the other.
>
Interesting... Any idea if the problem affects the SiI 3114 chipset as
well? I've been seeing some similar problems, but haven't had enough
time to dig into it to query the list yet.
--
david williams
<darewi@gmail.com>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Debugging a strange array corruption
2010-12-14 11:59 ` David W.
@ 2010-12-14 12:07 ` Roman Mamedov
0 siblings, 0 replies; 7+ messages in thread
From: Roman Mamedov @ 2010-12-14 12:07 UTC (permalink / raw)
To: David W.; +Cc: Brad Campbell, RAID Linux
[-- Attachment #1: Type: text/plain, Size: 527 bytes --]
On Tue, 14 Dec 2010 05:59:17 -0600
"David W." <darewi@gmail.com> wrote:
> Interesting... Any idea if the problem affects the SiI 3114 chipset as
> well? I've been seeing some similar problems, but haven't had enough
> time to dig into it to query the list yet.
There are data corruption issues reported with 3114 as well, not sure how many
of those have been worked around, traced down to something else, or still
remain. See https://encrypted.google.com/search?q=3114+data+corruption
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-12-14 12:07 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-14 8:10 Debugging a strange array corruption Brad Campbell
2010-12-14 9:22 ` Roman Mamedov
2010-12-14 9:37 ` Brad Campbell
2010-12-14 9:42 ` Roman Mamedov
2010-12-14 10:29 ` Brad Campbell
2010-12-14 11:59 ` David W.
2010-12-14 12:07 ` Roman Mamedov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).