linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* silent corruption with RAID1
@ 2006-02-26  2:40 Moses Leslie
  2006-02-26  5:07 ` Bill Davidsen
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Moses Leslie @ 2006-02-26  2:40 UTC (permalink / raw)
  To: linux-raid

Hi,

I have a machine that currently has 4 drives in it (currently running
2.6.15.4). The first two drives are on the onboard SATA controller (VIA)
in a RAID-1.  I haven't had any issues with these.

The other two drives were added recently, along with an SiL PCI SATA card
to put them on.  lspci reports this card as:

0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
(formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
Controller (rev 02)

I initially used mdadm to create a new RAID1 of the two new drives, and
added them into the LVM group that the other ones were in to expand the
drive, but pretty quickly noticed (via rsync -c) that all new files were
corrupted.

I've since pulled the 2nd set of drives out of the LVM to test.  It's only
when using a RAID-1 that I get occasionaly corruption.  I split the drives
(each 300GB) into 4 75GB partitions each, and created 3 md devices.   One
75GB raid1, one 150GB raid0, and 1 225GB raid5.

I used a script that newfs'd each one, dd'd multiple copies of files (one
run with a 1GB, one with 3GB, one with 6GB), md5'd those files, then
umounted.

At least once in each test run, there was a file with the wrong checksum
when on the RAID-1 part of the test.

After completing all the tests, I redid the md devices such that none
of them used any of the same partitions that they had used in the first
test (IE the RAID1 was sda1 and sdb1 in the first one, and was sda4 and
sdb4 in the second one).

I also did the same test using each of the regular partitions as well
(sda1-4 and sdb1-4).

I was never able to duplicate any corruption any other time than with the
RAID1.

There's never any error messages in dmesg or syslog.

Is there anything I can do to help track down where the problem is?

Thanks!

Moses

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  2:40 silent corruption with RAID1 Moses Leslie
@ 2006-02-26  5:07 ` Bill Davidsen
  2006-02-26  8:22 ` Andre Noll
  2006-02-27  0:55 ` Moses Leslie
  2 siblings, 0 replies; 7+ messages in thread
From: Bill Davidsen @ 2006-02-26  5:07 UTC (permalink / raw)
  To: Moses Leslie; +Cc: linux-raid

Moses Leslie wrote:

>Hi,
>
>I have a machine that currently has 4 drives in it (currently running
>2.6.15.4). The first two drives are on the onboard SATA controller (VIA)
>in a RAID-1.  I haven't had any issues with these.
>
>The other two drives were added recently, along with an SiL PCI SATA card
>to put them on.  lspci reports this card as:
>
>0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
>(formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
>Controller (rev 02)
>
>I initially used mdadm to create a new RAID1 of the two new drives, and
>added them into the LVM group that the other ones were in to expand the
>drive, but pretty quickly noticed (via rsync -c) that all new files were
>corrupted.
>
>I've since pulled the 2nd set of drives out of the LVM to test.  It's only
>when using a RAID-1 that I get occasionaly corruption.  I split the drives
>(each 300GB) into 4 75GB partitions each, and created 3 md devices.   One
>75GB raid1, one 150GB raid0, and 1 225GB raid5.
>
>I used a script that newfs'd each one, dd'd multiple copies of files (one
>run with a 1GB, one with 3GB, one with 6GB), md5'd those files, then
>umounted.
>
>At least once in each test run, there was a file with the wrong checksum
>when on the RAID-1 part of the test.
>
>After completing all the tests, I redid the md devices such that none
>of them used any of the same partitions that they had used in the first
>test (IE the RAID1 was sda1 and sdb1 in the first one, and was sda4 and
>sdb4 in the second one).
>
>I also did the same test using each of the regular partitions as well
>(sda1-4 and sdb1-4).
>
>I was never able to duplicate any corruption any other time than with the
>RAID1.
>
>There's never any error messages in dmesg or syslog.
>
>Is there anything I can do to help track down where the problem is?
>

Based on my own experience, I would suspect hardware. I can't swear that 
you don't have buggy software of some kind, but I've been running for 
over a year on RAID-1 with critical data on the volume, and haven't seen 
any indication of problems. Because of the data, the files get checked 
against md5sums daily and sha1sums monthly. Some files are old, some are 
added almost every day, files seldom are updated, but it does happen, 
and they are moved to new directories on a fairly frequest (2-3 
times/mo) basis. The checkfiles are run against an archival copy on 
another system about once a month, so I'm pretty sure there is no 
corruption happening.

Cables are my favorite source of intermittent evil, memory problems are 
next, but that usually shows up everywhere if you look hard. Hope any of 
this is useful.

-- 
bill davidsen <davidsen@tmr.com>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  2:40 silent corruption with RAID1 Moses Leslie
  2006-02-26  5:07 ` Bill Davidsen
@ 2006-02-26  8:22 ` Andre Noll
  2006-02-26  8:57   ` Moses Leslie
  2006-02-27  0:55 ` Moses Leslie
  2 siblings, 1 reply; 7+ messages in thread
From: Andre Noll @ 2006-02-26  8:22 UTC (permalink / raw)
  To: Moses Leslie; +Cc: linux-raid

On 18:40, Moses Leslie wrote:
> 0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
> (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
> Controller (rev 02)

Do you have Seagate drives? Some models have problems with this controller..

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  8:22 ` Andre Noll
@ 2006-02-26  8:57   ` Moses Leslie
  2006-02-26  9:05     ` Andre Noll
  0 siblings, 1 reply; 7+ messages in thread
From: Moses Leslie @ 2006-02-26  8:57 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Sun, 26 Feb 2006, Andre Noll wrote:

> On 18:40, Moses Leslie wrote:
> > 0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
> > (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
> > Controller (rev 02)
>
> Do you have Seagate drives? Some models have problems with this controller..

They are indeed 300GB Seagate drives.

Are there any workarounds?  It seems odd that raid1 would be the only
thing that has a problem.

The other two drives are western digitals, is there a blacklist somewhere
I could check?

Thanks for the reply,

Moses

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  8:57   ` Moses Leslie
@ 2006-02-26  9:05     ` Andre Noll
  2006-02-26 17:39       ` Moses Leslie
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Noll @ 2006-02-26  9:05 UTC (permalink / raw)
  To: Moses Leslie; +Cc: linux-raid

On 00:57, Moses Leslie wrote:
> On Sun, 26 Feb 2006, Andre Noll wrote:
> 
> > On 18:40, Moses Leslie wrote:
> > > 0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
> > > (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
> > > Controller (rev 02)
> >
> > Do you have Seagate drives? Some models have problems with this controller..
> 
> They are indeed 300GB Seagate drives.
> 
> Are there any workarounds?  It seems odd that raid1 would be the only
> thing that has a problem.

You could add your drive to the blacklist just to see if that makes any
difference.

> The other two drives are western digitals, is there a blacklist somewhere
> I could check?

Just look at the top of drivers/scsi/sata_sil.c

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  9:05     ` Andre Noll
@ 2006-02-26 17:39       ` Moses Leslie
  0 siblings, 0 replies; 7+ messages in thread
From: Moses Leslie @ 2006-02-26 17:39 UTC (permalink / raw)
  To: Andre Noll; +Cc: linux-raid

On Sun, 26 Feb 2006, Andre Noll wrote:

> > Are there any workarounds?  It seems odd that raid1 would be the only
> > thing that has a problem.
>
> You could add your drive to the blacklist just to see if that makes any
> difference.

Unfortunately, it didn't make any difference.  After adding in the drive
IDs, they're identified correctly:

ata1(0): applying Seagate errata fix
ata2(0): applying Seagate errata fix

but I still get the corruption on md1 (sda1 + sdb1 in raid 1) but not on
either sda1 or sdb1 when the raid is shut down and the partitions are used
individually.

As a new test I made two raid1s, sda1+sda2 and sdb1+sdb2, and did the same
tests with both of them at once and there was no corruption.

Moses


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: silent corruption with RAID1
  2006-02-26  2:40 silent corruption with RAID1 Moses Leslie
  2006-02-26  5:07 ` Bill Davidsen
  2006-02-26  8:22 ` Andre Noll
@ 2006-02-27  0:55 ` Moses Leslie
  2 siblings, 0 replies; 7+ messages in thread
From: Moses Leslie @ 2006-02-27  0:55 UTC (permalink / raw)
  To: linux-raid

Turns out this is indeed DMA corruption that only happens under high load,
I guessing that raid1 must just do more DMA, so it shows up more often
there.

I can turn the corruption on and off by starting other high-ish bandwidth
processes on the machine (backing up to a remote server, etc).

Thanks for all the suggestions,

Moses

On Sat, 25 Feb 2006, Moses Leslie wrote:

> Hi,
>
> I have a machine that currently has 4 drives in it (currently running
> 2.6.15.4). The first two drives are on the onboard SATA controller (VIA)
> in a RAID-1.  I haven't had any issues with these.
>
> The other two drives were added recently, along with an SiL PCI SATA card
> to put them on.  lspci reports this card as:
>
> 0000:00:0a.0 Unknown mass storage controller: Silicon Image, Inc.
> (formerly CMD Technology Inc) SiI 3112 [SATALink/SATARaid] Serial ATA
> Controller (rev 02)
>
> I initially used mdadm to create a new RAID1 of the two new drives, and
> added them into the LVM group that the other ones were in to expand the
> drive, but pretty quickly noticed (via rsync -c) that all new files were
> corrupted.
>
> I've since pulled the 2nd set of drives out of the LVM to test.  It's only
> when using a RAID-1 that I get occasionaly corruption.  I split the drives
> (each 300GB) into 4 75GB partitions each, and created 3 md devices.   One
> 75GB raid1, one 150GB raid0, and 1 225GB raid5.
>
> I used a script that newfs'd each one, dd'd multiple copies of files (one
> run with a 1GB, one with 3GB, one with 6GB), md5'd those files, then
> umounted.
>
> At least once in each test run, there was a file with the wrong checksum
> when on the RAID-1 part of the test.
>
> After completing all the tests, I redid the md devices such that none
> of them used any of the same partitions that they had used in the first
> test (IE the RAID1 was sda1 and sdb1 in the first one, and was sda4 and
> sdb4 in the second one).
>
> I also did the same test using each of the regular partitions as well
> (sda1-4 and sdb1-4).
>
> I was never able to duplicate any corruption any other time than with the
> RAID1.
>
> There's never any error messages in dmesg or syslog.
>
> Is there anything I can do to help track down where the problem is?
>
> Thanks!
>
> Moses
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-02-27  0:55 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-02-26  2:40 silent corruption with RAID1 Moses Leslie
2006-02-26  5:07 ` Bill Davidsen
2006-02-26  8:22 ` Andre Noll
2006-02-26  8:57   ` Moses Leslie
2006-02-26  9:05     ` Andre Noll
2006-02-26 17:39       ` Moses Leslie
2006-02-27  0:55 ` Moses Leslie

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).