Silent Corruption on RAID5

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Silent Corruption on RAID5
@ 2006-01-22 16:44 Michael Barnwell
  2006-01-22 18:42 ` Mitchell Laks
  2006-01-27 12:29 ` Molle Bestefich
  0 siblings, 2 replies; 4+ messages in thread
From: Michael Barnwell @ 2006-01-22 16:44 UTC (permalink / raw)
  To: linux-raid

Hi,

I'm experiencing silent data corruption on my RAID 5 set of four 400GB 
SATA disks.

I first had the problem a couple of weeks ago and thought it was related 
to using reiserfs on my system because I hadn't used it before but have 
another perfectly functional RAID 5 array running ext3 after lots of 
testing I find the problem happens with ext3 on the array as well, and 
after even more testing I find that the problem only occurs on the array 
not the individual hard disks.

My test consists of making a ~10GB file of zeros, then checking it for 
non-zero bytes, I've also tried creating the file of zeros on a 
functional array and copying it across with the same results.

dd bs=1024 count=10000k if=/dev/zero of=./10GB.tst
od -t x1 s0/10GB.tst

These commands give me one row of zeros on my other RAID 5 set on the 
same box and on each individual hard disk in the array when I put ext3 
on them all to see if one was faulty but when they are in RAID the od 
spouts lots of non-zeros at me.

<snip>
21524747740 00 00 00 00 00 00 00 00 00 00 00 00 00 50 5c 36
21524747760 00 10 00 00 00 00 a7 23 00 10 00 80 00 00 00 00
21524750000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
21525147740 00 00 00 00 00 00 00 00 00 00 00 00 00 50 5c 36
21525147760 00 10 00 00 00 00 a7 23 00 10 00 80 00 00 00 00
21525150000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
<snip>

There is a fair bit of that and the last time I ran it I got an I/O 
array and the mount went into read only mode.

I figure the problem is either with the Silicon Image 3114 hardware or 
driver that supports the array or the RAID subsystem but like I 
mentioned earlier my other RAID 5 set of three 120GB drives on two IDE 
controllers works fine.

I'm running Debian sarge with a 2.6.15-1 kernel, it has an Athlon 
XP2200, 1GB of RAM, Asus A7N8X-Deluxe motherboard, 2 Maxtor IDE 
controllers, one Silicon Image 3114 PCI adapter, along with the on-board 
Silicon Image 3112 controller - 2x 10GB IDE disks and a DVD ROM drive on 
the on-board IDE controller, 3x 120GB Seagate hard disks on the PCI IDE 
adapters, 2x 80GB Seagate disks on the on-board SilImg 3112 controller 
and finally 4x 400GB disks on the SilImg 3114 PCI adapter.

biggs:/mnt/test/s0# uname -a
Linux biggs 2.6.15.1.060121 #1 Sat Jan 21 17:01:30 GMT 2006 i686 GNU/Linux
biggs:/mnt/test/s0# cat /proc/mdstat
Personalities : [raid1] [raid5]
md2 : active raid5 sdd1[4] sdc1[2] sdb1[1] sda1[0]
       1172126208 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
       [=>...................]  recovery =  8.3% (32500608/390708736) 
finish=253.3min speed=23564K/sec

md1 : active raid5 hdg1[0] hde1[2] hdi1[1]
       234436352 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

md0 : active raid1 hdb2[0] hda2[1]
       9502336 blocks [2/2] [UU]

unused devices: <none>

(Note: md2 is the array with problems, and I've done the tests when its 
been fully synced with the same results)

So, does anyone have any suggestions or tests I could perform to narrow 
down where my problem is?

Regards,

Michael Barnwell.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Silent Corruption on RAID5
  2006-01-22 16:44 Silent Corruption on RAID5 Michael Barnwell
@ 2006-01-22 18:42 ` Mitchell Laks
  2006-01-22 20:58   ` Michael Barnwell
  2006-01-27 12:29 ` Molle Bestefich
  1 sibling, 1 reply; 4+ messages in thread
From: Mitchell Laks @ 2006-01-22 18:42 UTC (permalink / raw)
  To: linux-raid

On Sunday 22 January 2006 11:44 am, Michael Barnwell wrote:
> Hi,
>
> I'm experiencing silent data corruption on my RAID 5 set of four 400GB
> SATA disks.

> dd bs=1024 count=10000k if=/dev/zero of=./10GB.tst
> od -t x1 s0/10GB.tst
>
> These commands give me one row of zeros on my other RAID 5 set on the

> I'm running Debian sarge with a 2.6.15-1 kernel, it has an Athlon
> XP2200, 1GB of RAM, Asus A7N8X-Deluxe motherboard, 2 Maxtor IDE
> controllers, one Silicon Image 3114 PCI adapter, along with the on-board
> Silicon Image 3112 controller - 2x 10GB IDE disks and a DVD ROM drive on
> the on-board IDE controller, 3x 120GB Seagate hard disks on the PCI IDE
> adapters, 2x 80GB Seagate disks on the on-board SilImg 3112 controller
> and finally 4x 400GB disks on the SilImg 3114 PCI adapter.
>

Dear Michael,

If you look at  my recent post and the response from David Greaves, I suspect 
it is because  of the presence of multiple diffferent SATA controllers. 

Could you make a try of running  your test with ONLY the SilImg 3114 adapter 
populated with disks. Also I am not aware if the 3112 and 3114 use different 
kernel modules, make sure the  other one is not loaded. 

I ran your test on my raid1 system with the debian SID 2.6.15 kernel and ran 
the test on both motherboard sata_via and pci card sata_promise controlled 
raid devices (i have raid1 though) and had no problem.

I could only run od -t x1 10GB.tst.
 what is the "s0 " for? 
I tried s0 or -s0 and the machine didnt accept that switch for od.

od -t x1 -s0 10GB.tst 
"od: no type may be specified when dumping strings"

For what its worth, on my system the Promise controller wipes out the 
via VT8237 onboard controller. You seem to have the opposite problem.

I am afraid that SATA controllers may  not yet be stable enough for 
production.

Mitchell Laks

> So, does anyone have any suggestions or tests I could perform to narrow
> down where my problem is?
>
> Regards,
>
> Michael Barnwell.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Silent Corruption on RAID5
  2006-01-22 18:42 ` Mitchell Laks
@ 2006-01-22 20:58   ` Michael Barnwell
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Barnwell @ 2006-01-22 20:58 UTC (permalink / raw)
  To: Mitchell Laks; +Cc: linux-raid

Hi,

Mitchell Laks wrote:
> On Sunday 22 January 2006 11:44 am, Michael Barnwell wrote:
>> Hi,
>>
>> I'm experiencing silent data corruption on my RAID 5 set of four 400GB
>> SATA disks.
> 
>> dd bs=1024 count=10000k if=/dev/zero of=./10GB.tst
>> od -t x1 s0/10GB.tst
>>
>> These commands give me one row of zeros on my other RAID 5 set on the
> 
>> I'm running Debian sarge with a 2.6.15-1 kernel, it has an Athlon
>> XP2200, 1GB of RAM, Asus A7N8X-Deluxe motherboard, 2 Maxtor IDE
>> controllers, one Silicon Image 3114 PCI adapter, along with the on-board
>> Silicon Image 3112 controller - 2x 10GB IDE disks and a DVD ROM drive on
>> the on-board IDE controller, 3x 120GB Seagate hard disks on the PCI IDE
>> adapters, 2x 80GB Seagate disks on the on-board SilImg 3112 controller
>> and finally 4x 400GB disks on the SilImg 3114 PCI adapter.
>>
> 
> Dear Michael,
> 
> If you look at  my recent post and the response from David Greaves, I suspect 
> it is because  of the presence of multiple diffferent SATA controllers. 

I just tried disabling the on-board SATA controller via the jumper on 
the motherboard and then recreating the array and file system and the 
problem happened again.

> Could you make a try of running  your test with ONLY the SilImg 3114 adapter 
> populated with disks. Also I am not aware if the 3112 and 3114 use different 
> kernel modules, make sure the  other one is not loaded. 

They use the same module.

> I ran your test on my raid1 system with the debian SID 2.6.15 kernel and ran 
> the test on both motherboard sata_via and pci card sata_promise controlled 
> raid devices (i have raid1 though) and had no problem.
> 
> I could only run od -t x1 10GB.tst.
>  what is the "s0 " for? 
> I tried s0 or -s0 and the machine didnt accept that switch for od.
> 
> od -t x1 -s0 10GB.tst 
> "od: no type may be specified when dumping strings"

That was a copy and paste error, its just od -t x1 10GB.tst

> For what its worth, on my system the Promise controller wipes out the 
> via VT8237 onboard controller. You seem to have the opposite problem.

I tried a BIOS update this morning because it updated the SATA BIOS on 
the on-board card and allowed me to see both of them during the booting 
section (the PCI one finds drives and lets me access the SilImg BIOS 
then the on-board one does the same).

> I am afraid that SATA controllers may  not yet be stable enough for 
> production.

Are other chipsets better supported?

> Mitchell Laks
> 
<snip>

Thanks,

Michael Barnwell.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Silent Corruption on RAID5
  2006-01-22 16:44 Silent Corruption on RAID5 Michael Barnwell
  2006-01-22 18:42 ` Mitchell Laks
@ 2006-01-27 12:29 ` Molle Bestefich
  1 sibling, 0 replies; 4+ messages in thread
From: Molle Bestefich @ 2006-01-27 12:29 UTC (permalink / raw)
  To: Michael Barnwell; +Cc: linux-raid

Michael Barnwell wrote:
> I'm experiencing silent data corruption
> on my RAID 5 set of four 400GB SATA disks.

I have circa the same hardware:
 * AMD Opteron 250
 * Silicon Image 3114
 * 300 GB Maxtor SATA

Just to add a data point, I've run your test on my RAID 1 (not RAID 5
!) without problems.

localhost ~ # dd bs=1024 count=10000k if=/dev/zero of=./10GB.tst
10240000+0 records in
10240000+0 records out
localhost ~ # od -t x1 ./10GB.tst
0000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
116100000000
localhost ~ # uname -a
Linux localhost 2.6.12.6-xen #6 SMP Fri Jan 6 06:49:53 CET 2006 x86_64
AMD Opteron(tm) Processor 250 AuthenticAMD GNU/Linux

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-01-27 12:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-01-22 16:44 Silent Corruption on RAID5 Michael Barnwell
2006-01-22 18:42 ` Mitchell Laks
2006-01-22 20:58   ` Michael Barnwell
2006-01-27 12:29 ` Molle Bestefich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).