Re: Corrupt data - RAID sata

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Corrupt data - RAID sata_sil 3114 chip
       [not found] ` <alpine.DEB.1.10.0901020741200.11852@p34.internal.lan>
@ 2009-01-02 21:30   ` Bernd Schubert
  2009-01-02 21:47     ` Twigathy
                       ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Bernd Schubert @ 2009-01-02 21:30 UTC (permalink / raw)
  To: Justin Piszcz; +Cc: bengt, debian-user, linux-raid, linux-ide

Hello Bengt,

sil3114 is known to cause data corruption with some disks. So far I only know
about Seagate, but maybe there issues with newer Samsungs as well?

http://lkml.indiana.edu/hypermail/linux/kernel/0710.2/2035.html

Unfortuntely this issue has been simply ignored by the SATA developers :(
So if you want to be on the safe side, go an get another controller. 

I hope I won't frighten you too much, but it also might be possible one of 
your disks has a problem,  I have also seen a few broken disks, which don't 
return what you write to it...


Cheers,
Bernd


On Fri, Jan 02, 2009 at 07:42:30AM -0500, Justin Piszcz wrote:
>
>
> On Fri, 2 Jan 2009, Bengt Samuelsson wrote:
>
>>
>> Hi,
>>
>> I need some support for this soft-raid system.
>>
>> I am running it as RAID5 with 4 samsung spinpoint 500G SATA300 tot 1.3T byte
>>
>> And it runs in http://sm7jqb.dnsalias.com
>> I use mdadm sytem in a Debian Linux
>> CPU 1.2Mhz 1G memory ( my older 433Mhz / 512M dont work at all )
>>
>> I have 'some courrupt' data. And I don't understand whay and how to fix it.
>> Mybee slow it down more, but how slow it down?
>>
>> Any with experents from this cheep way of RAID systems.
>>
>> Ask for more information and I can get it, logs, setup files and what 
>> you want
>> to know.
>>
>> -- 
>> Bengt Samuelsson
>> Nydalavägen 30 A
>> 352 48 Växjö
>>
>> +46(0)703686441
>>
>> http://sm7jqb.se
>>
>>
>> -- 
>> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a 
>> subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>>
>
> If this is an mdadm-related raid (not dmraid) please show all relevant md 
> info, mdadm -D /dev/md0, I have cc'd linux-raid on this thread for you.
>
> You'll want to read md.txt in /usr/src/linux/Documentation and read on 
> the check and repair commands.
>
> In addition, have you run memtest86 on your system first to make sure its 
> not memory related?
>
> Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-02 21:30   ` Bernd Schubert
@ 2009-01-02 21:47     ` Twigathy
  2009-01-03  2:31     ` Redeeman
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 29+ messages in thread
From: Twigathy @ 2009-01-02 21:47 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

Hi,

I also had problems with the sata_sil driver with more than one
silicon image card in the same machine about a year or two back. Don't
remember the specifics, but basically the cards would occasionally
drop the SATA link. This was with Western Digital drives. With a
Samsung 750GB disk the disk and controller absolutely refused to talk
to each other.

I've since got rid of all but one silicon image card and haven't had
problems since and swapped out cables. Coincidence? No idea.

04:01.0 RAID bus controller: Silicon Image, Inc. SiI 3512
[SATALink/SATARaid] Serial ATA Controller (rev 01)
Currently running kernel 2.6.24-21

Not much fun when disks don't work properly, is it? :-(

T

2009/1/2 Bernd Schubert <bs@q-leap.de>:
> Hello Bengt,
>
> sil3114 is known to cause data corruption with some disks. So far I only know
> about Seagate, but maybe there issues with newer Samsungs as well?
>
> http://lkml.indiana.edu/hypermail/linux/kernel/0710.2/2035.html
>
> Unfortuntely this issue has been simply ignored by the SATA developers :(
> So if you want to be on the safe side, go an get another controller.
>
> I hope I won't frighten you too much, but it also might be possible one of
> your disks has a problem,  I have also seen a few broken disks, which don't
> return what you write to it...
>
>
> Cheers,
> Bernd
>
>
> On Fri, Jan 02, 2009 at 07:42:30AM -0500, Justin Piszcz wrote:
>>
>>
>> On Fri, 2 Jan 2009, Bengt Samuelsson wrote:
>>
>>>
>>> Hi,
>>>
>>> I need some support for this soft-raid system.
>>>
>>> I am running it as RAID5 with 4 samsung spinpoint 500G SATA300 tot 1.3T byte
>>>
>>> And it runs in http://sm7jqb.dnsalias.com
>>> I use mdadm sytem in a Debian Linux
>>> CPU 1.2Mhz 1G memory ( my older 433Mhz / 512M dont work at all )
>>>
>>> I have 'some courrupt' data. And I don't understand whay and how to fix it.
>>> Mybee slow it down more, but how slow it down?
>>>
>>> Any with experents from this cheep way of RAID systems.
>>>
>>> Ask for more information and I can get it, logs, setup files and what
>>> you want
>>> to know.
>>>
>>> --
>>> Bengt Samuelsson
>>> Nydalavägen 30 A
>>> 352 48 Växjö
>>>
>>> +46(0)703686441
>>>
>>> http://sm7jqb.se
>>>
>>>
>>> --
>>> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a
>>> subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
>>>
>>
>> If this is an mdadm-related raid (not dmraid) please show all relevant md
>> info, mdadm -D /dev/md0, I have cc'd linux-raid on this thread for you.
>>
>> You'll want to read md.txt in /usr/src/linux/Documentation and read on
>> the check and repair commands.
>>
>> In addition, have you run memtest86 on your system first to make sure its
>> not memory related?
>>
>> Justin.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-02 21:30   ` Bernd Schubert
  2009-01-02 21:47     ` Twigathy
@ 2009-01-03  2:31     ` Redeeman
  2009-01-03 13:13       ` Bernd Schubert
  2009-01-03 13:39     ` Alan Cox
  2009-01-03 22:19     ` James Youngman
  3 siblings, 1 reply; 29+ messages in thread
From: Redeeman @ 2009-01-03  2:31 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

On Fri, 2009-01-02 at 22:30 +0100, Bernd Schubert wrote:
> Hello Bengt,
> 
> sil3114 is known to cause data corruption with some disks. So far I only know
> about Seagate, but maybe there issues with newer Samsungs as well?
> 
> http://lkml.indiana.edu/hypermail/linux/kernel/0710.2/2035.html
> 
> Unfortuntely this issue has been simply ignored by the SATA developers :(
> So if you want to be on the safe side, go an get another controller. 

Are you sure? is this not the "15" or "slow_down" thing mentioned here:
http://ata.wiki.kernel.org/index.php/Sata_sil ?
> 
> I hope I won't frighten you too much, but it also might be possible one of 
> your disks has a problem,  I have also seen a few broken disks, which don't 
> return what you write to it...
> 
> 
> Cheers,
> Bernd
> 
> 
> On Fri, Jan 02, 2009 at 07:42:30AM -0500, Justin Piszcz wrote:
> >
> >
> > On Fri, 2 Jan 2009, Bengt Samuelsson wrote:
> >
> >>
> >> Hi,
> >>
> >> I need some support for this soft-raid system.
> >>
> >> I am running it as RAID5 with 4 samsung spinpoint 500G SATA300 tot 1.3T byte
> >>
> >> And it runs in http://sm7jqb.dnsalias.com
> >> I use mdadm sytem in a Debian Linux
> >> CPU 1.2Mhz 1G memory ( my older 433Mhz / 512M dont work at all )
> >>
> >> I have 'some courrupt' data. And I don't understand whay and how to fix it.
> >> Mybee slow it down more, but how slow it down?
> >>
> >> Any with experents from this cheep way of RAID systems.
> >>
> >> Ask for more information and I can get it, logs, setup files and what 
> >> you want
> >> to know.
> >>
> >> -- 
> >> Bengt Samuelsson
> >> Nydalavägen 30 A
> >> 352 48 Växjö
> >>
> >> +46(0)703686441
> >>
> >> http://sm7jqb.se
> >>
> >>
> >> -- 
> >> To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org with a 
> >> subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
> >>
> >
> > If this is an mdadm-related raid (not dmraid) please show all relevant md 
> > info, mdadm -D /dev/md0, I have cc'd linux-raid on this thread for you.
> >
> > You'll want to read md.txt in /usr/src/linux/Documentation and read on 
> > the check and repair commands.
> >
> > In addition, have you run memtest86 on your system first to make sure its 
> > not memory related?
> >
> > Justin.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03  2:31     ` Redeeman
@ 2009-01-03 13:13       ` Bernd Schubert
  0 siblings, 0 replies; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 13:13 UTC (permalink / raw)
  To: Redeeman; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

On Saturday 03 January 2009 03:31:57 Redeeman wrote:
> On Fri, 2009-01-02 at 22:30 +0100, Bernd Schubert wrote:
> > Hello Bengt,
> >
> > sil3114 is known to cause data corruption with some disks. So far I only
> > know about Seagate, but maybe there issues with newer Samsungs as well?
> >
> > http://lkml.indiana.edu/hypermail/linux/kernel/0710.2/2035.html
> >
> > Unfortuntely this issue has been simply ignored by the SATA developers :(
> > So if you want to be on the safe side, go an get another controller.
>
> Are you sure? is this not the "15" or "slow_down" thing mentioned here:
> http://ata.wiki.kernel.org/index.php/Sata_sil ?
>

According to Jeff Garzik and Tejun Heo 3114 is not affected by the mod15 bug. 
The mod15 also help in our case, but probably we are just luckily.

https://kerneltrap.org/mailarchive/linux-kernel/2007/10/11/334985/thread


Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-02 21:30   ` Bernd Schubert
  2009-01-02 21:47     ` Twigathy
  2009-01-03  2:31     ` Redeeman
@ 2009-01-03 13:39     ` Alan Cox
  2009-01-03 16:20       ` Bernd Schubert
  2009-01-03 22:19     ` James Youngman
  3 siblings, 1 reply; 29+ messages in thread
From: Alan Cox @ 2009-01-03 13:39 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

On Fri, 2 Jan 2009 22:30:07 +0100
Bernd Schubert <bs@q-leap.de> wrote:

> Hello Bengt,
> 
> sil3114 is known to cause data corruption with some disks. 

News to me. There are a few people with lots of SI and other devices
jammed into the same mainboard who had problems but that doesn't appear
to be an SI problem as far as I can tell.

There are some incompatibilities between certain silicon image chips and
Nvidia chipsets needing BIOS workarounds according to the errata docs.

Alan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 13:39     ` Alan Cox
@ 2009-01-03 16:20       ` Bernd Schubert
  2009-01-03 18:31         ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 16:20 UTC (permalink / raw)
  To: Alan Cox; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
> On Fri, 2 Jan 2009 22:30:07 +0100
> Bernd Schubert <bs@q-leap.de> wrote:
> 
> > Hello Bengt,
> > 
> > sil3114 is known to cause data corruption with some disks. 
> 
> News to me. There are a few people with lots of SI and other devices

No no, you just forgot about it, since you even reviewed the patches ;)

http://lkml.org/lkml/2007/10/11/137

> jammed into the same mainboard who had problems but that doesn't appear
> to be an SI problem as far as I can tell.
> 
> There are some incompatibilities between certain silicon image chips and
> Nvidia chipsets needing BIOS workarounds according to the errata docs.

Well, I already posted the the links to the discussion we had in the past.
The corruption issue is easily reproducible on Tyan S2882 with AMD-8111,
SiI 3114 and ST3250820AS disks. This is on a compute cluster, and we run into 
the problem, when a few ST3200822AS failed and got replaced by newer 250GB 
disks. The 200GB ST3200822AS work perfectly fine, while the 250GB ST3250820AS 
disks cause data corrution. 

Presently the cluster is empty, so if you want do help me, your help to 
properly solve the issue would be highly appreciated (*).

Cheers,
Bernd

PS: The patches I posted work fine on these systems, but they are not upstream 
and I really would prefer to find a way in vanilla linux to prevent this
data corruption.

PPS: Its a bit funny with this cluster, since it is located at my university 
group and I did and do many calculations on it myself.  But presently I work 
for the company we bought it from and which is responsible to maintain it... ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 16:20       ` Bernd Schubert
@ 2009-01-03 18:31         ` Robert Hancock
  0 siblings, 0 replies; 29+ messages in thread
From: Robert Hancock @ 2009-01-03 18:31 UTC (permalink / raw)
  To: linux-ide; +Cc: debian-user, linux-raid

Bernd Schubert wrote:
> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>> On Fri, 2 Jan 2009 22:30:07 +0100
>> Bernd Schubert <bs@q-leap.de> wrote:
>>
>>> Hello Bengt,
>>>
>>> sil3114 is known to cause data corruption with some disks. 
>> News to me. There are a few people with lots of SI and other devices
> 
> No no, you just forgot about it, since you even reviewed the patches ;)
> 
> http://lkml.org/lkml/2007/10/11/137

And Jeff explained why they were not merged:

http://lkml.org/lkml/2007/10/11/166

All the patch does is try to reduce the speed impact of the workaround. 
But as was pointed out, they don't reliably solve the problem the 
workaround is trying to fix, and besides, the workaround is already not 
applied to SiI3114 at all, as it is apparently not applicable on that 
controller (only 3112).

> 
>> jammed into the same mainboard who had problems but that doesn't appear
>> to be an SI problem as far as I can tell.
>>
>> There are some incompatibilities between certain silicon image chips and
>> Nvidia chipsets needing BIOS workarounds according to the errata docs.

Do you have details of these Alan?

> 
> Well, I already posted the the links to the discussion we had in the past.
> The corruption issue is easily reproducible on Tyan S2882 with AMD-8111,
> SiI 3114 and ST3250820AS disks. This is on a compute cluster, and we run into 
> the problem, when a few ST3200822AS failed and got replaced by newer 250GB 
> disks. The 200GB ST3200822AS work perfectly fine, while the 250GB ST3250820AS 
> disks cause data corrution. 
> 
> Presently the cluster is empty, so if you want do help me, your help to 
> properly solve the issue would be highly appreciated (*).
> 
> 
> Cheers,
> Bernd
> 
> PS: The patches I posted work fine on these systems, but they are not upstream 
> and I really would prefer to find a way in vanilla linux to prevent this
> data corruption.

Some people have tried turning on the slow_down option or adding their 
drive to the mod15 blacklist and found that problems went away, but that 
in no way implies that their setup actually needs this workaround, only 
that it slows down the IO enough that the problem no longer shows up. 
It's a big hammer that can cover up all kinds of other issues and has 
confused a lot of people into thinking the mod15write problem is bigger 
than it actually is.

> 
> PPS: Its a bit funny with this cluster, since it is located at my university 
> group and I did and do many calculations on it myself.  But presently I work 
> for the company we bought it from and which is responsible to maintain it... ;)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
@ 2009-01-03 20:04 Bernd Schubert
  2009-01-03 20:53 ` Robert Hancock
  2009-01-07  4:59 ` Tejun Heo
  0 siblings, 2 replies; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 20:04 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide

[sorry sent again, since Robert dropped all mailing list CCs and I didn't 
notice first]

On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
> Bernd Schubert wrote:
>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>
>>>> Hello Bengt,
>>>>
>>>> sil3114 is known to cause data corruption with some disks. 
>>> News to me. There are a few people with lots of SI and other devices
>>
>> No no, you just forgot about it, since you even reviewed the patches ;)
>>
>> http://lkml.org/lkml/2007/10/11/137
>
> And Jeff explained why they were not merged:
>
> http://lkml.org/lkml/2007/10/11/166
>
> All the patch does is try to reduce the speed impact of the workaround.  
> But as was pointed out, they don't reliably solve the problem the  
> workaround is trying to fix, and besides, the workaround is already not  
> applied to SiI3114 at all, as it is apparently not applicable on that  
> controller (only 3112).

Well, do they reliable solve the problem in our case (before taking the patch
into production I run a checksum tests for about 2 weeks). Anyway, I entirely
understand the patches didn't get accepted. 

But now more than a year has passed again without doing anything
about it and actually this is what I strongly criticize. Most people don't
know about issues like that and don't run file checksum tests as I now always
do before taking a disk into production. So users are exposed to known
data corruption problems without even being warned about it. Usually
even backups don't help, since one creates a backup of the corrupted data.

So IMHO, the driver should be deactived for sil3114 until a real solution is 
found. And it only should be possible to force activate it by a kernel flag, 
which then also would print a huuuge warning about possible data corruption 
(unfortunately most distributions disables inital kernel messages *grumble*).

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 20:04 Bernd Schubert
@ 2009-01-03 20:53 ` Robert Hancock
  2009-01-03 21:11   ` Bernd Schubert
  2009-01-07  4:59 ` Tejun Heo
  1 sibling, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-03 20:53 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide

Bernd Schubert wrote:
> [sorry sent again, since Robert dropped all mailing list CCs and I didn't 
> notice first]
> 
> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>> Bernd Schubert wrote:
>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>
>>>>> Hello Bengt,
>>>>>
>>>>> sil3114 is known to cause data corruption with some disks. 
>>>> News to me. There are a few people with lots of SI and other devices
>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>
>>> http://lkml.org/lkml/2007/10/11/137
>> And Jeff explained why they were not merged:
>>
>> http://lkml.org/lkml/2007/10/11/166
>>
>> All the patch does is try to reduce the speed impact of the workaround.  
>> But as was pointed out, they don't reliably solve the problem the  
>> workaround is trying to fix, and besides, the workaround is already not  
>> applied to SiI3114 at all, as it is apparently not applicable on that  
>> controller (only 3112).
> 
> Well, do they reliable solve the problem in our case (before taking the patch
> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
> understand the patches didn't get accepted. 
> 
> But now more than a year has passed again without doing anything
> about it and actually this is what I strongly criticize. Most people don't
> know about issues like that and don't run file checksum tests as I now always
> do before taking a disk into production. So users are exposed to known
> data corruption problems without even being warned about it. Usually
> even backups don't help, since one creates a backup of the corrupted data.
> 
> So IMHO, the driver should be deactived for sil3114 until a real solution is 
> found. And it only should be possible to force activate it by a kernel flag, 
> which then also would print a huuuge warning about possible data corruption 
> (unfortunately most distributions disables inital kernel messages *grumble*).

If the corruption was happening on all such controllers then people 
would have been complaining in droves and something would have been 
done. It seems much more likely that in this case the problem is some 
kind of hardware fault or combination of hardware which is causing the 
problem. Unfortunately these kind of not-easily-reproducible issues tend 
to be very hard to track down.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 20:53 ` Robert Hancock
@ 2009-01-03 21:11   ` Bernd Schubert
  2009-01-03 23:23     ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 21:11 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide

On Sat, Jan 03, 2009 at 02:53:09PM -0600, Robert Hancock wrote:
> Bernd Schubert wrote:
>> [sorry sent again, since Robert dropped all mailing list CCs and I 
>> didn't notice first]
>>
>> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>>> Bernd Schubert wrote:
>>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>>
>>>>>> Hello Bengt,
>>>>>>
>>>>>> sil3114 is known to cause data corruption with some disks. 
>>>>> News to me. There are a few people with lots of SI and other devices
>>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>>
>>>> http://lkml.org/lkml/2007/10/11/137
>>> And Jeff explained why they were not merged:
>>>
>>> http://lkml.org/lkml/2007/10/11/166
>>>
>>> All the patch does is try to reduce the speed impact of the 
>>> workaround.  But as was pointed out, they don't reliably solve the 
>>> problem the  workaround is trying to fix, and besides, the workaround 
>>> is already not  applied to SiI3114 at all, as it is apparently not 
>>> applicable on that  controller (only 3112).
>>
>> Well, do they reliable solve the problem in our case (before taking the patch
>> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
>> understand the patches didn't get accepted. 
>>
>> But now more than a year has passed again without doing anything
>> about it and actually this is what I strongly criticize. Most people don't
>> know about issues like that and don't run file checksum tests as I now always
>> do before taking a disk into production. So users are exposed to known
>> data corruption problems without even being warned about it. Usually
>> even backups don't help, since one creates a backup of the corrupted data.
>>
>> So IMHO, the driver should be deactived for sil3114 until a real 
>> solution is found. And it only should be possible to force activate it 
>> by a kernel flag, which then also would print a huuuge warning about 
>> possible data corruption (unfortunately most distributions disables 
>> inital kernel messages *grumble*).
>
> If the corruption was happening on all such controllers then people  
> would have been complaining in droves and something would have been  
> done. It seems much more likely that in this case the problem is some  
> kind of hardware fault or combination of hardware which is causing the  
> problem. Unfortunately these kind of not-easily-reproducible issues tend  
> to be very hard to track down.
>

Well yes, it only happens with certain drives. But these drives work fine on
other controllers. But still these are by now 
known issues and nothing is done for that.
I would happily help to solve the problem, I just don't have any knowledge 
about hardware programming. What would be your next step, if you had remote
access to such a system? 

Thanks,
Bernd

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-02 21:30   ` Bernd Schubert
                       ` (2 preceding siblings ...)
  2009-01-03 13:39     ` Alan Cox
@ 2009-01-03 22:19     ` James Youngman
  3 siblings, 0 replies; 29+ messages in thread
From: James Youngman @ 2009-01-03 22:19 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Justin Piszcz, bengt, debian-user, linux-raid, linux-ide

On Fri, Jan 2, 2009 at 9:30 PM, Bernd Schubert <bs@q-leap.de> wrote:
> Hello Bengt,
>
> sil3114 is known to cause data corruption with some disks. So far I only know
> about Seagate, but maybe there issues with newer Samsungs as well?

I've experienced data corruption with a SII 0680 ACLU144 (on an ST
Labs' A-132 card) with a pair of Seagate ST3300622A drives.  I was
using them with MD in a RAID1 configuration.

James.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 21:11   ` Bernd Schubert
@ 2009-01-03 23:23     ` Robert Hancock
  0 siblings, 0 replies; 29+ messages in thread
From: Robert Hancock @ 2009-01-03 23:23 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide

Bernd Schubert wrote:
> On Sat, Jan 03, 2009 at 02:53:09PM -0600, Robert Hancock wrote:
>> Bernd Schubert wrote:
>>> [sorry sent again, since Robert dropped all mailing list CCs and I 
>>> didn't notice first]
>>>
>>> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>>>> Bernd Schubert wrote:
>>>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>>>
>>>>>>> Hello Bengt,
>>>>>>>
>>>>>>> sil3114 is known to cause data corruption with some disks. 
>>>>>> News to me. There are a few people with lots of SI and other devices
>>>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>>>
>>>>> http://lkml.org/lkml/2007/10/11/137
>>>> And Jeff explained why they were not merged:
>>>>
>>>> http://lkml.org/lkml/2007/10/11/166
>>>>
>>>> All the patch does is try to reduce the speed impact of the 
>>>> workaround.  But as was pointed out, they don't reliably solve the 
>>>> problem the  workaround is trying to fix, and besides, the workaround 
>>>> is already not  applied to SiI3114 at all, as it is apparently not 
>>>> applicable on that  controller (only 3112).
>>> Well, do they reliable solve the problem in our case (before taking the patch
>>> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
>>> understand the patches didn't get accepted. 
>>>
>>> But now more than a year has passed again without doing anything
>>> about it and actually this is what I strongly criticize. Most people don't
>>> know about issues like that and don't run file checksum tests as I now always
>>> do before taking a disk into production. So users are exposed to known
>>> data corruption problems without even being warned about it. Usually
>>> even backups don't help, since one creates a backup of the corrupted data.
>>>
>>> So IMHO, the driver should be deactived for sil3114 until a real 
>>> solution is found. And it only should be possible to force activate it 
>>> by a kernel flag, which then also would print a huuuge warning about 
>>> possible data corruption (unfortunately most distributions disables 
>>> inital kernel messages *grumble*).
>> If the corruption was happening on all such controllers then people  
>> would have been complaining in droves and something would have been  
>> done. It seems much more likely that in this case the problem is some  
>> kind of hardware fault or combination of hardware which is causing the  
>> problem. Unfortunately these kind of not-easily-reproducible issues tend  
>> to be very hard to track down.
>>
> 
> Well yes, it only happens with certain drives. But these drives work fine on
> other controllers. But still these are by now 
> known issues and nothing is done for that.
> I would happily help to solve the problem, I just don't have any knowledge 
> about hardware programming. What would be your next step, if you had remote
> access to such a system? 

Have you been able to track down what kind of corruption is occurring 
exactly, i.e. what is happening to the data, is data being zeroed out, 
random bits being flipped, chunks of a certain size being corrupted, 
etc. That would likely be useful in determining where to go next..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
       [not found]             ` <4963306F.4060504@sm7jqb.se>
@ 2009-01-06 10:48               ` Justin Piszcz
  0 siblings, 0 replies; 29+ messages in thread
From: Justin Piszcz @ 2009-01-06 10:48 UTC (permalink / raw)
  To: Bengt Samuelsson; +Cc: debian-user, linux-ide, linux-raid

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3966 bytes --]

cc: linux-ide, linux-raid

There was some talk about corruption on these chips I believe, hopefully 
someone on the list can offer further insight.

On Tue, 6 Jan 2009, Bengt Samuelsson wrote:

> Justin Piszcz skrev:
>> 
>> 
>> On Mon, 5 Jan 2009, Bengt Samuelsson wrote:
>> 
>>> Justin Piszcz skrev:
>>>> 
>>>> 
>>>> On Sun, 4 Jan 2009, Bengt Samuelsson wrote:
>>>> 
>>>>> Bengt Samuelsson skrev:
>>>>> 
>>>>> ~# mdadm -D /dev/md0
>>>>> ------------------------------
>>>>> /dev/md0:
>>>>>         Version : 00.90.03
>>>>>   Creation Time : Fri Sep 12 19:08:22 2008
>>>>>      Raid Level : raid5
>>>>>      Array Size : 1465151616 (1397.28 GiB 1500.32 GB)
>>>>>     Device Size : 488383872 (465.76 GiB 500.11 GB)
>>>>>    Raid Devices : 4
>>>>>   Total Devices : 4
>>>>> Preferred Minor : 0
>>>>>     Persistence : Superblock is persistent
>>>>>
>>>>>     Update Time : Fri Jan  2 16:53:10 2009
>>>>>           State : clean
>>>>>  Active Devices : 4
>>>>> Working Devices : 4
>>>>>  Failed Devices : 0
>>>>>   Spare Devices : 0
>>>>>
>>>>>          Layout : left-symmetric
>>>>>      Chunk Size : 128K
>>>>>
>>>>>            UUID : 68439662:90431c4a:5e66217b:5a1a585f (local to host 
>>>>> sm7jqb.dnsalias.com)
>>>>>          Events : 0.13406
>>>>>
>>>>>     Number   Major   Minor   RaidDevice State
>>>>>        0       8        1        0      active sync   /dev/sda1
>>>>>        1       8       17        1      active sync   /dev/sdb1
>>>>>        2       8       33        2      active sync   /dev/sdc1
>>>>>        3       8       49        3      active sync   /dev/sdd1
>>>>> ------------------------------
>>>>> 
>>>>> 
>>>>> 
>>> 
>>>>> No memory error
>>>>> 
>>>>> L1 Cache 128 7361MB/s
>>>>> L2 Cache 64k 3260MB/s
>>>>> Mem 1024M 275MB/s
>>>>> 
>>>>> Next to check for?
>>>>> 
>>>>> -- 
>>>> 
>>>> You ran a check on the array and then checked mismatch_cnt?
>>> like
>>> ~# fsck.ext3 -y -v /dev/md0  ??
>>> You vant to se any log ??  I do not understand maybe?
>>> It works for 95% I want it to work 100%
>>> 
>>> /var/log/fsck/ceheks
>>> Log of fsck -C -R -A -a
>>> Sun Jan  4 16:30:05 2009
>>> 
>>> fsck 1.40-WIP (14-Nov-2006)
>>> /: clean, 21179/987712 files, 648652/1973160 blocks
>>> boot: clean, 30/32128 files, 22378/128488 blocks
>>> /dev/md0: clean, 142094/183156736 files, 23162450/366287904 blocks (check 
>>> after next mount)
>>> 
>>> Sun Jan  4 16:30:06 2009
>>> ----------------
>>> 
>>> Can I se the sata_sil parameters?
>>> Or test something there?
>>> For me it shuld slow don a bit more.
>>> 
>>> 
>>> 
>> 
>> Run a check on the array:
>> p34:~# echo check > /sys/devices/virtual/block/md0/md/sync_action
>
> I found /sys/block/md0/md/sync_action
> idle
>
> I don't find 'check' i my box, but I run this every 2nd day, it help a bit.
> /etc/cron.d/mdadm/
> ...
> 5 0 * * 1,3,5 root [ -x /usr/share/mdadm/checkarray ] \
> && /usr/share/mdadm/checkarray --cron --all --quiet
> ...
>
>> p34:~#
>> 
>> Watch the status:
>> p34:~# cat /proc/mdstat
> ---
> Personalities : [raid6] [raid5] [raid4]
> md0 : active raid5 sda1[0] sdd1[3] sdc1[2] sdb1[1]
>      1465151616 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUUU]
>
> unused devices: <none>
> ---
>> 
>> When its done, run:
>> 
>> p34:~# cat /sys/devices/virtual/block/md0/md/mismatch_cnt
> /sys/block/md0/md/mismatch_cnt 0
>
>> p34:~#
>> 
>> And show the output.
>> 
>> Justin.
> I find
> /sys/module/sata_sil/parameter/slow_down
> 0
>
> and some more, all locks ok to me!
>
> My motherboard is KT7A-RAID (not using that raid)
> AMD 1.2Ghz
> I have a pdf manual.
>
> SATA kontroler board SA3114-4IR
> http://it.us.syba.com/product/43/02/05/index.html
>
> Now I have -3C and sunshine outside, I have to go out in the sun now!
>
> -- 
> Bengt Samuelsson
> Nydalavägen 30 A
> 352 48 Växjö
>
> +46(0)703686441
>
> http://sm7jqb.se
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-03 20:04 Bernd Schubert
  2009-01-03 20:53 ` Robert Hancock
@ 2009-01-07  4:59 ` Tejun Heo
  2009-01-07  5:38   ` Robert Hancock
  1 sibling, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2009-01-07  4:59 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Robert Hancock, Alan Cox, Justin Piszcz, debian-user, linux-raid,
	linux-ide

Hello,

Bernd Schubert wrote:
> But now more than a year has passed again without doing anything
> about it and actually this is what I strongly criticize. Most people
> don't know about issues like that and don't run file checksum tests
> as I now always do before taking a disk into production. So users
> are exposed to known data corruption problems without even being
> warned about it. Usually even backups don't help, since one creates
> a backup of the corrupted data.

sata_sil being one of the most popular controllers && data corruption
reports seem to be concentrated on certain chipsets, I don't think
it's a wide spread problem.  In some cases, the corruption was very
reproducible too.

I think it's something related to setting up the PCI side of things.
There have been hints that incorrect CLS setting was the culprit and I
tried thte combinations but without any success and unfortunately the
problem wasn't reproducible with the hardware I have here.  :-(

Anyways, there was an interesting report that updating the BIOS on the
controller fixed the problem.

  http://bugzilla.kernel.org/show_bug.cgi?id=10480  

Taking "lspci -nnvvvxxx" output of before and after such BIOS update
will shed some light on what's really going on.  Can you please try
that?

> So IMHO, the driver should be deactived for sil3114 until a real
> solution is found. And it only should be possible to force activate
> it by a kernel flag, which then also would print a huuuge warning
> about possible data corruption (unfortunately most distributions
> disables inital kernel messages *grumble*).

The problem is serious but the scope is quite limited and we can't
tell where the problem lies, so I'm not too sure about taking such
drastic measure.  Grumble...

Yeah, I really want to see this long standing problem fixed.  To my
knowledge, this is one of two still open data corruption bugs - the
other one being via putting CDB bytes into burnt CD/DVDs.

So, if you can try the BIOS update thing, please give it a shot.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-07  4:59 ` Tejun Heo
@ 2009-01-07  5:38   ` Robert Hancock
  2009-01-07 15:31     ` Bernd Schubert
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-07  5:38 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Bernd Schubert, Alan Cox, Justin Piszcz, debian-user, linux-raid,
	linux-ide

Tejun Heo wrote:
> Hello,
> 
> Bernd Schubert wrote:
>> But now more than a year has passed again without doing anything
>> about it and actually this is what I strongly criticize. Most people
>> don't know about issues like that and don't run file checksum tests
>> as I now always do before taking a disk into production. So users
>> are exposed to known data corruption problems without even being
>> warned about it. Usually even backups don't help, since one creates
>> a backup of the corrupted data.
> 
> sata_sil being one of the most popular controllers && data corruption
> reports seem to be concentrated on certain chipsets, I don't think
> it's a wide spread problem.  In some cases, the corruption was very
> reproducible too.
> 
> I think it's something related to setting up the PCI side of things.
> There have been hints that incorrect CLS setting was the culprit and I
> tried thte combinations but without any success and unfortunately the
> problem wasn't reproducible with the hardware I have here.  :-(

As far as the cache line size register, the only thing the documentation 
says it controls _directly_ is "With the SiI3114 as a master, initiating 
a read transaction, it issues PCI command Read Multiple in place, when 
empty space in its FIFO is larger than the value programmed in this 
register."

The interesting thing is the commit (log below) that added code to the 
driver to check the PCI cache line size register and set up the FIFO 
thresholds:

   2005/03/24 23:32:42-05:00 Carlos.Pardo
   [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration

   This patch set default values for the FIFO PCI Bus Arbitration to
   avoid data corruption. The root cause is due to our PCI bus master
   handling mismatch with the chipset PCI bridge during DMA xfer (write
   data to the device). The patch is to setup the DMA fifo threshold so
   that there is no chance for the DMA engine to change protocol. We have
   seen this problem only on one motherboard.

   Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
   Signed-off-by: Jeff Garzik <jgarzik@pobox.com>

What the code's doing is setting the FIFO thresholds, used to assign 
priority when requesting a PCI bus read or write operation, based on the 
cache line size somehow. It seems to be trusting that the chip's cache 
line size register has been set properly by the BIOS. The kernel should 
know what the cache line size is but AFAIK normally only sets it when 
the driver requests MWI. This chip doesn't support MWI, but it looks 
like pci_set_mwi would fix up the CLS register as a side effect..

> 
> Anyways, there was an interesting report that updating the BIOS on the
> controller fixed the problem.
> 
>   http://bugzilla.kernel.org/show_bug.cgi?id=10480  
> 
> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
> will shed some light on what's really going on.  Can you please try
> that?

Yes, that would be quite interesting.. the output even with the current 
BIOS would be useful to see if the BIOS set some stupid cache line size 
value..

> 
>> So IMHO, the driver should be deactived for sil3114 until a real
>> solution is found. And it only should be possible to force activate
>> it by a kernel flag, which then also would print a huuuge warning
>> about possible data corruption (unfortunately most distributions
>> disables inital kernel messages *grumble*).
> 
> The problem is serious but the scope is quite limited and we can't
> tell where the problem lies, so I'm not too sure about taking such
> drastic measure.  Grumble...
> 
> Yeah, I really want to see this long standing problem fixed.  To my
> knowledge, this is one of two still open data corruption bugs - the
> other one being via putting CDB bytes into burnt CD/DVDs.
> 
> So, if you can try the BIOS update thing, please give it a shot.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-07  5:38   ` Robert Hancock
@ 2009-01-07 15:31     ` Bernd Schubert
  2009-01-11  0:32       ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2009-01-07 15:31 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Tejun Heo, Alan Cox, Justin Piszcz, debian-user, linux-raid,
	linux-ide

On Wednesday 07 January 2009 06:38:28 Robert Hancock wrote:
> Tejun Heo wrote:
> > Hello,
> >
> > Bernd Schubert wrote:
> >> But now more than a year has passed again without doing anything
> >> about it and actually this is what I strongly criticize. Most people
> >> don't know about issues like that and don't run file checksum tests
> >> as I now always do before taking a disk into production. So users
> >> are exposed to known data corruption problems without even being
> >> warned about it. Usually even backups don't help, since one creates
> >> a backup of the corrupted data.
> >
> > sata_sil being one of the most popular controllers && data corruption
> > reports seem to be concentrated on certain chipsets, I don't think
> > it's a wide spread problem.  In some cases, the corruption was very
> > reproducible too.
> >
> > I think it's something related to setting up the PCI side of things.
> > There have been hints that incorrect CLS setting was the culprit and I
> > tried thte combinations but without any success and unfortunately the
> > problem wasn't reproducible with the hardware I have here.  :-(
>
> As far as the cache line size register, the only thing the documentation
> says it controls _directly_ is "With the SiI3114 as a master, initiating
> a read transaction, it issues PCI command Read Multiple in place, when
> empty space in its FIFO is larger than the value programmed in this
> register."
>
> The interesting thing is the commit (log below) that added code to the
> driver to check the PCI cache line size register and set up the FIFO
> thresholds:
>
>    2005/03/24 23:32:42-05:00 Carlos.Pardo
>    [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>
>    This patch set default values for the FIFO PCI Bus Arbitration to
>    avoid data corruption. The root cause is due to our PCI bus master
>    handling mismatch with the chipset PCI bridge during DMA xfer (write
>    data to the device). The patch is to setup the DMA fifo threshold so
>    that there is no chance for the DMA engine to change protocol. We have
>    seen this problem only on one motherboard.
>
>    Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
>    Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>
> What the code's doing is setting the FIFO thresholds, used to assign
> priority when requesting a PCI bus read or write operation, based on the
> cache line size somehow. It seems to be trusting that the chip's cache
> line size register has been set properly by the BIOS. The kernel should
> know what the cache line size is but AFAIK normally only sets it when
> the driver requests MWI. This chip doesn't support MWI, but it looks
> like pci_set_mwi would fix up the CLS register as a side effect..
>
> > Anyways, there was an interesting report that updating the BIOS on the
> > controller fixed the problem.
> >
> >   http://bugzilla.kernel.org/show_bug.cgi?id=10480
> >
> > Taking "lspci -nnvvvxxx" output of before and after such BIOS update
> > will shed some light on what's really going on.  Can you please try
> > that?
>
> Yes, that would be quite interesting.. the output even with the current
> BIOS would be useful to see if the BIOS set some stupid cache line size
> value..

Unfortunately I can't update the bios/firmware of the Sil3114 directly, it is 
onboard and the firmware is included into the mainboard bios. There is not 
the most recent bios version installed, but when we initially had the 
problems, we first tried a bios update, but it didn't help.

As suggested by Robert, I'm presently trying to figure out the corruption 
pattern. Actually our test tool easily provides these data. Unfortunately, it 
so far didn't report anything, although the reiserfs already got corrupted. 
Might be my colleague, who wrote that tool, recently broke something (as it 
is the second time, it doesn't report corruptions), in the past it did work 
reliably. Please give me a few more days...


03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114 
[SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
        Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller 
[1095:3114]
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B-
        Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
<TAbort- <MAbort- >SERR- <PERR-
        Latency: 64, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 19
        Region 0: I/O ports at bc00 [size=8]
        Region 1: I/O ports at b880 [size=4]
        Region 2: I/O ports at b800 [size=8]
        Region 3: I/O ports at ac00 [size=4]
        Region 4: I/O ports at a880 [size=16]
        Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
        Expansion ROM at fea00000 [disabled] [size=512K]
        Capabilities: [60] Power Management version 2
                Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 PME-Enable- DSel=0 DScale=2 PME-
00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00



Cheers,
Bernd

-- 
Bernd Schubert
Q-Leap Networks GmbH

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-07 15:31     ` Bernd Schubert
@ 2009-01-11  0:32       ` Robert Hancock
  2009-01-11  0:43         ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-11  0:32 UTC (permalink / raw)
  To: Bernd Schubert
  Cc: Tejun Heo, Alan Cox, Justin Piszcz, debian-user, linux-raid,
	linux-ide, cpardo

Bernd Schubert wrote:
>>> I think it's something related to setting up the PCI side of things.
>>> There have been hints that incorrect CLS setting was the culprit and I
>>> tried thte combinations but without any success and unfortunately the
>>> problem wasn't reproducible with the hardware I have here.  :-(
>> As far as the cache line size register, the only thing the documentation
>> says it controls _directly_ is "With the SiI3114 as a master, initiating
>> a read transaction, it issues PCI command Read Multiple in place, when
>> empty space in its FIFO is larger than the value programmed in this
>> register."
>>
>> The interesting thing is the commit (log below) that added code to the
>> driver to check the PCI cache line size register and set up the FIFO
>> thresholds:
>>
>>    2005/03/24 23:32:42-05:00 Carlos.Pardo
>>    [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>>
>>    This patch set default values for the FIFO PCI Bus Arbitration to
>>    avoid data corruption. The root cause is due to our PCI bus master
>>    handling mismatch with the chipset PCI bridge during DMA xfer (write
>>    data to the device). The patch is to setup the DMA fifo threshold so
>>    that there is no chance for the DMA engine to change protocol. We have
>>    seen this problem only on one motherboard.
>>
>>    Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
>>    Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>>4
>> What the code's doing is setting the FIFO thresholds, used to assign
>> priority when requesting a PCI bus read or write operation, based on the
>> cache line size somehow. It seems to be trusting that the chip's cache
>> line size register has been set properly by the BIOS. The kernel should
>> know what the cache line size is but AFAIK normally only sets it when
>> the driver requests MWI. This chip doesn't support MWI, but it looks
>> like pci_set_mwi would fix up the CLS register as a side effect..
>>
>>> Anyways, there was an interesting report that updating the BIOS on the
>>> controller fixed the problem.
>>>
>>>   http://bugzilla.kernel.org/show_bug.cgi?id=10480
>>>
>>> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
>>> will shed some light on what's really going on.  Can you please try
>>> that?
>> Yes, that would be quite interesting.. the output even with the current
>> BIOS would be useful to see if the BIOS set some stupid cache line size
>> value..
> 
> Unfortunately I can't update the bios/firmware of the Sil3114 directly, it is 
> onboard and the firmware is included into the mainboard bios. There is not 
> the most recent bios version installed, but when we initially had the 
> problems, we first tried a bios update, but it didn't help.

Well if one is really adventurous one can sometimes use some BIOS image 
editing tools to install an updated flash image for such integrated 
chips into the main BIOS image. This is definitely for advanced users 
only though..

> 
> As suggested by Robert, I'm presently trying to figure out the corruption 
> pattern. Actually our test tool easily provides these data. Unfortunately, it 
> so far didn't report anything, although the reiserfs already got corrupted. 
> Might be my colleague, who wrote that tool, recently broke something (as it 
> is the second time, it doesn't report corruptions), in the past it did work 
> reliably. Please give me a few more days...
> 
> 
> 03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114 
> [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
>         Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller 
> [1095:3114]
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR+ FastB2B-
>         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- 
> <TAbort- <MAbort- >SERR- <PERR-
>         Latency: 64, Cache Line Size: 64 bytes

Well, 64 seems quite reasonable, so that doesn't really give any more 
useful information.

I'm CCing Carlos Pardo at Silicon Image who wrote the patch above, maybe 
he has some insight.. Carlos, we have a case here where Bernd is 
reporting seeing corruption on an integrated SiI3114 on a Tyan Thunder 
K8S Pro (S2882) board, AMD 8111 chipset. This is reportedly occurring 
only with certain Seagate drives. Do you have any insight into this 
problem, in particular as far as whether the problem worked around in 
the patch mentioned above might be related?

There are apparently some reports of issues on NVidia chipsets as well, 
though I don't have any details at hand.

>         Interrupt: pin A routed to IRQ 19
>         Region 0: I/O ports at bc00 [size=8]
>         Region 1: I/O ports at b880 [size=4]
>         Region 2: I/O ports at b800 [size=8]
>         Region 3: I/O ports at ac00 [size=4]
>         Region 4: I/O ports at a880 [size=16]
>         Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
>         Expansion ROM at fea00000 [disabled] [size=512K]
>         Capabilities: [60] Power Management version 2
>                 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA 
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 PME-Enable- DSel=0 DScale=2 PME-
> 00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
> 10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
> 20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
> 30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
> 40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
> 70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
> 80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
> 90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
> a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
> b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
> c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> 
> 
> Cheers,
> Bernd
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-11  0:32       ` Robert Hancock
@ 2009-01-11  0:43         ` Robert Hancock
  2009-01-12  1:30           ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-11  0:43 UTC (permalink / raw)
  Cc: Bernd Schubert, Tejun Heo, Alan Cox, Justin Piszcz, debian-user,
	linux-raid, linux-ide

Robert Hancock wrote:
> Bernd Schubert wrote:
>>>> I think it's something related to setting up the PCI side of things.
>>>> There have been hints that incorrect CLS setting was the culprit and I
>>>> tried thte combinations but without any success and unfortunately the
>>>> problem wasn't reproducible with the hardware I have here.  :-(
>>> As far as the cache line size register, the only thing the documentation
>>> says it controls _directly_ is "With the SiI3114 as a master, initiating
>>> a read transaction, it issues PCI command Read Multiple in place, when
>>> empty space in its FIFO is larger than the value programmed in this
>>> register."
>>>
>>> The interesting thing is the commit (log below) that added code to the
>>> driver to check the PCI cache line size register and set up the FIFO
>>> thresholds:
>>>
>>>    2005/03/24 23:32:42-05:00 Carlos.Pardo
>>>    [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>>>
>>>    This patch set default values for the FIFO PCI Bus Arbitration to
>>>    avoid data corruption. The root cause is due to our PCI bus master
>>>    handling mismatch with the chipset PCI bridge during DMA xfer (write
>>>    data to the device). The patch is to setup the DMA fifo threshold so
>>>    that there is no chance for the DMA engine to change protocol. We 
>>> have
>>>    seen this problem only on one motherboard.
>>>
>>>    Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
>>>    Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>>> 4
>>> What the code's doing is setting the FIFO thresholds, used to assign
>>> priority when requesting a PCI bus read or write operation, based on the
>>> cache line size somehow. It seems to be trusting that the chip's cache
>>> line size register has been set properly by the BIOS. The kernel should
>>> know what the cache line size is but AFAIK normally only sets it when
>>> the driver requests MWI. This chip doesn't support MWI, but it looks
>>> like pci_set_mwi would fix up the CLS register as a side effect..
>>>
>>>> Anyways, there was an interesting report that updating the BIOS on the
>>>> controller fixed the problem.
>>>>
>>>>   http://bugzilla.kernel.org/show_bug.cgi?id=10480
>>>>
>>>> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
>>>> will shed some light on what's really going on.  Can you please try
>>>> that?
>>> Yes, that would be quite interesting.. the output even with the current
>>> BIOS would be useful to see if the BIOS set some stupid cache line size
>>> value..
>>
>> Unfortunately I can't update the bios/firmware of the Sil3114 
>> directly, it is onboard and the firmware is included into the 
>> mainboard bios. There is not the most recent bios version installed, 
>> but when we initially had the problems, we first tried a bios update, 
>> but it didn't help.
> 
> Well if one is really adventurous one can sometimes use some BIOS image 
> editing tools to install an updated flash image for such integrated 
> chips into the main BIOS image. This is definitely for advanced users 
> only though..
> 
>>
>> As suggested by Robert, I'm presently trying to figure out the 
>> corruption pattern. Actually our test tool easily provides these data. 
>> Unfortunately, it so far didn't report anything, although the reiserfs 
>> already got corrupted. Might be my colleague, who wrote that tool, 
>> recently broke something (as it is the second time, it doesn't report 
>> corruptions), in the past it did work reliably. Please give me a few 
>> more days...
>>
>>
>> 03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114 
>> [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
>>         Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller 
>> [1095:3114]
>>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
>> ParErr- Stepping- SERR+ FastB2B-
>>         Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium 
>> >TAbort- <TAbort- <MAbort- >SERR- <PERR-
>>         Latency: 64, Cache Line Size: 64 bytes
> 
> Well, 64 seems quite reasonable, so that doesn't really give any more 
> useful information.
> 
> I'm CCing Carlos Pardo at Silicon Image who wrote the patch above, maybe 
> he has some insight.. Carlos, we have a case here where Bernd is 
> reporting seeing corruption on an integrated SiI3114 on a Tyan Thunder 
> K8S Pro (S2882) board, AMD 8111 chipset. This is reportedly occurring 
> only with certain Seagate drives. Do you have any insight into this 
> problem, in particular as far as whether the problem worked around in 
> the patch mentioned above might be related?
> 
> There are apparently some reports of issues on NVidia chipsets as well, 
> though I don't have any details at hand.

Well, Carlos' email bounces, so much for that one. Anyone have any other 
contacts at Silicon Image?

> 
>>         Interrupt: pin A routed to IRQ 19
>>         Region 0: I/O ports at bc00 [size=8]
>>         Region 1: I/O ports at b880 [size=4]
>>         Region 2: I/O ports at b800 [size=8]
>>         Region 3: I/O ports at ac00 [size=4]
>>         Region 4: I/O ports at a880 [size=16]
>>         Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
>>         Expansion ROM at fea00000 [disabled] [size=512K]
>>         Capabilities: [60] Power Management version 2
>>                 Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA 
>> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>>                 Status: D0 PME-Enable- DSel=0 DScale=2 PME-
>> 00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
>> 10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
>> 20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
>> 30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
>> 40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
>> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
>> 70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
>> 80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
>> 90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
>> a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
>> b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
>> c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>
>>
>>
>> Cheers,
>> Bernd
>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-11  0:43         ` Robert Hancock
@ 2009-01-12  1:30           ` Tejun Heo
  2009-01-19 18:43             ` Dave Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2009-01-12  1:30 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Bernd Schubert, Alan Cox, Justin Piszcz, debian-user, linux-raid,
	linux-ide

Robert Hancock wrote:
>> There are apparently some reports of issues on NVidia chipsets as
>> well, though I don't have any details at hand.
> 
> Well, Carlos' email bounces, so much for that one. Anyone have any other
> contacts at Silicon Image?

I'll ping my SIMG contacts but I've pinged about this problem in the
past but it didn't get anywhere.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-12  1:30           ` Tejun Heo
@ 2009-01-19 18:43             ` Dave Jones
  2009-01-20  2:50               ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Jones @ 2009-01-19 18:43 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Robert Hancock, Bernd Schubert, Alan Cox, Justin Piszcz,
	debian-user, linux-raid, linux-ide

On Mon, Jan 12, 2009 at 10:30:42AM +0900, Tejun Heo wrote:
 > Robert Hancock wrote:
 > >> There are apparently some reports of issues on NVidia chipsets as
 > >> well, though I don't have any details at hand.
 > > 
 > > Well, Carlos' email bounces, so much for that one. Anyone have any other
 > > contacts at Silicon Image?
 > 
 > I'll ping my SIMG contacts but I've pinged about this problem in the
 > past but it didn't get anywhere.

I wish I'd read this thread last week.. I've been beating my head
against this problem all weekend.

I picked up a cheap 3114 card, and found that when I created a filesystem
with it on a 250GB disk, it got massive corruption very quickly.

My experience echos most the other peoples in this thread, but here's
a few data points I've been able to figure out..

I ran badblocks -v -w -s on the disk, and after running
for nearly 24 hours, it reported a huge number of blocks
failing at the upper part of the disk.

I created a partition in this bad area to speed up testing..

   Device Boot      Start         End	   Blocks   Id  System
/dev/sde1               1	30000   240974968+  83  Linux
/dev/sde2           30001	30200     1606500   83  Linux
/dev/sde3           30201	30401     1614532+  83  Linux

Rerunning badblocks on /dev/sde2 consistently fails when
it gets to the reading back 0x00 stage.
(Somehow it passes reading back 0xff, 0xaa and 0x55)

I was beginning to suspect the disk may be bad, but when I
moved it to a box with Intel sata, the badblocks run on that
same partition succeeds with no problems at all.

Given the corruption happens at high block numbers, I'm wondering
if maybe there's some kind of wraparound bug happening here.
(Though why only the 0x00 pattern fails would still be a mystery).

After reading about the firmware update fixing it, I thought I'd
give that a shot.  This was pretty much complete fail.

The DOS utility for flashing claims I'm running BIOS 5.0.39,
which looking at http://www.siliconimage.com/support/searchresults.aspx?pid=28&cat=15
is quite ancient.  So I tried the newer ones.
Same experience with both 5.4.0.3, and 5.0.73

"BIOS version in the input file is not a newer version"

Forcing it to write anyway gets..

"Data is different at address 65f6h"

	Dave 

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-19 18:43             ` Dave Jones
@ 2009-01-20  2:50               ` Robert Hancock
  2009-01-20 20:07                 ` Dave Jones
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-20  2:50 UTC (permalink / raw)
  To: Dave Jones
  Cc: Tejun Heo, Bernd Schubert, Alan Cox, Justin Piszcz, debian-user,
	linux-raid, linux-ide

Dave Jones wrote:
> On Mon, Jan 12, 2009 at 10:30:42AM +0900, Tejun Heo wrote:
>  > Robert Hancock wrote:
>  > >> There are apparently some reports of issues on NVidia chipsets as
>  > >> well, though I don't have any details at hand.
>  > > 
>  > > Well, Carlos' email bounces, so much for that one. Anyone have any other
>  > > contacts at Silicon Image?
>  > 
>  > I'll ping my SIMG contacts but I've pinged about this problem in the
>  > past but it didn't get anywhere.
> 
> I wish I'd read this thread last week.. I've been beating my head
> against this problem all weekend.
> 
> I picked up a cheap 3114 card, and found that when I created a filesystem
> with it on a 250GB disk, it got massive corruption very quickly.
> 
> My experience echos most the other peoples in this thread, but here's
> a few data points I've been able to figure out..
> 
> I ran badblocks -v -w -s on the disk, and after running
> for nearly 24 hours, it reported a huge number of blocks
> failing at the upper part of the disk.
> 
> I created a partition in this bad area to speed up testing..
> 
>    Device Boot      Start         End	   Blocks   Id  System
> /dev/sde1               1	30000   240974968+  83  Linux
> /dev/sde2           30001	30200     1606500   83  Linux
> /dev/sde3           30201	30401     1614532+  83  Linux
> 
> Rerunning badblocks on /dev/sde2 consistently fails when
> it gets to the reading back 0x00 stage.
> (Somehow it passes reading back 0xff, 0xaa and 0x55)
> 
> I was beginning to suspect the disk may be bad, but when I
> moved it to a box with Intel sata, the badblocks run on that
> same partition succeeds with no problems at all.
> 
> Given the corruption happens at high block numbers, I'm wondering
> if maybe there's some kind of wraparound bug happening here.
> (Though why only the 0x00 pattern fails would still be a mystery).

Yeah, that seems a bit bizarre.. Apparently somehow zeros are being 
converted into non-zero.. Can you try zeroing out the partition by 
dd'ing into it from /dev/zero or something, then dumping it back out to 
see what kind of data is showing up?

> 
> 
> After reading about the firmware update fixing it, I thought I'd
> give that a shot.  This was pretty much complete fail.
> 
> The DOS utility for flashing claims I'm running BIOS 5.0.39,
> which looking at http://www.siliconimage.com/support/searchresults.aspx?pid=28&cat=15
> is quite ancient.  So I tried the newer ones.
> Same experience with both 5.4.0.3, and 5.0.73
> 
> "BIOS version in the input file is not a newer version"
> 
> Forcing it to write anyway gets..
> 
> "Data is different at address 65f6h"
> 
> 
> 
> 
> 	Dave 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2009-01-20  2:50               ` Robert Hancock
@ 2009-01-20 20:07                 ` Dave Jones
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Jones @ 2009-01-20 20:07 UTC (permalink / raw)
  To: Robert Hancock
  Cc: Tejun Heo, Bernd Schubert, Alan Cox, Justin Piszcz, debian-user,
	linux-raid, linux-ide

On Mon, Jan 19, 2009 at 08:50:06PM -0600, Robert Hancock wrote:
 
 > > Given the corruption happens at high block numbers, I'm wondering
 > > if maybe there's some kind of wraparound bug happening here.
 > > (Though why only the 0x00 pattern fails would still be a mystery).
 > 
 > Yeah, that seems a bit bizarre.. Apparently somehow zeros are being 
 > converted into non-zero.. Can you try zeroing out the partition by 
 > dd'ing into it from /dev/zero or something, then dumping it back out to 
 > see what kind of data is showing up?

Hmm, it seems the failed firmware update has killed the eeprom.
It no longer reports the right PCI vendor ID.

	Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
@ 2010-01-29 16:13 Ulli.Brennenstuhl
  2010-01-29 19:37 ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Ulli.Brennenstuhl @ 2010-01-29 16:13 UTC (permalink / raw)
  To: linux-ide

The last message of this discussion is more than one year old, but still 
there was no solution to this problem.

I recently encountered the same problem that a raid created with mdadm 
consisting of three SAMSUNG HD154UI sata harddisks had random errors and 
mdadm --examine would randomly report that checksums are wrong/correct.

The sata controller with the SIL 3114 chipset runs on an old Epox 8K3A 
board with a VIA KT133 chipset. I noticed that placing the controller in 
another pci slot would change the results of mdadm --examine.
While in one slot it was the checksums were randomly changing between 
correct and wrong in another slot it was always displayed as wrong.

After deactivating every single bios option that somehow optimizes the 
pci bus the problem seems to be gone. After some more testing I could 
narrow the problem down to the option "PCI Master 0 WS Write", which 
controls if requests to the pci bus are executed immediately (with zero 
wait states) or if every write request will be delayed by one wait state.

Obviously this reduces the performance. I didn't perform tests but the 
resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.

I hope this also solves the problems for other people and it would be 
interesting if any change to the driver would allow to reenable the "PCI 
Master 0 WS Write" option.

Regards,
Ulli Brennenstuhl

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-01-29 16:13 Corrupt data - RAID sata_sil 3114 chip Ulli.Brennenstuhl
@ 2010-01-29 19:37 ` Robert Hancock
  2010-02-06  3:54   ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2010-01-29 19:37 UTC (permalink / raw)
  To: Ulli.Brennenstuhl; +Cc: linux-ide

On 01/29/2010 10:13 AM, Ulli.Brennenstuhl wrote:
> The last message of this discussion is more than one year old, but still
> there was no solution to this problem.
>
> I recently encountered the same problem that a raid created with mdadm
> consisting of three SAMSUNG HD154UI sata harddisks had random errors and
> mdadm --examine would randomly report that checksums are wrong/correct.
>
> The sata controller with the SIL 3114 chipset runs on an old Epox 8K3A
> board with a VIA KT133 chipset. I noticed that placing the controller in
> another pci slot would change the results of mdadm --examine.
> While in one slot it was the checksums were randomly changing between
> correct and wrong in another slot it was always displayed as wrong.
>
> After deactivating every single bios option that somehow optimizes the
> pci bus the problem seems to be gone. After some more testing I could
> narrow the problem down to the option "PCI Master 0 WS Write", which
> controls if requests to the pci bus are executed immediately (with zero
> wait states) or if every write request will be delayed by one wait state.
>
> Obviously this reduces the performance. I didn't perform tests but the
> resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.
>
> I hope this also solves the problems for other people and it would be
> interesting if any change to the driver would allow to reenable the "PCI
> Master 0 WS Write" option.

I don't imagine there's anything the driver's likely to be able to do to 
avoid it. That sounds like a definite chipset bug. The PCI interface on 
older VIA chipsets was pretty notorious for them.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-01-29 19:37 ` Robert Hancock
@ 2010-02-06  3:54   ` Tejun Heo
  2010-02-06 15:16     ` Tim Small
  0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2010-02-06  3:54 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Ulli.Brennenstuhl, linux-ide

Hello,

On 01/30/2010 04:37 AM, Robert Hancock wrote:
> On 01/29/2010 10:13 AM, Ulli.Brennenstuhl wrote:
>> After deactivating every single bios option that somehow optimizes the
>> pci bus the problem seems to be gone. After some more testing I could
>> narrow the problem down to the option "PCI Master 0 WS Write", which
>> controls if requests to the pci bus are executed immediately (with zero
>> wait states) or if every write request will be delayed by one wait state.
>>
>> Obviously this reduces the performance. I didn't perform tests but the
>> resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.
>>
>> I hope this also solves the problems for other people and it would be
>> interesting if any change to the driver would allow to reenable the "PCI
>> Master 0 WS Write" option.
> 
> I don't imagine there's anything the driver's likely to be able to do to
> avoid it. That sounds like a definite chipset bug. The PCI interface on
> older VIA chipsets was pretty notorious for them.

Sil3112/3114 are now virtually the only controllers with occassional
and unresolved data corruption issues.  The other one was sata_via
with ATAPI devices which got a possible fix recently (it needed a
delay or flush during command issue or CDB leaked into write buffer).

Given that these sil chips are used very widely and the relatively low
frequency and certain traits like putting two controllers on the bus
triggers the problem more easily and this fiddling with PCI bus option
resolves it, it's conceivable that signal tx/rx quality of sil3112/4
is borderline in certain cases which usually doesn't show up but
causes problem when there also are other weaknesses in the bus (heavy
loading, bus controller not having exactly the best signal quality and
so on.

It would be great if there's some knob we can turn in the controller
PCI config space but I really have no idea whatsoever.  :-(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-02-06  3:54   ` Tejun Heo
@ 2010-02-06 15:16     ` Tim Small
  2010-02-07 16:09       ` Robert Hancock
  0 siblings, 1 reply; 29+ messages in thread
From: Tim Small @ 2010-02-06 15:16 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Robert Hancock, Ulli.Brennenstuhl, linux-ide

Tejun Heo wrote:
> It would be great if there's some knob we can turn in the controller
> PCI config space but I really have no idea whatsoever.  :-(
>   

I wonder if enabling EDAC PCI parity error detection would show up these
problems - either on the controller itself, or its upstream PCI bridge chip?

modprobe edac_core check_pci_errors=1

alternatively, "setpci -s <slot> STATUS" should also show status
register bit 15 having been asserted, I think, and lspci -vv should say
"<PERR+"

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-02-06 15:16     ` Tim Small
@ 2010-02-07 16:09       ` Robert Hancock
  2010-02-08  2:31         ` Tejun Heo
  2010-02-08 14:25         ` Tim Small
  0 siblings, 2 replies; 29+ messages in thread
From: Robert Hancock @ 2010-02-07 16:09 UTC (permalink / raw)
  To: Tim Small; +Cc: Tejun Heo, Ulli.Brennenstuhl, linux-ide

On 02/06/2010 09:16 AM, Tim Small wrote:
> Tejun Heo wrote:
>> It would be great if there's some knob we can turn in the controller
>> PCI config space but I really have no idea whatsoever.  :-(
>>
>
> I wonder if enabling EDAC PCI parity error detection would show up these
> problems - either on the controller itself, or its upstream PCI bridge chip?
>
> modprobe edac_core check_pci_errors=1
>
> alternatively, "setpci -s<slot>  STATUS" should also show status
> register bit 15 having been asserted, I think, and lspci -vv should say
> "<PERR+"

It's something to check, yes. It would be somewhat surprising if that 
were occurring - detected PCI parity errors should cause a target abort 
and cause a transfer failure, not silent data corruption. But again, on 
an old VIA chipset, for this to be handled improperly wouldn't be 
shocking :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-02-07 16:09       ` Robert Hancock
@ 2010-02-08  2:31         ` Tejun Heo
  2010-02-08 14:25         ` Tim Small
  1 sibling, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2010-02-08  2:31 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Tim Small, Ulli.Brennenstuhl, linux-ide

On 02/08/2010 01:09 AM, Robert Hancock wrote:
> It's something to check, yes. It would be somewhat surprising if that
> were occurring - detected PCI parity errors should cause a target abort
> and cause a transfer failure, not silent data corruption. But again, on
> an old VIA chipset, for this to be handled improperly wouldn't be
> shocking :-)

Oh... some via pci bridges are known to ignore PCI parity errors.  It
will just happily proceed with corrupt data.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Corrupt data - RAID sata_sil 3114 chip
  2010-02-07 16:09       ` Robert Hancock
  2010-02-08  2:31         ` Tejun Heo
@ 2010-02-08 14:25         ` Tim Small
  1 sibling, 0 replies; 29+ messages in thread
From: Tim Small @ 2010-02-08 14:25 UTC (permalink / raw)
  To: Robert Hancock; +Cc: Tejun Heo, Ulli.Brennenstuhl, linux-ide

Robert Hancock wrote:
> It's something to check, yes. It would be somewhat surprising if that 
> were occurring - detected PCI parity errors should cause a target 
> abort and cause a transfer failure, not silent data corruption. But 
> again, on an old VIA chipset, for this to be handled improperly 
> wouldn't be shocking :-)

I believe that this is only the case if the motherboard BIOS sets bit 6 
in the Control register to tell the device to respond to parity errors.  
Most of the "server" motherboards I've seen do.  None of the non-server 
motherboards that I've just looked at do....

The Status flag "should" still work whether the "parity error response" 
flag is set, or not...

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-02-08 14:26 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-29 16:13 Corrupt data - RAID sata_sil 3114 chip Ulli.Brennenstuhl
2010-01-29 19:37 ` Robert Hancock
2010-02-06  3:54   ` Tejun Heo
2010-02-06 15:16     ` Tim Small
2010-02-07 16:09       ` Robert Hancock
2010-02-08  2:31         ` Tejun Heo
2010-02-08 14:25         ` Tim Small
     [not found] <bQVFb-3SB-37@gated-at.bofh.it>
     [not found] ` <bQVFb-3SB-39@gated-at.bofh.it>
     [not found]   ` <bQVFb-3SB-41@gated-at.bofh.it>
     [not found]     ` <bQVFc-3SB-43@gated-at.bofh.it>
     [not found]       ` <bQVFc-3SB-45@gated-at.bofh.it>
     [not found]         ` <bQVFc-3SB-47@gated-at.bofh.it>
     [not found]           ` <bQVFb-3SB-35@gated-at.bofh.it>
     [not found]             ` <4963306F.4060504@sm7jqb.se>
2009-01-06 10:48               ` Justin Piszcz
  -- strict thread matches above, loose matches on Subject: below --
2009-01-03 20:04 Bernd Schubert
2009-01-03 20:53 ` Robert Hancock
2009-01-03 21:11   ` Bernd Schubert
2009-01-03 23:23     ` Robert Hancock
2009-01-07  4:59 ` Tejun Heo
2009-01-07  5:38   ` Robert Hancock
2009-01-07 15:31     ` Bernd Schubert
2009-01-11  0:32       ` Robert Hancock
2009-01-11  0:43         ` Robert Hancock
2009-01-12  1:30           ` Tejun Heo
2009-01-19 18:43             ` Dave Jones
2009-01-20  2:50               ` Robert Hancock
2009-01-20 20:07                 ` Dave Jones
     [not found] <495E01E3.9060903@sm7jqb.se>
     [not found] ` <alpine.DEB.1.10.0901020741200.11852@p34.internal.lan>
2009-01-02 21:30   ` Bernd Schubert
2009-01-02 21:47     ` Twigathy
2009-01-03  2:31     ` Redeeman
2009-01-03 13:13       ` Bernd Schubert
2009-01-03 13:39     ` Alan Cox
2009-01-03 16:20       ` Bernd Schubert
2009-01-03 18:31         ` Robert Hancock
2009-01-03 22:19     ` James Youngman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).