* Re: Corrupt data - RAID sata_sil 3114 chip
@ 2010-01-29 16:13 Ulli.Brennenstuhl
2010-01-29 19:37 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Ulli.Brennenstuhl @ 2010-01-29 16:13 UTC (permalink / raw)
To: linux-ide
The last message of this discussion is more than one year old, but still
there was no solution to this problem.
I recently encountered the same problem that a raid created with mdadm
consisting of three SAMSUNG HD154UI sata harddisks had random errors and
mdadm --examine would randomly report that checksums are wrong/correct.
The sata controller with the SIL 3114 chipset runs on an old Epox 8K3A
board with a VIA KT133 chipset. I noticed that placing the controller in
another pci slot would change the results of mdadm --examine.
While in one slot it was the checksums were randomly changing between
correct and wrong in another slot it was always displayed as wrong.
After deactivating every single bios option that somehow optimizes the
pci bus the problem seems to be gone. After some more testing I could
narrow the problem down to the option "PCI Master 0 WS Write", which
controls if requests to the pci bus are executed immediately (with zero
wait states) or if every write request will be delayed by one wait state.
Obviously this reduces the performance. I didn't perform tests but the
resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.
I hope this also solves the problems for other people and it would be
interesting if any change to the driver would allow to reenable the "PCI
Master 0 WS Write" option.
Regards,
Ulli Brennenstuhl
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-01-29 16:13 Corrupt data - RAID sata_sil 3114 chip Ulli.Brennenstuhl
@ 2010-01-29 19:37 ` Robert Hancock
2010-02-06 3:54 ` Tejun Heo
0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2010-01-29 19:37 UTC (permalink / raw)
To: Ulli.Brennenstuhl; +Cc: linux-ide
On 01/29/2010 10:13 AM, Ulli.Brennenstuhl wrote:
> The last message of this discussion is more than one year old, but still
> there was no solution to this problem.
>
> I recently encountered the same problem that a raid created with mdadm
> consisting of three SAMSUNG HD154UI sata harddisks had random errors and
> mdadm --examine would randomly report that checksums are wrong/correct.
>
> The sata controller with the SIL 3114 chipset runs on an old Epox 8K3A
> board with a VIA KT133 chipset. I noticed that placing the controller in
> another pci slot would change the results of mdadm --examine.
> While in one slot it was the checksums were randomly changing between
> correct and wrong in another slot it was always displayed as wrong.
>
> After deactivating every single bios option that somehow optimizes the
> pci bus the problem seems to be gone. After some more testing I could
> narrow the problem down to the option "PCI Master 0 WS Write", which
> controls if requests to the pci bus are executed immediately (with zero
> wait states) or if every write request will be delayed by one wait state.
>
> Obviously this reduces the performance. I didn't perform tests but the
> resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.
>
> I hope this also solves the problems for other people and it would be
> interesting if any change to the driver would allow to reenable the "PCI
> Master 0 WS Write" option.
I don't imagine there's anything the driver's likely to be able to do to
avoid it. That sounds like a definite chipset bug. The PCI interface on
older VIA chipsets was pretty notorious for them.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-01-29 19:37 ` Robert Hancock
@ 2010-02-06 3:54 ` Tejun Heo
2010-02-06 15:16 ` Tim Small
0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2010-02-06 3:54 UTC (permalink / raw)
To: Robert Hancock; +Cc: Ulli.Brennenstuhl, linux-ide
Hello,
On 01/30/2010 04:37 AM, Robert Hancock wrote:
> On 01/29/2010 10:13 AM, Ulli.Brennenstuhl wrote:
>> After deactivating every single bios option that somehow optimizes the
>> pci bus the problem seems to be gone. After some more testing I could
>> narrow the problem down to the option "PCI Master 0 WS Write", which
>> controls if requests to the pci bus are executed immediately (with zero
>> wait states) or if every write request will be delayed by one wait state.
>>
>> Obviously this reduces the performance. I didn't perform tests but the
>> resync speed of the raid dropped from ~ 28mb/s to ~ 17mb/s.
>>
>> I hope this also solves the problems for other people and it would be
>> interesting if any change to the driver would allow to reenable the "PCI
>> Master 0 WS Write" option.
>
> I don't imagine there's anything the driver's likely to be able to do to
> avoid it. That sounds like a definite chipset bug. The PCI interface on
> older VIA chipsets was pretty notorious for them.
Sil3112/3114 are now virtually the only controllers with occassional
and unresolved data corruption issues. The other one was sata_via
with ATAPI devices which got a possible fix recently (it needed a
delay or flush during command issue or CDB leaked into write buffer).
Given that these sil chips are used very widely and the relatively low
frequency and certain traits like putting two controllers on the bus
triggers the problem more easily and this fiddling with PCI bus option
resolves it, it's conceivable that signal tx/rx quality of sil3112/4
is borderline in certain cases which usually doesn't show up but
causes problem when there also are other weaknesses in the bus (heavy
loading, bus controller not having exactly the best signal quality and
so on.
It would be great if there's some knob we can turn in the controller
PCI config space but I really have no idea whatsoever. :-(
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-02-06 3:54 ` Tejun Heo
@ 2010-02-06 15:16 ` Tim Small
2010-02-07 16:09 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Tim Small @ 2010-02-06 15:16 UTC (permalink / raw)
To: Tejun Heo; +Cc: Robert Hancock, Ulli.Brennenstuhl, linux-ide
Tejun Heo wrote:
> It would be great if there's some knob we can turn in the controller
> PCI config space but I really have no idea whatsoever. :-(
>
I wonder if enabling EDAC PCI parity error detection would show up these
problems - either on the controller itself, or its upstream PCI bridge chip?
modprobe edac_core check_pci_errors=1
alternatively, "setpci -s <slot> STATUS" should also show status
register bit 15 having been asserted, I think, and lspci -vv should say
"<PERR+"
Cheers,
Tim.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-02-06 15:16 ` Tim Small
@ 2010-02-07 16:09 ` Robert Hancock
2010-02-08 2:31 ` Tejun Heo
2010-02-08 14:25 ` Tim Small
0 siblings, 2 replies; 29+ messages in thread
From: Robert Hancock @ 2010-02-07 16:09 UTC (permalink / raw)
To: Tim Small; +Cc: Tejun Heo, Ulli.Brennenstuhl, linux-ide
On 02/06/2010 09:16 AM, Tim Small wrote:
> Tejun Heo wrote:
>> It would be great if there's some knob we can turn in the controller
>> PCI config space but I really have no idea whatsoever. :-(
>>
>
> I wonder if enabling EDAC PCI parity error detection would show up these
> problems - either on the controller itself, or its upstream PCI bridge chip?
>
> modprobe edac_core check_pci_errors=1
>
> alternatively, "setpci -s<slot> STATUS" should also show status
> register bit 15 having been asserted, I think, and lspci -vv should say
> "<PERR+"
It's something to check, yes. It would be somewhat surprising if that
were occurring - detected PCI parity errors should cause a target abort
and cause a transfer failure, not silent data corruption. But again, on
an old VIA chipset, for this to be handled improperly wouldn't be
shocking :-)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-02-07 16:09 ` Robert Hancock
@ 2010-02-08 2:31 ` Tejun Heo
2010-02-08 14:25 ` Tim Small
1 sibling, 0 replies; 29+ messages in thread
From: Tejun Heo @ 2010-02-08 2:31 UTC (permalink / raw)
To: Robert Hancock; +Cc: Tim Small, Ulli.Brennenstuhl, linux-ide
On 02/08/2010 01:09 AM, Robert Hancock wrote:
> It's something to check, yes. It would be somewhat surprising if that
> were occurring - detected PCI parity errors should cause a target abort
> and cause a transfer failure, not silent data corruption. But again, on
> an old VIA chipset, for this to be handled improperly wouldn't be
> shocking :-)
Oh... some via pci bridges are known to ignore PCI parity errors. It
will just happily proceed with corrupt data.
--
tejun
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2010-02-07 16:09 ` Robert Hancock
2010-02-08 2:31 ` Tejun Heo
@ 2010-02-08 14:25 ` Tim Small
1 sibling, 0 replies; 29+ messages in thread
From: Tim Small @ 2010-02-08 14:25 UTC (permalink / raw)
To: Robert Hancock; +Cc: Tejun Heo, Ulli.Brennenstuhl, linux-ide
Robert Hancock wrote:
> It's something to check, yes. It would be somewhat surprising if that
> were occurring - detected PCI parity errors should cause a target
> abort and cause a transfer failure, not silent data corruption. But
> again, on an old VIA chipset, for this to be handled improperly
> wouldn't be shocking :-)
I believe that this is only the case if the motherboard BIOS sets bit 6
in the Control register to tell the device to respond to parity errors.
Most of the "server" motherboards I've seen do. None of the non-server
motherboards that I've just looked at do....
The Status flag "should" still work whether the "parity error response"
flag is set, or not...
Cheers,
Tim.
^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <bQVFb-3SB-37@gated-at.bofh.it>]
* Re: Corrupt data - RAID sata_sil 3114 chip
@ 2009-01-03 20:04 Bernd Schubert
2009-01-03 20:53 ` Robert Hancock
2009-01-07 4:59 ` Tejun Heo
0 siblings, 2 replies; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 20:04 UTC (permalink / raw)
To: Robert Hancock
Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide
[sorry sent again, since Robert dropped all mailing list CCs and I didn't
notice first]
On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
> Bernd Schubert wrote:
>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>
>>>> Hello Bengt,
>>>>
>>>> sil3114 is known to cause data corruption with some disks.
>>> News to me. There are a few people with lots of SI and other devices
>>
>> No no, you just forgot about it, since you even reviewed the patches ;)
>>
>> http://lkml.org/lkml/2007/10/11/137
>
> And Jeff explained why they were not merged:
>
> http://lkml.org/lkml/2007/10/11/166
>
> All the patch does is try to reduce the speed impact of the workaround.
> But as was pointed out, they don't reliably solve the problem the
> workaround is trying to fix, and besides, the workaround is already not
> applied to SiI3114 at all, as it is apparently not applicable on that
> controller (only 3112).
Well, do they reliable solve the problem in our case (before taking the patch
into production I run a checksum tests for about 2 weeks). Anyway, I entirely
understand the patches didn't get accepted.
But now more than a year has passed again without doing anything
about it and actually this is what I strongly criticize. Most people don't
know about issues like that and don't run file checksum tests as I now always
do before taking a disk into production. So users are exposed to known
data corruption problems without even being warned about it. Usually
even backups don't help, since one creates a backup of the corrupted data.
So IMHO, the driver should be deactived for sil3114 until a real solution is
found. And it only should be possible to force activate it by a kernel flag,
which then also would print a huuuge warning about possible data corruption
(unfortunately most distributions disables inital kernel messages *grumble*).
Cheers,
Bernd
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-03 20:04 Bernd Schubert
@ 2009-01-03 20:53 ` Robert Hancock
2009-01-03 21:11 ` Bernd Schubert
2009-01-07 4:59 ` Tejun Heo
1 sibling, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-03 20:53 UTC (permalink / raw)
To: Bernd Schubert
Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide
Bernd Schubert wrote:
> [sorry sent again, since Robert dropped all mailing list CCs and I didn't
> notice first]
>
> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>> Bernd Schubert wrote:
>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>
>>>>> Hello Bengt,
>>>>>
>>>>> sil3114 is known to cause data corruption with some disks.
>>>> News to me. There are a few people with lots of SI and other devices
>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>
>>> http://lkml.org/lkml/2007/10/11/137
>> And Jeff explained why they were not merged:
>>
>> http://lkml.org/lkml/2007/10/11/166
>>
>> All the patch does is try to reduce the speed impact of the workaround.
>> But as was pointed out, they don't reliably solve the problem the
>> workaround is trying to fix, and besides, the workaround is already not
>> applied to SiI3114 at all, as it is apparently not applicable on that
>> controller (only 3112).
>
> Well, do they reliable solve the problem in our case (before taking the patch
> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
> understand the patches didn't get accepted.
>
> But now more than a year has passed again without doing anything
> about it and actually this is what I strongly criticize. Most people don't
> know about issues like that and don't run file checksum tests as I now always
> do before taking a disk into production. So users are exposed to known
> data corruption problems without even being warned about it. Usually
> even backups don't help, since one creates a backup of the corrupted data.
>
> So IMHO, the driver should be deactived for sil3114 until a real solution is
> found. And it only should be possible to force activate it by a kernel flag,
> which then also would print a huuuge warning about possible data corruption
> (unfortunately most distributions disables inital kernel messages *grumble*).
If the corruption was happening on all such controllers then people
would have been complaining in droves and something would have been
done. It seems much more likely that in this case the problem is some
kind of hardware fault or combination of hardware which is causing the
problem. Unfortunately these kind of not-easily-reproducible issues tend
to be very hard to track down.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-03 20:53 ` Robert Hancock
@ 2009-01-03 21:11 ` Bernd Schubert
2009-01-03 23:23 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2009-01-03 21:11 UTC (permalink / raw)
To: Robert Hancock
Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide
On Sat, Jan 03, 2009 at 02:53:09PM -0600, Robert Hancock wrote:
> Bernd Schubert wrote:
>> [sorry sent again, since Robert dropped all mailing list CCs and I
>> didn't notice first]
>>
>> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>>> Bernd Schubert wrote:
>>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>>
>>>>>> Hello Bengt,
>>>>>>
>>>>>> sil3114 is known to cause data corruption with some disks.
>>>>> News to me. There are a few people with lots of SI and other devices
>>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>>
>>>> http://lkml.org/lkml/2007/10/11/137
>>> And Jeff explained why they were not merged:
>>>
>>> http://lkml.org/lkml/2007/10/11/166
>>>
>>> All the patch does is try to reduce the speed impact of the
>>> workaround. But as was pointed out, they don't reliably solve the
>>> problem the workaround is trying to fix, and besides, the workaround
>>> is already not applied to SiI3114 at all, as it is apparently not
>>> applicable on that controller (only 3112).
>>
>> Well, do they reliable solve the problem in our case (before taking the patch
>> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
>> understand the patches didn't get accepted.
>>
>> But now more than a year has passed again without doing anything
>> about it and actually this is what I strongly criticize. Most people don't
>> know about issues like that and don't run file checksum tests as I now always
>> do before taking a disk into production. So users are exposed to known
>> data corruption problems without even being warned about it. Usually
>> even backups don't help, since one creates a backup of the corrupted data.
>>
>> So IMHO, the driver should be deactived for sil3114 until a real
>> solution is found. And it only should be possible to force activate it
>> by a kernel flag, which then also would print a huuuge warning about
>> possible data corruption (unfortunately most distributions disables
>> inital kernel messages *grumble*).
>
> If the corruption was happening on all such controllers then people
> would have been complaining in droves and something would have been
> done. It seems much more likely that in this case the problem is some
> kind of hardware fault or combination of hardware which is causing the
> problem. Unfortunately these kind of not-easily-reproducible issues tend
> to be very hard to track down.
>
Well yes, it only happens with certain drives. But these drives work fine on
other controllers. But still these are by now
known issues and nothing is done for that.
I would happily help to solve the problem, I just don't have any knowledge
about hardware programming. What would be your next step, if you had remote
access to such a system?
Thanks,
Bernd
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-03 21:11 ` Bernd Schubert
@ 2009-01-03 23:23 ` Robert Hancock
0 siblings, 0 replies; 29+ messages in thread
From: Robert Hancock @ 2009-01-03 23:23 UTC (permalink / raw)
To: Bernd Schubert
Cc: Alan Cox, Justin Piszcz, debian-user, linux-raid, linux-ide
Bernd Schubert wrote:
> On Sat, Jan 03, 2009 at 02:53:09PM -0600, Robert Hancock wrote:
>> Bernd Schubert wrote:
>>> [sorry sent again, since Robert dropped all mailing list CCs and I
>>> didn't notice first]
>>>
>>> On Sat, Jan 03, 2009 at 12:31:12PM -0600, Robert Hancock wrote:
>>>> Bernd Schubert wrote:
>>>>> On Sat, Jan 03, 2009 at 01:39:36PM +0000, Alan Cox wrote:
>>>>>> On Fri, 2 Jan 2009 22:30:07 +0100
>>>>>> Bernd Schubert <bs@q-leap.de> wrote:
>>>>>>
>>>>>>> Hello Bengt,
>>>>>>>
>>>>>>> sil3114 is known to cause data corruption with some disks.
>>>>>> News to me. There are a few people with lots of SI and other devices
>>>>> No no, you just forgot about it, since you even reviewed the patches ;)
>>>>>
>>>>> http://lkml.org/lkml/2007/10/11/137
>>>> And Jeff explained why they were not merged:
>>>>
>>>> http://lkml.org/lkml/2007/10/11/166
>>>>
>>>> All the patch does is try to reduce the speed impact of the
>>>> workaround. But as was pointed out, they don't reliably solve the
>>>> problem the workaround is trying to fix, and besides, the workaround
>>>> is already not applied to SiI3114 at all, as it is apparently not
>>>> applicable on that controller (only 3112).
>>> Well, do they reliable solve the problem in our case (before taking the patch
>>> into production I run a checksum tests for about 2 weeks). Anyway, I entirely
>>> understand the patches didn't get accepted.
>>>
>>> But now more than a year has passed again without doing anything
>>> about it and actually this is what I strongly criticize. Most people don't
>>> know about issues like that and don't run file checksum tests as I now always
>>> do before taking a disk into production. So users are exposed to known
>>> data corruption problems without even being warned about it. Usually
>>> even backups don't help, since one creates a backup of the corrupted data.
>>>
>>> So IMHO, the driver should be deactived for sil3114 until a real
>>> solution is found. And it only should be possible to force activate it
>>> by a kernel flag, which then also would print a huuuge warning about
>>> possible data corruption (unfortunately most distributions disables
>>> inital kernel messages *grumble*).
>> If the corruption was happening on all such controllers then people
>> would have been complaining in droves and something would have been
>> done. It seems much more likely that in this case the problem is some
>> kind of hardware fault or combination of hardware which is causing the
>> problem. Unfortunately these kind of not-easily-reproducible issues tend
>> to be very hard to track down.
>>
>
> Well yes, it only happens with certain drives. But these drives work fine on
> other controllers. But still these are by now
> known issues and nothing is done for that.
> I would happily help to solve the problem, I just don't have any knowledge
> about hardware programming. What would be your next step, if you had remote
> access to such a system?
Have you been able to track down what kind of corruption is occurring
exactly, i.e. what is happening to the data, is data being zeroed out,
random bits being flipped, chunks of a certain size being corrupted,
etc. That would likely be useful in determining where to go next..
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-03 20:04 Bernd Schubert
2009-01-03 20:53 ` Robert Hancock
@ 2009-01-07 4:59 ` Tejun Heo
2009-01-07 5:38 ` Robert Hancock
1 sibling, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2009-01-07 4:59 UTC (permalink / raw)
To: Bernd Schubert
Cc: Robert Hancock, Alan Cox, Justin Piszcz, debian-user, linux-raid,
linux-ide
Hello,
Bernd Schubert wrote:
> But now more than a year has passed again without doing anything
> about it and actually this is what I strongly criticize. Most people
> don't know about issues like that and don't run file checksum tests
> as I now always do before taking a disk into production. So users
> are exposed to known data corruption problems without even being
> warned about it. Usually even backups don't help, since one creates
> a backup of the corrupted data.
sata_sil being one of the most popular controllers && data corruption
reports seem to be concentrated on certain chipsets, I don't think
it's a wide spread problem. In some cases, the corruption was very
reproducible too.
I think it's something related to setting up the PCI side of things.
There have been hints that incorrect CLS setting was the culprit and I
tried thte combinations but without any success and unfortunately the
problem wasn't reproducible with the hardware I have here. :-(
Anyways, there was an interesting report that updating the BIOS on the
controller fixed the problem.
http://bugzilla.kernel.org/show_bug.cgi?id=10480
Taking "lspci -nnvvvxxx" output of before and after such BIOS update
will shed some light on what's really going on. Can you please try
that?
> So IMHO, the driver should be deactived for sil3114 until a real
> solution is found. And it only should be possible to force activate
> it by a kernel flag, which then also would print a huuuge warning
> about possible data corruption (unfortunately most distributions
> disables inital kernel messages *grumble*).
The problem is serious but the scope is quite limited and we can't
tell where the problem lies, so I'm not too sure about taking such
drastic measure. Grumble...
Yeah, I really want to see this long standing problem fixed. To my
knowledge, this is one of two still open data corruption bugs - the
other one being via putting CDB bytes into burnt CD/DVDs.
So, if you can try the BIOS update thing, please give it a shot.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-07 4:59 ` Tejun Heo
@ 2009-01-07 5:38 ` Robert Hancock
2009-01-07 15:31 ` Bernd Schubert
0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-07 5:38 UTC (permalink / raw)
To: Tejun Heo
Cc: Bernd Schubert, Alan Cox, Justin Piszcz, debian-user, linux-raid,
linux-ide
Tejun Heo wrote:
> Hello,
>
> Bernd Schubert wrote:
>> But now more than a year has passed again without doing anything
>> about it and actually this is what I strongly criticize. Most people
>> don't know about issues like that and don't run file checksum tests
>> as I now always do before taking a disk into production. So users
>> are exposed to known data corruption problems without even being
>> warned about it. Usually even backups don't help, since one creates
>> a backup of the corrupted data.
>
> sata_sil being one of the most popular controllers && data corruption
> reports seem to be concentrated on certain chipsets, I don't think
> it's a wide spread problem. In some cases, the corruption was very
> reproducible too.
>
> I think it's something related to setting up the PCI side of things.
> There have been hints that incorrect CLS setting was the culprit and I
> tried thte combinations but without any success and unfortunately the
> problem wasn't reproducible with the hardware I have here. :-(
As far as the cache line size register, the only thing the documentation
says it controls _directly_ is "With the SiI3114 as a master, initiating
a read transaction, it issues PCI command Read Multiple in place, when
empty space in its FIFO is larger than the value programmed in this
register."
The interesting thing is the commit (log below) that added code to the
driver to check the PCI cache line size register and set up the FIFO
thresholds:
2005/03/24 23:32:42-05:00 Carlos.Pardo
[PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
This patch set default values for the FIFO PCI Bus Arbitration to
avoid data corruption. The root cause is due to our PCI bus master
handling mismatch with the chipset PCI bridge during DMA xfer (write
data to the device). The patch is to setup the DMA fifo threshold so
that there is no chance for the DMA engine to change protocol. We have
seen this problem only on one motherboard.
Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
What the code's doing is setting the FIFO thresholds, used to assign
priority when requesting a PCI bus read or write operation, based on the
cache line size somehow. It seems to be trusting that the chip's cache
line size register has been set properly by the BIOS. The kernel should
know what the cache line size is but AFAIK normally only sets it when
the driver requests MWI. This chip doesn't support MWI, but it looks
like pci_set_mwi would fix up the CLS register as a side effect..
>
> Anyways, there was an interesting report that updating the BIOS on the
> controller fixed the problem.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=10480
>
> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
> will shed some light on what's really going on. Can you please try
> that?
Yes, that would be quite interesting.. the output even with the current
BIOS would be useful to see if the BIOS set some stupid cache line size
value..
>
>> So IMHO, the driver should be deactived for sil3114 until a real
>> solution is found. And it only should be possible to force activate
>> it by a kernel flag, which then also would print a huuuge warning
>> about possible data corruption (unfortunately most distributions
>> disables inital kernel messages *grumble*).
>
> The problem is serious but the scope is quite limited and we can't
> tell where the problem lies, so I'm not too sure about taking such
> drastic measure. Grumble...
>
> Yeah, I really want to see this long standing problem fixed. To my
> knowledge, this is one of two still open data corruption bugs - the
> other one being via putting CDB bytes into burnt CD/DVDs.
>
> So, if you can try the BIOS update thing, please give it a shot.
>
> Thanks.
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-07 5:38 ` Robert Hancock
@ 2009-01-07 15:31 ` Bernd Schubert
2009-01-11 0:32 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Bernd Schubert @ 2009-01-07 15:31 UTC (permalink / raw)
To: Robert Hancock
Cc: Tejun Heo, Alan Cox, Justin Piszcz, debian-user, linux-raid,
linux-ide
On Wednesday 07 January 2009 06:38:28 Robert Hancock wrote:
> Tejun Heo wrote:
> > Hello,
> >
> > Bernd Schubert wrote:
> >> But now more than a year has passed again without doing anything
> >> about it and actually this is what I strongly criticize. Most people
> >> don't know about issues like that and don't run file checksum tests
> >> as I now always do before taking a disk into production. So users
> >> are exposed to known data corruption problems without even being
> >> warned about it. Usually even backups don't help, since one creates
> >> a backup of the corrupted data.
> >
> > sata_sil being one of the most popular controllers && data corruption
> > reports seem to be concentrated on certain chipsets, I don't think
> > it's a wide spread problem. In some cases, the corruption was very
> > reproducible too.
> >
> > I think it's something related to setting up the PCI side of things.
> > There have been hints that incorrect CLS setting was the culprit and I
> > tried thte combinations but without any success and unfortunately the
> > problem wasn't reproducible with the hardware I have here. :-(
>
> As far as the cache line size register, the only thing the documentation
> says it controls _directly_ is "With the SiI3114 as a master, initiating
> a read transaction, it issues PCI command Read Multiple in place, when
> empty space in its FIFO is larger than the value programmed in this
> register."
>
> The interesting thing is the commit (log below) that added code to the
> driver to check the PCI cache line size register and set up the FIFO
> thresholds:
>
> 2005/03/24 23:32:42-05:00 Carlos.Pardo
> [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>
> This patch set default values for the FIFO PCI Bus Arbitration to
> avoid data corruption. The root cause is due to our PCI bus master
> handling mismatch with the chipset PCI bridge during DMA xfer (write
> data to the device). The patch is to setup the DMA fifo threshold so
> that there is no chance for the DMA engine to change protocol. We have
> seen this problem only on one motherboard.
>
> Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>
> What the code's doing is setting the FIFO thresholds, used to assign
> priority when requesting a PCI bus read or write operation, based on the
> cache line size somehow. It seems to be trusting that the chip's cache
> line size register has been set properly by the BIOS. The kernel should
> know what the cache line size is but AFAIK normally only sets it when
> the driver requests MWI. This chip doesn't support MWI, but it looks
> like pci_set_mwi would fix up the CLS register as a side effect..
>
> > Anyways, there was an interesting report that updating the BIOS on the
> > controller fixed the problem.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=10480
> >
> > Taking "lspci -nnvvvxxx" output of before and after such BIOS update
> > will shed some light on what's really going on. Can you please try
> > that?
>
> Yes, that would be quite interesting.. the output even with the current
> BIOS would be useful to see if the BIOS set some stupid cache line size
> value..
Unfortunately I can't update the bios/firmware of the Sil3114 directly, it is
onboard and the firmware is included into the mainboard bios. There is not
the most recent bios version installed, but when we initially had the
problems, we first tried a bios update, but it didn't help.
As suggested by Robert, I'm presently trying to figure out the corruption
pattern. Actually our test tool easily provides these data. Unfortunately, it
so far didn't report anything, although the reiserfs already got corrupted.
Might be my colleague, who wrote that tool, recently broke something (as it
is the second time, it doesn't report corruptions), in the past it did work
reliably. Please give me a few more days...
03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller
[1095:3114]
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR+ FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 19
Region 0: I/O ports at bc00 [size=8]
Region 1: I/O ports at b880 [size=4]
Region 2: I/O ports at b800 [size=8]
Region 3: I/O ports at ac00 [size=4]
Region 4: I/O ports at a880 [size=16]
Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
Expansion ROM at fea00000 [disabled] [size=512K]
Capabilities: [60] Power Management version 2
Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-
00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-07 15:31 ` Bernd Schubert
@ 2009-01-11 0:32 ` Robert Hancock
2009-01-11 0:43 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-11 0:32 UTC (permalink / raw)
To: Bernd Schubert
Cc: Tejun Heo, Alan Cox, Justin Piszcz, debian-user, linux-raid,
linux-ide, cpardo
Bernd Schubert wrote:
>>> I think it's something related to setting up the PCI side of things.
>>> There have been hints that incorrect CLS setting was the culprit and I
>>> tried thte combinations but without any success and unfortunately the
>>> problem wasn't reproducible with the hardware I have here. :-(
>> As far as the cache line size register, the only thing the documentation
>> says it controls _directly_ is "With the SiI3114 as a master, initiating
>> a read transaction, it issues PCI command Read Multiple in place, when
>> empty space in its FIFO is larger than the value programmed in this
>> register."
>>
>> The interesting thing is the commit (log below) that added code to the
>> driver to check the PCI cache line size register and set up the FIFO
>> thresholds:
>>
>> 2005/03/24 23:32:42-05:00 Carlos.Pardo
>> [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>>
>> This patch set default values for the FIFO PCI Bus Arbitration to
>> avoid data corruption. The root cause is due to our PCI bus master
>> handling mismatch with the chipset PCI bridge during DMA xfer (write
>> data to the device). The patch is to setup the DMA fifo threshold so
>> that there is no chance for the DMA engine to change protocol. We have
>> seen this problem only on one motherboard.
>>
>> Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
>> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>>4
>> What the code's doing is setting the FIFO thresholds, used to assign
>> priority when requesting a PCI bus read or write operation, based on the
>> cache line size somehow. It seems to be trusting that the chip's cache
>> line size register has been set properly by the BIOS. The kernel should
>> know what the cache line size is but AFAIK normally only sets it when
>> the driver requests MWI. This chip doesn't support MWI, but it looks
>> like pci_set_mwi would fix up the CLS register as a side effect..
>>
>>> Anyways, there was an interesting report that updating the BIOS on the
>>> controller fixed the problem.
>>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=10480
>>>
>>> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
>>> will shed some light on what's really going on. Can you please try
>>> that?
>> Yes, that would be quite interesting.. the output even with the current
>> BIOS would be useful to see if the BIOS set some stupid cache line size
>> value..
>
> Unfortunately I can't update the bios/firmware of the Sil3114 directly, it is
> onboard and the firmware is included into the mainboard bios. There is not
> the most recent bios version installed, but when we initially had the
> problems, we first tried a bios update, but it didn't help.
Well if one is really adventurous one can sometimes use some BIOS image
editing tools to install an updated flash image for such integrated
chips into the main BIOS image. This is definitely for advanced users
only though..
>
> As suggested by Robert, I'm presently trying to figure out the corruption
> pattern. Actually our test tool easily provides these data. Unfortunately, it
> so far didn't report anything, although the reiserfs already got corrupted.
> Might be my colleague, who wrote that tool, recently broke something (as it
> is the second time, it doesn't report corruptions), in the past it did work
> reliably. Please give me a few more days...
>
>
> 03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114
> [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
> Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller
> [1095:3114]
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR+ FastB2B-
> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
> Latency: 64, Cache Line Size: 64 bytes
Well, 64 seems quite reasonable, so that doesn't really give any more
useful information.
I'm CCing Carlos Pardo at Silicon Image who wrote the patch above, maybe
he has some insight.. Carlos, we have a case here where Bernd is
reporting seeing corruption on an integrated SiI3114 on a Tyan Thunder
K8S Pro (S2882) board, AMD 8111 chipset. This is reportedly occurring
only with certain Seagate drives. Do you have any insight into this
problem, in particular as far as whether the problem worked around in
the patch mentioned above might be related?
There are apparently some reports of issues on NVidia chipsets as well,
though I don't have any details at hand.
> Interrupt: pin A routed to IRQ 19
> Region 0: I/O ports at bc00 [size=8]
> Region 1: I/O ports at b880 [size=4]
> Region 2: I/O ports at b800 [size=8]
> Region 3: I/O ports at ac00 [size=4]
> Region 4: I/O ports at a880 [size=16]
> Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
> Expansion ROM at fea00000 [disabled] [size=512K]
> Capabilities: [60] Power Management version 2
> Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
> PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 PME-Enable- DSel=0 DScale=2 PME-
> 00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
> 10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
> 20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
> 30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
> 40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
> 70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
> 80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
> 90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
> a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
> b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
> c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>
>
>
> Cheers,
> Bernd
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-11 0:32 ` Robert Hancock
@ 2009-01-11 0:43 ` Robert Hancock
2009-01-12 1:30 ` Tejun Heo
0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-11 0:43 UTC (permalink / raw)
Cc: Bernd Schubert, Tejun Heo, Alan Cox, Justin Piszcz, debian-user,
linux-raid, linux-ide
Robert Hancock wrote:
> Bernd Schubert wrote:
>>>> I think it's something related to setting up the PCI side of things.
>>>> There have been hints that incorrect CLS setting was the culprit and I
>>>> tried thte combinations but without any success and unfortunately the
>>>> problem wasn't reproducible with the hardware I have here. :-(
>>> As far as the cache line size register, the only thing the documentation
>>> says it controls _directly_ is "With the SiI3114 as a master, initiating
>>> a read transaction, it issues PCI command Read Multiple in place, when
>>> empty space in its FIFO is larger than the value programmed in this
>>> register."
>>>
>>> The interesting thing is the commit (log below) that added code to the
>>> driver to check the PCI cache line size register and set up the FIFO
>>> thresholds:
>>>
>>> 2005/03/24 23:32:42-05:00 Carlos.Pardo
>>> [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration
>>>
>>> This patch set default values for the FIFO PCI Bus Arbitration to
>>> avoid data corruption. The root cause is due to our PCI bus master
>>> handling mismatch with the chipset PCI bridge during DMA xfer (write
>>> data to the device). The patch is to setup the DMA fifo threshold so
>>> that there is no chance for the DMA engine to change protocol. We
>>> have
>>> seen this problem only on one motherboard.
>>>
>>> Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
>>> Signed-off-by: Jeff Garzik <jgarzik@pobox.com>
>>> 4
>>> What the code's doing is setting the FIFO thresholds, used to assign
>>> priority when requesting a PCI bus read or write operation, based on the
>>> cache line size somehow. It seems to be trusting that the chip's cache
>>> line size register has been set properly by the BIOS. The kernel should
>>> know what the cache line size is but AFAIK normally only sets it when
>>> the driver requests MWI. This chip doesn't support MWI, but it looks
>>> like pci_set_mwi would fix up the CLS register as a side effect..
>>>
>>>> Anyways, there was an interesting report that updating the BIOS on the
>>>> controller fixed the problem.
>>>>
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=10480
>>>>
>>>> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
>>>> will shed some light on what's really going on. Can you please try
>>>> that?
>>> Yes, that would be quite interesting.. the output even with the current
>>> BIOS would be useful to see if the BIOS set some stupid cache line size
>>> value..
>>
>> Unfortunately I can't update the bios/firmware of the Sil3114
>> directly, it is onboard and the firmware is included into the
>> mainboard bios. There is not the most recent bios version installed,
>> but when we initially had the problems, we first tried a bios update,
>> but it didn't help.
>
> Well if one is really adventurous one can sometimes use some BIOS image
> editing tools to install an updated flash image for such integrated
> chips into the main BIOS image. This is definitely for advanced users
> only though..
>
>>
>> As suggested by Robert, I'm presently trying to figure out the
>> corruption pattern. Actually our test tool easily provides these data.
>> Unfortunately, it so far didn't report anything, although the reiserfs
>> already got corrupted. Might be my colleague, who wrote that tool,
>> recently broke something (as it is the second time, it doesn't report
>> corruptions), in the past it did work reliably. Please give me a few
>> more days...
>>
>>
>> 03:05.0 Mass storage controller [0180]: Silicon Image, Inc. SiI 3114
>> [SATALink/SATARaid] Serial ATA Controller [1095:3114] (rev 02)
>> Subsystem: Silicon Image, Inc. SiI 3114 SATALink Controller
>> [1095:3114]
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>> ParErr- Stepping- SERR+ FastB2B-
>> Status: Cap+ 66MHz+ UDF- FastB2B+ ParErr- DEVSEL=medium
>> >TAbort- <TAbort- <MAbort- >SERR- <PERR-
>> Latency: 64, Cache Line Size: 64 bytes
>
> Well, 64 seems quite reasonable, so that doesn't really give any more
> useful information.
>
> I'm CCing Carlos Pardo at Silicon Image who wrote the patch above, maybe
> he has some insight.. Carlos, we have a case here where Bernd is
> reporting seeing corruption on an integrated SiI3114 on a Tyan Thunder
> K8S Pro (S2882) board, AMD 8111 chipset. This is reportedly occurring
> only with certain Seagate drives. Do you have any insight into this
> problem, in particular as far as whether the problem worked around in
> the patch mentioned above might be related?
>
> There are apparently some reports of issues on NVidia chipsets as well,
> though I don't have any details at hand.
Well, Carlos' email bounces, so much for that one. Anyone have any other
contacts at Silicon Image?
>
>> Interrupt: pin A routed to IRQ 19
>> Region 0: I/O ports at bc00 [size=8]
>> Region 1: I/O ports at b880 [size=4]
>> Region 2: I/O ports at b800 [size=8]
>> Region 3: I/O ports at ac00 [size=4]
>> Region 4: I/O ports at a880 [size=16]
>> Region 5: Memory at feafec00 (32-bit, non-prefetchable) [size=1K]
>> Expansion ROM at fea00000 [disabled] [size=512K]
>> Capabilities: [60] Power Management version 2
>> Flags: PMEClk- DSI+ D1+ D2+ AuxCurrent=0mA
>> PME(D0-,D1-,D2-,D3hot-,D3cold-)
>> Status: D0 PME-Enable- DSel=0 DScale=2 PME-
>> 00: 95 10 14 31 07 01 b0 02 02 00 80 01 10 40 00 00
>> 10: 01 bc 00 00 81 b8 00 00 01 b8 00 00 01 ac 00 00
>> 20: 81 a8 00 00 00 ec af fe 00 00 00 00 95 10 14 31
>> 30: 00 00 a0 fe 60 00 00 00 00 00 00 00 0a 01 00 00
>> 40: 02 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00
>> 50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 60: 01 00 22 06 00 40 00 64 00 00 00 00 00 00 00 00
>> 70: 00 00 60 00 d0 d0 09 00 00 00 60 00 00 00 00 00
>> 80: 03 00 00 00 22 00 00 00 00 00 00 00 c8 93 7f ef
>> 90: 00 00 00 09 ff ff 00 00 00 00 00 19 00 00 00 00
>> a0: 01 31 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
>> b0: 01 21 15 65 dd 62 dd 62 92 43 92 43 09 40 09 40
>> c0: 84 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>
>>
>>
>> Cheers,
>> Bernd
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-11 0:43 ` Robert Hancock
@ 2009-01-12 1:30 ` Tejun Heo
2009-01-19 18:43 ` Dave Jones
0 siblings, 1 reply; 29+ messages in thread
From: Tejun Heo @ 2009-01-12 1:30 UTC (permalink / raw)
To: Robert Hancock
Cc: Bernd Schubert, Alan Cox, Justin Piszcz, debian-user, linux-raid,
linux-ide
Robert Hancock wrote:
>> There are apparently some reports of issues on NVidia chipsets as
>> well, though I don't have any details at hand.
>
> Well, Carlos' email bounces, so much for that one. Anyone have any other
> contacts at Silicon Image?
I'll ping my SIMG contacts but I've pinged about this problem in the
past but it didn't get anywhere.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-12 1:30 ` Tejun Heo
@ 2009-01-19 18:43 ` Dave Jones
2009-01-20 2:50 ` Robert Hancock
0 siblings, 1 reply; 29+ messages in thread
From: Dave Jones @ 2009-01-19 18:43 UTC (permalink / raw)
To: Tejun Heo
Cc: Robert Hancock, Bernd Schubert, Alan Cox, Justin Piszcz,
debian-user, linux-raid, linux-ide
On Mon, Jan 12, 2009 at 10:30:42AM +0900, Tejun Heo wrote:
> Robert Hancock wrote:
> >> There are apparently some reports of issues on NVidia chipsets as
> >> well, though I don't have any details at hand.
> >
> > Well, Carlos' email bounces, so much for that one. Anyone have any other
> > contacts at Silicon Image?
>
> I'll ping my SIMG contacts but I've pinged about this problem in the
> past but it didn't get anywhere.
I wish I'd read this thread last week.. I've been beating my head
against this problem all weekend.
I picked up a cheap 3114 card, and found that when I created a filesystem
with it on a 250GB disk, it got massive corruption very quickly.
My experience echos most the other peoples in this thread, but here's
a few data points I've been able to figure out..
I ran badblocks -v -w -s on the disk, and after running
for nearly 24 hours, it reported a huge number of blocks
failing at the upper part of the disk.
I created a partition in this bad area to speed up testing..
Device Boot Start End Blocks Id System
/dev/sde1 1 30000 240974968+ 83 Linux
/dev/sde2 30001 30200 1606500 83 Linux
/dev/sde3 30201 30401 1614532+ 83 Linux
Rerunning badblocks on /dev/sde2 consistently fails when
it gets to the reading back 0x00 stage.
(Somehow it passes reading back 0xff, 0xaa and 0x55)
I was beginning to suspect the disk may be bad, but when I
moved it to a box with Intel sata, the badblocks run on that
same partition succeeds with no problems at all.
Given the corruption happens at high block numbers, I'm wondering
if maybe there's some kind of wraparound bug happening here.
(Though why only the 0x00 pattern fails would still be a mystery).
After reading about the firmware update fixing it, I thought I'd
give that a shot. This was pretty much complete fail.
The DOS utility for flashing claims I'm running BIOS 5.0.39,
which looking at http://www.siliconimage.com/support/searchresults.aspx?pid=28&cat=15
is quite ancient. So I tried the newer ones.
Same experience with both 5.4.0.3, and 5.0.73
"BIOS version in the input file is not a newer version"
Forcing it to write anyway gets..
"Data is different at address 65f6h"
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-19 18:43 ` Dave Jones
@ 2009-01-20 2:50 ` Robert Hancock
2009-01-20 20:07 ` Dave Jones
0 siblings, 1 reply; 29+ messages in thread
From: Robert Hancock @ 2009-01-20 2:50 UTC (permalink / raw)
To: Dave Jones
Cc: Tejun Heo, Bernd Schubert, Alan Cox, Justin Piszcz, debian-user,
linux-raid, linux-ide
Dave Jones wrote:
> On Mon, Jan 12, 2009 at 10:30:42AM +0900, Tejun Heo wrote:
> > Robert Hancock wrote:
> > >> There are apparently some reports of issues on NVidia chipsets as
> > >> well, though I don't have any details at hand.
> > >
> > > Well, Carlos' email bounces, so much for that one. Anyone have any other
> > > contacts at Silicon Image?
> >
> > I'll ping my SIMG contacts but I've pinged about this problem in the
> > past but it didn't get anywhere.
>
> I wish I'd read this thread last week.. I've been beating my head
> against this problem all weekend.
>
> I picked up a cheap 3114 card, and found that when I created a filesystem
> with it on a 250GB disk, it got massive corruption very quickly.
>
> My experience echos most the other peoples in this thread, but here's
> a few data points I've been able to figure out..
>
> I ran badblocks -v -w -s on the disk, and after running
> for nearly 24 hours, it reported a huge number of blocks
> failing at the upper part of the disk.
>
> I created a partition in this bad area to speed up testing..
>
> Device Boot Start End Blocks Id System
> /dev/sde1 1 30000 240974968+ 83 Linux
> /dev/sde2 30001 30200 1606500 83 Linux
> /dev/sde3 30201 30401 1614532+ 83 Linux
>
> Rerunning badblocks on /dev/sde2 consistently fails when
> it gets to the reading back 0x00 stage.
> (Somehow it passes reading back 0xff, 0xaa and 0x55)
>
> I was beginning to suspect the disk may be bad, but when I
> moved it to a box with Intel sata, the badblocks run on that
> same partition succeeds with no problems at all.
>
> Given the corruption happens at high block numbers, I'm wondering
> if maybe there's some kind of wraparound bug happening here.
> (Though why only the 0x00 pattern fails would still be a mystery).
Yeah, that seems a bit bizarre.. Apparently somehow zeros are being
converted into non-zero.. Can you try zeroing out the partition by
dd'ing into it from /dev/zero or something, then dumping it back out to
see what kind of data is showing up?
>
>
> After reading about the firmware update fixing it, I thought I'd
> give that a shot. This was pretty much complete fail.
>
> The DOS utility for flashing claims I'm running BIOS 5.0.39,
> which looking at http://www.siliconimage.com/support/searchresults.aspx?pid=28&cat=15
> is quite ancient. So I tried the newer ones.
> Same experience with both 5.4.0.3, and 5.0.73
>
> "BIOS version in the input file is not a newer version"
>
> Forcing it to write anyway gets..
>
> "Data is different at address 65f6h"
>
>
>
>
> Dave
>
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: Corrupt data - RAID sata_sil 3114 chip
2009-01-20 2:50 ` Robert Hancock
@ 2009-01-20 20:07 ` Dave Jones
0 siblings, 0 replies; 29+ messages in thread
From: Dave Jones @ 2009-01-20 20:07 UTC (permalink / raw)
To: Robert Hancock
Cc: Tejun Heo, Bernd Schubert, Alan Cox, Justin Piszcz, debian-user,
linux-raid, linux-ide
On Mon, Jan 19, 2009 at 08:50:06PM -0600, Robert Hancock wrote:
> > Given the corruption happens at high block numbers, I'm wondering
> > if maybe there's some kind of wraparound bug happening here.
> > (Though why only the 0x00 pattern fails would still be a mystery).
>
> Yeah, that seems a bit bizarre.. Apparently somehow zeros are being
> converted into non-zero.. Can you try zeroing out the partition by
> dd'ing into it from /dev/zero or something, then dumping it back out to
> see what kind of data is showing up?
Hmm, it seems the failed firmware update has killed the eeprom.
It no longer reports the right PCI vendor ID.
Dave
--
http://www.codemonkey.org.uk
^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <495E01E3.9060903@sm7jqb.se>]
end of thread, other threads:[~2010-02-08 14:26 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-29 16:13 Corrupt data - RAID sata_sil 3114 chip Ulli.Brennenstuhl
2010-01-29 19:37 ` Robert Hancock
2010-02-06 3:54 ` Tejun Heo
2010-02-06 15:16 ` Tim Small
2010-02-07 16:09 ` Robert Hancock
2010-02-08 2:31 ` Tejun Heo
2010-02-08 14:25 ` Tim Small
[not found] <bQVFb-3SB-37@gated-at.bofh.it>
[not found] ` <bQVFb-3SB-39@gated-at.bofh.it>
[not found] ` <bQVFb-3SB-41@gated-at.bofh.it>
[not found] ` <bQVFc-3SB-43@gated-at.bofh.it>
[not found] ` <bQVFc-3SB-45@gated-at.bofh.it>
[not found] ` <bQVFc-3SB-47@gated-at.bofh.it>
[not found] ` <bQVFb-3SB-35@gated-at.bofh.it>
[not found] ` <4963306F.4060504@sm7jqb.se>
2009-01-06 10:48 ` Justin Piszcz
-- strict thread matches above, loose matches on Subject: below --
2009-01-03 20:04 Bernd Schubert
2009-01-03 20:53 ` Robert Hancock
2009-01-03 21:11 ` Bernd Schubert
2009-01-03 23:23 ` Robert Hancock
2009-01-07 4:59 ` Tejun Heo
2009-01-07 5:38 ` Robert Hancock
2009-01-07 15:31 ` Bernd Schubert
2009-01-11 0:32 ` Robert Hancock
2009-01-11 0:43 ` Robert Hancock
2009-01-12 1:30 ` Tejun Heo
2009-01-19 18:43 ` Dave Jones
2009-01-20 2:50 ` Robert Hancock
2009-01-20 20:07 ` Dave Jones
[not found] <495E01E3.9060903@sm7jqb.se>
[not found] ` <alpine.DEB.1.10.0901020741200.11852@p34.internal.lan>
2009-01-02 21:30 ` Bernd Schubert
2009-01-02 21:47 ` Twigathy
2009-01-03 2:31 ` Redeeman
2009-01-03 13:13 ` Bernd Schubert
2009-01-03 13:39 ` Alan Cox
2009-01-03 16:20 ` Bernd Schubert
2009-01-03 18:31 ` Robert Hancock
2009-01-03 22:19 ` James Youngman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).