Re: Corrupt data - RAID sata_sil 3114 chip

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Robert Hancock <hancockr@shaw.ca>
To: Tejun Heo <tj@kernel.org>
Cc: Bernd Schubert <bs@q-leap.de>,
	Alan Cox <alan@lxorguk.ukuu.org.uk>,
	Justin Piszcz <jpiszcz@lucidpixels.com>,
	debian-user@lists.debian.org, linux-raid@vger.kernel.org,
	linux-ide@vger.kernel.org
Subject: Re: Corrupt data - RAID sata_sil 3114 chip
Date: Tue, 06 Jan 2009 23:38:28 -0600	[thread overview]
Message-ID: <49643FD4.9080100@shaw.ca> (raw)
In-Reply-To: <496436C4.4070305@kernel.org>

Tejun Heo wrote:
> Hello,
> 
> Bernd Schubert wrote:
>> But now more than a year has passed again without doing anything
>> about it and actually this is what I strongly criticize. Most people
>> don't know about issues like that and don't run file checksum tests
>> as I now always do before taking a disk into production. So users
>> are exposed to known data corruption problems without even being
>> warned about it. Usually even backups don't help, since one creates
>> a backup of the corrupted data.
> 
> sata_sil being one of the most popular controllers && data corruption
> reports seem to be concentrated on certain chipsets, I don't think
> it's a wide spread problem.  In some cases, the corruption was very
> reproducible too.
> 
> I think it's something related to setting up the PCI side of things.
> There have been hints that incorrect CLS setting was the culprit and I
> tried thte combinations but without any success and unfortunately the
> problem wasn't reproducible with the hardware I have here.  :-(

As far as the cache line size register, the only thing the documentation 
says it controls _directly_ is "With the SiI3114 as a master, initiating 
a read transaction, it issues PCI command Read Multiple in place, when 
empty space in its FIFO is larger than the value programmed in this 
register."

The interesting thing is the commit (log below) that added code to the 
driver to check the PCI cache line size register and set up the FIFO 
thresholds:

   2005/03/24 23:32:42-05:00 Carlos.Pardo
   [PATCH] sata_sil: Fix FIFO PCI Bus Arbitration

   This patch set default values for the FIFO PCI Bus Arbitration to
   avoid data corruption. The root cause is due to our PCI bus master
   handling mismatch with the chipset PCI bridge during DMA xfer (write
   data to the device). The patch is to setup the DMA fifo threshold so
   that there is no chance for the DMA engine to change protocol. We have
   seen this problem only on one motherboard.

   Signed-off-by: Silicon Image Corporation <cpardo@siliconimage.com>
   Signed-off-by: Jeff Garzik <jgarzik@pobox.com>

What the code's doing is setting the FIFO thresholds, used to assign 
priority when requesting a PCI bus read or write operation, based on the 
cache line size somehow. It seems to be trusting that the chip's cache 
line size register has been set properly by the BIOS. The kernel should 
know what the cache line size is but AFAIK normally only sets it when 
the driver requests MWI. This chip doesn't support MWI, but it looks 
like pci_set_mwi would fix up the CLS register as a side effect..

> 
> Anyways, there was an interesting report that updating the BIOS on the
> controller fixed the problem.
> 
>   http://bugzilla.kernel.org/show_bug.cgi?id=10480  
> 
> Taking "lspci -nnvvvxxx" output of before and after such BIOS update
> will shed some light on what's really going on.  Can you please try
> that?

Yes, that would be quite interesting.. the output even with the current 
BIOS would be useful to see if the BIOS set some stupid cache line size 
value..

> 
>> So IMHO, the driver should be deactived for sil3114 until a real
>> solution is found. And it only should be possible to force activate
>> it by a kernel flag, which then also would print a huuuge warning
>> about possible data corruption (unfortunately most distributions
>> disables inital kernel messages *grumble*).
> 
> The problem is serious but the scope is quite limited and we can't
> tell where the problem lies, so I'm not too sure about taking such
> drastic measure.  Grumble...
> 
> Yeah, I really want to see this long standing problem fixed.  To my
> knowledge, this is one of two still open data corruption bugs - the
> other one being via putting CDB bytes into burnt CD/DVDs.
> 
> So, if you can try the BIOS update thing, please give it a shot.
> 
> Thanks.
>

next prev parent reply	other threads:[~2009-01-07  5:38 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-03 20:04 Corrupt data - RAID sata_sil 3114 chip Bernd Schubert
2009-01-03 20:53 ` Robert Hancock
2009-01-03 21:11   ` Bernd Schubert
2009-01-03 23:23     ` Robert Hancock
2009-01-07  4:59 ` Tejun Heo
2009-01-07  5:38   ` Robert Hancock [this message]
2009-01-07 15:31     ` Bernd Schubert
2009-01-11  0:32       ` Robert Hancock
2009-01-11  0:43         ` Robert Hancock
2009-01-12  1:30           ` Tejun Heo
2009-01-19 18:43             ` Dave Jones
2009-01-20  2:50               ` Robert Hancock
2009-01-20 20:07                 ` Dave Jones
     [not found] <bQVFb-3SB-37@gated-at.bofh.it>
     [not found] ` <bQVFb-3SB-39@gated-at.bofh.it>
     [not found]   ` <bQVFb-3SB-41@gated-at.bofh.it>
     [not found]     ` <bQVFc-3SB-43@gated-at.bofh.it>
     [not found]       ` <bQVFc-3SB-45@gated-at.bofh.it>
     [not found]         ` <bQVFc-3SB-47@gated-at.bofh.it>
     [not found]           ` <bQVFb-3SB-35@gated-at.bofh.it>
     [not found]             ` <4963306F.4060504@sm7jqb.se>
2009-01-06 10:48               ` Justin Piszcz
     [not found] <495E01E3.9060903@sm7jqb.se>
2009-01-02 12:42 ` Justin Piszcz
2009-01-02 21:30   ` Bernd Schubert
2009-01-02 21:47     ` Twigathy
2009-01-03  2:31     ` Redeeman
2009-01-03 13:13       ` Bernd Schubert
2009-01-03 13:39     ` Alan Cox
2009-01-03 16:20       ` Bernd Schubert
2009-01-03 18:31         ` Robert Hancock
2009-01-03 22:19     ` James Youngman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49643FD4.9080100@shaw.ca \
    --to=hancockr@shaw.ca \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=bs@q-leap.de \
    --cc=debian-user@lists.debian.org \
    --cc=jpiszcz@lucidpixels.com \
    --cc=linux-ide@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).