linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sata_sil, writing bug with multiple cards?
@ 2007-07-03  6:42 7091
  2007-07-03  8:51 ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: 7091 @ 2007-07-03  6:42 UTC (permalink / raw)
  To: linux-ide; +Cc: jgarzik

Greetings, 

I have been troubleshooting a problem for over a year now, and to make a
long story short, I think the sata_sil driver has a bug during writing when
there are multiple cards that are using different models of SiI chips in the
system. 

I will be watching the list, although cc'ing me over email will be useful
for speeding up replies. 

Longer version:
I've been having problems with my Linux server corrupting data.  Not just a
little - it can't copy a 700 meg ISO file and end up with the same checksum
(and usually corrupts the filesystem in the process). 

Hardware:
Asus A7N8X-Deluxe motherboard.  This has 2 parallel IDE connectors, each
with a 40 gig IDE HD hanging off it, and 2 SATA connectors (driven by a SiI
3112 chip) with (right now) 1 300G SATA HD and 1 250G SATA HD.  All of these
facts are important. 

On my PCI bus, I have a SiI 3114A card with 3 more 300G SATA HDs.  It should
be noted that only drives on the PCI card have corruption.  Neither the
parallel IDE HDs, nor the SATA drives on the motherboard experience the
problem. I have also tried replacing this card, which did not fix the
problem.  Also, placing the same drive on the add-on card has corruption,
the same drive, cable, power, etc. on the motherboard works fine. 

I've already swapped motherboards, CPU, and RAM. 

Booting to a Knoppix 5.1 CD shows the same problems. 

Reading is fine (i.e. I can read the same file 50 times and get the same
md5sum).  Writing causes the corruption. 

The corruption happens no matter what filesystem I try (ext2, ext3, reiser,
xfs).  (This does mean I've reformatted, etc. several times, so its not a
metadata problem) 

This happens with at least 3 different drives (the 300 and the 250,
different manufacturers), with different SATA data cables, power supplies,
power cables, etc. 

Now, here's the kicker.  Booting to an OpenBSD kernel, and using one of the
300G drives off the 3114A card (the one that show corruption under Linux)
works fine. 

This happens with the Knoppix 5.1 kernel (2.6.19), my own compiled 2.6.20.3,
and Debian kernel 2.6.18-4-k7. 

More spammy data:
# lspci
00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
(rev a2)
00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev a2)
00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev a2)
00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev a2)
00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev a2)
00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev a2)
00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:04.0 Ethernet controller: nVidia Corporation nForce2 Ethernet Controller
(rev a1)
00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
00:0c.0 PCI bridge: nVidia Corporation nForce2 PCI Bridge (rev a3)
00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev a2)
01:06.0 VGA compatible controller: Matrox Graphics, Inc. MGA 2164W
[Millennium II]
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169
Gigabit Ethernet (rev 10)
01:09.0 Mass storage controller: Silicon Image, Inc. SiI 3114
[SATALink/SATARaid] Serial ATA Controller (rev 02)
01:0b.0 RAID bus controller: Silicon Image, Inc. SiI 3112
[SATALink/SATARaid] Serial ATA Controller (rev 01)
02:01.0 Ethernet controller: 3Com Corporation 3C920B-EMB Integrated Fast
Ethernet Controller [Tornado] (rev 40) 

Any assistance, input, etc. appreciated.  Thanks.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-03  6:42 sata_sil, writing bug with multiple cards? 7091
@ 2007-07-03  8:51 ` Tejun Heo
  2007-07-04  1:40   ` 7091
  2007-07-04  2:05   ` 7091
  0 siblings, 2 replies; 18+ messages in thread
From: Tejun Heo @ 2007-07-03  8:51 UTC (permalink / raw)
  To: 7091; +Cc: linux-ide, jgarzik

7091@blargh.com wrote:
> Greetings,
> I have been troubleshooting a problem for over a year now, and to make a
> long story short, I think the sata_sil driver has a bug during writing when
> there are multiple cards that are using different models of SiI chips in
> the
> system.
> I will be watching the list, although cc'ing me over email will be useful
> for speeding up replies.
> Longer version:
> I've been having problems with my Linux server corrupting data.  Not just a
> little - it can't copy a 700 meg ISO file and end up with the same checksum
> (and usually corrupts the filesystem in the process).
> Hardware:
> Asus A7N8X-Deluxe motherboard.  This has 2 parallel IDE connectors, each
> with a 40 gig IDE HD hanging off it, and 2 SATA connectors (driven by a SiI
> 3112 chip) with (right now) 1 300G SATA HD and 1 250G SATA HD.  All of

Sounds awfully similar to the recent nvidia data corruption issue.  It
was involving the IOMMU and one of the workarounds was not using IOMMU,
IIRC.  Please turn off IOMMU and see what happens.  You can turn IOMMU
off by passing "iommu=off" as kernel parameter.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-03  8:51 ` Tejun Heo
@ 2007-07-04  1:40   ` 7091
  2007-07-04  1:48     ` Tejun Heo
  2007-07-04  2:05   ` 7091
  1 sibling, 1 reply; 18+ messages in thread
From: 7091 @ 2007-07-04  1:40 UTC (permalink / raw)
  To: Tejun Heo; +Cc: 7091, linux-ide, jgarzik

Tejun Heo writes: 

> Sounds awfully similar to the recent nvidia data corruption issue.  It
> was involving the IOMMU and one of the workarounds was not using IOMMU,
> IIRC.  Please turn off IOMMU and see what happens.  You can turn IOMMU
> off by passing "iommu=off" as kernel parameter. 
> 
> -- 
> tejun

That indeed seems to have fixed the problem - I've been pulling my hair out 
forever about this! 

Is this something I'll need to leave in forever, is it supposed to get 
fixed, etc.? 

I'm going to continue doing testing and really try to thrash it tonight (I 
just did the 'copy an ISO 5 times' test that always failed before - it 
succeeds now). 

Thanks for the help.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  1:40   ` 7091
@ 2007-07-04  1:48     ` Tejun Heo
  0 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2007-07-04  1:48 UTC (permalink / raw)
  To: 7091; +Cc: linux-ide, jgarzik

7091@blargh.com wrote:
> That indeed seems to have fixed the problem - I've been pulling my hair
> out forever about this!
> Is this something I'll need to leave in forever, is it supposed to get
> fixed, etc.?
> I'm going to continue doing testing and really try to thrash it tonight
> (I just did the 'copy an ISO 5 times' test that always failed before -
> it succeeds now).
> Thanks for the help.

I think fix is included during 2.6.22 devel cycle which basically is
turning off IOMMU by default on the affected chipset but I haven't
really followed after that.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-03  8:51 ` Tejun Heo
  2007-07-04  1:40   ` 7091
@ 2007-07-04  2:05   ` 7091
  2007-07-04  3:22     ` 7091
                       ` (2 more replies)
  1 sibling, 3 replies; 18+ messages in thread
From: 7091 @ 2007-07-04  2:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: 7091, linux-ide, jgarzik

Tejun Heo writes: 

> Sounds awfully similar to the recent nvidia data corruption issue.  It
> was involving the IOMMU and one of the workarounds was not using IOMMU,
> IIRC.  Please turn off IOMMU and see what happens.  You can turn IOMMU
> off by passing "iommu=off" as kernel parameter. 
> 
> -- 
> tejun

Blast.  I stand corrected - it didn't fix it. 

I now have 3 300G drives hanging off the add-in card, sda sdb sdc.  I did 
the following: 

# Make a RAID5 array of 3 out of the 4 drives I'll eventually be using
mdadm --create /dev/md6 --chunk=64 --level=raid5 --raid-devices=4 /dev/sda1 
/dev/sdb1 /dev/sdc1 missing
# Make the FS
mkfs.ext3 /dev/md6
# Test
cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso kn1.iso
cp kn1.iso kn2.iso
cp kn2.iso kn3.iso
cp kn3.iso kn4.iso
# Check
md5sum *.iso
eea5ecb53f1c6a397bcfeedc2fd42c64  kn1.iso
0360941210aa2d7159999e37c636f8cb  kn2.iso
md5sum: kn3.iso: Input/output error
86b008915fe02569a513b6c5ec45a523  kn4.iso 

In the dmesg:
[ 2619.483783] attempt to access beyond end of device
[ 2619.483956] md6: rw=0, want=25890291976, limit=1758201216
[ 2619.484121] attempt to access beyond end of device
[ 2619.484284] md6: rw=0, want=30654872552, limit=1758201216
[ 2619.484452] attempt to access beyond end of device
[ 2619.484614] md6: rw=0, want=33748316016, limit=1758201216
[ 2619.484780] attempt to access beyond end of device
[ 2619.484943] md6: rw=0, want=7818926368, limit=1758201216
[ 2619.485122] attempt to access beyond end of device
[ 2619.485285] md6: rw=0, want=32912340352, limit=1758201216
[ 2619.485451] attempt to access beyond end of device
[ 2619.496228] md6: rw=0, want=2277136440, limit=1758201216
[ 2619.496394] attempt to access beyond end of device
[ 2619.496557] md6: rw=0, want=2128357688, limit=1758201216
[ 2619.529625] attempt to access beyond end of device
[ 2619.530392] md6: rw=0, want=25890291976, limit=1758201216

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  2:05   ` 7091
@ 2007-07-04  3:22     ` 7091
  2007-07-04  3:44       ` 7091
  2007-07-04  3:41     ` Tejun Heo
  2007-07-04  9:18     ` Alan Cox
  2 siblings, 1 reply; 18+ messages in thread
From: 7091 @ 2007-07-04  3:22 UTC (permalink / raw)
  To: 7091; +Cc: Tejun Heo, linux-ide, jgarzik

7091@blargh.com writes: 

> Tejun Heo writes:  
> 
>> Sounds awfully similar to the recent nvidia data corruption issue.  It
>> was involving the IOMMU and one of the workarounds was not using IOMMU,
>> IIRC.  Please turn off IOMMU and see what happens.  You can turn IOMMU
>> off by passing "iommu=off" as kernel parameter.  
>> 
>> -- 
>> tejun
> 
> Blast.  I stand corrected - it didn't fix it.  
> 
> I now have 3 300G drives hanging off the add-in card, sda sdb sdc.  I did 
> the following:  
> 
> # Make a RAID5 array of 3 out of the 4 drives I'll eventually be using
> mdadm --create /dev/md6 --chunk=64 --level=raid5 --raid-devices=4 
> /dev/sda1 /dev/sdb1 /dev/sdc1 missing

Here's an odd data point. 

I just broke that array, formatted all three of those partitions seperately, 
mounted and did my ISO copy test. 

All three drives, run one at a time, function fine.  No corruption.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  2:05   ` 7091
  2007-07-04  3:22     ` 7091
@ 2007-07-04  3:41     ` Tejun Heo
  2007-07-04  9:18     ` Alan Cox
  2 siblings, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2007-07-04  3:41 UTC (permalink / raw)
  To: 7091; +Cc: linux-ide, jgarzik

7091@blargh.com wrote:
> Tejun Heo writes:
>> Sounds awfully similar to the recent nvidia data corruption issue.  It
>> was involving the IOMMU and one of the workarounds was not using IOMMU,
>> IIRC.  Please turn off IOMMU and see what happens.  You can turn IOMMU
>> off by passing "iommu=off" as kernel parameter.
>> -- 
>> tejun
> 
> Blast.  I stand corrected - it didn't fix it.
> I now have 3 300G drives hanging off the add-in card, sda sdb sdc.  I
> did the following:
> # Make a RAID5 array of 3 out of the 4 drives I'll eventually be using
> mdadm --create /dev/md6 --chunk=64 --level=raid5 --raid-devices=4
> /dev/sda1 /dev/sdb1 /dev/sdc1 missing
> # Make the FS
> mkfs.ext3 /dev/md6
> # Test
> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso kn1.iso
> cp kn1.iso kn2.iso
> cp kn2.iso kn3.iso
> cp kn3.iso kn4.iso
> # Check
> md5sum *.iso
> eea5ecb53f1c6a397bcfeedc2fd42c64  kn1.iso
> 0360941210aa2d7159999e37c636f8cb  kn2.iso
> md5sum: kn3.iso: Input/output error
> 86b008915fe02569a513b6c5ec45a523  kn4.iso
> In the dmesg:
> [ 2619.483783] attempt to access beyond end of device
> [ 2619.483956] md6: rw=0, want=25890291976, limit=1758201216

Something went very wrong here and it probably doesn't have much to do
with IOMMU corruption.  Care to post full dmesg?

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  3:22     ` 7091
@ 2007-07-04  3:44       ` 7091
  2007-07-04  3:53         ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: 7091 @ 2007-07-04  3:44 UTC (permalink / raw)
  To: 7091; +Cc: Tejun Heo, linux-ide, jgarzik

Apologies for the chain-replying to myself, just replying as I think of 
things to try. 

7091@blargh.com writes: 

> Here's an odd data point.  
> 
> I just broke that array, formatted all three of those partitions 
> seperately, mounted and did my ISO copy test.  
> 
> All three drives, run one at a time, function fine.  No corruption.

Here's another odd one.  I did the following:
# Mount all 3 drives as individuals...
mount /dev/sda1 /mnt/a
mount /dev/sdb1 /mnt/b
mount /dev/sdc1 /mnt/c
# Copy the same file to all three drives at the same time
cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso a/kn10.iso &
cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso b/kn10.iso &
cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso c/kn10.iso & 

Got massive corruption.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  3:44       ` 7091
@ 2007-07-04  3:53         ` Tejun Heo
  2007-07-04  7:08           ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2007-07-04  3:53 UTC (permalink / raw)
  To: 7091; +Cc: linux-ide, jgarzik, ak, Linux Kernel Mailing List

7091@blargh.com wrote:
> Apologies for the chain-replying to myself, just replying as I think of
> things to try.
> 7091@blargh.com writes:
>> Here's an odd data point. 
>> I just broke that array, formatted all three of those partitions
>> seperately, mounted and did my ISO copy test. 
>> All three drives, run one at a time, function fine.  No corruption.
> 
> Here's another odd one.  I did the following:
> # Mount all 3 drives as individuals...
> mount /dev/sda1 /mnt/a
> mount /dev/sdb1 /mnt/b
> mount /dev/sdc1 /mnt/c
> # Copy the same file to all three drives at the same time
> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso a/kn10.iso &
> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso b/kn10.iso &
> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso c/kn10.iso &
> Got massive corruption.

Hmmm... I don't think this is sata_sil driver bug.  cc'ing Andi Kleen
and lkml.  Andi, the original thread can be read from

http://thread.gmane.org/gmane.linux.ide/20213

Any ideas?

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  3:53         ` Tejun Heo
@ 2007-07-04  7:08           ` Andi Kleen
  2007-07-04  8:17             ` 7091
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2007-07-04  7:08 UTC (permalink / raw)
  To: Tejun Heo; +Cc: 7091, linux-ide, jgarzik, Linux Kernel Mailing List

On Wednesday 04 July 2007 05:53:30 Tejun Heo wrote:
> 7091@blargh.com wrote:
> > Apologies for the chain-replying to myself, just replying as I think of
> > things to try.
> > 7091@blargh.com writes:
> >> Here's an odd data point. 
> >> I just broke that array, formatted all three of those partitions
> >> seperately, mounted and did my ISO copy test. 
> >> All three drives, run one at a time, function fine.  No corruption.
> > 
> > Here's another odd one.  I did the following:
> > # Mount all 3 drives as individuals...
> > mount /dev/sda1 /mnt/a
> > mount /dev/sdb1 /mnt/b
> > mount /dev/sdc1 /mnt/c
> > # Copy the same file to all three drives at the same time
> > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso a/kn10.iso &
> > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso b/kn10.iso &
> > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso c/kn10.iso &
> > Got massive corruption.
> 
> Hmmm... I don't think this is sata_sil driver bug.  cc'ing Andi Kleen
> and lkml.  Andi, the original thread can be read from
> 
> http://thread.gmane.org/gmane.linux.ide/20213

It seems to be a 32bit system. There is no IOMMU. 

If it has >2GB or so it might be worth trying booting it with mem=2G
and see if it goes away. However if it was the standard VIA DAC issue
you should get problems even with a single interface, so probably
that's not it either.

Most likely it is some sort of hardware bug that we might
not be able to do much about. Have you tried contacting SIL or VIA? 
e.g. if you have some other system with a different chipset it might
be useful to test the SIL controllers in those.

I would perhaps also try a newer kernel.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  7:08           ` Andi Kleen
@ 2007-07-04  8:17             ` 7091
  2007-07-04  8:38               ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: 7091 @ 2007-07-04  8:17 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Tejun Heo, 7091, linux-ide, jgarzik, Linux Kernel Mailing List

Andi Kleen writes:

> If it has >2GB or so it might be worth trying booting it with mem=2G

Nope, only 1GB of RAM.

> Most likely it is some sort of hardware bug that we might
> not be able to do much about. Have you tried contacting SIL or VIA? 

No, I haven't.  Like I mentioned above, the OpenBSD drivers seemed to work, 
or at least did with similar tests.  I would need to run the more extensive 
checks to be positive, but those take a lot of time, obviously.  And 
downtime for the box, a lot of which isn't really manageable, at the moment. 

> e.g. if you have some other system with a different chipset it might
> be useful to test the SIL controllers in those.

The previous motherboard was an AMD 760 chipset, and it had the same 
problem. 

> I would perhaps also try a newer kernel.

I can certainly try that - I admit 2.6.20.3 is a little old now.  This will 
probably take me a couple days - tomorrow is the 4th of July and a holiday 
for me.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  8:17             ` 7091
@ 2007-07-04  8:38               ` Andi Kleen
  2007-07-04  8:52                 ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2007-07-04  8:38 UTC (permalink / raw)
  To: 7091; +Cc: Tejun Heo, linux-ide, jgarzik, Linux Kernel Mailing List

On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote:

> > Most likely it is some sort of hardware bug that we might
> > not be able to do much about. Have you tried contacting SIL or VIA? 
> 
> No, I haven't.  Like I mentioned above, the OpenBSD drivers seemed to work, 
> or at least did with similar tests.  I would need to run the more extensive 
> checks to be positive, but those take a lot of time, obviously.  And 
> downtime for the box, a lot of which isn't really manageable, at the moment. 

Perhaps the OpenBSD drivers program the SIL chip in a different way
that avoids this problem. 

> 
> > e.g. if you have some other system with a different chipset it might
> > be useful to test the SIL controllers in those.
> 
> The previous motherboard was an AMD 760 chipset, and it had the same 
> problem. 

Ok this means it's likely a SIL issue, not a chipset issue.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  8:38               ` Andi Kleen
@ 2007-07-04  8:52                 ` Tejun Heo
  2007-07-10 10:55                   ` Tejun Heo
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2007-07-04  8:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: 7091, linux-ide, jgarzik, Linux Kernel Mailing List

Andi Kleen wrote:
> On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote:
> 
>>> Most likely it is some sort of hardware bug that we might
>>> not be able to do much about. Have you tried contacting SIL or VIA? 
>> No, I haven't.  Like I mentioned above, the OpenBSD drivers seemed to work, 
>> or at least did with similar tests.  I would need to run the more extensive 
>> checks to be positive, but those take a lot of time, obviously.  And 
>> downtime for the box, a lot of which isn't really manageable, at the moment. 
> 
> Perhaps the OpenBSD drivers program the SIL chip in a different way
> that avoids this problem. 
> 
>>> e.g. if you have some other system with a different chipset it might
>>> be useful to test the SIL controllers in those.
>> The previous motherboard was an AMD 760 chipset, and it had the same 
>> problem. 
> 
> Ok this means it's likely a SIL issue, not a chipset issue.

Hmmmm... okay.  I'll take look at the openBSD driver.  I still have no
idea what it can be tho.  Maybe, FIFO setup?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  9:18     ` Alan Cox
@ 2007-07-04  9:14       ` Tejun Heo
  2007-07-04  9:26       ` 7091
  1 sibling, 0 replies; 18+ messages in thread
From: Tejun Heo @ 2007-07-04  9:14 UTC (permalink / raw)
  To: Alan Cox; +Cc: 7091, linux-ide, jgarzik, Andi Kleen

Alan Cox wrote:
>> # Make a RAID5 array of 3 out of the 4 drives I'll eventually be using
>> mdadm --create /dev/md6 --chunk=64 --level=raid5 --raid-devices=4 /dev/sda1 
>> /dev/sdb1 /dev/sdc1 missing
>> # Make the FS
>> mkfs.ext3 /dev/md6
>> # Test
>> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso kn1.iso
>> cp kn1.iso kn2.iso
>> cp kn2.iso kn3.iso
>> cp kn3.iso kn4.iso
>> # Check
>> md5sum *.iso
>> eea5ecb53f1c6a397bcfeedc2fd42c64  kn1.iso
>> 0360941210aa2d7159999e37c636f8cb  kn2.iso
>> md5sum: kn3.iso: Input/output error
>> 86b008915fe02569a513b6c5ec45a523  kn4.iso 
> 
> Not suprised to be honest. We have a large number of reports that are all
> of the following form
> 
> 	"Nvidia chipset, Silicon Image SATA, corruption"
> 
> and several reports that BIOS updates fixed it. Unfortunately we don't
> know what the BIOS updates do (or indeed if what they do is board
> specific) or how to work around it otherwise.

I thought good number of those were the IOMMU bug but I could be
mistaken.  Considering the large number of sata_sil's in the field, it's
a bit surprising to see we still have this kind of problem.  :-(

-- 
tejun

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  2:05   ` 7091
  2007-07-04  3:22     ` 7091
  2007-07-04  3:41     ` Tejun Heo
@ 2007-07-04  9:18     ` Alan Cox
  2007-07-04  9:14       ` Tejun Heo
  2007-07-04  9:26       ` 7091
  2 siblings, 2 replies; 18+ messages in thread
From: Alan Cox @ 2007-07-04  9:18 UTC (permalink / raw)
  Cc: Tejun Heo, 7091, linux-ide, jgarzik

> # Make a RAID5 array of 3 out of the 4 drives I'll eventually be using
> mdadm --create /dev/md6 --chunk=64 --level=raid5 --raid-devices=4 /dev/sda1 
> /dev/sdb1 /dev/sdc1 missing
> # Make the FS
> mkfs.ext3 /dev/md6
> # Test
> cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso kn1.iso
> cp kn1.iso kn2.iso
> cp kn2.iso kn3.iso
> cp kn3.iso kn4.iso
> # Check
> md5sum *.iso
> eea5ecb53f1c6a397bcfeedc2fd42c64  kn1.iso
> 0360941210aa2d7159999e37c636f8cb  kn2.iso
> md5sum: kn3.iso: Input/output error
> 86b008915fe02569a513b6c5ec45a523  kn4.iso 

Not suprised to be honest. We have a large number of reports that are all
of the following form

	"Nvidia chipset, Silicon Image SATA, corruption"

and several reports that BIOS updates fixed it. Unfortunately we don't
know what the BIOS updates do (or indeed if what they do is board
specific) or how to work around it otherwise.

Alan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  9:18     ` Alan Cox
  2007-07-04  9:14       ` Tejun Heo
@ 2007-07-04  9:26       ` 7091
  1 sibling, 0 replies; 18+ messages in thread
From: 7091 @ 2007-07-04  9:26 UTC (permalink / raw)
  To: Alan Cox; +Cc: 7091, Tejun Heo, linux-ide, jgarzik

Alan Cox writes: 

> Not suprised to be honest. We have a large number of reports that are all
> of the following form 
> 
> 	"Nvidia chipset, Silicon Image SATA, corruption" 
> 
> and several reports that BIOS updates fixed it. Unfortunately we don't
> know what the BIOS updates do (or indeed if what they do is board
> specific) or how to work around it otherwise. 
> 
> Alan

Well, that's the thing, and one of the details that may have gotten a little 
lost in the long reply chain. 

The Silicon Image SATA ports on the motherboard work fine. 

Silicon Image SATA ports on addon cards are showing corrutpion. 

I have also flashed the Motherboard BIOS to the newest available version, 
which got the onboard ports, and the addon card got flashed as well. 

:(  A very frustrating problem.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-04  8:52                 ` Tejun Heo
@ 2007-07-10 10:55                   ` Tejun Heo
  2007-07-12  3:21                     ` 7091
  0 siblings, 1 reply; 18+ messages in thread
From: Tejun Heo @ 2007-07-10 10:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Andi Kleen, 7091, linux-ide, jgarzik, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1141 bytes --]

Tejun Heo wrote:
> Andi Kleen wrote:
>> On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote:
>>
>>>> Most likely it is some sort of hardware bug that we might
>>>> not be able to do much about. Have you tried contacting SIL or VIA? 
>>> No, I haven't.  Like I mentioned above, the OpenBSD drivers seemed to work, 
>>> or at least did with similar tests.  I would need to run the more extensive 
>>> checks to be positive, but those take a lot of time, obviously.  And 
>>> downtime for the box, a lot of which isn't really manageable, at the moment. 
>> Perhaps the OpenBSD drivers program the SIL chip in a different way
>> that avoids this problem. 
>>
>>>> e.g. if you have some other system with a different chipset it might
>>>> be useful to test the SIL controllers in those.
>>> The previous motherboard was an AMD 760 chipset, and it had the same 
>>> problem. 
>> Ok this means it's likely a SIL issue, not a chipset issue.
> 
> Hmmmm... okay.  I'll take look at the openBSD driver.  I still have no
> idea what it can be tho.  Maybe, FIFO setup?

Please give a shot at the attached patch on top of 2.6.22.  Thanks.

-- 
tejun

[-- Attachment #2: sata_sil-update-cache-line-programming.patch --]
[-- Type: text/x-patch, Size: 2085 bytes --]

diff --git a/drivers/ata/sata_sil.c b/drivers/ata/sata_sil.c
index 2a86dc4..6c0fe7e 100644
--- a/drivers/ata/sata_sil.c
+++ b/drivers/ata/sata_sil.c
@@ -280,14 +280,6 @@ static int slow_down = 0;
 module_param(slow_down, int, 0444);
 MODULE_PARM_DESC(slow_down, "Sledgehammer used to work around random problems, by limiting commands to 15 sectors (0=off, 1=on)");
 
-
-static unsigned char sil_get_device_cache_line(struct pci_dev *pdev)
-{
-	u8 cache_line = 0;
-	pci_read_config_byte(pdev, PCI_CACHE_LINE_SIZE, &cache_line);
-	return cache_line;
-}
-
 /**
  *	sil_set_mode		-	wrap set_mode functions
  *	@ap: port to set up
@@ -597,17 +589,29 @@ static void sil_init_controller(struct ata_host *host)
 	u32 tmp;
 	int i;
 
-	/* Initialize FIFO PCI bus arbitration */
-	cls = sil_get_device_cache_line(pdev);
-	if (cls) {
-		cls >>= 3;
-		cls++;  /* cls = (line_size/8)+1 */
-		for (i = 0; i < host->n_ports; i++)
-			writew(cls << 8 | cls,
-			       mmio_base + sil_port[i].fifo_cfg);
-	} else
-		dev_printk(KERN_WARNING, &pdev->dev,
-			   "cache line size not set.  Driver may not function\n");
+	/* When the Silicon Image 3112/4 retries a PCI memory read
+	 * command, it may retry it as a memory read multiple command
+	 * under some circumstances.  This can totally confuse some
+	 * PCI controllers, so ensure that it will never do this by
+	 * making sure that the Read Threshold (FIFO Read Request
+	 * Control) field of the FIFO Valid Byte Count and Control
+	 * registers for all the channels are set to be at least as
+	 * large as the cacheline size register.
+	 *
+	 * tj - code and comment shamelessly taken from NetBSD.
+	 */
+	pci_read_config_byte(pdev, PCI_CACHE_LINE_SIZE, &cls);
+	cls *= 4;
+
+	if (cls > 224) {
+		pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, 224 / 4);
+		cls = 224;
+	} else if (cls < 32)
+		cls = 32;
+
+	cls = DIV_ROUND_UP(cls, 32);
+	for (i = 0; i < host->n_ports; i++)
+		writeb(cls, mmio_base + sil_port[i].fifo_cfg);
 
 	/* Apply R_ERR on DMA activate FIS errata workaround */
 	if (host->ports[0]->flags & SIL_FLAG_RERR_ON_DMA_ACT) {

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: sata_sil, writing bug with multiple cards?
  2007-07-10 10:55                   ` Tejun Heo
@ 2007-07-12  3:21                     ` 7091
  0 siblings, 0 replies; 18+ messages in thread
From: 7091 @ 2007-07-12  3:21 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tejun Heo, Andi Kleen, 7091, linux-ide, jgarzik,
	Linux Kernel Mailing List

Tejun Heo writes: 

> Please give a shot at the attached patch on top of 2.6.22.  Thanks.

Patch applied, but still getting the corruption.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-07-12  3:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-03  6:42 sata_sil, writing bug with multiple cards? 7091
2007-07-03  8:51 ` Tejun Heo
2007-07-04  1:40   ` 7091
2007-07-04  1:48     ` Tejun Heo
2007-07-04  2:05   ` 7091
2007-07-04  3:22     ` 7091
2007-07-04  3:44       ` 7091
2007-07-04  3:53         ` Tejun Heo
2007-07-04  7:08           ` Andi Kleen
2007-07-04  8:17             ` 7091
2007-07-04  8:38               ` Andi Kleen
2007-07-04  8:52                 ` Tejun Heo
2007-07-10 10:55                   ` Tejun Heo
2007-07-12  3:21                     ` 7091
2007-07-04  3:41     ` Tejun Heo
2007-07-04  9:18     ` Alan Cox
2007-07-04  9:14       ` Tejun Heo
2007-07-04  9:26       ` 7091

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).