* Re: sata_sil, writing bug with multiple cards? [not found] ` <courier.468B17B1.00001A78@blargh.com> @ 2007-07-04 3:53 ` Tejun Heo 2007-07-04 7:08 ` Andi Kleen 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2007-07-04 3:53 UTC (permalink / raw) To: 7091; +Cc: linux-ide, jgarzik, ak, Linux Kernel Mailing List 7091@blargh.com wrote: > Apologies for the chain-replying to myself, just replying as I think of > things to try. > 7091@blargh.com writes: >> Here's an odd data point. >> I just broke that array, formatted all three of those partitions >> seperately, mounted and did my ISO copy test. >> All three drives, run one at a time, function fine. No corruption. > > Here's another odd one. I did the following: > # Mount all 3 drives as individuals... > mount /dev/sda1 /mnt/a > mount /dev/sdb1 /mnt/b > mount /dev/sdc1 /mnt/c > # Copy the same file to all three drives at the same time > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso a/kn10.iso & > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso b/kn10.iso & > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso c/kn10.iso & > Got massive corruption. Hmmm... I don't think this is sata_sil driver bug. cc'ing Andi Kleen and lkml. Andi, the original thread can be read from http://thread.gmane.org/gmane.linux.ide/20213 Any ideas? -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-04 3:53 ` sata_sil, writing bug with multiple cards? Tejun Heo @ 2007-07-04 7:08 ` Andi Kleen 2007-07-04 8:17 ` 7091 0 siblings, 1 reply; 9+ messages in thread From: Andi Kleen @ 2007-07-04 7:08 UTC (permalink / raw) To: Tejun Heo; +Cc: 7091, linux-ide, jgarzik, Linux Kernel Mailing List On Wednesday 04 July 2007 05:53:30 Tejun Heo wrote: > 7091@blargh.com wrote: > > Apologies for the chain-replying to myself, just replying as I think of > > things to try. > > 7091@blargh.com writes: > >> Here's an odd data point. > >> I just broke that array, formatted all three of those partitions > >> seperately, mounted and did my ISO copy test. > >> All three drives, run one at a time, function fine. No corruption. > > > > Here's another odd one. I did the following: > > # Mount all 3 drives as individuals... > > mount /dev/sda1 /mnt/a > > mount /dev/sdb1 /mnt/b > > mount /dev/sdc1 /mnt/c > > # Copy the same file to all three drives at the same time > > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso a/kn10.iso & > > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso b/kn10.iso & > > cp KNOPPIX_V5.1.0CD-2006-12-30-EN.iso c/kn10.iso & > > Got massive corruption. > > Hmmm... I don't think this is sata_sil driver bug. cc'ing Andi Kleen > and lkml. Andi, the original thread can be read from > > http://thread.gmane.org/gmane.linux.ide/20213 It seems to be a 32bit system. There is no IOMMU. If it has >2GB or so it might be worth trying booting it with mem=2G and see if it goes away. However if it was the standard VIA DAC issue you should get problems even with a single interface, so probably that's not it either. Most likely it is some sort of hardware bug that we might not be able to do much about. Have you tried contacting SIL or VIA? e.g. if you have some other system with a different chipset it might be useful to test the SIL controllers in those. I would perhaps also try a newer kernel. -Andi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-04 7:08 ` Andi Kleen @ 2007-07-04 8:17 ` 7091 2007-07-04 8:38 ` Andi Kleen 0 siblings, 1 reply; 9+ messages in thread From: 7091 @ 2007-07-04 8:17 UTC (permalink / raw) To: Andi Kleen; +Cc: Tejun Heo, 7091, linux-ide, jgarzik, Linux Kernel Mailing List Andi Kleen writes: > If it has >2GB or so it might be worth trying booting it with mem=2G Nope, only 1GB of RAM. > Most likely it is some sort of hardware bug that we might > not be able to do much about. Have you tried contacting SIL or VIA? No, I haven't. Like I mentioned above, the OpenBSD drivers seemed to work, or at least did with similar tests. I would need to run the more extensive checks to be positive, but those take a lot of time, obviously. And downtime for the box, a lot of which isn't really manageable, at the moment. > e.g. if you have some other system with a different chipset it might > be useful to test the SIL controllers in those. The previous motherboard was an AMD 760 chipset, and it had the same problem. > I would perhaps also try a newer kernel. I can certainly try that - I admit 2.6.20.3 is a little old now. This will probably take me a couple days - tomorrow is the 4th of July and a holiday for me. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-04 8:17 ` 7091 @ 2007-07-04 8:38 ` Andi Kleen 2007-07-04 8:52 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Andi Kleen @ 2007-07-04 8:38 UTC (permalink / raw) To: 7091; +Cc: Tejun Heo, linux-ide, jgarzik, Linux Kernel Mailing List On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote: > > Most likely it is some sort of hardware bug that we might > > not be able to do much about. Have you tried contacting SIL or VIA? > > No, I haven't. Like I mentioned above, the OpenBSD drivers seemed to work, > or at least did with similar tests. I would need to run the more extensive > checks to be positive, but those take a lot of time, obviously. And > downtime for the box, a lot of which isn't really manageable, at the moment. Perhaps the OpenBSD drivers program the SIL chip in a different way that avoids this problem. > > > e.g. if you have some other system with a different chipset it might > > be useful to test the SIL controllers in those. > > The previous motherboard was an AMD 760 chipset, and it had the same > problem. Ok this means it's likely a SIL issue, not a chipset issue. -Andi ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-04 8:38 ` Andi Kleen @ 2007-07-04 8:52 ` Tejun Heo 2007-07-10 10:55 ` Tejun Heo 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2007-07-04 8:52 UTC (permalink / raw) To: Andi Kleen; +Cc: 7091, linux-ide, jgarzik, Linux Kernel Mailing List Andi Kleen wrote: > On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote: > >>> Most likely it is some sort of hardware bug that we might >>> not be able to do much about. Have you tried contacting SIL or VIA? >> No, I haven't. Like I mentioned above, the OpenBSD drivers seemed to work, >> or at least did with similar tests. I would need to run the more extensive >> checks to be positive, but those take a lot of time, obviously. And >> downtime for the box, a lot of which isn't really manageable, at the moment. > > Perhaps the OpenBSD drivers program the SIL chip in a different way > that avoids this problem. > >>> e.g. if you have some other system with a different chipset it might >>> be useful to test the SIL controllers in those. >> The previous motherboard was an AMD 760 chipset, and it had the same >> problem. > > Ok this means it's likely a SIL issue, not a chipset issue. Hmmmm... okay. I'll take look at the openBSD driver. I still have no idea what it can be tho. Maybe, FIFO setup? Thanks. -- tejun ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-04 8:52 ` Tejun Heo @ 2007-07-10 10:55 ` Tejun Heo 2007-07-12 3:21 ` 7091 0 siblings, 1 reply; 9+ messages in thread From: Tejun Heo @ 2007-07-10 10:55 UTC (permalink / raw) To: Tejun Heo; +Cc: Andi Kleen, 7091, linux-ide, jgarzik, Linux Kernel Mailing List [-- Attachment #1: Type: text/plain, Size: 1141 bytes --] Tejun Heo wrote: > Andi Kleen wrote: >> On Wednesday 04 July 2007 10:17:34 7091@blargh.com wrote: >> >>>> Most likely it is some sort of hardware bug that we might >>>> not be able to do much about. Have you tried contacting SIL or VIA? >>> No, I haven't. Like I mentioned above, the OpenBSD drivers seemed to work, >>> or at least did with similar tests. I would need to run the more extensive >>> checks to be positive, but those take a lot of time, obviously. And >>> downtime for the box, a lot of which isn't really manageable, at the moment. >> Perhaps the OpenBSD drivers program the SIL chip in a different way >> that avoids this problem. >> >>>> e.g. if you have some other system with a different chipset it might >>>> be useful to test the SIL controllers in those. >>> The previous motherboard was an AMD 760 chipset, and it had the same >>> problem. >> Ok this means it's likely a SIL issue, not a chipset issue. > > Hmmmm... okay. I'll take look at the openBSD driver. I still have no > idea what it can be tho. Maybe, FIFO setup? Please give a shot at the attached patch on top of 2.6.22. Thanks. -- tejun [-- Attachment #2: sata_sil-update-cache-line-programming.patch --] [-- Type: text/x-patch, Size: 2085 bytes --] diff --git a/drivers/ata/sata_sil.c b/drivers/ata/sata_sil.c index 2a86dc4..6c0fe7e 100644 --- a/drivers/ata/sata_sil.c +++ b/drivers/ata/sata_sil.c @@ -280,14 +280,6 @@ static int slow_down = 0; module_param(slow_down, int, 0444); MODULE_PARM_DESC(slow_down, "Sledgehammer used to work around random problems, by limiting commands to 15 sectors (0=off, 1=on)"); - -static unsigned char sil_get_device_cache_line(struct pci_dev *pdev) -{ - u8 cache_line = 0; - pci_read_config_byte(pdev, PCI_CACHE_LINE_SIZE, &cache_line); - return cache_line; -} - /** * sil_set_mode - wrap set_mode functions * @ap: port to set up @@ -597,17 +589,29 @@ static void sil_init_controller(struct ata_host *host) u32 tmp; int i; - /* Initialize FIFO PCI bus arbitration */ - cls = sil_get_device_cache_line(pdev); - if (cls) { - cls >>= 3; - cls++; /* cls = (line_size/8)+1 */ - for (i = 0; i < host->n_ports; i++) - writew(cls << 8 | cls, - mmio_base + sil_port[i].fifo_cfg); - } else - dev_printk(KERN_WARNING, &pdev->dev, - "cache line size not set. Driver may not function\n"); + /* When the Silicon Image 3112/4 retries a PCI memory read + * command, it may retry it as a memory read multiple command + * under some circumstances. This can totally confuse some + * PCI controllers, so ensure that it will never do this by + * making sure that the Read Threshold (FIFO Read Request + * Control) field of the FIFO Valid Byte Count and Control + * registers for all the channels are set to be at least as + * large as the cacheline size register. + * + * tj - code and comment shamelessly taken from NetBSD. + */ + pci_read_config_byte(pdev, PCI_CACHE_LINE_SIZE, &cls); + cls *= 4; + + if (cls > 224) { + pci_write_config_byte(pdev, PCI_CACHE_LINE_SIZE, 224 / 4); + cls = 224; + } else if (cls < 32) + cls = 32; + + cls = DIV_ROUND_UP(cls, 32); + for (i = 0; i < host->n_ports; i++) + writeb(cls, mmio_base + sil_port[i].fifo_cfg); /* Apply R_ERR on DMA activate FIS errata workaround */ if (host->ports[0]->flags & SIL_FLAG_RERR_ON_DMA_ACT) { ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-07-10 10:55 ` Tejun Heo @ 2007-07-12 3:21 ` 7091 0 siblings, 0 replies; 9+ messages in thread From: 7091 @ 2007-07-12 3:21 UTC (permalink / raw) To: Tejun Heo Cc: Tejun Heo, Andi Kleen, 7091, linux-ide, jgarzik, Linux Kernel Mailing List Tejun Heo writes: > Please give a shot at the attached patch on top of 2.6.22. Thanks. Patch applied, but still getting the corruption. ^ permalink raw reply [flat|nested] 9+ messages in thread
* sata_sil, writing bug with multiple cards? @ 2007-06-27 21:36 7091 2007-07-08 8:01 ` Janos Haar 0 siblings, 1 reply; 9+ messages in thread From: 7091 @ 2007-06-27 21:36 UTC (permalink / raw) To: linux-kernel Greetings, I have been troubleshooting a problem for over a year now, and to make a long story short, I think the sata_sil driver has a bug during writing when there are multiple cards that are using different models of SiI chips in the system. I will be watching the list, although cc'ing me over email will be useful for speeding up replies. Longer version: I've been having problems with my Linux server corrupting data. Not just a little - it can't copy a 700 meg ISO file and end up with the same checksum (and usually corrupts the filesystem in the process). Hardware: Asus A7N8X-Deluxe motherboard. This has 2 parallel IDE connectors, each with a 40 gig IDE HD hanging off it, and 2 SATA connectors (driven by a SiI 3112 chip) with (right now) 1 300G SATA HD and 1 250G SATA HD. All of these facts are important. On my PCI bus, I have a SiI 3114A card with 3 more 300G SATA HDs. It should be noted that only drives on the PCI card have corruption. Neither the parallel IDE HDs, nor the SATA drives on the motherboard experience the problem. I have also tried replacing this card, which did not fix the problem. Also, placing the same drive on the add-on card has corruption, the same drive, cable, power, etc. on the motherboard works fine. I've already swapped motherboards, CPU, and RAM. Booting to a Knoppix 5.1 CD shows the same problems. Reading is fine (i.e. I can read the same file 50 times and get the same md5sum). Writing causes the corruption. The corruption happens no matter what filesystem I try (ext2, ext3, reiser, xfs). (This does mean I've reformatted, etc. several times, so its not a metadata problem) This happens with at least 3 different drives (the 300 and the 250, different manufacturers), with different SATA data cables, power supplies, power cables, etc. Now, here's the kicker. Booting to an OpenBSD kernel, and using one of the 300G drives off the 3114A card (the one that show corruption under Linux) works fine. This happens with the Knoppix 5.1 kernel (2.6.19), my own compiled 2.6.20.3, and Debian kernel 2.6.18-4-k7. More spammy data: # lspci 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev a2) 00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev a2) 00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev a2) 00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev a2) 00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev a2) 00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev a2) 00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3) 00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2) 00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) 00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) 00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) 00:04.0 Ethernet controller: nVidia Corporation nForce2 Ethernet Controller (rev a1) 00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3) 00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2) 00:0c.0 PCI bridge: nVidia Corporation nForce2 PCI Bridge (rev a3) 00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev a2) 01:06.0 VGA compatible controller: Matrox Graphics, Inc. MGA 2164W [Millennium II] 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10) 01:09.0 Mass storage controller: Silicon Image, Inc. SiI 3114 [SATALink/SATARaid] Serial ATA Controller (rev 02) 01:0b.0 RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 01) 02:01.0 Ethernet controller: 3Com Corporation 3C920B-EMB Integrated Fast Ethernet Controller [Tornado] (rev 40) Any assistance, input, etc. appreciated. Thanks. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: sata_sil, writing bug with multiple cards? 2007-06-27 21:36 7091 @ 2007-07-08 8:01 ` Janos Haar 0 siblings, 0 replies; 9+ messages in thread From: Janos Haar @ 2007-07-08 8:01 UTC (permalink / raw) To: 7091; +Cc: linux-kernel Hello, list, I have a little note for this. About 1 - 1.5 years before i have the same problem. I have 4 server, each has 3 sata sil card, and 2 of 4 servers freezed in the first time during the raid5 resyncing! After restart, the box kicked 2 hdd-s from the array, that are on the same card. I try to replace the card to the new one, nothing is chancged, the resync stopped again. I try to swap over the 2x3 card on the 2 server, and nothing is changed! When i try to read the strip (raid5) all works fine. But if i try to write a large amount of data, these two box freezed. (or rebooted, i can't remember now) Finally i buy promise cards to all box, and all problem is gone. :-) The problem is true, exists, and looks like in the driver. (or workaround of hardware problem?) Cheers, Janos Haar ----- Original Message ----- From: <7091@blargh.com> To: <linux-kernel@vger.kernel.org> Sent: Wednesday, June 27, 2007 11:36 PM Subject: sata_sil, writing bug with multiple cards? > Greetings, > > I have been troubleshooting a problem for over a year now, and to make a > long story short, I think the sata_sil driver has a bug during writing when > there are multiple cards that are using different models of SiI chips in the > system. > > I will be watching the list, although cc'ing me over email will be useful > for speeding up replies. > > Longer version: > I've been having problems with my Linux server corrupting data. Not just a > little - it can't copy a 700 meg ISO file and end up with the same checksum > (and usually corrupts the filesystem in the process). > > Hardware: > Asus A7N8X-Deluxe motherboard. This has 2 parallel IDE connectors, each > with a 40 gig IDE HD hanging off it, and 2 SATA connectors (driven by a SiI > 3112 chip) with (right now) 1 300G SATA HD and 1 250G SATA HD. All of these > facts are important. > > On my PCI bus, I have a SiI 3114A card with 3 more 300G SATA HDs. It should > be noted that only drives on the PCI card have corruption. Neither the > parallel IDE HDs, nor the SATA drives on the motherboard experience the > problem. I have also tried replacing this card, which did not fix the > problem. Also, placing the same drive on the add-on card has corruption, > the same drive, cable, power, etc. on the motherboard works fine. > > I've already swapped motherboards, CPU, and RAM. > > Booting to a Knoppix 5.1 CD shows the same problems. > > Reading is fine (i.e. I can read the same file 50 times and get the same > md5sum). Writing causes the corruption. > > The corruption happens no matter what filesystem I try (ext2, ext3, reiser, > xfs). (This does mean I've reformatted, etc. several times, so its not a > metadata problem) > > This happens with at least 3 different drives (the 300 and the 250, > different manufacturers), with different SATA data cables, power supplies, > power cables, etc. > > Now, here's the kicker. Booting to an OpenBSD kernel, and using one of the > 300G drives off the 3114A card (the one that show corruption under Linux) > works fine. > > This happens with the Knoppix 5.1 kernel (2.6.19), my own compiled 2.6.20.3, > and Debian kernel 2.6.18-4-k7. > > More spammy data: > # lspci > 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) > (rev a2) > 00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev a2) > 00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev a2) > 00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev a2) > 00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev a2) > 00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev a2) > 00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3) > 00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2) > 00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) > 00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) > 00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3) > 00:04.0 Ethernet controller: nVidia Corporation nForce2 Ethernet Controller > (rev a1) > 00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3) > 00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2) > 00:0c.0 PCI bridge: nVidia Corporation nForce2 PCI Bridge (rev a3) > 00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev a2) > 01:06.0 VGA compatible controller: Matrox Graphics, Inc. MGA 2164W > [Millennium II] > 01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 > Gigabit Ethernet (rev 10) > 01:09.0 Mass storage controller: Silicon Image, Inc. SiI 3114 > [SATALink/SATARaid] Serial ATA Controller (rev 02) > 01:0b.0 RAID bus controller: Silicon Image, Inc. SiI 3112 > [SATALink/SATARaid] Serial ATA Controller (rev 01) > 02:01.0 Ethernet controller: 3Com Corporation 3C920B-EMB Integrated Fast > Ethernet Controller [Tornado] (rev 40) > > Any assistance, input, etc. appreciated. Thanks. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2007-07-12 3:21 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <courier.4689EFD9.0000589A@blargh.com>
[not found] ` <468A0E24.7020000@gmail.com>
[not found] ` <courier.468B005A.0000127A@blargh.com>
[not found] ` <courier.468B1258.000018F4@blargh.com>
[not found] ` <courier.468B17B1.00001A78@blargh.com>
2007-07-04 3:53 ` sata_sil, writing bug with multiple cards? Tejun Heo
2007-07-04 7:08 ` Andi Kleen
2007-07-04 8:17 ` 7091
2007-07-04 8:38 ` Andi Kleen
2007-07-04 8:52 ` Tejun Heo
2007-07-10 10:55 ` Tejun Heo
2007-07-12 3:21 ` 7091
2007-06-27 21:36 7091
2007-07-08 8:01 ` Janos Haar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox