From mboxrd@z Thu Jan 1 00:00:00 1970 From: Peter Favrholdt Subject: Re: sata_promise SATA300TX4 "intermittent problems" Date: Thu, 08 Mar 2007 17:26:37 +0100 Message-ID: <45F0393D.80301@how.dk> References: <45EECD12.1000506@how.dk> <17903.7366.875191.751728@alkaid.it.uu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from pfepc.post.tele.dk ([195.41.46.237]:38156 "EHLO pfepc.post.tele.dk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752430AbXCHQ0k (ORCPT ); Thu, 8 Mar 2007 11:26:40 -0500 In-Reply-To: <17903.7366.875191.751728@alkaid.it.uu.se> Sender: linux-ide-owner@vger.kernel.org List-Id: linux-ide@vger.kernel.org To: Mikael Pettersson Cc: linux-ide@vger.kernel.org Hi Mikael, Thanks for the reply, I've commented below: Mikael Pettersson wrote: > SErr 0x01380000 would indicate: > transport state transmission error (bit 24) > CRC error (bit 21) > disparity error (bit 20) [whatever that is] > 10b_to_8b decoding error (bit 19) > > I.e., serious transmission issues. :-) > > [52849.930755] pdc_error_intr: port_status 0x00001000 serror 0x00000000 > > [52849.930880] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 > > frozen > > [52849.930883] ata2.00: (port_status 0x00001000) > > "host bus timeout error" (bit 12). > I wonder why SError was clear now. I can't say - this whole ata thing is much too complex for me ;-) > > I would be very happy to help debug this issue. Any suggestions on what > > I should try next? > > Well, at the moment I have only one possible cure: to forcibly > limit 3Gbps drives to 1.5Gbps operation, as the patch below does. I haven't tried your 1.5Gbps patch (yet). But I have been running more tests on my experiment system with the kernels I have handy. My procedure is as follows: 1. power cycle 2. boot selected kernel 3. start dd if=/dev/sdx of=/dev/null bs=1M for x=a,b,c,d 4. wait until one fails 5. record dmesg output So far here are my results: 2.6.18.1 fails (in 25 minutes) 2.6.19 fails (in 4 minutes) 2.6.19.2 fails (in 5 minutes) 2.6.20.1 fails (in 48 minutes) 2.6.21-rc2+p (with additional patches) doesn't fail This is very consistent. 2.6.21-rc2+p has been tested for more than 10 hours without a hickup :-) In the above tests it is always ata3 or ata4 (sdc or sdd) which fails. Another strange thing which happens on 2.6.21-rc2+p but not the other kernels: using smartctl -a -d ata while dd is running gives errors (I also mentioned this in my first mail, but wasn't sure then): [11046.005178] pdc_error_intr: port_status 0x00001000 serror 0x00000000 [11046.005286] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen [11046.005374] ata4.00: (port_status 0x00001000) [11046.005383] ata4.00: cmd 25/00:00:00:3b:a0/00:01:27:00:00/e0 tag 0 cdb 0x0 data 131072 in [11046.005385] res 50/00:00:ff:3b:a0/00:00:00:00:00/e0 Emask 0x4 (timeout) [11046.313769] ata4: soft resetting port [11046.469806] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [11046.496254] pdc_error_intr: port_status 0x00001000 serror 0x00000000 [11046.496580] ata4.00: failed to set xfermode (err_mask=0x104) [11046.496585] ata4: failed to recover some devices, retrying in 5 secs [11051.495393] ata4: hard resetting port [11051.971276] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [11052.005267] ata4.00: configured for UDMA/133 [11052.005285] ata4: EH complete [11052.042615] SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) [11052.051769] sdd: Write Protect is off [11052.051778] sdd: Mode Sense: 00 3a 00 00 [11052.059455] SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA [11052.066354] SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) [11052.070822] sdd: Write Protect is off [11052.070830] sdd: Mode Sense: 00 3a 00 00 [11052.073297] SCSI device sdd: write cache: enabled, read cache: enabled, doesn't support DPO or FUA Then it recovers and dd continues :-) Note that using smartctl this way on the other kernels does not show this problem! > On one of my test machines (an old UltraSPARC), a SATA300 TX2plus > with a Seagate 3Gbps drive (don't have the model number handy), > will quickly experience "DMA S/G overrun" errors during an fsck > of a large but clean ext3 partition. With the patch below things > work solidly on that particular machine. OTOH, on another test > machine (a 440BX chipset Intel PIII), the same card/cable/disk > combination works flawlessly at 3Gbps. Mysterious. My feeling is this is not caused by 1.5Gbps or 3.0Gbps operation. I was thinking about adding the speed selections jumpers on the harddrives, but so far I'm not touching the system as I don't want hardware problems (e.g. a loose cable) disturbing the test results. I'll stick to replacing software. My next test will be a plain 2.6.21rc2. Then I'll apply the patches one by one. One thought is this could be a bug/race condition which only shows under certain lucky circumstances - maybe the robustness of 2.6.21-rc2+p is due to local-apic not being enabled or some other subtle kernel build thing? Any suggestion on what I could do to help track this down is much appreciated? Best regards, Peter Favrholdt