All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Favrholdt <linux-ide@how.dk>
To: linux-ide@vger.kernel.org
Subject: sata_promise SATA300TX4 "intermittent problems"
Date: Wed, 07 Mar 2007 15:32:50 +0100	[thread overview]
Message-ID: <45EECD12.1000506@how.dk> (raw)

Hi,

I've seen "intermittent problems" with Promise SATA300 TX4 controllers
and Linux kernel 2.6.19 (through 2.6.20-rc2 with some additional
patches).

Sometimes the TX4 will loose a port - a reboot brings the drive back up 
again. I'm quite sure the harddrives are not at fault.

I have experienced this using "plain vanilla" Linux 2.6.19.2 and 
2.6.20.1. Today I have tested using Linux 2.6.21-rc2 with Mikael 
Petterson's patches (more on that further down).

Yesterday (using 2.6.20.1) I could fail two out of four drives by doing:
dd if=/dev/sda of=/dev/null bs=1M &
dd if=/dev/sdb of=/dev/null bs=1M &
dd if=/dev/sdc of=/dev/null bs=1M &
dd if=/dev/sdd of=/dev/null bs=1M &

sdd would fail first then after a while sdc, here is the dmesg output 
when sdd failed:

[14895.092650] ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x1380000 
action 0x2 frozen
[14895.092664] ata4.00: cmd 25/00:00:00:3e:1a/00:02:05:00:00/e0 tag 0 
cdb 0x0 data 262144 in
[14895.092666]          res 40/00:01:09:4f:c2/00:00:00:00:00/00 Emask 
0x4 (timeout)
[14895.404597] ata4: soft resetting port
[14895.560511] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[14925.555206] ata4.00: qc timeout (cmd 0xec)
[14925.555437] ata4.00: failed to IDENTIFY (I/O error, err_mask=0x104)
[14925.555441] ata4.00: revalidation failed (errno=-5)
[14925.555452] ata4: failed to recover some devices, retrying in 5 secs
[14930.556912] ata4: hard resetting port
[14930.876763] ata4: COMRESET failed (device not ready)
[14930.876772] ata4: hardreset failed, retrying in 5 secs
[14935.878525] ata4: hard resetting port
[14936.198407] ata4: COMRESET failed (device not ready)
[14936.198416] ata4: hardreset failed, retrying in 5 secs
[14941.200169] ata4: hard resetting port
[14941.520051] ata4: COMRESET failed (device not ready)
[14941.520060] ata4: reset failed, giving up
[14941.520063] ata4.00: disabled
[14941.520075] ata4: EH complete
[14941.520567] sd 4:0:0:0: SCSI error: return code = 0x00040000
[14941.520572] end_request: I/O error, dev sdd, sector 85605888
[14941.520577] Buffer I/O error on device sdd, logical block 10700736
[14941.520582] Buffer I/O error on device sdd, logical block 10700737

After a reboot the drives are operating again. But with an entry in the 
SMART log, e.g.:

Error 6 occurred at disk power-on lifetime: 353 hours (14 days + 17 hours)
   When the command that caused the error occurred, the device was 
active or idle.

   After command completion occurred, registers were:
   ER ST SC SN CL CH DH
   -- -- -- -- -- -- --
   84 51 ef 11 3e 1a e0  Error: ICRC, ABRT 239 sectors at LBA = 
0x001a3e11 = 1719825

   Commands leading to the command that caused the error were:
   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
   -- -- -- -- -- -- -- --  ----------------  --------------------
   25 00 00 00 3e 1a e0 00      04:08:17.774  READ DMA EXT
   25 00 00 00 3c 1a e0 00      04:08:17.764  READ DMA EXT
   25 00 00 00 3a 1a e0 00      04:08:17.753  READ DMA EXT
   25 00 00 00 38 1a e0 00      04:08:17.743  READ DMA EXT
   25 00 00 00 36 1a e0 00      04:08:17.734  READ DMA EXT


Today I have tested using Linux 2.6.21-rc2 with Mikael Petterson's
patches. In order to make it build I had to disable local-apic. So far
it seems to work better, but doing

dd if=/dev/sda of=/dev/null bs=1M &
dd if=/dev/sdb of=/dev/null bs=1M &
dd if=/dev/sdc of=/dev/null bs=1M &
dd if=/dev/sdd of=/dev/null bs=1M &

and then a couple of times:

for each in /dev/sd[abcd]; do smartctl -d ata -a $each | awk 
'/194/{print $10}'; done

will trig the error again:

[52849.930755] pdc_error_intr: port_status 0x00001000 serror 0x00000000
[52849.930880] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 
frozen
[52849.930883] ata2.00: (port_status 0x00001000)
[52849.930892] ata2.00: cmd 25/00:00:00:f7:1e/00:02:1b:00:00/e0 tag 0 
cdb 0x0 data 262144 in
[52849.930894]          res 50/00:00:ff:f8:1e/00:00:ff:59:c8/e0 Emask 
0x4 (timeout)
[52850.241962] ata2: soft resetting port
[52850.397984] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[52850.424344] pdc_error_intr: port_status 0x00001000 serror 0x00000000
[52850.424639] ata2.00: failed to set xfermode (err_mask=0x104)
[52850.424643] ata2: failed to recover some devices, retrying in 5 secs
[52855.423576] ata2: hard resetting port
[52855.899453] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[52855.933438] ata2.00: configured for UDMA/133
[52855.933456] ata2: EH complete
[52855.973979] SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
[52856.022739] sdb: Write Protect is off
[52856.022747] sdb: Mode Sense: 00 3a 00 00
[52856.085241] SCSI device sdb: write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA
[52856.089287] SCSI device sdb: 976773168 512-byte hdwr sectors (500108 MB)
[52856.092552] sdb: Write Protect is off
[52856.092560] sdb: Mode Sense: 00 3a 00 00
[52856.099067] SCSI device sdb: write cache: enabled, read cache: 
enabled, doesn't support DPO or FUA

although this time the hard reset is working, and the port comes back
up and continues reading. This is of course much better because a raid
device would not fail. But I still think the reset should not be
necessary?

I wonder if the earlier problems I've seen has been due to my own poking 
around with smartctl during heavy load. I'll try to test this some more.

I would be very happy to help debug this issue. Any suggestions on what 
I should try next?

Some background info:

I have three systems with SATA300TX4s:

System 1 (can be used for testing):
Linux 2.6.21-rc2+Mikael_Petterson
AMD Athlon(tm) XP 2500+ on a Nvidia nForce2 motherboard.
4 harddrives all connected to the TX4 in a normal PCI slot 133MHz
Seagate ST3500630NS (Barracuda 500GB ES) Firmware 3.AEE

System 2 (production system)
Dell PowerEdge 2800
Linux 2.6.19.5
Identical harddrives all connected to TX4 in a PCI-X slot 266MHz.

System 3 (production backup):
Linux 2.6.15
Identical to System 2 except only two disks. These are Barracuda 500GB
(non ES version).

Best regards,

Peter

             reply	other threads:[~2007-03-07 14:54 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-07 14:32 Peter Favrholdt [this message]
2007-03-07 20:12 ` sata_promise SATA300TX4 "intermittent problems" Mikael Pettersson
2007-03-08 16:26   ` Peter Favrholdt
2007-03-09  6:27     ` Peter Favrholdt
2007-03-09  7:01       ` Tomi Orava
2007-03-09  7:29         ` Peter Favrholdt
2007-03-13  7:11       ` Tomi Orava

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=45EECD12.1000506@how.dk \
    --to=linux-ide@how.dk \
    --cc=linux-ide@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.