Disk stuck in error recovery loop with AHCI

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Disk stuck in error recovery loop with AHCI
@ 2007-02-21  5:20 Jim Paris
  2007-02-23  7:28 ` Jim Paris
  0 siblings, 1 reply; 4+ messages in thread
From: Jim Paris @ 2007-02-21  5:20 UTC (permalink / raw)
  To: linux-ide

Hi,

For the past month or so I've been trying to track down data
corruption I'm seeing on my server.  It's something like 2-3 bits per
100 gigs of data, but only shows up on my 4-disk RAID5.  RAM is ECC
and fine.  Anyway, as part of my testing I swapped SATA controllers
many times (sata_nv, sata_sil24, sata_sil) then eventually switched
the motherboard/cpu/ram entirely and am now using an Intel chipset
with AHCI.  Turns out the corruption is still there (my current theory
is a bad disk or cables, doing more testing at the moment), but I saw
an error with AHCI that I hadn't seen before with the other
controllers.

Kernel is 2.6.20.  The onboard 6-port controller is:

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage Controller AHCI (rev 09)
00:1f.2 0106: 8086:2681 (rev 09)

It is set up in "Enhanced AHCI" mode in BIOS, because that's the only
mode in which all the disks are recognized.

After setting up the new machine, I was doing a MD "repair" operation on
the first two disks (RAID1) and the last four disks (RAID5):
   echo repair > /sys/block/md1/md/sync_action
   echo repair > /sys/block/md2/md/sync_action

Eventually I got the following errors on one of the disks in the raid1: 
(full dmesg is attached)

...
Feb 20 01:05:07 psychosis kernel: [11473.137130]          res 40/00:08:10:d7:4e/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:05:07 psychosis kernel: [11473.449587] ata3: soft resetting port
Feb 20 01:05:07 psychosis kernel: [11473.952880] ata3: softreset failed, retrying in 5 secs
Feb 20 01:05:12 psychosis kernel: [11478.947007] ata3: hard resetting port
Feb 20 01:05:20 psychosis kernel: [11486.256966] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:05:43 psychosis kernel: [11509.228781] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:05:48 psychosis kernel: [11514.223105] ata3: hard resetting port
Feb 20 01:05:48 psychosis kernel: [11514.542508] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:05:48 psychosis kernel: [11514.544548] ata3.00: configured for UDMA/133
Feb 20 01:05:48 psychosis kernel: [11514.544559] ata3: EH complete
Feb 20 01:05:48 psychosis kernel: [11514.544685] SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 01:05:48 psychosis kernel: [11514.544709] sda: Write Protect is off
Feb 20 01:05:48 psychosis kernel: [11514.544743] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 01:05:57 psychosis kernel: [11523.816196]          res 40/00:00:10:7d:5f/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:05:58 psychosis kernel: [11524.128039] ata3: soft resetting port
Feb 20 01:05:58 psychosis kernel: [11524.631101] ata3: softreset failed, retrying in 5 secs
Feb 20 01:06:03 psychosis kernel: [11529.625457] ata3: hard resetting port
Feb 20 01:06:10 psychosis kernel: [11536.935417] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:06:33 psychosis kernel: [11559.907230] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:06:39 psychosis kernel: [11564.901852] ata3: hard resetting port
Feb 20 01:06:39 psychosis kernel: [11565.221011] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:06:39 psychosis kernel: [11565.223107] ata3.00: configured for UDMA/133
Feb 20 01:06:39 psychosis kernel: [11565.223118] ata3: EH complete
Feb 20 01:06:39 psychosis kernel: [11565.223245] SCSI device sda: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 01:06:39 psychosis kernel: [11565.223268] sda: Write Protect is off
Feb 20 01:06:39 psychosis kernel: [11565.223301] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 01:07:02 psychosis kernel: [11588.192660]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:07:02 psychosis kernel: [11588.192788]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
      (repeated 30 times total)
Feb 20 01:07:02 psychosis kernel: [11588.194209]          res 40/00:50:10:76:88/00:00:07:00:00/40 Emask 0x10 (ATA bus error)
Feb 20 01:07:02 psychosis kernel: [11588.504333] ata3: soft resetting port
Feb 20 01:07:03 psychosis kernel: [11589.007184] ata3: softreset failed, retrying in 5 secs
Feb 20 01:07:08 psychosis kernel: [11594.001540] ata3: hard resetting port
Feb 20 01:07:15 psychosis kernel: [11601.311499] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:07:38 psychosis kernel: [11624.283314] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:07:43 psychosis kernel: [11629.277637] ata3: hard resetting port
Feb 20 01:07:43 psychosis kernel: [11629.597039] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:07:43 psychosis kernel: [11629.598746] ata3.00: configured for UDMA/133
Feb 20 01:07:43 psychosis kernel: [11629.598898] ata3: EH complete
...

This kept happening over and over, reducing the speed each time:

Feb 20 01:18:59 psychosis kernel: [12304.433025] ata3.00: limiting speed to UDMA/66
Feb 20 01:20:42 psychosis kernel: [12406.788586] ata3.00: limiting speed to UDMA/44
Feb 20 01:22:06 psychosis kernel: [12490.909166] ata3.00: limiting speed to UDMA/33
Feb 20 01:23:24 psychosis kernel: [12568.798521] ata3.00: limiting speed to UDMA/25
Feb 20 01:24:36 psychosis kernel: [12640.161835] ata3.00: limiting speed to UDMA/16
Feb 20 01:26:05 psychosis kernel: [12729.369247] ata3.00: limiting speed to PIO4
Feb 20 01:27:17 psychosis kernel: [12800.722750] ata3.00: limiting speed to PIO3
Feb 20 01:28:28 psychosis kernel: [12871.570376] ata3.00: limiting speed to PIO2
Feb 20 01:29:39 psychosis kernel: [12942.417995] ata3.00: limiting speed to PIO1
Feb 20 01:58:27 psychosis kernel: [13013.265629] ata3.00: limiting speed to PIO0

At this point I remember seeing on the console an error about how
speed-down was requested, but no lower speed modes were available; not
sure why that didn't get into the log.  But the messages kept coming:

Feb 20 01:58:27 psychosis kernel: [13650.934246]          res 40/00:20:10:ae:40/00:00:0d:00:00/40 Emask 0x4 (timeout)
Feb 20 01:58:27 psychosis kernel: [13651.245577] ata3: soft resetting port
Feb 20 01:58:27 psychosis kernel: [13651.245611] ata3: softreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13656.239963] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13663.549923] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:58:27 psychosis kernel: [13686.521729] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13691.516061] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13691.835461] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:58:27 psychosis kernel: [13691.837090] ata3.00: configured for PIO0
Feb 20 01:58:27 psychosis kernel: [13691.837106] ata3: EH complete
Feb 20 01:58:27 psychosis kernel: [13721.785875]          res 40/00:20:10:ae:40/00:00:0d:00:00/40 Emask 0x4 (timeout)
Feb 20 01:58:27 psychosis kernel: [13722.097199] ata3: soft resetting port
Feb 20 01:58:27 psychosis kernel: [13722.097236] ata3: softreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13727.091584] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13734.401547] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 01:58:27 psychosis kernel: [13757.373357] ata3: hardreset failed, retrying in 5 secs
Feb 20 01:58:27 psychosis kernel: [13762.367683] ata3: hard resetting port
Feb 20 01:58:27 psychosis kernel: [13762.687085] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 01:58:27 psychosis kernel: [13762.689015] ata3.00: configured for PIO0
Feb 20 01:58:27 psychosis kernel: [13762.689027] ata3: EH complete

etc etc.  Since this was one of the two disks in the RAID1 root
filesystem, all processes just froze while it tried over and over to
access the disk and went through the error handling sequence.  I guess
since the EH thought it completed successfully, it was never reported
to the MD layer as being a bad disk.

At this point I realized things were stuck.  I plugged a keyboard in,
then removed all 6 disks and replugged them (not knowing which one was
ata3).  The last one I tried was finally ata3, and I left that one out
so it would finally fail the disk:

...
Feb 20 01:58:27 psychosis kernel: [14665.593082] ata3: SATA link down (SStatus 0 SControl 300)
Feb 20 01:58:27 psychosis kernel: [14665.593089] ata3.00: disabled
Feb 20 01:58:27 psychosis kernel: [14666.096260] sd 2:0:0:0: SCSI error: return code = 0x08000002
Feb 20 01:58:27 psychosis kernel: [14666.096264] sda: Current [descriptor]: sense key: Aborted Command
Feb 20 01:58:27 psychosis kernel: [14666.096267]     Additional sense: No additional sense information
Feb 20 01:58:27 psychosis kernel: [14666.096270] Descriptor sense data with sense descriptors (in hex):
Feb 20 01:58:27 psychosis kernel: [14666.096272]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 
Feb 20 01:58:27 psychosis kernel: [14666.096278]         00 00 00 00 
Feb 20 01:58:27 psychosis kernel: [14666.096282] end_request: I/O error, dev sda, sector 222342032
Feb 20 01:58:27 psychosis kernel: [14666.096731] sd 2:0:0:0: SCSI error: return code = 0x00010000
Feb 20 01:58:27 psychosis kernel: [14666.096734] end_request: I/O error, dev sda, sector 222342800
Feb 20 01:58:27 psychosis kernel: [14666.097039] md: super_written gets error=-5, uptodate=0
Feb 20 01:58:27 psychosis kernel: [14666.097044] ^IOperation continuing on 1 devices
..

Now that md failed the disk, the system came back to life and was
usable again.  I plugged ata3 back in, and it came up fine:

Feb 20 02:02:04 psychosis kernel: [14883.936792] ata3: waiting for device to spin up (8 secs)
Feb 20 02:02:12 psychosis kernel: [14892.125029] ata3: soft resetting port
Feb 20 02:02:12 psychosis kernel: [14892.125131] ata3: softreset failed, retrying in 5 secs
Feb 20 02:02:17 psychosis kernel: [14897.119415] ata3: hard resetting port
Feb 20 02:02:25 psychosis kernel: [14904.832598] ata3: port is slow to respond, please be patient (Status 0x80)
Feb 20 02:02:48 psychosis kernel: [14927.804508] ata3: hardreset failed, retrying in 5 secs
Feb 20 02:02:53 psychosis kernel: [14932.798737] ata3: hard resetting port
Feb 20 02:02:54 psychosis kernel: [14933.521359] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 02:02:54 psychosis kernel: [14933.522135] ata3.00: ATA-7, max UDMA/133, 625142448 sectors: LBA48 NCQ (depth 31/32)
Feb 20 02:02:54 psychosis kernel: [14933.523485] ata3.00: configured for UDMA/133
Feb 20 02:02:54 psychosis kernel: [14933.523494] ata3: EH complete
Feb 20 02:02:54 psychosis kernel: [14933.523577] scsi 2:0:0:0: Direct-Access     ATA      ST3320620AS      3.AA PQ: 0 ANSI: 5
Feb 20 02:02:54 psychosis kernel: [14933.523661] SCSI device sdh: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 02:02:54 psychosis kernel: [14933.523672] sdh: Write Protect is off
Feb 20 02:02:54 psychosis kernel: [14933.523687] SCSI device sdh: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 02:02:54 psychosis kernel: [14933.523741] SCSI device sdh: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 02:02:54 psychosis kernel: [14933.523750] sdh: Write Protect is off
Feb 20 02:02:54 psychosis kernel: [14933.523764] SCSI device sdh: write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 20 02:02:54 psychosis kernel: [14933.523768]  sdh: sdh1 sdh2
Feb 20 02:02:54 psychosis kernel: [14933.539428] sd 2:0:0:0: Attached scsi disk sdh

Because the error at [11588.19xx] was repeated 30 times, I suspected
NCQ.  I set the queue_depth on all 6 disks down to 1, and haven't seen
the same problem since, although once I did get:

Feb 20 22:01:01 psychosis kernel: [86680.771935]          res 50/00:00:57:58:70/00:00:00:00:00/e2 Emask 0x10 (ATA bus error)
Feb 20 22:01:01 psychosis kernel: [86681.083496] ata5: soft resetting port
Feb 20 22:01:01 psychosis kernel: [86681.083540] ata5: softreset failed, retrying in 5 secs
Feb 20 22:01:06 psychosis kernel: [86686.077882] ata5: hard resetting port
Feb 20 22:01:13 psychosis kernel: [86693.387842] ata5: port is slow to respond, please be patient (Status 0x80)
Feb 20 22:01:36 psychosis kernel: [86716.359653] ata5: hardreset failed, retrying in 5 secs
Feb 20 22:01:41 psychosis kernel: [86721.353980] ata5: hard resetting port
Feb 20 22:01:41 psychosis kernel: [86721.673381] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Feb 20 22:01:41 psychosis kernel: [86721.675047] ata5.00: configured for UDMA/133
Feb 20 22:01:41 psychosis kernel: [86721.675057] ata5: EH complete
Feb 20 22:01:42 psychosis kernel: [86721.687806] SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
Feb 20 22:01:42 psychosis kernel: [86721.690191] sdc: Write Protect is off
Feb 20 22:01:42 psychosis kernel: [86721.694914] SCSI device sdc: write cache: enabled, read cache: enabled, doesn't support DPO or FUA

I know it's been mentioned before that some disks have broken NCQ, but
it doesn't look like there's much in the kernel NCQ blacklist at the
moment, and the strange part is that I've never seen this problem with
the same disks and any of the other controllers I've tried that
supported NCQ (sata_nv, sata_sil24).  So I guess my concern here is
that this is actually related to AHCI in some way?

BTW - I just realized that dmesg has some error messages that are
missing from /var/log/messages, like the actual "exception Emask.."
lines.  Not sure what's up with that, but I'll give dmesg output too.

Full logs at
  http://jim.sh/~jim/ide/
Note that there's an identical disk that I sometimes had plugged into
the sata_sil24 controller; it wasn't being used at the time.

-jim

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk stuck in error recovery loop with AHCI
  2007-02-21  5:20 Disk stuck in error recovery loop with AHCI Jim Paris
@ 2007-02-23  7:28 ` Jim Paris
  2007-02-24  4:22   ` Tejun Heo
  0 siblings, 1 reply; 4+ messages in thread
From: Jim Paris @ 2007-02-23  7:28 UTC (permalink / raw)
  To: linux-ide

I wrote:
> I've been trying to track down data corruption I'm seeing on my
> server.

Turns out it was a bad disk.  Not a media error, but maybe bad RAM or
logic on the drive.

> I saw an error with AHCI that I hadn't seen before with the other
> controllers.
...
> Because the error at [11588.19xx] was repeated 30 times, I suspected
> NCQ.  I set the queue_depth on all 6 disks down to 1, and haven't seen
> the same problem since

It's not related to NCQ.  I still saw the problem with it disabled,
and it finally went away when I enabled spread-spectrum clocking in
BIOS, even once I turned NCQ back on.  So this report is bogus.

Still, it seems that some improvements could be made to the EH when
this sort of thing happens.  For example, after "speed down requested
but no transfer mode left" a few times in a row, maybe it would make
sense to just fail the disk and give up.  That would have allowed
higher layers like MD to recover.

-jim

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk stuck in error recovery loop with AHCI
  2007-02-23  7:28 ` Jim Paris
@ 2007-02-24  4:22   ` Tejun Heo
  2007-02-24  8:00     ` Jim Paris
  0 siblings, 1 reply; 4+ messages in thread
From: Tejun Heo @ 2007-02-24  4:22 UTC (permalink / raw)
  To: Jim Paris; +Cc: linux-ide

Jim Paris wrote:
> Still, it seems that some improvements could be made to the EH when
> this sort of thing happens.  For example, after "speed down requested
> but no transfer mode left" a few times in a row, maybe it would make
> sense to just fail the disk and give up.  That would have allowed
> higher layers like MD to recover.

That's SD retrying sector-by-sector a large request as libata doesn't
report proper failed sector info.  Please try 2.6.20.  md will fail the
drive after a few errors.

Patches to improve EH behavior further went into 2.6.21 devel tree and
some are pending.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Disk stuck in error recovery loop with AHCI
  2007-02-24  4:22   ` Tejun Heo
@ 2007-02-24  8:00     ` Jim Paris
  0 siblings, 0 replies; 4+ messages in thread
From: Jim Paris @ 2007-02-24  8:00 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

Tejun Heo wrote:
> Jim Paris wrote:
> > Still, it seems that some improvements could be made to the EH when
> > this sort of thing happens.  For example, after "speed down requested
> > but no transfer mode left" a few times in a row, maybe it would make
> > sense to just fail the disk and give up.  That would have allowed
> > higher layers like MD to recover.
> 
> That's SD retrying sector-by-sector a large request as libata doesn't
> report proper failed sector info.  Please try 2.6.20.  md will fail the
> drive after a few errors.

Hi Tejun,

It was 2.6.20.  The system was stuck repeating EH for about half an
hour before I physically pulled the disk to get MD to fail the device
and to bring the system back to a responsive state.

> Patches to improve EH behavior further went into 2.6.21 devel tree and
> some are pending.

Great, I've hopefully eliminated the cause of my errors but I'll give
that a try if I run into this again.

-jim

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-02-24  8:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-02-21  5:20 Disk stuck in error recovery loop with AHCI Jim Paris
2007-02-23  7:28 ` Jim Paris
2007-02-24  4:22   ` Tejun Heo
2007-02-24  8:00     ` Jim Paris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).