ahci timeouts, retries etc.

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ahci timeouts, retries etc.
@ 2009-10-14 16:51 Tim Small
  2009-10-16  0:52 ` Robert Hancock
  0 siblings, 1 reply; 3+ messages in thread
From: Tim Small @ 2009-10-14 16:51 UTC (permalink / raw)
  To: linux-ide

Hi,

I have a Tyan S5375 (BIOS v1.03) ICH9 which periodically (approx twice a 
week) logs timeouts like this:

[6475755.652262] ata2.00: exception Emask 0x0 SAct 0x3832 SErr 0x0 action 0x6 frozen
[6475755.652262] ata2.00: cmd 60/18:08:2a:90:ee/00:00:12:00:00/40 tag 1 ncq 12288 in
[6475755.652262]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[6475755.652262] ata2.00: status: { DRDY }
[6475755.652262] ata2.00: cmd 61/60:20:6a:8c:ee/00:00:12:00:00/40 tag 4 ncq 49152 out
[6475755.652262]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
...
[6475755.652262] ata2.00: cmd 60/10:68:6a:65:ee/00:00:12:00:00/40 tag 13 ncq 8192 in
[6475755.652262]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[6475755.652262] ata2.00: status: { DRDY }
[6475755.652262] ata2: hard resetting link
[6475756.009863] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[6475756.040731] ata2.00: configured for UDMA/133
[6475756.040731] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
[6475756.040731] sd 1:0:0:0: [sdb] Write Protect is off
[6475756.040731] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[6475756.040731] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA


A look at the libata wiki suggests interrupt delivery problems as a 
possible explanation, but is this likely to be the case here?  I'm 
guessing that multiple interrupts must have been dropped by the time 
this error has occurred, as multiple requests are queued for the drive?

I'm assuming that the kernel will retry these requests after the sata 
link has been reset?

The errors appear to be randomly distributed over the four drives on 
this machine - all are Seagate ST31000340NS with either firmware version 
SN05 or SN16...

Cheers,

Tim.


-- 
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.  
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53  http://seoss.co.uk/ +44-(0)1273-808309


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ahci timeouts, retries etc.
  2009-10-14 16:51 ahci timeouts, retries etc Tim Small
@ 2009-10-16  0:52 ` Robert Hancock
  2009-10-23 11:26   ` Tim Small
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Hancock @ 2009-10-16  0:52 UTC (permalink / raw)
  To: Tim Small; +Cc: linux-ide

On 10/14/2009 10:51 AM, Tim Small wrote:
> Hi,
>
> I have a Tyan S5375 (BIOS v1.03) ICH9 which periodically (approx twice a
> week) logs timeouts like this:
>
> [6475755.652262] ata2.00: exception Emask 0x0 SAct 0x3832 SErr 0x0
> action 0x6 frozen
> [6475755.652262] ata2.00: cmd 60/18:08:2a:90:ee/00:00:12:00:00/40 tag 1
> ncq 12288 in
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> [6475755.652262] ata2.00: status: { DRDY }
> [6475755.652262] ata2.00: cmd 61/60:20:6a:8c:ee/00:00:12:00:00/40 tag 4
> ncq 49152 out
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> ...
> [6475755.652262] ata2.00: cmd 60/10:68:6a:65:ee/00:00:12:00:00/40 tag 13
> ncq 8192 in
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> [6475755.652262] ata2.00: status: { DRDY }
> [6475755.652262] ata2: hard resetting link
> [6475756.009863] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [6475756.040731] ata2.00: configured for UDMA/133
> [6475756.040731] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors
> (1000205 MB)
> [6475756.040731] sd 1:0:0:0: [sdb] Write Protect is off
> [6475756.040731] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> [6475756.040731] sd 1:0:0:0: [sdb] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
>
> A look at the libata wiki suggests interrupt delivery problems as a
> possible explanation, but is this likely to be the case here? I'm
> guessing that multiple interrupts must have been dropped by the time
> this error has occurred, as multiple requests are queued for the drive?

Interrupt delivery doesn't seem too likely here - it normally either 
works or it doesn't, it doesn't randomly fail once in a while..

>
> I'm assuming that the kernel will retry these requests after the sata
> link has been reset?

Yes.

>
> The errors appear to be randomly distributed over the four drives on
> this machine - all are Seagate ST31000340NS with either firmware version
> SN05 or SN16...

This kind of problem often seems to be due to signal integrity or power 
problems. For whatever reason, an insufficient power supply (or 
something like overloading one power cable) can tend to trigger SATA 
errors as an early symptom..

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ahci timeouts, retries etc.
  2009-10-16  0:52 ` Robert Hancock
@ 2009-10-23 11:26   ` Tim Small
  0 siblings, 0 replies; 3+ messages in thread
From: Tim Small @ 2009-10-23 11:26 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-ide

Robert Hancock wrote:
>>
>> I'm assuming that the kernel will retry these requests after the sata
>> link has been reset?
>
> Yes.
>
>>
>> The errors appear to be randomly distributed over the four drives on
>> this machine - all are Seagate ST31000340NS with either firmware version
>> SN05 or SN16...
>
> This kind of problem often seems to be due to signal integrity or 
> power problems. For whatever reason, an insufficient power supply (or 
> something like overloading one power cable) can tend to trigger SATA 
> errors as an early symptom..

Thanks for the reply Robert...  The power (and SATA signal) delivery to 
these drives is via a hot-swap backplane which is built-into the chassis 
- I had considered some sort of hardware fault here, and that would seem 
possible, but I don't really have any way to check as I don't have 
access to another one of these machines in order to swap-out parts etc.  
IPMI info looks OK (although I realise this may not catch transient 
power problems at the drives etc.).

The timeouts appear to happen about 4 times per month.  In the absence 
of any other easy strategies, I've disabled SMART data collection on 
this machine, on the off-chance that that makes any difference....

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-10-23 11:24 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-14 16:51 ahci timeouts, retries etc Tim Small
2009-10-16  0:52 ` Robert Hancock
2009-10-23 11:26   ` Tim Small

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).