* ahci timeouts, retries etc.
@ 2009-10-14 16:51 Tim Small
2009-10-16 0:52 ` Robert Hancock
0 siblings, 1 reply; 3+ messages in thread
From: Tim Small @ 2009-10-14 16:51 UTC (permalink / raw)
To: linux-ide
Hi,
I have a Tyan S5375 (BIOS v1.03) ICH9 which periodically (approx twice a
week) logs timeouts like this:
[6475755.652262] ata2.00: exception Emask 0x0 SAct 0x3832 SErr 0x0 action 0x6 frozen
[6475755.652262] ata2.00: cmd 60/18:08:2a:90:ee/00:00:12:00:00/40 tag 1 ncq 12288 in
[6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[6475755.652262] ata2.00: status: { DRDY }
[6475755.652262] ata2.00: cmd 61/60:20:6a:8c:ee/00:00:12:00:00/40 tag 4 ncq 49152 out
[6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
...
[6475755.652262] ata2.00: cmd 60/10:68:6a:65:ee/00:00:12:00:00/40 tag 13 ncq 8192 in
[6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[6475755.652262] ata2.00: status: { DRDY }
[6475755.652262] ata2: hard resetting link
[6475756.009863] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[6475756.040731] ata2.00: configured for UDMA/133
[6475756.040731] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors (1000205 MB)
[6475756.040731] sd 1:0:0:0: [sdb] Write Protect is off
[6475756.040731] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[6475756.040731] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
A look at the libata wiki suggests interrupt delivery problems as a
possible explanation, but is this likely to be the case here? I'm
guessing that multiple interrupts must have been dropped by the time
this error has occurred, as multiple requests are queued for the drive?
I'm assuming that the kernel will retry these requests after the sata
link has been reset?
The errors appear to be randomly distributed over the four drives on
this machine - all are Seagate ST31000340NS with either firmware version
SN05 or SN16...
Cheers,
Tim.
--
South East Open Source Solutions Limited
Registered in England and Wales with company number 06134732.
Registered Office: 2 Powell Gardens, Redhill, Surrey, RH1 1TQ
VAT number: 900 6633 53 http://seoss.co.uk/ +44-(0)1273-808309
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ahci timeouts, retries etc.
2009-10-14 16:51 ahci timeouts, retries etc Tim Small
@ 2009-10-16 0:52 ` Robert Hancock
2009-10-23 11:26 ` Tim Small
0 siblings, 1 reply; 3+ messages in thread
From: Robert Hancock @ 2009-10-16 0:52 UTC (permalink / raw)
To: Tim Small; +Cc: linux-ide
On 10/14/2009 10:51 AM, Tim Small wrote:
> Hi,
>
> I have a Tyan S5375 (BIOS v1.03) ICH9 which periodically (approx twice a
> week) logs timeouts like this:
>
> [6475755.652262] ata2.00: exception Emask 0x0 SAct 0x3832 SErr 0x0
> action 0x6 frozen
> [6475755.652262] ata2.00: cmd 60/18:08:2a:90:ee/00:00:12:00:00/40 tag 1
> ncq 12288 in
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> [6475755.652262] ata2.00: status: { DRDY }
> [6475755.652262] ata2.00: cmd 61/60:20:6a:8c:ee/00:00:12:00:00/40 tag 4
> ncq 49152 out
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> ...
> [6475755.652262] ata2.00: cmd 60/10:68:6a:65:ee/00:00:12:00:00/40 tag 13
> ncq 8192 in
> [6475755.652262] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4
> (timeout)
> [6475755.652262] ata2.00: status: { DRDY }
> [6475755.652262] ata2: hard resetting link
> [6475756.009863] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
> [6475756.040731] ata2.00: configured for UDMA/133
> [6475756.040731] sd 1:0:0:0: [sdb] 1953525168 512-byte hardware sectors
> (1000205 MB)
> [6475756.040731] sd 1:0:0:0: [sdb] Write Protect is off
> [6475756.040731] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> [6475756.040731] sd 1:0:0:0: [sdb] Write cache: enabled, read cache:
> enabled, doesn't support DPO or FUA
>
>
> A look at the libata wiki suggests interrupt delivery problems as a
> possible explanation, but is this likely to be the case here? I'm
> guessing that multiple interrupts must have been dropped by the time
> this error has occurred, as multiple requests are queued for the drive?
Interrupt delivery doesn't seem too likely here - it normally either
works or it doesn't, it doesn't randomly fail once in a while..
>
> I'm assuming that the kernel will retry these requests after the sata
> link has been reset?
Yes.
>
> The errors appear to be randomly distributed over the four drives on
> this machine - all are Seagate ST31000340NS with either firmware version
> SN05 or SN16...
This kind of problem often seems to be due to signal integrity or power
problems. For whatever reason, an insufficient power supply (or
something like overloading one power cable) can tend to trigger SATA
errors as an early symptom..
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: ahci timeouts, retries etc.
2009-10-16 0:52 ` Robert Hancock
@ 2009-10-23 11:26 ` Tim Small
0 siblings, 0 replies; 3+ messages in thread
From: Tim Small @ 2009-10-23 11:26 UTC (permalink / raw)
To: Robert Hancock; +Cc: linux-ide
Robert Hancock wrote:
>>
>> I'm assuming that the kernel will retry these requests after the sata
>> link has been reset?
>
> Yes.
>
>>
>> The errors appear to be randomly distributed over the four drives on
>> this machine - all are Seagate ST31000340NS with either firmware version
>> SN05 or SN16...
>
> This kind of problem often seems to be due to signal integrity or
> power problems. For whatever reason, an insufficient power supply (or
> something like overloading one power cable) can tend to trigger SATA
> errors as an early symptom..
Thanks for the reply Robert... The power (and SATA signal) delivery to
these drives is via a hot-swap backplane which is built-into the chassis
- I had considered some sort of hardware fault here, and that would seem
possible, but I don't really have any way to check as I don't have
access to another one of these machines in order to swap-out parts etc.
IPMI info looks OK (although I realise this may not catch transient
power problems at the drives etc.).
The timeouts appear to happen about 4 times per month. In the absence
of any other easy strategies, I've disabled SMART data collection on
this machine, on the off-chance that that makes any difference....
Cheers,
Tim.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-10-23 11:24 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-14 16:51 ahci timeouts, retries etc Tim Small
2009-10-16 0:52 ` Robert Hancock
2009-10-23 11:26 ` Tim Small
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).