* SATA errors
@ 2009-08-09 10:36 Miah Gregory
2009-08-09 17:24 ` Robert Hancock
2009-08-09 19:57 ` Mikael Pettersson
0 siblings, 2 replies; 5+ messages in thread
From: Miah Gregory @ 2009-08-09 10:36 UTC (permalink / raw)
To: linux-ide
Hi,
One of my servers has started to log slightly odd errors following one
of the software RAID arrays having been degraded due to an error on sdb.
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
sd 3:0:0:0: [sdb] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
end_request: I/O error, dev sdb, sector 976767834
md: super_written gets error=-5, uptodate=0
raid1: Disk failure on sdb3, disabling device.
raid1: Operation continuing on 1 devices.
sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda3
disk 1, wo:1, o:0, dev:sdb3
RAID1 conf printout:
--- wd:1 rd:2
disk 0, wo:0, o:1, dev:sda3
I tried running various smart tests and have run badblocks in R/W mode
across the whole surface of sdb, but did not find any obvious cause.
Do the following error logs point at anything specifically, all of which
have been seen since the above:
-snip-
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata4.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
sd 3:0:0:0: [sdb] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-snip-
-snip-
ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata4.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
sd 3:0:0:0: [sdb] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
sd 3:0:0:0: [sdb] Write Protect is off
sd 3:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-snip-
-snip-
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata3.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata3.00: status: { DRDY }
ata3: hard resetting link
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: configured for UDMA/133
sd 2:0:0:0: [sda] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-snip-
-snip-
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata5.00: cmd b0/d5:01:09:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
res 40/00:00:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata5.00: status: { DRDY }
ata5: hard resetting link
ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata5.00: configured for UDMA/133
sd 4:0:0:0: [sdc] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
sd 4:0:0:0: [sdc] Write Protect is off
sd 4:0:0:0: [sdc] Mode Sense: 00 3a 00 00
sd 4:0:0:0: [sdc] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
-snip-
All of these errors might possibly related to smart self checking which
I've now set to run more regularly - although I wouldn't really expect a
reset being required to get the disk to respond during one.
I'm running a 2.6.28.2 kernel in a machine with 4 WDC WD5000AACS-0 500GB
disks.
I'm unable to check how the disks are physically wired at the moment,
however they're probably all on the promise controller:
00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
01:04.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
Any assistance greatfully received - please let me know if further
information is needed.
--
Miah Gregory
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SATA errors
2009-08-09 10:36 SATA errors Miah Gregory
@ 2009-08-09 17:24 ` Robert Hancock
2009-08-09 19:57 ` Mikael Pettersson
1 sibling, 0 replies; 5+ messages in thread
From: Robert Hancock @ 2009-08-09 17:24 UTC (permalink / raw)
To: Miah Gregory; +Cc: linux-ide
On 08/09/2009 04:36 AM, Miah Gregory wrote:
> Hi,
>
> One of my servers has started to log slightly odd errors following one
> of the software RAID arrays having been degraded due to an error on sdb.
>
> ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
> res 40/00:00:09:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
> ata4.00: status: { DRDY }
> ata4: hard resetting link
> ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
> ata4.00: configured for UDMA/133
> ata4: EH complete
> sd 3:0:0:0: [sdb] 976773168 512-byte hardware sectors: (500 GB/465 GiB)
> end_request: I/O error, dev sdb, sector 976767834
This is a cache flush request that's timing out, the others are SMART
requests. Random timeouts are rather hard to diagnose but can often be
caused by hardware issues. You mentioned you have 4 drives in the
machine, SATA can be rather sensitive to power issues and it's possible
the power supply can't handle peak demands from all 4 drives
simultaneously. Might be something to look into first.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SATA errors
2009-08-09 10:36 SATA errors Miah Gregory
2009-08-09 17:24 ` Robert Hancock
@ 2009-08-09 19:57 ` Mikael Pettersson
2009-08-09 21:11 ` Miah Gregory
1 sibling, 1 reply; 5+ messages in thread
From: Mikael Pettersson @ 2009-08-09 19:57 UTC (permalink / raw)
To: Miah Gregory; +Cc: linux-ide
Miah Gregory writes:
> Hi,
>
> One of my servers has started to log slightly odd errors following one
> of the software RAID arrays having been degraded due to an error on sdb.
...
> All of these errors might possibly related to smart self checking which
> I've now set to run more regularly - although I wouldn't really expect a
> reset being required to get the disk to respond during one.
>
> I'm running a 2.6.28.2 kernel in a machine with 4 WDC WD5000AACS-0 500GB
> disks.
>
> I'm unable to check how the disks are physically wired at the moment,
> however they're probably all on the promise controller:
>
> 00:1f.2 IDE interface: Intel Corporation 82801EB (ICH5) SATA Controller (rev 02)
> 01:04.0 Mass storage controller: Promise Technology, Inc. PDC40718 (SATA 300 TX4) (rev 02)
>
> Any assistance greatfully received - please let me know if further
> information is needed.
1. Did these problems start spontaneously, or did they follow some
system change like installing more disks or booting a newer kernel?
2. If there's reason to suspect a kernel issue, the disk-to-controller
mapping in this machine will tell us which driver may be at fault.
Please post a complete kernel boot log from e.g. `dmesg'.
3. If the disks are attached to the Promise controller, please try this patch:
<http://user.it.uu.se/~mikpe/linux/patches/sata_promise/2.6.28/patch-sata_promise-reset-updates-v1-2.6.28>
It improved error recovery in a case where smart commands to a sleeping
disk of a particular model timed out.
/Mikael
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SATA errors
2009-08-09 19:57 ` Mikael Pettersson
@ 2009-08-09 21:11 ` Miah Gregory
2009-08-09 21:35 ` Mikael Pettersson
0 siblings, 1 reply; 5+ messages in thread
From: Miah Gregory @ 2009-08-09 21:11 UTC (permalink / raw)
To: Linux IDE
On Sun, 2009-08-09 at 21:57 +0200, Mikael Pettersson wrote:
> Miah Gregory writes:
> > One of my servers has started to log slightly odd errors following one
> > of the software RAID arrays having been degraded due to an error on sdb.
> ...
> 1. Did these problems start spontaneously, or did they follow some
> system change like installing more disks or booting a newer kernel?
Spontaneously; the machine has been running the current kernel since the
end of January, with some brief excursions to newer kernels which were
reverted due to the XFS/NFS interaction problems which haven't yet been
pinned down. No hardware changes etc. Current uptime just over 61 days.
> 2. If there's reason to suspect a kernel issue, the disk-to-controller
> mapping in this machine will tell us which driver may be at fault.
> Please post a complete kernel boot log from e.g. `dmesg'.
That will need a reboot, as the logs from the previous boot are long
since rotated; will organise this as time permits.
> 3. If the disks are attached to the Promise controller, please try this patch:
> <http://user.it.uu.se/~mikpe/linux/patches/sata_promise/2.6.28/patch-sata_promise-reset-updates-v1-2.6.28>
> It improved error recovery in a case where smart commands to a sleeping
> disk of a particular model timed out.
Could you clarify sleeping in this context? None of the disks are being
spun down, if that is what is meant?
Cheers.
ps. am subscribed to list, no need to cc
--
Miah Gregory
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: SATA errors
2009-08-09 21:11 ` Miah Gregory
@ 2009-08-09 21:35 ` Mikael Pettersson
0 siblings, 0 replies; 5+ messages in thread
From: Mikael Pettersson @ 2009-08-09 21:35 UTC (permalink / raw)
To: Linux IDE
Miah Gregory writes:
> On Sun, 2009-08-09 at 21:57 +0200, Mikael Pettersson wrote:
> > Miah Gregory writes:
>
> > > One of my servers has started to log slightly odd errors following one
> > > of the software RAID arrays having been degraded due to an error on sdb.
> > ...
>
> > 1. Did these problems start spontaneously, or did they follow some
> > system change like installing more disks or booting a newer kernel?
>
> Spontaneously; the machine has been running the current kernel since the
> end of January, with some brief excursions to newer kernels which were
> reverted due to the XFS/NFS interaction problems which haven't yet been
> pinned down. No hardware changes etc. Current uptime just over 61 days.
That is a strong indication that the core problem is hardware.
The kernel problem would be inadequate error recovery causing a
disk to be offlined after what should have been a recoverable event.
> > 2. If there's reason to suspect a kernel issue, the disk-to-controller
> > mapping in this machine will tell us which driver may be at fault.
> > Please post a complete kernel boot log from e.g. `dmesg'.
>
> That will need a reboot, as the logs from the previous boot are long
> since rotated; will organise this as time permits.
>
> > 3. If the disks are attached to the Promise controller, please try this patch:
> > <http://user.it.uu.se/~mikpe/linux/patches/sata_promise/2.6.28/patch-sata_promise-reset-updates-v1-2.6.28>
> > It improved error recovery in a case where smart commands to a sleeping
> > disk of a particular model timed out.
>
> Could you clarify sleeping in this context? None of the disks are being
> spun down, if that is what is meant?
The important thing here is the "It improved error recovery" part, the
rest just described that other case where error recovery needed improving.
(A specific disk was spun down by hdparm -Y, then subjected to smart commands.
It did not like that, triggering timeouts and other errors.)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2009-08-09 21:35 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-09 10:36 SATA errors Miah Gregory
2009-08-09 17:24 ` Robert Hancock
2009-08-09 19:57 ` Mikael Pettersson
2009-08-09 21:11 ` Miah Gregory
2009-08-09 21:35 ` Mikael Pettersson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).