PMP SMART error recovery and failure code decoding help

linux-ide.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* PMP SMART error recovery and failure code decoding help
@ 2011-01-16 16:39 Marc MERLIN
  2011-01-17 13:26 ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2011-01-16 16:39 UTC (permalink / raw)
  To: Tejun Heo, linux-ide

I have 2 sets of 5 drives being a PMP.

- 2.6.36.0 kernel
- sata_sil24 card
- Port Multiplier 1.1, 0x1095:0x3726 r23, 6 ports, feat 0x1/0x9

All 10 are outputting errors on a schedule after being queried by some Smart tool
(I have hddtemp and smartmontools at least).

Error is:
ata10.02: failed command: SMART
ata10.02: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata10.02: status: { DRDY }

I think it's due to a script that I wrote that uses hdparm -y to spin
drives down after an idle period (because at least 5 of my drives, WDC
WD20EADS, have stupid green firmware that prevents auto spindowns with
hdparm -S).
My swdisksusp script is at:
http://marc.merlins.org/perso/linux/post_2010-08-03_Spinning-Down-WD20EADS-Drives-and-Fixing-Load-Cycle.html

Anyway, the problem happens with both drives that I manually spin down and
drives that spin down on their own. I think it's not actually a 'real',
error but more an issue where drives cannot answer some SMART error when
they are spun down.
That said, is it normal/expected for the PMP code to do a full bus reset
because of a SMART command that couldn't go through?

Thanks,
Marc

Jan 16 05:54:23 gargamel kernel: ata9.00: failed command: SMART
Jan 16 05:54:31 gargamel kernel: ata9.01: failed command: SMART
Jan 16 05:54:39 gargamel kernel: ata9.02: failed command: SMART
Jan 16 05:54:47 gargamel kernel: ata9.03: failed command: SMART
Jan 16 05:54:55 gargamel kernel: ata9.04: failed command: SMART
Jan 16 06:05:04 gargamel kernel: ata9.00: failed command: SMART
Jan 16 06:05:12 gargamel kernel: ata9.01: failed command: SMART
Jan 16 06:05:20 gargamel kernel: ata9.02: failed command: SMART
Jan 16 06:05:28 gargamel kernel: ata9.03: failed command: SMART
Jan 16 06:05:36 gargamel kernel: ata9.04: failed command: SMART
Jan 16 06:05:44 gargamel kernel: ata10.00: failed command: SMART
Jan 16 06:06:01 gargamel kernel: ata10.01: failed command: SMART
Jan 16 06:06:18 gargamel kernel: ata10.02: failed command: SMART
Jan 16 06:16:35 gargamel kernel: ata9.00: failed command: SMART
Jan 16 06:16:52 gargamel kernel: ata9.01: failed command: SMART
Jan 16 06:17:01 gargamel kernel: ata9.02: failed command: SMART
Jan 16 06:17:08 gargamel kernel: ata9.03: failed command: SMART
Jan 16 06:17:16 gargamel kernel: ata9.04: failed command: SMART
Jan 16 06:27:25 gargamel kernel: ata10.00: failed command: SMART
Jan 16 06:27:42 gargamel kernel: ata10.01: failed command: SMART
Jan 16 06:27:59 gargamel kernel: ata10.02: failed command: SMART
Jan 16 06:38:16 gargamel kernel: ata9.00: failed command: SMART
Jan 16 06:38:24 gargamel kernel: ata9.01: failed command: SMART
Jan 16 06:38:32 gargamel kernel: ata9.02: failed command: SMART
Jan 16 06:38:40 gargamel kernel: ata9.03: failed command: SMART
Jan 16 06:38:48 gargamel kernel: ata9.04: failed command: SMART
Jan 16 06:59:05 gargamel kernel: ata10.00: failed command: SMART
Jan 16 06:59:19 gargamel kernel: ata10.01: failed command: SMART
Jan 16 06:59:36 gargamel kernel: ata10.02: failed command: SMART
Jan 16 07:29:58 gargamel kernel: ata10.00: failed command: SMART
Jan 16 07:30:15 gargamel kernel: ata10.01: failed command: SMART
Jan 16 07:30:32 gargamel kernel: ata10.02: failed command: SMART
Jan 16 08:00:55 gargamel kernel: ata10.00: failed command: SMART
Jan 16 08:01:12 gargamel kernel: ata10.01: failed command: SMART
Jan 16 08:01:29 gargamel kernel: ata10.02: failed command: SMART

A full error looks like this:
ata10.00: failed to read SCR 1 (Emask=0x40)
ata10.01: failed to read SCR 1 (Emask=0x40)
ata10.02: failed to read SCR 1 (Emask=0x40)
ata10.03: failed to read SCR 1 (Emask=0x40)
ata10.04: failed to read SCR 1 (Emask=0x40)
ata10.05: failed to read SCR 1 (Emask=0x40)
ata10.15: exception Emask 0x4 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.00: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.01: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.02: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.02: failed command: SMART
ata10.02: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata10.02: status: { DRDY }
ata10.03: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.04: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.05: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
ata10.15: hard resetting link
ata10.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
ata10.05: limiting SATA link speed to 1.5 Gbps
ata10.00: hard resetting link
ata10.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata10.01: hard resetting link
ata10.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.02: hard resetting link
ata10.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.03: hard resetting link
ata10.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.04: hard resetting link
ata10.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.05: hard resetting link
ata10.05: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata10.00: configured for UDMA/100
ata10.01: configured for UDMA/100
ata10.02: qc timeout (cmd 0xec)
ata10.02: failed to IDENTIFY (I/O error, err_mask=0x5)
ata10.02: revalidation failed (errno=-5)
ata10.15: hard resetting link
ata10.15: SATA link up 3.0 Gbps (SStatus 123 SControl 0)
ata10.00: hard resetting link
ata10.00: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata10.01: hard resetting link
ata10.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.02: hard resetting link
ata10.02: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.03: hard resetting link
ata10.03: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.04: hard resetting link
ata10.04: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata10.05: hard resetting link
ata10.05: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata10.00: configured for UDMA/100
ata10.01: configured for UDMA/100
ata10.02: configured for UDMA/100
ata10.03: configured for UDMA/100
ata10.04: configured for UDMA/100
ata10: EH complete


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PMP SMART error recovery and failure code decoding help
  2011-01-16 16:39 PMP SMART error recovery and failure code decoding help Marc MERLIN
@ 2011-01-17 13:26 ` Tejun Heo
  2011-01-17 16:43   ` Marc MERLIN
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2011-01-17 13:26 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-ide

Hello,

On Sun, Jan 16, 2011 at 08:39:50AM -0800, Marc MERLIN wrote:
> Anyway, the problem happens with both drives that I manually spin down and
> drives that spin down on their own. I think it's not actually a 'real',
> error but more an issue where drives cannot answer some SMART error when
> they are spun down.

It could be that the drives need to spin up to answer the smart
command and the timeout on the smart commands is a bit too short for
that to happen.  Forcing a disk access before issuing the smart
command could work around the problem.

> That said, is it normal/expected for the PMP code to do a full bus reset
> because of a SMART command that couldn't go through?

Yeah, after a timeout, the driver doesn't know what state the
controller / PMP / devices are in, so it's kind of forced to do full
reset.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PMP SMART error recovery and failure code decoding help
  2011-01-17 13:26 ` Tejun Heo
@ 2011-01-17 16:43   ` Marc MERLIN
  2011-01-17 17:12     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2011-01-17 16:43 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Mon, Jan 17, 2011 at 02:26:23PM +0100, Tejun Heo wrote:
> Hello,

Hi (sorry that got cut out of my original message :) ).

> On Sun, Jan 16, 2011 at 08:39:50AM -0800, Marc MERLIN wrote:
> > Anyway, the problem happens with both drives that I manually spin down and
> > drives that spin down on their own. I think it's not actually a 'real',
> > error but more an issue where drives cannot answer some SMART error when
> > they are spun down.
> 
> It could be that the drives need to spin up to answer the smart
> command and the timeout on the smart commands is a bit too short for
> that to happen.  Forcing a disk access before issuing the smart
> command could work around the problem.
 
Right, although the idea is of course to keep the drives spun down :)
I haven't been able to find which SMART call is causing those errors yet.
Does cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 translate to anything useful?

> > That said, is it normal/expected for the PMP code to do a full bus reset
> > because of a SMART command that couldn't go through?
> 
> Yeah, after a timeout, the driver doesn't know what state the
> controller / PMP / devices are in, so it's kind of forced to do full
> reset.

Fair enough. I guess it's one of the downsides of PMP.

Thanks for the answer,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PMP SMART error recovery and failure code decoding help
  2011-01-17 16:43   ` Marc MERLIN
@ 2011-01-17 17:12     ` Tejun Heo
  2011-01-17 17:29       ` Marc MERLIN
  0 siblings, 1 reply; 6+ messages in thread
From: Tejun Heo @ 2011-01-17 17:12 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: linux-ide

Hello,

On Mon, Jan 17, 2011 at 08:43:40AM -0800, Marc MERLIN wrote:
> > It could be that the drives need to spin up to answer the smart
> > command and the timeout on the smart commands is a bit too short for
> > that to happen.  Forcing a disk access before issuing the smart
> > command could work around the problem.
>  
> Right, although the idea is of course to keep the drives spun down :)
> I haven't been able to find which SMART call is causing those errors yet.
> Does cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 translate to anything useful?

That's SMART ENABLE OPERATIONS.  It turns on SMART.

> > > That said, is it normal/expected for the PMP code to do a full bus reset
> > > because of a SMART command that couldn't go through?
> > 
> > Yeah, after a timeout, the driver doesn't know what state the
> > controller / PMP / devices are in, so it's kind of forced to do full
> > reset.
> 
> Fair enough. I guess it's one of the downsides of PMP.

The device would still be reset even if it's attached directly.  The
only different is that everything under PMP is reset together instead
of individual ones.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PMP SMART error recovery and failure code decoding help
  2011-01-17 17:12     ` Tejun Heo
@ 2011-01-17 17:29       ` Marc MERLIN
  2011-06-29 17:14         ` Marc MERLIN
  0 siblings, 1 reply; 6+ messages in thread
From: Marc MERLIN @ 2011-01-17 17:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Mon, Jan 17, 2011 at 06:12:09PM +0100, Tejun Heo wrote:
> Hello,
> 
> On Mon, Jan 17, 2011 at 08:43:40AM -0800, Marc MERLIN wrote:
> > > It could be that the drives need to spin up to answer the smart
> > > command and the timeout on the smart commands is a bit too short for
> > > that to happen.  Forcing a disk access before issuing the smart
> > > command could work around the problem.
> >  
> > Right, although the idea is of course to keep the drives spun down :)
> > I haven't been able to find which SMART call is causing those errors yet.
> > Does cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 translate to anything useful?
> 
> That's SMART ENABLE OPERATIONS.  It turns on SMART.
 
Haha, ok, not as useful as I thought :)

> > > > That said, is it normal/expected for the PMP code to do a full bus reset
> > > > because of a SMART command that couldn't go through?
> > > 
> > > Yeah, after a timeout, the driver doesn't know what state the
> > > controller / PMP / devices are in, so it's kind of forced to do full
> > > reset.
> > 
> > Fair enough. I guess it's one of the downsides of PMP.
> 
> The device would still be reset even if it's attached directly.  The
> only different is that everything under PMP is reset together instead
> of individual ones.

That's absolutely correct. I was kind of trying to avoid unnecessary full
PMP resets: they always make me nervous with software raid on top, but so
far no real disasters have happened :)

Thanks for your answers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems & security ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: PMP SMART error recovery and failure code decoding help
  2011-01-17 17:29       ` Marc MERLIN
@ 2011-06-29 17:14         ` Marc MERLIN
  0 siblings, 0 replies; 6+ messages in thread
From: Marc MERLIN @ 2011-06-29 17:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: linux-ide

On Mon, Jan 17, 2011 at 09:29:24AM -0800, Marc MERLIN wrote:
> On Mon, Jan 17, 2011 at 06:12:09PM +0100, Tejun Heo wrote:
> > Hello,
> > 
> > On Mon, Jan 17, 2011 at 08:43:40AM -0800, Marc MERLIN wrote:
> > > > It could be that the drives need to spin up to answer the smart
> > > > command and the timeout on the smart commands is a bit too short for
> > > > that to happen.  Forcing a disk access before issuing the smart
> > > > command could work around the problem.
> > >  
> > > Right, although the idea is of course to keep the drives spun down :)
> > > I haven't been able to find which SMART call is causing those errors yet.
> > > Does cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 translate to anything useful?
> > 
> > That's SMART ENABLE OPERATIONS.  It turns on SMART.
>  
> Haha, ok, not as useful as I thought :)

As an update for the list archives, more recent smartmontools (5.40 and
maybe older) allow this:
DEVICESCAN -n standby

This tells smartmontools not to talk to drives that are sleeping so that
it does not spin them back up.

In turn it fixed the related PMP timeout and PMP bus reset issues I was getting
from those commands.

Hope this helps someone.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-06-29 17:48 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-16 16:39 PMP SMART error recovery and failure code decoding help Marc MERLIN
2011-01-17 13:26 ` Tejun Heo
2011-01-17 16:43   ` Marc MERLIN
2011-01-17 17:12     ` Tejun Heo
2011-01-17 17:29       ` Marc MERLIN
2011-06-29 17:14         ` Marc MERLIN

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).