SCSI error handling -- one error blocks the whole SCSI host

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* SCSI error handling -- one error blocks the whole SCSI host
@ 2013-05-23 18:14 Roland Dreier
  2013-05-25 18:07 ` James Smart
  2013-05-26 22:44 ` James Bottomley
  0 siblings, 2 replies; 8+ messages in thread
From: Roland Dreier @ 2013-05-23 18:14 UTC (permalink / raw)
  To: linux-scsi, Hannes Reinecke, Jej B

At LSF this year, we had a discussion about error handling and in
particular the problem that SCSI midlayer error handling waits for the
entire SCSI host (HBA) to quiesce before it starts to abort commands
etc.

James made the suggestion that FC should handle things the way SAS
does, because SAS has a strategy handler that does things the right
way.  However, now that I finally sit down and look at the code, I
don't see how this is the case.  It seems inherent in the way that
scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
particular the strategy handler can't even be called until host_failed
== host_busy; we don't bump host_failed without SHOST_RECOVERY set,
which stops queueing commands to any devices attached to the whole
HBA).

James, am I understanding your suggestion properly?  If so can you
explain what you meant about the libsas code -- I see that it has its
own strategy handler but as I said before we've already stopped every
device attached to the HBA before we ever get there.

To recapitulate the problem here, we might have a whole fabric
attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
devices.  Then a single LUN goes wonky and all the IO stops while we
try to recover that single device, which might take minutes.

I know this has been discussed before, but can we find a way forward
here?  Is there some way we can start with per-device error recovery
and avoid disrupting IO that we can see is working fine?

Thanks,
  Roland

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-23 18:14 SCSI error handling -- one error blocks the whole SCSI host Roland Dreier
@ 2013-05-25 18:07 ` James Smart
  2013-05-26 22:44 ` James Bottomley
  1 sibling, 0 replies; 8+ messages in thread
From: James Smart @ 2013-05-25 18:07 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-scsi, Hannes Reinecke, Jej B

Roland,

I agree, and am already working around that limitation.

-- james s


On 5/23/2013 2:14 PM, Roland Dreier wrote:
> At LSF this year, we had a discussion about error handling and in
> particular the problem that SCSI midlayer error handling waits for the
> entire SCSI host (HBA) to quiesce before it starts to abort commands
> etc.
>
> James made the suggestion that FC should handle things the way SAS
> does, because SAS has a strategy handler that does things the right
> way.  However, now that I finally sit down and look at the code, I
> don't see how this is the case.  It seems inherent in the way that
> scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
> particular the strategy handler can't even be called until host_failed
> == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
> which stops queueing commands to any devices attached to the whole
> HBA).
>
> James, am I understanding your suggestion properly?  If so can you
> explain what you meant about the libsas code -- I see that it has its
> own strategy handler but as I said before we've already stopped every
> device attached to the HBA before we ever get there.
>
> To recapitulate the problem here, we might have a whole fabric
> attached to an HBA via SAS or FC, and be doing 500K IOPS happily to 50
> devices.  Then a single LUN goes wonky and all the IO stops while we
> try to recover that single device, which might take minutes.
>
> I know this has been discussed before, but can we find a way forward
> here?  Is there some way we can start with per-device error recovery
> and avoid disrupting IO that we can see is working fine?
>
> Thanks,
>    Roland
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-23 18:14 SCSI error handling -- one error blocks the whole SCSI host Roland Dreier
  2013-05-25 18:07 ` James Smart
@ 2013-05-26 22:44 ` James Bottomley
  2013-05-27 14:39   ` Hannes Reinecke
  1 sibling, 1 reply; 8+ messages in thread
From: James Bottomley @ 2013-05-26 22:44 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-scsi, Hannes Reinecke

On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
> At LSF this year, we had a discussion about error handling and in
> particular the problem that SCSI midlayer error handling waits for the
> entire SCSI host (HBA) to quiesce before it starts to abort commands
> etc.
> 
> James made the suggestion that FC should handle things the way SAS
> does, because SAS has a strategy handler that does things the right
> way.  However, now that I finally sit down and look at the code, I
> don't see how this is the case.  It seems inherent in the way that
> scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
> particular the strategy handler can't even be called until host_failed
> == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
> which stops queueing commands to any devices attached to the whole
> HBA).
> 
> James, am I understanding your suggestion properly?  If so can you
> explain what you meant about the libsas code -- I see that it has its
> own strategy handler but as I said before we've already stopped every
> device attached to the HBA before we ever get there.

It is, but I checked: Apparently it's not implemented in the sas
transport class.  The original discussion when libsas was constructed,
as I remember it, was about using the scsi timeout handler to implement
a running abort.  The idea is fairly simple: you use the first fire of
eh_timed_out to trigger the abort (or LUN reset) while simultaneously
returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
hasn't returned, you escalate, otherwise you resend the command when the
abort returns.  This allows you to handle single command failures (up to
LUN reset) without stopping the host.  Obviously, if you have to
escalate to device reset, then you need to start the eh thread.

James

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-26 22:44 ` James Bottomley
@ 2013-05-27 14:39   ` Hannes Reinecke
  2013-05-27 20:41     ` James Bottomley
  0 siblings, 1 reply; 8+ messages in thread
From: Hannes Reinecke @ 2013-05-27 14:39 UTC (permalink / raw)
  To: James Bottomley; +Cc: Roland Dreier, linux-scsi

On 05/27/2013 12:44 AM, James Bottomley wrote:
> 
> On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
>> At LSF this year, we had a discussion about error handling and in
>> particular the problem that SCSI midlayer error handling waits for the
>> entire SCSI host (HBA) to quiesce before it starts to abort commands
>> etc.
>>
>> James made the suggestion that FC should handle things the way SAS
>> does, because SAS has a strategy handler that does things the right
>> way.  However, now that I finally sit down and look at the code, I
>> don't see how this is the case.  It seems inherent in the way that
>> scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
>> particular the strategy handler can't even be called until host_failed
>> == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
>> which stops queueing commands to any devices attached to the whole
>> HBA).
>>
>> James, am I understanding your suggestion properly?  If so can you
>> explain what you meant about the libsas code -- I see that it has its
>> own strategy handler but as I said before we've already stopped every
>> device attached to the HBA before we ever get there.
> 
> It is, but I checked: Apparently it's not implemented in the sas
> transport class.  The original discussion when libsas was constructed,
> as I remember it, was about using the scsi timeout handler to implement
> a running abort.  The idea is fairly simple: you use the first fire of
> eh_timed_out to trigger the abort (or LUN reset) while simultaneously
> returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
> hasn't returned, you escalate, otherwise you resend the command when the
> abort returns.  This allows you to handle single command failures (up to
> LUN reset) without stopping the host.  Obviously, if you have to
> escalate to device reset, then you need to start the eh thread.
> 
There are some problems with that:

- Returning BLK_EH_RESET_TIMER will restart the timer with the
  _default_ blk timeout. Whereas the _abort_ timeout might
  (and, for some LLDDs, it definitely is) different from
  that.
- Leaving the command running while abort is active will
  inevitably risk a double completion on the original command;
  the command abort might terminate the command at the
  same time as the (real) completion comes in.
  'Normal' command timeouts are protected against this via
  REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd()
  are not.
- LLDDs typically won't return a command status even for a
  command which has been aborted via ABORT TASK TMF.
  So the midlayer probably will never get notified if
  the command got aborted via ABORT TASK.

Especially the last point made me abandon this idea for my EH
rewrite. We would be having a real benefit if we somehow could get
the command status _from the target_ for an aborted command.
But as it appears we won't.
So as any status is made up anyway I'd very much prefer to have it
set by the midlayer. Which renders the whole operation quite
pointless and we're better off using the existing syntax for command
aborts.
Plus it makes life _so much_ easier for the implementation ...

But to answer Roland: Have you checked my patchset?
It should help for command timeouts ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-27 14:39   ` Hannes Reinecke
@ 2013-05-27 20:41     ` James Bottomley
  2013-05-28  1:32       ` Baruch Even
  0 siblings, 1 reply; 8+ messages in thread
From: James Bottomley @ 2013-05-27 20:41 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Roland Dreier, linux-scsi

On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote:
> On 05/27/2013 12:44 AM, James Bottomley wrote:
> > 
> > On Thu, 2013-05-23 at 11:14 -0700, Roland Dreier wrote:
> >> At LSF this year, we had a discussion about error handling and in
> >> particular the problem that SCSI midlayer error handling waits for the
> >> entire SCSI host (HBA) to quiesce before it starts to abort commands
> >> etc.
> >>
> >> James made the suggestion that FC should handle things the way SAS
> >> does, because SAS has a strategy handler that does things the right
> >> way.  However, now that I finally sit down and look at the code, I
> >> don't see how this is the case.  It seems inherent in the way that
> >> scsi_eh_scmd_add() and the thread in scsi_error_handler() work (in
> >> particular the strategy handler can't even be called until host_failed
> >> == host_busy; we don't bump host_failed without SHOST_RECOVERY set,
> >> which stops queueing commands to any devices attached to the whole
> >> HBA).
> >>
> >> James, am I understanding your suggestion properly?  If so can you
> >> explain what you meant about the libsas code -- I see that it has its
> >> own strategy handler but as I said before we've already stopped every
> >> device attached to the HBA before we ever get there.
> > 
> > It is, but I checked: Apparently it's not implemented in the sas
> > transport class.  The original discussion when libsas was constructed,
> > as I remember it, was about using the scsi timeout handler to implement
> > a running abort.  The idea is fairly simple: you use the first fire of
> > eh_timed_out to trigger the abort (or LUN reset) while simultaneously
> > returning BLK_EH_RESET_TIMER.  If the timer fires again and the abort
> > hasn't returned, you escalate, otherwise you resend the command when the
> > abort returns.  This allows you to handle single command failures (up to
> > LUN reset) without stopping the host.  Obviously, if you have to
> > escalate to device reset, then you need to start the eh thread.
> > 
> There are some problems with that:
> 
> - Returning BLK_EH_RESET_TIMER will restart the timer with the
>   _default_ blk timeout. Whereas the _abort_ timeout might
>   (and, for some LLDDs, it definitely is) different from
>   that.

Right ... you don't reuse the command, you have to start a new one.
libsas actually has a task abstraction, which is what you use to send
TMFs.

> - Leaving the command running while abort is active will
>   inevitably risk a double completion on the original command;
>   the command abort might terminate the command at the
>   same time as the (real) completion comes in.
>   'Normal' command timeouts are protected against this via
>   REQ_ATOM_COMPLETE; commands aborted via scsi_finish_cmnd()
>   are not.

That's not a bug, it's a requirement.  The way you handle commands in a
running abort or LUN reset is only in the status return code from the
command, so you have to tie the success of the eh action to the base
command and return DID_ABORT (or DID_RESET) in the actual command ...
this is how retries get done without troubling the error handler.
Essentially, this requires a low level tie with the HBA machine
description of the command, which is what avoids double completion.

> - LLDDs typically won't return a command status even for a
>   command which has been aborted via ABORT TASK TMF.
>   So the midlayer probably will never get notified if
>   the command got aborted via ABORT TASK.

Well, that's true, but irrelevant.  If the HBA can't inform you of the
status of the abort, then abort is useless as a first step in the
traditional eh as well as in this method, so you just don't do that and
proceed to resets.

There's actually a school of thought that says even if the HBA *can*
give you all the status you need, aborts are still pointless because
it's sending in yet another state transition to an already failed state
machine (because the device is timing out).  Therefore, since the chance
of recovering the state machine with an abort is so tiny, you should
start with the lowest reset anyway because that takes the state machine
to a known state.

James

> Especially the last point made me abandon this idea for my EH
> rewrite. We would be having a real benefit if we somehow could get
> the command status _from the target_ for an aborted command.
> But as it appears we won't.
> So as any status is made up anyway I'd very much prefer to have it
> set by the midlayer. Which renders the whole operation quite
> pointless and we're better off using the existing syntax for command
> aborts.
> Plus it makes life _so much_ easier for the implementation ...
> 
> But to answer Roland: Have you checked my patchset?
> It should help for command timeouts ...
> 
> Cheers,
> 
> Hannes




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-27 20:41     ` James Bottomley
@ 2013-05-28  1:32       ` Baruch Even
  2013-05-28 14:38         ` Jeremy Linton
  0 siblings, 1 reply; 8+ messages in thread
From: Baruch Even @ 2013-05-28  1:32 UTC (permalink / raw)
  To: James Bottomley; +Cc: Hannes Reinecke, Roland Dreier, linux-scsi

On Mon, May 27, 2013 at 11:41 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> On Mon, 2013-05-27 at 16:39 +0200, Hannes Reinecke wrote:
>
>> - LLDDs typically won't return a command status even for a
>>   command which has been aborted via ABORT TASK TMF.
>>   So the midlayer probably will never get notified if
>>   the command got aborted via ABORT TASK.
>
> Well, that's true, but irrelevant.  If the HBA can't inform you of the
> status of the abort, then abort is useless as a first step in the
> traditional eh as well as in this method, so you just don't do that and
> proceed to resets.
>
> There's actually a school of thought that says even if the HBA *can*
> give you all the status you need, aborts are still pointless because
> it's sending in yet another state transition to an already failed state
> machine (because the device is timing out).  Therefore, since the chance
> of recovering the state machine with an abort is so tiny, you should
> start with the lowest reset anyway because that takes the state machine
> to a known state.

Most devices I know do not really abort the command in any normal sense
anyhow. Not even when doing a reset. The disks (HDD & SSD) and also SAN
systems normally just treat an abort or a reset as a signal that no
real reply is
necessary but the command itself if it is already actively handled continues
in its path. The abort only cancels those commands that are in the queue
and if there really was a problem and the disk is engaging in error recovery
of its own you'll just have no response from it and it will seem dead (abort
may timeout).

The one thing aborts/reset help with is to clear your HBA from any pending
so that your DMA buffers will no longer be affected and you can forget the
command and do your application level recovery (RAID or lose data and panic).

It is also an important part of handling bad links but at least in SAS that is
done internally in the HBA anyway.

This view of aborts also means that reducing timeouts for commands and
TMFs is mostly useless and sometimes even a really bad idea. I prefer
to just let the device go on with its error recovery and just forget about the
command. I want to forget about the DMA so I issue an abort but anything
higher than that means a link is dead to me.

Baruch

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-28  1:32       ` Baruch Even
@ 2013-05-28 14:38         ` Jeremy Linton
  2013-05-28 16:22           ` Baruch Even
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Linton @ 2013-05-28 14:38 UTC (permalink / raw)
  To: Baruch Even; +Cc: James Bottomley, Hannes Reinecke, Roland Dreier, linux-scsi

On 5/27/2013 8:32 PM, Baruch Even wrote:

> necessary but the command itself if it is already actively handled
> continues in its path. The abort only cancels those commands that are in
> the queue and if there really was a problem and the disk is engaging in
> error recovery of its own you'll just have no response from it and it will
> seem dead (abort may timeout).

	Yes, the abort seems to be handled more like a "hint" in many cases. Having
coded a couple targets, abort handling is often _REALLY_ hard to get 100%
right. Especially, when its an actual error that is causing the delay, rather
than a correctly functional long running command. That said, I've seen devices
actually respond to aborts on tape ERASE and similar commands by actually
aborting the command as one would expect. So it does sometimes work..

	Besides abort timeouts (which is major bad karma) the abort may be accepted,
and the next non inquiry/tur type command that gets queued simply blocks
waiting for the abort to internally complete. From the target device
perspective, if you don't send a response for ABTS out in 2*RA_TOV then your
problems start to multiply. So it encourages the target devices to treat
aborts in an async manner. As you said, the device simply finds the indicated
command on a queue, marks it as being aborted and hopes whatever is processing
the command notices and terminates its operation. On subsequent commands the
nicer devices will notice the abort hasn't completed and return becoming ready
or similar in response to TUR/etc for some number of minutes.


	

> 
> This view of aborts also means that reducing timeouts for commands and TMFs
> is mostly useless and sometimes even a really bad idea. I prefer to just
> let the device go on with its error recovery and just forget about the 
> command. I want to forget about the DMA so I issue an abort but anything 
> higher than that means a link is dead to me.

	Well, invariably the manufactures have timeouts that are really long and
based on internal error recovery logic. See
http://www-01.ibm.com/support/docview.wss?uid=ssg1S7003556&aid=1 page 468.
Notice the timeouts are specified in minutes, not seconds. Furthermore, the
commands that normally complete in fractions of a second have actual timeouts
that can be tens of minutes (READ/WRITE for example). So, doing anything
before that timeout has expired is a good way to knock the device offline.
Some of the newer disks have mode page options to shorten their read/write
error recovery, but "short" error recovery can still be many tens of seconds
rather than a couple minutes. Plus, it doesn't help compound commands like
"SYNCHRONIZE CACHE" which may take multiple errors during operation.

	This is another part of what formed my opinions about error isolation. If one
of your devices goes out to lunch and isn't recovering via abort/lun reset.
Its done! Wrecking the rest of the SAN doing "bus resets" and HBA resets is a
good way to take a serious problem and turn it into a full blown catastrophe.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: SCSI error handling -- one error blocks the whole SCSI host
  2013-05-28 14:38         ` Jeremy Linton
@ 2013-05-28 16:22           ` Baruch Even
  0 siblings, 0 replies; 8+ messages in thread
From: Baruch Even @ 2013-05-28 16:22 UTC (permalink / raw)
  To: Jeremy Linton; +Cc: James Bottomley, Hannes Reinecke, Roland Dreier, linux-scsi

On Tue, May 28, 2013 at 5:38 PM, Jeremy Linton <jlinton@tributary.com> wrote:
>         This is another part of what formed my opinions about error isolation. If one
> of your devices goes out to lunch and isn't recovering via abort/lun reset.
> Its done! Wrecking the rest of the SAN doing "bus resets" and HBA resets is a
> good way to take a serious problem and turn it into a full blown catastrophe.

This is the gist of the issue, once you got to an abort you are screwed already.
You need the abort but anything else should be reserved to when things
are really
dead (the HBA might still recover on a host reset, but only do it if the host is
really unresponsive).

That's why I prefer to have a long timeout for the command and a long
timeout for
the abort. The application above should handle itself with its own
timeout once the
abort was sent (the buffer remains locked until the abort returns).
The device itself
is likely stuck in error recovery and it will come out of it when its
own internal
timeouts are exhausted which can be infinite and will generally be very large.

Baruch

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-05-28 16:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-23 18:14 SCSI error handling -- one error blocks the whole SCSI host Roland Dreier
2013-05-25 18:07 ` James Smart
2013-05-26 22:44 ` James Bottomley
2013-05-27 14:39   ` Hannes Reinecke
2013-05-27 20:41     ` James Bottomley
2013-05-28  1:32       ` Baruch Even
2013-05-28 14:38         ` Jeremy Linton
2013-05-28 16:22           ` Baruch Even

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox