error handler scheduling

All of lore.kernel.org
 help / color / mirror / Atom feed

* error handler scheduling
@ 2013-03-27  2:11 James Smart
  2013-03-27 14:35 ` Hannes Reinecke
  2013-03-27 14:39 ` Douglas Gilbert
  0 siblings, 2 replies; 7+ messages in thread
From: James Smart @ 2013-03-27  2:11 UTC (permalink / raw)
  To: linux-scsi@vger.kernel.org

In looking through the error handler, if a command times out and is 
added to the eh_cmd_q for the shost, the error handler is only awakened 
once shost->host_busy (total number of i/os posted to the shost) is 
equal to shost->host_failed (number of i/o that have been failed and put 
on the eh_cmd_q).  Which means, any other i/o that was outstanding must 
either complete or have their timeout fire.  Additionally, as all 
further i/o is held off at the block layer as the shost is in recovery, 
new i/o cannot be submitted until the error handler runs and resolves 
the errored i/os.

Is this true ?

I take it is also true that the midlayer thus expects every i/o to have 
an i/o timeout.  True ?

The crux of this point is that when the recovery thread runs to aborts 
the timed out i/os, is at the mercy of the last command to complete or 
timeout. Additionally, as all further i/o is held off at the block layer 
as the shost is in recovery, new i/o cannot be submitted until the error 
handler runs and resolves the errored i/os. So all I/O on the host is 
stopped until that last i/o completes/times out.   The timeouts may be 
eons later.  Consider SCSI format commands or verify commands that can 
take hours to complete.

Specifically, I'm in a situation currently, where an application is 
using sg to send a command to a target. The app selected no-timeout - by 
setting timeout to MAX_INT. Effectively it's so large its infinite.   
This I/O was one of those "lost" on the storage fabric. There was 
another command that long ago timed out and is sitting on the error 
handlers queue. But nothing is happening - new i/o, or error handler to 
resolve the failed i/o, until that inifinite i/o completes.

I'm hoping I hear that I just misunderstand things.  If not,  is there a 
suggestion for how to resolve this predicament ?    IMHO, I'm surprised 
we stop all i/o for error handling, and that it can be so long later...  
I would assume there's a minimum bound we would wait in the error 
handler (30s?) before we unconditionally run it and abort anything that 
was outstanding.

-- james s

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error handler scheduling
  2013-03-27  2:11 error handler scheduling James Smart
@ 2013-03-27 14:35 ` Hannes Reinecke
  2013-04-02  7:43   ` Bhanu Prakash Gollapudi
  2013-03-27 14:39 ` Douglas Gilbert
  1 sibling, 1 reply; 7+ messages in thread
From: Hannes Reinecke @ 2013-03-27 14:35 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi@vger.kernel.org

On 03/27/2013 03:11 AM, James Smart wrote:
> In looking through the error handler, if a command times out and is
> added to the eh_cmd_q for the shost, the error handler is only
> awakened once shost->host_busy (total number of i/os posted to the
> shost) is equal to shost->host_failed (number of i/o that have been
> failed and put on the eh_cmd_q).  Which means, any other i/o that
> was outstanding must either complete or have their timeout fire.
> Additionally, as all further i/o is held off at the block layer as
> the shost is in recovery, new i/o cannot be submitted until the
> error handler runs and resolves the errored i/os.
>
> Is this true ?
>
Yes.

> I take it is also true that the midlayer thus expects every i/o to
> have an i/o timeout.  True ?
>
Yes. But this is guaranteed by the block-layer:

void blk_add_timer(struct request *req)
{
	struct request_queue *q = req->q;
	unsigned long expiry;

	if (!q->rq_timed_out_fn)
		return;

	BUG_ON(!list_empty(&req->timeout_list));
	BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));

	/*
	 * Some LLDs, like scsi, peek at the timeout to prevent a
	 * command from being retried forever.
	 */
	if (!req->timeout)
		req->timeout = q->rq_timeout;


So every request will have a timeout, either the default 
request_queue timeout or an individual one.

> The crux of this point is that when the recovery thread runs to
> aborts the timed out i/os, is at the mercy of the last command to
> complete or timeout. Additionally, as all further i/o is held off at
> the block layer as the shost is in recovery, new i/o cannot be
> submitted until the error handler runs and resolves the errored
> i/os. So all I/O on the host is stopped until that last i/o
> completes/times out.   The timeouts may be eons later.  Consider
> SCSI format commands or verify commands that can take hours to
> complete.
>
Yes, that's true. Unfortunately.

> Specifically, I'm in a situation currently, where an application is
> using sg to send a command to a target. The app selected no-timeout
> - by setting timeout to MAX_INT. Effectively it's so large its
> infinite. This I/O was one of those "lost" on the storage fabric.
> There was another command that long ago timed out and is sitting on
> the error handlers queue. But nothing is happening - new i/o, or
> error handler to resolve the failed i/o, until that inifinite i/o
> completes.
>
Hehe. no timeout != MAX_INT.

It's easy to apply a timeout if none is set. But how do we determine 
what constitutes a valid timeout?

As mentioned, some command can literally take forever, _and_ being 
fully legit. So who are we to decide?

> I'm hoping I hear that I just misunderstand things.  If not,  is
> there a suggestion for how to resolve this predicament ?    IMHO,
> I'm surprised we stop all i/o for error handling, and that it can be
> so long later... I would assume there's a minimum bound we would
> wait in the error handler (30s?) before we unconditionally run it
> and abort anything that was outstanding.
>
Ah, the joys of error recovery.

Incidentally, that'll be one of the topics I'll be discussing at 
LSF; I've been bitten by this on various other occasions.

AFAIK the reasoning behind the current error recovery strategy is 
that it's modelled after SCSI parallel behaviour, where you 
basically have to stop the entire bus, figure out which state it's 
in, and then take corrective action.
And you typically don't have any LUNs to deal with.
_And_ SPI is essentially single-threaded when it comes to target 
access, so in effect you cannot send commands over the bus when 
resetting a target.
So there it makes sense.

Less so for modern fabrics, where target access is governed by an 
I_T nexus, any of which is largely independent on others.

Actually there is another issue with the error handler:
The commands will only be release after eh is done.

If you look at the eh sequence
-> eh_abort
   -> eh_lun_reset
     -> eh_target_reset
       -> eh_bus_reset
         -> eh_host_reset
the command itself is only meaningful until lun_reset() has 
completed; after lun_reset() the command is invalided.
Every other stage still uses the scsi command as an argument,
but only as a place holder to figure out which device it should act 
upon.

So we _could_ speed up things by quite a lot when we were able to 
call ->done() on the command after lun reset; then the command would 
be returned to the upper layers.
And things like multipath could kick in an move I/O to other
devices.

However, this is a daunting task.
I've tried, and it's far from easy.
_Especially_ do to some FC HBAs insisting on using scmds for sending 
TARGET RESET TMFs.
If we just could do a LOGO for target reset things would become so 
much easier ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error handler scheduling
  2013-03-27 14:35 ` Hannes Reinecke
@ 2013-04-02  7:43   ` Bhanu Prakash Gollapudi
  0 siblings, 0 replies; 7+ messages in thread
From: Bhanu Prakash Gollapudi @ 2013-04-02  7:43 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: James.Smart, linux-scsi@vger.kernel.org

On 03/27/2013 07:35 AM, Hannes Reinecke wrote:
> On 03/27/2013 03:11 AM, James Smart wrote:
>> In looking through the error handler, if a command times out and is
>> added to the eh_cmd_q for the shost, the error handler is only
>> awakened once shost->host_busy (total number of i/os posted to the
>> shost) is equal to shost->host_failed (number of i/o that have been
>> failed and put on the eh_cmd_q).  Which means, any other i/o that
>> was outstanding must either complete or have their timeout fire.
>> Additionally, as all further i/o is held off at the block layer as
>> the shost is in recovery, new i/o cannot be submitted until the
>> error handler runs and resolves the errored i/os.
>>
>> Is this true ?
>>
> Yes.
>
>> I take it is also true that the midlayer thus expects every i/o to
>> have an i/o timeout.  True ?
>>
> Yes. But this is guaranteed by the block-layer:
>
> void blk_add_timer(struct request *req)
> {
>     struct request_queue *q = req->q;
>     unsigned long expiry;
>
>     if (!q->rq_timed_out_fn)
>         return;
>
>     BUG_ON(!list_empty(&req->timeout_list));
>     BUG_ON(test_bit(REQ_ATOM_COMPLETE, &req->atomic_flags));
>
>     /*
>      * Some LLDs, like scsi, peek at the timeout to prevent a
>      * command from being retried forever.
>      */
>     if (!req->timeout)
>         req->timeout = q->rq_timeout;
>
>
> So every request will have a timeout, either the default request_queue 
> timeout or an individual one.
>
>> The crux of this point is that when the recovery thread runs to
>> aborts the timed out i/os, is at the mercy of the last command to
>> complete or timeout. Additionally, as all further i/o is held off at
>> the block layer as the shost is in recovery, new i/o cannot be
>> submitted until the error handler runs and resolves the errored
>> i/os. So all I/O on the host is stopped until that last i/o
>> completes/times out.   The timeouts may be eons later.  Consider
>> SCSI format commands or verify commands that can take hours to
>> complete.
>>
> Yes, that's true. Unfortunately.
>
>> Specifically, I'm in a situation currently, where an application is
>> using sg to send a command to a target. The app selected no-timeout
>> - by setting timeout to MAX_INT. Effectively it's so large its
>> infinite. This I/O was one of those "lost" on the storage fabric.
>> There was another command that long ago timed out and is sitting on
>> the error handlers queue. But nothing is happening - new i/o, or
>> error handler to resolve the failed i/o, until that inifinite i/o
>> completes.
>>
> Hehe. no timeout != MAX_INT.
>
> It's easy to apply a timeout if none is set. But how do we determine 
> what constitutes a valid timeout?
>
> As mentioned, some command can literally take forever, _and_ being 
> fully legit. So who are we to decide?
>
>> I'm hoping I hear that I just misunderstand things.  If not,  is
>> there a suggestion for how to resolve this predicament ? IMHO,
>> I'm surprised we stop all i/o for error handling, and that it can be
>> so long later... I would assume there's a minimum bound we would
>> wait in the error handler (30s?) before we unconditionally run it
>> and abort anything that was outstanding.
>>
> Ah, the joys of error recovery.
>
> Incidentally, that'll be one of the topics I'll be discussing at LSF; 
> I've been bitten by this on various other occasions.
>
> AFAIK the reasoning behind the current error recovery strategy is that 
> it's modelled after SCSI parallel behaviour, where you basically have 
> to stop the entire bus, figure out which state it's in, and then take 
> corrective action.
> And you typically don't have any LUNs to deal with.
> _And_ SPI is essentially single-threaded when it comes to target 
> access, so in effect you cannot send commands over the bus when 
> resetting a target.
> So there it makes sense.
>
> Less so for modern fabrics, where target access is governed by an I_T 
> nexus, any of which is largely independent on others.
>
> Actually there is another issue with the error handler:
> The commands will only be release after eh is done.
>
> If you look at the eh sequence
> -> eh_abort
>   -> eh_lun_reset
>     -> eh_target_reset
>       -> eh_bus_reset
>         -> eh_host_reset
> the command itself is only meaningful until lun_reset() has completed; 
> after lun_reset() the command is invalided.
> Every other stage still uses the scsi command as an argument,
> but only as a place holder to figure out which device it should act upon.
>
> So we _could_ speed up things by quite a lot when we were able to call 
> ->done() on the command after lun reset; then the command would be 
> returned to the upper layers.
> And things like multipath could kick in an move I/O to other
> devices.
>
> However, this is a daunting task.
> I've tried, and it's far from easy.
> _Especially_ do to some FC HBAs insisting on using scmds for sending 
> TARGET RESET TMFs.
> If we just could do a LOGO for target reset things would become so 
> much easier ...
For FC HBAs, as per FCP-4:
"12.5.1 ABTS error recovery
If a response to an ABTS is not received within 2 times R_A_TOVELS, the 
initiator FCP_Port may transmit the ABTS again, attempt other retry 
operations allowed by FC-FS-3, or explicitly logout the target FCP_Port. 
If those retry operations attempted are unsuccessful, the initiator 
FCP_Port shall explicitly logout (i.e., transmit a LOGO ELS) the target 
FCP_Port. All outstanding Exchanges with that target FCP_Port are 
terminated at the initiatorFCP_Port."

So, for FC HBAs, if a command times out we dont have to escalate the 
error recovery from lun reset to host reset.

Thanks,
Bhanu
>
> Cheers,
>
> Hannes




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error handler scheduling
  2013-03-27  2:11 error handler scheduling James Smart
  2013-03-27 14:35 ` Hannes Reinecke
@ 2013-03-27 14:39 ` Douglas Gilbert
  2013-03-28 16:02   ` Elliott, Robert (Server Storage)
  1 sibling, 1 reply; 7+ messages in thread
From: Douglas Gilbert @ 2013-03-27 14:39 UTC (permalink / raw)
  To: James.Smart; +Cc: linux-scsi@vger.kernel.org

On 13-03-26 10:11 PM, James Smart wrote:
> In looking through the error handler, if a command times out and is added to the
> eh_cmd_q for the shost, the error handler is only awakened once shost->host_busy
> (total number of i/os posted to the shost) is equal to shost->host_failed
> (number of i/o that have been failed and put on the eh_cmd_q).  Which means, any
> other i/o that was outstanding must either complete or have their timeout fire.
> Additionally, as all further i/o is held off at the block layer as the shost is
> in recovery, new i/o cannot be submitted until the error handler runs and
> resolves the errored i/os.
>
> Is this true ?
>
> I take it is also true that the midlayer thus expects every i/o to have an i/o
> timeout.  True ?
>
> The crux of this point is that when the recovery thread runs to aborts the timed
> out i/os, is at the mercy of the last command to complete or timeout.
> Additionally, as all further i/o is held off at the block layer as the shost is
> in recovery, new i/o cannot be submitted until the error handler runs and
> resolves the errored i/os. So all I/O on the host is stopped until that last i/o
> completes/times out.   The timeouts may be eons later.  Consider SCSI format
> commands or verify commands that can take hours to complete.
>
> Specifically, I'm in a situation currently, where an application is using sg to
> send a command to a target. The app selected no-timeout - by setting timeout to
> MAX_INT. Effectively it's so large its infinite. This I/O was one of those
> "lost" on the storage fabric. There was another command that long ago timed out
> and is sitting on the error handlers queue. But nothing is happening - new i/o,
> or error handler to resolve the failed i/o, until that inifinite i/o completes.
>
> I'm hoping I hear that I just misunderstand things.  If not,  is there a
> suggestion for how to resolve this predicament ?    IMHO, I'm surprised we stop
> all i/o for error handling, and that it can be so long later... I would assume
> there's a minimum bound we would wait in the error handler (30s?) before we
> unconditionally run it and abort anything that was outstanding.

James,
After many encounters with the Linux SCSI mid-level error
handler I have concluded it is uncontrollable and
seemingly random, seen from the user space. Interestingly,
several attempts to add finer grained controls over
lu/target/host resets have been rebuffed.

So my policy is to avoid timeout induced resets (like the
plague). Hence the default with sg_format is to set the IMMED
bit and use TEST UNIT READY or REQUEST SENSE polling to
monitor progress **. With commands like VERIFY, send many
reasonably sized commands, not one big one. And a special
mention for the SCSI WRITE SAME command which probably
has T10's silliest definition: if the NUMBER OF
LOGICAL BLOCKS field is set to zero it means keep writing
until the end of the disk *** and that might be 20 hours
later! The equivalent field set to zero in a SCSI VERIFY
or WRITE *** command means do nothing.

Doug Gilbert


**   You can still run into problems when a SCSI FORMAT UNIT
      with the IMMED bit set: some other kernel subsystem or
      user space program may decide to send a SCSI command to the
      disk during format. Then said code may not comprehend why
      the disk in question is not ready and ends up triggering
      mid-level error handling which blows the format out of
      the water. That leaves the disk in the "format corrupt"
      state.

***  recently the Block Limits VPD has (knee-)capped this
      with the WSNZ bit

**** apart from the obsolete WRITE(6) command which found
      another non obvious interpretation for a zero transfer
      length

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: error handler scheduling
  2013-03-27 14:39 ` Douglas Gilbert
@ 2013-03-28 16:02   ` Elliott, Robert (Server Storage)
  2013-04-12  9:42     ` Ren Mingxin
  0 siblings, 1 reply; 7+ messages in thread
From: Elliott, Robert (Server Storage) @ 2013-03-28 16:02 UTC (permalink / raw)
  To: dgilbert@interlog.com, James.Smart@emulex.com; +Cc: linux-scsi@vger.kernel.org

There are several possible reasons for SCSI command timeouts:
    a) the command request did not get to the SCSI target port and logical
       unit (e.g., error on the wire)
    b) logical unit is still working on the command
    c) the command completed, but status didn't get to the SCSI initiator port 
       and application client (e.g., error on the wire)

SCSI doesn't have a good way to detect case (c). For status delivery errors
detected by the logical unit, I once proposed that the logical unit establish
a unit attention condition and record the status delivery problem in a log
page (T10 proposal 04-072) but this proposal didn't draw much interest. The 
QUERY TASK task management function can detect case (b) vs. the other cases.

With SSDs, a lengthy timeout derived from ancient SCSI floppy drives doesn't
make sense. Timeouts should scale automatically based on the device type
(e.g., use microseconds for SSDs and seconds for HDDs). The REPORT
SUPPORTED OPERATION CODES command provides some command timeout values
to facilitate this.

For Base feature set drives I'm encouraging an approach like this for 
handling command timeouts:

1) at discovery time:
    1a) send REPORT SUPPORTED OPERATION CODES to determine the nominal
        and maximum command timeouts
    1b) send REPORT SUPPORTED TASK MANAGEMENT FUNCTION to determine 
        the TMF timeouts

2) send the command (e.g., READ, WRITE, FORMAT UNIT, ...)

If status arrives for the command at any time, exit out of this procedure. 
If an I_T nexus loss occurs, then that handling overrides this procedure
as well. Otherwise:

3) if the nominal command timeout is long (e.g., for a command like FORMAT
UNIT with IMMED=0, but not for IO commands like READ and WRITE), then wait
a short time and send QUERY TASK to ensure the command got there:
    3a) if the command is not there (probably lost in delivery, but
        possibly lost status), go to step (2) to resend the command
    3b) if the command is still being processed, keep waiting

4) if the nominal command timeout is reached, send QUERY TASK to determine
what is happening:
    4a) if the command is not there (if step (3) was run, then this
        probably means lost status), go to step (2) to resend the command
    4b) if the command is still being processed, keep waiting

5) if the maximum command timeout is reached, send QUERY TASK to determine
what is happening:
    5a) if the command is not there (since step (4) was run, this
         probably means lost status), go to step (2) to resend the command
    5b) if the command is still being processed, proceed to step (6)
        to abort the command

6) send ABORT TASK to abort the command

7) If ABORT TASK succeeds, either:
    7a) escalate to a stronger TMF or hard reset if this command
       keeps having repeated problems; or
    7b) go to step (2) to resend the command

8) If the ABORT TASK timeout is reached, either:
    8a) escalate to a stronger TMF or hard reset, then go to step (2) 
        to resend the command; or
    8b) declare the logical unit is unavailable

Doug: for ***, In addition to WSNZ bit now letting the drive not support
the value of zero, T10 proposal 13-052 changes WRITE SAME so the NUMBER 
OF LOGICAL BLOCKS set to zero (if supported) must honor the MAXIMUM WRITE
SAME LENGTH field, so the drive can provide a reasonable timeout value
for the command (not worry that the entire capacity might be specified).

---
Rob Elliott    HP Server Storage



> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Douglas Gilbert
> Sent: Wednesday, 27 March, 2013 9:39 AM
> To: James.Smart@emulex.com
> Cc: linux-scsi@vger.kernel.org
> Subject: Re: error handler scheduling
> 
> On 13-03-26 10:11 PM, James Smart wrote:
> > In looking through the error handler, if a command times out and is added to
> the
> > eh_cmd_q for the shost, the error handler is only awakened once shost-
> >host_busy
> > (total number of i/os posted to the shost) is equal to shost->host_failed
> > (number of i/o that have been failed and put on the eh_cmd_q).  Which
> means, any
> > other i/o that was outstanding must either complete or have their timeout
> fire.
> > Additionally, as all further i/o is held off at the block layer as the shost is
> > in recovery, new i/o cannot be submitted until the error handler runs and
> > resolves the errored i/os.
> >
> > Is this true ?
> >
> > I take it is also true that the midlayer thus expects every i/o to have an i/o
> > timeout.  True ?
> >
> > The crux of this point is that when the recovery thread runs to aborts the
> timed
> > out i/os, is at the mercy of the last command to complete or timeout.
> > Additionally, as all further i/o is held off at the block layer as the shost is
> > in recovery, new i/o cannot be submitted until the error handler runs and
> > resolves the errored i/os. So all I/O on the host is stopped until that last i/o
> > completes/times out.   The timeouts may be eons later.  Consider SCSI format
> > commands or verify commands that can take hours to complete.
> >
> > Specifically, I'm in a situation currently, where an application is using sg to
> > send a command to a target. The app selected no-timeout - by setting
> timeout to
> > MAX_INT. Effectively it's so large its infinite. This I/O was one of those
> > "lost" on the storage fabric. There was another command that long ago timed
> out
> > and is sitting on the error handlers queue. But nothing is happening - new i/o,
> > or error handler to resolve the failed i/o, until that inifinite i/o completes.
> >
> > I'm hoping I hear that I just misunderstand things.  If not,  is there a
> > suggestion for how to resolve this predicament ?    IMHO, I'm surprised we
> stop
> > all i/o for error handling, and that it can be so long later... I would assume
> > there's a minimum bound we would wait in the error handler (30s?) before
> we
> > unconditionally run it and abort anything that was outstanding.
> 
> James,
> After many encounters with the Linux SCSI mid-level error
> handler I have concluded it is uncontrollable and
> seemingly random, seen from the user space. Interestingly,
> several attempts to add finer grained controls over
> lu/target/host resets have been rebuffed.
> 
> So my policy is to avoid timeout induced resets (like the
> plague). Hence the default with sg_format is to set the IMMED
> bit and use TEST UNIT READY or REQUEST SENSE polling to
> monitor progress **. With commands like VERIFY, send many
> reasonably sized commands, not one big one. And a special
> mention for the SCSI WRITE SAME command which probably
> has T10's silliest definition: if the NUMBER OF
> LOGICAL BLOCKS field is set to zero it means keep writing
> until the end of the disk *** and that might be 20 hours
> later! The equivalent field set to zero in a SCSI VERIFY
> or WRITE *** command means do nothing.
> 
> Doug Gilbert
> 
> 
> **   You can still run into problems when a SCSI FORMAT UNIT
>       with the IMMED bit set: some other kernel subsystem or
>       user space program may decide to send a SCSI command to the
>       disk during format. Then said code may not comprehend why
>       the disk in question is not ready and ends up triggering
>       mid-level error handling which blows the format out of
>       the water. That leaves the disk in the "format corrupt"
>       state.
> 
> ***  recently the Block Limits VPD has (knee-)capped this
>       with the WSNZ bit
> 
> **** apart from the obsolete WRITE(6) command which found
>       another non obvious interpretation for a zero transfer
>       length
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error handler scheduling
  2013-03-28 16:02   ` Elliott, Robert (Server Storage)
@ 2013-04-12  9:42     ` Ren Mingxin
  2013-04-12 19:20       ` Baruch Even
  0 siblings, 1 reply; 7+ messages in thread
From: Ren Mingxin @ 2013-04-12  9:42 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), dgilbert@interlog.com,
	James.Smart@emulex.com, hare
  Cc: linux-scsi@vger.kernel.org

On 03/29/2013 12:02 AM, Elliott, Robert (Server Storage) wrote:
> There are several possible reasons for SCSI command timeouts:
>      a) the command request did not get to the SCSI target port and logical
>         unit (e.g., error on the wire)
>      b) logical unit is still working on the command
>      c) the command completed, but status didn't get to the SCSI initiator port
>         and application client (e.g., error on the wire)
>
> SCSI doesn't have a good way to detect case (c). For status delivery errors
> detected by the logical unit, I once proposed that the logical unit establish
> a unit attention condition and record the status delivery problem in a log
> page (T10 proposal 04-072) but this proposal didn't draw much interest. The
> QUERY TASK task management function can detect case (b) vs. the other cases.
>
> With SSDs, a lengthy timeout derived from ancient SCSI floppy drives doesn't
> make sense. Timeouts should scale automatically based on the device type
> (e.g., use microseconds for SSDs and seconds for HDDs). The REPORT
> SUPPORTED OPERATION CODES command provides some command timeout values
> to facilitate this.
>
> For Base feature set drives I'm encouraging an approach like this for
> handling command timeouts:
>
> 1) at discovery time:
>      1a) send REPORT SUPPORTED OPERATION CODES to determine the nominal
>          and maximum command timeouts
>      1b) send REPORT SUPPORTED TASK MANAGEMENT FUNCTION to determine
>          the TMF timeouts
>
> 2) send the command (e.g., READ, WRITE, FORMAT UNIT, ...)
>
> If status arrives for the command at any time, exit out of this procedure.
> If an I_T nexus loss occurs, then that handling overrides this procedure
> as well. Otherwise:
>
> 3) if the nominal command timeout is long (e.g., for a command like FORMAT
> UNIT with IMMED=0, but not for IO commands like READ and WRITE), then wait
> a short time and send QUERY TASK to ensure the command got there:
>      3a) if the command is not there (probably lost in delivery, but
>          possibly lost status), go to step (2) to resend the command
>      3b) if the command is still being processed, keep waiting
>
> 4) if the nominal command timeout is reached, send QUERY TASK to determine
> what is happening:
>      4a) if the command is not there (if step (3) was run, then this
>          probably means lost status), go to step (2) to resend the command
>      4b) if the command is still being processed, keep waiting
>
> 5) if the maximum command timeout is reached, send QUERY TASK to determine
> what is happening:
>      5a) if the command is not there (since step (4) was run, this
>           probably means lost status), go to step (2) to resend the command
>      5b) if the command is still being processed, proceed to step (6)
>          to abort the command
>
> 6) send ABORT TASK to abort the command
>
> 7) If ABORT TASK succeeds, either:
>      7a) escalate to a stronger TMF or hard reset if this command
>         keeps having repeated problems; or
>      7b) go to step (2) to resend the command
>
> 8) If the ABORT TASK timeout is reached, either:
>      8a) escalate to a stronger TMF or hard reset, then go to step (2)
>          to resend the command; or
>      8b) declare the logical unit is unavailable
>
> Doug: for ***, In addition to WSNZ bit now letting the drive not support
> the value of zero, T10 proposal 13-052 changes WRITE SAME so the NUMBER
> OF LOGICAL BLOCKS set to zero (if supported) must honor the MAXIMUM WRITE
> SAME LENGTH field, so the drive can provide a reasonable timeout value
> for the command (not worry that the entire capacity might be specified).

Please let me summarize what this thread has talked about the scsi
eh latency:

1) some scsi cmds' timemout values are inappropriate, we can avoid
    timeout by:
    a) sg_format sets the IMMED bit and use TEST UNIT READY or REQUEST
       SENSE polling to monitor - by Douglas
    b) cut big cmd into some reasonable-sized ones - by Douglas
    c) improve timeout values according to device types - by Elliott
2) call ->done() on the command after lun reset - by Hannes

And, my question is:
- could we wake up eh thread ASAP instead of waiting for all cmds
   complete to fast scheduling?

BTW: my original question is here:
http://www.spinics.net/lists/linux-scsi/msg65107.html

Thanks,
Ren

> ---
> Rob Elliott    HP Server Storage
>
>
>
>> -----Original Message-----
>> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
>> owner@vger.kernel.org] On Behalf Of Douglas Gilbert
>> Sent: Wednesday, 27 March, 2013 9:39 AM
>> To: James.Smart@emulex.com
>> Cc: linux-scsi@vger.kernel.org
>> Subject: Re: error handler scheduling
>>
>> On 13-03-26 10:11 PM, James Smart wrote:
>>> In looking through the error handler, if a command times out and is added to
>> the
>>> eh_cmd_q for the shost, the error handler is only awakened once shost-
>>> host_busy
>>> (total number of i/os posted to the shost) is equal to shost->host_failed
>>> (number of i/o that have been failed and put on the eh_cmd_q).  Which
>> means, any
>>> other i/o that was outstanding must either complete or have their timeout
>> fire.
>>> Additionally, as all further i/o is held off at the block layer as the shost is
>>> in recovery, new i/o cannot be submitted until the error handler runs and
>>> resolves the errored i/os.
>>>
>>> Is this true ?
>>>
>>> I take it is also true that the midlayer thus expects every i/o to have an i/o
>>> timeout.  True ?
>>>
>>> The crux of this point is that when the recovery thread runs to aborts the
>> timed
>>> out i/os, is at the mercy of the last command to complete or timeout.
>>> Additionally, as all further i/o is held off at the block layer as the shost is
>>> in recovery, new i/o cannot be submitted until the error handler runs and
>>> resolves the errored i/os. So all I/O on the host is stopped until that last i/o
>>> completes/times out.   The timeouts may be eons later.  Consider SCSI format
>>> commands or verify commands that can take hours to complete.
>>>
>>> Specifically, I'm in a situation currently, where an application is using sg to
>>> send a command to a target. The app selected no-timeout - by setting
>> timeout to
>>> MAX_INT. Effectively it's so large its infinite. This I/O was one of those
>>> "lost" on the storage fabric. There was another command that long ago timed
>> out
>>> and is sitting on the error handlers queue. But nothing is happening - new i/o,
>>> or error handler to resolve the failed i/o, until that inifinite i/o completes.
>>>
>>> I'm hoping I hear that I just misunderstand things.  If not,  is there a
>>> suggestion for how to resolve this predicament ?    IMHO, I'm surprised we
>> stop
>>> all i/o for error handling, and that it can be so long later... I would assume
>>> there's a minimum bound we would wait in the error handler (30s?) before
>> we
>>> unconditionally run it and abort anything that was outstanding.
>> James,
>> After many encounters with the Linux SCSI mid-level error
>> handler I have concluded it is uncontrollable and
>> seemingly random, seen from the user space. Interestingly,
>> several attempts to add finer grained controls over
>> lu/target/host resets have been rebuffed.
>>
>> So my policy is to avoid timeout induced resets (like the
>> plague). Hence the default with sg_format is to set the IMMED
>> bit and use TEST UNIT READY or REQUEST SENSE polling to
>> monitor progress **. With commands like VERIFY, send many
>> reasonably sized commands, not one big one. And a special
>> mention for the SCSI WRITE SAME command which probably
>> has T10's silliest definition: if the NUMBER OF
>> LOGICAL BLOCKS field is set to zero it means keep writing
>> until the end of the disk *** and that might be 20 hours
>> later! The equivalent field set to zero in a SCSI VERIFY
>> or WRITE *** command means do nothing.
>>
>> Doug Gilbert
>>
>>
>> **   You can still run into problems when a SCSI FORMAT UNIT
>>        with the IMMED bit set: some other kernel subsystem or
>>        user space program may decide to send a SCSI command to the
>>        disk during format. Then said code may not comprehend why
>>        the disk in question is not ready and ends up triggering
>>        mid-level error handling which blows the format out of
>>        the water. That leaves the disk in the "format corrupt"
>>        state.
>>
>> ***  recently the Block Limits VPD has (knee-)capped this
>>        with the WSNZ bit
>>
>> **** apart from the obsolete WRITE(6) command which found
>>        another non obvious interpretation for a zero transfer
>>        length
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: error handler scheduling
  2013-04-12  9:42     ` Ren Mingxin
@ 2013-04-12 19:20       ` Baruch Even
  0 siblings, 0 replies; 7+ messages in thread
From: Baruch Even @ 2013-04-12 19:20 UTC (permalink / raw)
  To: Ren Mingxin
  Cc: Elliott, Robert (Server Storage), dgilbert@interlog.com,
	James.Smart@emulex.com, hare, linux-scsi@vger.kernel.org

On Fri, Apr 12, 2013 at 12:42 PM, Ren Mingxin <renmx@cn.fujitsu.com> wrote:
>
> Please let me summarize what this thread has talked about the scsi
> eh latency:
>
> 1) some scsi cmds' timemout values are inappropriate, we can avoid
>    timeout by:
>    a) sg_format sets the IMMED bit and use TEST UNIT READY or REQUEST
>       SENSE polling to monitor - by Douglas
>    b) cut big cmd into some reasonable-sized ones - by Douglas
>    c) improve timeout values according to device types - by Elliott
> 2) call ->done() on the command after lun reset - by Hannes
>
> And, my question is:
> - could we wake up eh thread ASAP instead of waiting for all cmds
>   complete to fast scheduling?
>
> BTW: my original question is here:
> http://www.spinics.net/lists/linux-scsi/msg65107.html

I don't think you can just do this simple change of not waiting for all commands
to timeout. The problem will start when your abort will fail and
you'll be forced
to do the higher level actions such as target reset, since it will
take out all active
requests you'll want to wait for all active requests to either return
or timeout and
for that time you want to also not send any new commands down the pipe.

Ofcourse, there are situations where you don't care enough about the other
commands and can handle their cancellation with an immediate target reset
or you may even just prefer to not even abort that single command. I've seen and
implemented changes (internal to my employer) to do just that but the changes
to the code are non-trivial.

Baruch

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-12 19:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-27  2:11 error handler scheduling James Smart
2013-03-27 14:35 ` Hannes Reinecke
2013-04-02  7:43   ` Bhanu Prakash Gollapudi
2013-03-27 14:39 ` Douglas Gilbert
2013-03-28 16:02   ` Elliott, Robert (Server Storage)
2013-04-12  9:42     ` Ren Mingxin
2013-04-12 19:20       ` Baruch Even

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.