scsi error handling thread and REQUEST SENSE

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* scsi error handling thread and REQUEST SENSE
@ 2014-05-16 19:02 Elliott, Robert (Server Storage)
  2014-05-16 20:05 ` Ewan Milne
  2014-05-19 11:40 ` Bart Van Assche
  0 siblings, 2 replies; 9+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-05-16 19:02 UTC (permalink / raw)
  To: James Bottomley (jbottomley@parallels.com), Hannes Reinecke,
	Christoph Hellwig, scameron@beardog.cce.hp.com,
	linux-scsi@vger.kernel.org

There is an issue with a command timeout followed by a failed
abort in the linux SCSI stack.

After triggering a timeout on a command like:
[ 5454.196861] sd 2:0:0:1: [sds] Done: TIMEOUT
[ 5454.196863] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[ 5454.196866] sd 2:0:0:1: [sds] CDB: Mode Sense(10): 5a 00 03 00 00 00 00 00 04 00

scsi_times_out() invokes scsi_abort_command():
[ 5454.196880] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort scheduled

and scmd_eh_abort_handler() tries to abort the command:
[ 5454.206828] sd 2:0:0:1: [sds] aborting command ffff880428963a70

If the abort fails (with return value FAILED (0x2003 == 8195)):
[ 5454.206832] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort failed, rtn 8195

then scmd_eh_abort_handler() just gives up and expects the error 
handler thread to deal with the problem.

When that thread (scsi_error_handler()) wakes up later on, it finds 
this command (and others) outstanding:
[ 5454.373581] scsi_eh_2: waking up 0/3/3
[ 5454.375037] sd 2:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
[ 5454.377332] sd 2:0:0:11: scsi_eh_prt_fail_stats: cmds failed: 2, cancel: 0
[ 5454.379779] Total of 3 commands on 2 devices require eh work

For each command, it starts with this check:

#define SCSI_SENSE_VALID(scmd) \
        (((scmd)->sense_buffer[0] & 0x70) == 0x70)

                if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
                    SCSI_SENSE_VALID(scmd))
                        continue;

In this case, that if statement fails.  The eflags bit is not 
set, and the sense data buffer still contains zeros or garbage - 
the command is still outstanding, so the buffer might be written 
at any time.

(the sense buffer shouldn't be read unless a valid bit says
it's filled in, and this lacks support for descriptor format 
sense data (type 0x72), but those are side issues)

Strangely, the error handler code (scsi_unjam_host()) proceeds 
to send a REQUEST SENSE command and sees the resulting sense 
key of NO SENSE:

[ 5454.381659] sd 2:0:0:1: [sds] scsi_eh_2: requesting sense
[ 5454.383597] scsi_eh_done scmd: ffff880428963a70 result: 0
[ 5454.385457] sd 2:0:0:1: [sds] Done: UNKNOWN
[ 5454.387430] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK 
[ 5454.390450] sd 2:0:0:1: [sds] CDB: Request Sense: 03 00 00 00 60 00
[ 5454.393497] scsi_send_eh_cmnd: scmd: ffff880428963a70, timeleft: 9998
[ 5454.395667] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
[ 5454.397842] sense requested for ffff880428963a70 result 0
[ 5454.399675] sd 2:0:0:1: [sds] Sense Key : No Sense [current]
[ 5454.402570] sd 2:0:0:1: [sds] Add. Sense: No additional sense information

The bogus "UNKNOWN" print is being fixed by Hannes' logging 
patch. It just means the REQUEST SENSE command was submitted 
successfully.

This device uses autosense, so REQUEST SENSE is not a valid way 
to find out any information for the timed out command. There is
no contingent allegiance condition stalling the queue until 
REQUEST SENSE comes along to collect the sense data - that
parallel SCSI concept went obsolete in SAM-3 revision 5 in 
January 2003.

The command is still outstanding; data transfers might still occur, 
and a completion using its tag could still appear at any time. 
However, the error handler declares that the command is done, 
so all the buffers are freed and the tag is reused.

The SCSI error handler needs to escalate this to a reset that 
ensures that the command is no longer outstanding: ABORT
TASK (which already didn't work), ABORT TASK SET, LOGICAL 
UNIT RESET, I_T NEXUS RESET, or hard reset.

If those fail, then it needs to escalate to resetting or disabling
the controller - disable the Bus Master Enable bit in its PCIe 
interface so the controller cannot write into host memory and
report the device as gone.  It's not safe to proceed while
hardware is still able to write to host memory for those old
commands.

The error handler thread does let a transport layer replace 
scsi_unjam_host(), but not all drivers have a transport
layer assisting them:
                if (shost->transportt->eh_strategy_handler)
                        shost->transportt->eh_strategy_handler(shost);
                else
                        scsi_unjam_host(shost);

libsas, for example, provides sas_scsi_recover_host() as that
function.  It does try more things, but appears to give up 
if they don't work and eventually calls scsi_eh_get_sense() 
just like scsi_unjam_host(), so may suffer from the same 
problem.

Any suggestions for how to fix this?

---
Rob Elliott    HP Server Storage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-16 19:02 scsi error handling thread and REQUEST SENSE Elliott, Robert (Server Storage)
@ 2014-05-16 20:05 ` Ewan Milne
  2014-05-19  8:32   ` Hannes Reinecke
  2014-05-19 11:40 ` Bart Van Assche
  1 sibling, 1 reply; 9+ messages in thread
From: Ewan Milne @ 2014-05-16 20:05 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage)
  Cc: James Bottomley (jbottomley@parallels.com), Hannes Reinecke,
	Christoph Hellwig, scameron@beardog.cce.hp.com,
	linux-scsi@vger.kernel.org

On Fri, 2014-05-16 at 19:02 +0000, Elliott, Robert (Server Storage)
wrote:
> There is an issue with a command timeout followed by a failed
> abort in the linux SCSI stack.

This might explain some  odd crashes I've seen, where it looks like
a command might have completed *long* after it should have timed out.
I have a few questions, see below:

> 
> After triggering a timeout on a command like:
> [ 5454.196861] sd 2:0:0:1: [sds] Done: TIMEOUT
> [ 5454.196863] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> [ 5454.196866] sd 2:0:0:1: [sds] CDB: Mode Sense(10): 5a 00 03 00 00 00 00 00 04 00
> 
> scsi_times_out() invokes scsi_abort_command():
> [ 5454.196880] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort scheduled
> 
> and scmd_eh_abort_handler() tries to abort the command:
> [ 5454.206828] sd 2:0:0:1: [sds] aborting command ffff880428963a70
> 
> If the abort fails (with return value FAILED (0x2003 == 8195)):
> [ 5454.206832] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort failed, rtn 8195
> 
> then scmd_eh_abort_handler() just gives up and expects the error 
> handler thread to deal with the problem.
> 
> When that thread (scsi_error_handler()) wakes up later on, it finds 
> this command (and others) outstanding:
> [ 5454.373581] scsi_eh_2: waking up 0/3/3
> [ 5454.375037] sd 2:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
> [ 5454.377332] sd 2:0:0:11: scsi_eh_prt_fail_stats: cmds failed: 2, cancel: 0
> [ 5454.379779] Total of 3 commands on 2 devices require eh work
> 
> For each command, it starts with this check:
> 
> #define SCSI_SENSE_VALID(scmd) \
>         (((scmd)->sense_buffer[0] & 0x70) == 0x70)
> 
>                 if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
>                     SCSI_SENSE_VALID(scmd))
>                         continue;
> 
> In this case, that if statement fails.  The eflags bit is not 
> set, and the sense data buffer still contains zeros or garbage - 
> the command is still outstanding, so the buffer might be written 
> at any time.
> 
> (the sense buffer shouldn't be read unless a valid bit says
> it's filled in, and this lacks support for descriptor format 
> sense data (type 0x72), but those are side issues)

Doesn't the check for:   (byte[0] & 0x70) == 0x70   cover 0x70 - 0x73?

> 
> Strangely, the error handler code (scsi_unjam_host()) proceeds 
> to send a REQUEST SENSE command and sees the resulting sense 
> key of NO SENSE:
> 
> [ 5454.381659] sd 2:0:0:1: [sds] scsi_eh_2: requesting sense
> [ 5454.383597] scsi_eh_done scmd: ffff880428963a70 result: 0
> [ 5454.385457] sd 2:0:0:1: [sds] Done: UNKNOWN
> [ 5454.387430] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK 
> [ 5454.390450] sd 2:0:0:1: [sds] CDB: Request Sense: 03 00 00 00 60 00
> [ 5454.393497] scsi_send_eh_cmnd: scmd: ffff880428963a70, timeleft: 9998
> [ 5454.395667] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> [ 5454.397842] sense requested for ffff880428963a70 result 0
> [ 5454.399675] sd 2:0:0:1: [sds] Sense Key : No Sense [current]
> [ 5454.402570] sd 2:0:0:1: [sds] Add. Sense: No additional sense information

So, a command timed out, the abort didn't succeed, but a
REQUEST SENSE completed normally?

What kernel was this?  Did it have the change to issue the abort
in the timeout handler rather than the EH thread?  It seems like
it does, based on your description above.  However, I'm wondering
because I have seen crashes on kernels both with and without that
change.

> 
> The bogus "UNKNOWN" print is being fixed by Hannes' logging 
> patch. It just means the REQUEST SENSE command was submitted 
> successfully.
> 
> This device uses autosense, so REQUEST SENSE is not a valid way 
> to find out any information for the timed out command. There is
> no contingent allegiance condition stalling the queue until 
> REQUEST SENSE comes along to collect the sense data - that
> parallel SCSI concept went obsolete in SAM-3 revision 5 in 
> January 2003.
> 
> The command is still outstanding; data transfers might still occur, 
> and a completion using its tag could still appear at any time. 
> However, the error handler declares that the command is done, 
> so all the buffers are freed and the tag is reused.
> 
> The SCSI error handler needs to escalate this to a reset that 
> ensures that the command is no longer outstanding: ABORT
> TASK (which already didn't work), ABORT TASK SET, LOGICAL 
> UNIT RESET, I_T NEXUS RESET, or hard reset.

What is supposed to happen is that the EH will escalate and
eventually reset the HBA if all else fails.  It definitely
should not be returning the scmd if the LLD is still using it.

> 
> If those fail, then it needs to escalate to resetting or disabling
> the controller - disable the Bus Master Enable bit in its PCIe 
> interface so the controller cannot write into host memory and
> report the device as gone.  It's not safe to proceed while
> hardware is still able to write to host memory for those old
> commands.
> 
> The error handler thread does let a transport layer replace 
> scsi_unjam_host(), but not all drivers have a transport
> layer assisting them:
>                 if (shost->transportt->eh_strategy_handler)
>                         shost->transportt->eh_strategy_handler(shost);
>                 else
>                         scsi_unjam_host(shost);
> 
> libsas, for example, provides sas_scsi_recover_host() as that
> function.  It does try more things, but appears to give up 
> if they don't work and eventually calls scsi_eh_get_sense() 
> just like scsi_unjam_host(), so may suffer from the same 
> problem.
> 
> Any suggestions for how to fix this?
> 
> ---
> Rob Elliott    HP Server Storage
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-16 20:05 ` Ewan Milne
@ 2014-05-19  8:32   ` Hannes Reinecke
  2014-05-19 10:29     ` Bart Van Assche
  2014-05-19 13:41     ` James Bottomley
  0 siblings, 2 replies; 9+ messages in thread
From: Hannes Reinecke @ 2014-05-19  8:32 UTC (permalink / raw)
  To: emilne, Elliott, Robert (Server Storage)
  Cc: James Bottomley (jbottomley@parallels.com), Christoph Hellwig,
	scameron@beardog.cce.hp.com, linux-scsi@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 5805 bytes --]

On 05/16/2014 10:05 PM, Ewan Milne wrote:
> On Fri, 2014-05-16 at 19:02 +0000, Elliott, Robert (Server Storage)
> wrote:
>> There is an issue with a command timeout followed by a failed
>> abort in the linux SCSI stack.
>
> This might explain some  odd crashes I've seen, where it looks like
> a command might have completed *long* after it should have timed out.
> I have a few questions, see below:
>
>>
>> After triggering a timeout on a command like:
>> [ 5454.196861] sd 2:0:0:1: [sds] Done: TIMEOUT
>> [ 5454.196863] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> [ 5454.196866] sd 2:0:0:1: [sds] CDB: Mode Sense(10): 5a 00 03 00 00 00 00 00 04 00
>>
>> scsi_times_out() invokes scsi_abort_command():
>> [ 5454.196880] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort scheduled
>>
>> and scmd_eh_abort_handler() tries to abort the command:
>> [ 5454.206828] sd 2:0:0:1: [sds] aborting command ffff880428963a70
>>
>> If the abort fails (with return value FAILED (0x2003 == 8195)):
>> [ 5454.206832] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort failed, rtn 8195
>>
>> then scmd_eh_abort_handler() just gives up and expects the error
>> handler thread to deal with the problem.
>>
>> When that thread (scsi_error_handler()) wakes up later on, it finds
>> this command (and others) outstanding:
>> [ 5454.373581] scsi_eh_2: waking up 0/3/3
>> [ 5454.375037] sd 2:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
>> [ 5454.377332] sd 2:0:0:11: scsi_eh_prt_fail_stats: cmds failed: 2, cancel: 0
>> [ 5454.379779] Total of 3 commands on 2 devices require eh work
>>
>> For each command, it starts with this check:
>>
>> #define SCSI_SENSE_VALID(scmd) \
>>          (((scmd)->sense_buffer[0] & 0x70) == 0x70)
>>
>>                  if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
>>                      SCSI_SENSE_VALID(scmd))
>>                          continue;
>>
>> In this case, that if statement fails.  The eflags bit is not
>> set, and the sense data buffer still contains zeros or garbage -
>> the command is still outstanding, so the buffer might be written
>> at any time.
>>
>> (the sense buffer shouldn't be read unless a valid bit says
>> it's filled in, and this lacks support for descriptor format
>> sense data (type 0x72), but those are side issues)
>
> Doesn't the check for:   (byte[0] & 0x70) == 0x70   cover 0x70 - 0x73?
>
>>
>> Strangely, the error handler code (scsi_unjam_host()) proceeds
>> to send a REQUEST SENSE command and sees the resulting sense
>> key of NO SENSE:
>>
>> [ 5454.381659] sd 2:0:0:1: [sds] scsi_eh_2: requesting sense
>> [ 5454.383597] scsi_eh_done scmd: ffff880428963a70 result: 0
>> [ 5454.385457] sd 2:0:0:1: [sds] Done: UNKNOWN
>> [ 5454.387430] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> [ 5454.390450] sd 2:0:0:1: [sds] CDB: Request Sense: 03 00 00 00 60 00
>> [ 5454.393497] scsi_send_eh_cmnd: scmd: ffff880428963a70, timeleft: 9998
>> [ 5454.395667] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
>> [ 5454.397842] sense requested for ffff880428963a70 result 0
>> [ 5454.399675] sd 2:0:0:1: [sds] Sense Key : No Sense [current]
>> [ 5454.402570] sd 2:0:0:1: [sds] Add. Sense: No additional sense information
>
> So, a command timed out, the abort didn't succeed, but a
> REQUEST SENSE completed normally?
>
> What kernel was this?  Did it have the change to issue the abort
> in the timeout handler rather than the EH thread?  It seems like
> it does, based on your description above.  However, I'm wondering
> because I have seen crashes on kernels both with and without that
> change.
>
>>
>> The bogus "UNKNOWN" print is being fixed by Hannes' logging
>> patch. It just means the REQUEST SENSE command was submitted
>> successfully.
>>
>> This device uses autosense, so REQUEST SENSE is not a valid way
>> to find out any information for the timed out command. There is
>> no contingent allegiance condition stalling the queue until
>> REQUEST SENSE comes along to collect the sense data - that
>> parallel SCSI concept went obsolete in SAM-3 revision 5 in
>> January 2003.
>>
>> The command is still outstanding; data transfers might still occur,
>> and a completion using its tag could still appear at any time.
>> However, the error handler declares that the command is done,
>> so all the buffers are freed and the tag is reused.
>>
>> The SCSI error handler needs to escalate this to a reset that
>> ensures that the command is no longer outstanding: ABORT
>> TASK (which already didn't work), ABORT TASK SET, LOGICAL
>> UNIT RESET, I_T NEXUS RESET, or hard reset.
>
> What is supposed to happen is that the EH will escalate and
> eventually reset the HBA if all else fails.  It definitely
> should not be returning the scmd if the LLD is still using it.
>
Well, problem here is that the 'REQUEST SENSE' command has two problems:
a) Most modern HBA (ie all non-SPI HBAs) use autosense, ie the sense 
code is returned with the command. So issuing 'REQUEST SENSE' here 
is pointless.
b) The sense code (when retrieved via 'REQUEST SENSE') relates to 
the most recently processed command (from the target perspective).
Which is a bit hard to make out, as by the time SCSI EH starts
several other commands might have been processed already, so any
sense we'd be retrieving most likely does not relate to the failed 
command.

I would propose to disable the 'REQUEST_SENSE' step as soon as the 
HBA is capable of autosensing. We requires us to add another flag
to the scsi_host field.

What about the attached patch? That should roughly do what's 
required here, right?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

[-- Attachment #2: scsi-host-autosense.patch --]
[-- Type: text/x-patch, Size: 2924 bytes --]

>From 585c989a6534fba358de9783c8a410e10d31812b Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Mon, 19 May 2014 10:26:57 +0200
Subject: [PATCH] Add 'autosense' flag to scsi_host structure

Some HBAs support autosense, so we should skip the 'REQUEST SENSE'
step during SCSI EH.
This patch adds a 'autosense' flag to the scsi_host structure
and enables it for FC, iSCSI, and SAS HBAs.
Other HBAs should enable it once we figure out whether they
support autosense.

Signed-off-by: Hannes Reinecke <hare@suse.de>

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f17aa7a..db0abed 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1160,6 +1160,12 @@ int scsi_eh_get_sense(struct list_head *work_q,
 					     __func__));
 			break;
 		}
+		if (shost->autosense)
+			/*
+			 * don't request sense if the HBA supports autosense
+			 */
+			continue;
+
 		if (status_byte(scmd->result) != CHECK_CONDITION)
 			/*
 			 * don't request sense if there's no check condition
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index f80908f..2e7b4be 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -452,6 +452,8 @@ static int fc_host_setup(struct transport_container *tc, struct device *dev,
 		return -ENOMEM;
 	}
 
+	shost->autosense = true;
+
 	fc_bsg_hostadd(shost, fc_host);
 	/* ignore any bsg add error - we just can't do sgio */
 
diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c
index 0102a2d..917f474 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -1568,7 +1568,7 @@ static int iscsi_setup_host(struct transport_container *tc, struct device *dev,
 	memset(ihost, 0, sizeof(*ihost));
 	atomic_set(&ihost->nr_scans, 0);
 	mutex_init(&ihost->mutex);
-
+	shost->autosense = true;
 	iscsi_bsg_host_add(shost, ihost);
 	/* ignore any bsg add error - we just can't do sgio */
 
diff --git a/drivers/scsi/scsi_transport_sas.c b/drivers/scsi/scsi_transport_sas.c
index 1b68142..942f8e1 100644
--- a/drivers/scsi/scsi_transport_sas.c
+++ b/drivers/scsi/scsi_transport_sas.c
@@ -288,6 +288,7 @@ static int sas_host_setup(struct transport_container *tc, struct device *dev,
 	sas_host->next_target_id = 0;
 	sas_host->next_expander_id = 0;
 	sas_host->next_port_id = 0;
+	shost->autosense = true;
 
 	if (sas_bsg_initialize(shost, NULL))
 		dev_printk(KERN_ERR, dev, "fail to a bsg device %d\n",
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 94844fc..d2019e2 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -485,6 +485,11 @@ struct scsi_host_template {
 	unsigned no_async_abort:1;
 
 	/*
+	 * True if the HBA support autosense
+	 */
+	unsigned autosense:1;
+
+	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
 	unsigned int max_host_blocked;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-19  8:32   ` Hannes Reinecke
@ 2014-05-19 10:29     ` Bart Van Assche
  2014-05-19 10:37       ` Hannes Reinecke
  2014-05-19 13:41     ` James Bottomley
  1 sibling, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2014-05-19 10:29 UTC (permalink / raw)
  To: Hannes Reinecke, emilne, Elliott, Robert (Server Storage)
  Cc: James Bottomley (jbottomley@parallels.com), Christoph Hellwig,
	scameron@beardog.cce.hp.com, linux-scsi@vger.kernel.org

On 05/19/14 10:32, Hannes Reinecke wrote:
> Well, problem here is that the 'REQUEST SENSE' command has two problems:
> a) Most modern HBA (ie all non-SPI HBAs) use autosense, ie the sense
> code is returned with the command. So issuing 'REQUEST SENSE' here is
> pointless.
> b) The sense code (when retrieved via 'REQUEST SENSE') relates to the
> most recently processed command (from the target perspective).
> Which is a bit hard to make out, as by the time SCSI EH starts
> several other commands might have been processed already, so any
> sense we'd be retrieving most likely does not relate to the failed command.
> 
> I would propose to disable the 'REQUEST_SENSE' step as soon as the HBA
> is capable of autosensing. We requires us to add another flag
> to the scsi_host field.
> 
> What about the attached patch? That should roughly do what's required
> here, right?

This patch does not address the SRP initiator. There might be more SCSI
initiator drivers that support autosense but that are not addressed by
this patch. Has it been considered to set the autosense flag for all
HBA's and clear it in those SCSI initiator drivers that do not support
autosense ?

Bart.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-19 10:29     ` Bart Van Assche
@ 2014-05-19 10:37       ` Hannes Reinecke
  2014-05-19 11:26         ` Bart Van Assche
  0 siblings, 1 reply; 9+ messages in thread
From: Hannes Reinecke @ 2014-05-19 10:37 UTC (permalink / raw)
  To: Bart Van Assche, emilne, Elliott, Robert (Server Storage)
  Cc: James Bottomley (jbottomley@parallels.com), Christoph Hellwig,
	scameron@beardog.cce.hp.com, linux-scsi@vger.kernel.org

On 05/19/2014 12:29 PM, Bart Van Assche wrote:
> On 05/19/14 10:32, Hannes Reinecke wrote:
>> Well, problem here is that the 'REQUEST SENSE' command has two problems:
>> a) Most modern HBA (ie all non-SPI HBAs) use autosense, ie the sense
>> code is returned with the command. So issuing 'REQUEST SENSE' here is
>> pointless.
>> b) The sense code (when retrieved via 'REQUEST SENSE') relates to the
>> most recently processed command (from the target perspective).
>> Which is a bit hard to make out, as by the time SCSI EH starts
>> several other commands might have been processed already, so any
>> sense we'd be retrieving most likely does not relate to the failed command.
>>
>> I would propose to disable the 'REQUEST_SENSE' step as soon as the HBA
>> is capable of autosensing. We requires us to add another flag
>> to the scsi_host field.
>>
>> What about the attached patch? That should roughly do what's required
>> here, right?
>
> This patch does not address the SRP initiator. There might be more SCSI
> initiator drivers that support autosense but that are not addressed by
> this patch. Has it been considered to set the autosense flag for all
> HBA's and clear it in those SCSI initiator drivers that do not support
> autosense ?
>
Correct.
I haven't looked at the SRP spec to figure out if it does autosense 
or not. If it does we should be updating the patch.

And yes, it has been considered.
However, I decided against it because it would require to check
each and every driver to figure out if they do autosense or not.
So as a first test I decided it to be quicker to have a 'whitelist' 
kind of approach and enable only those for which autosense is 
required by protocol.

This also excludes all RAID HBAs, which need to be checked 
individually. And the SPI HBAs, of course.

Plus this is just a test patch, nothing official yet.
I'm happy to include the changes for SRP if you can confirm that SRP 
requires autosense.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-19 10:37       ` Hannes Reinecke
@ 2014-05-19 11:26         ` Bart Van Assche
  0 siblings, 0 replies; 9+ messages in thread
From: Bart Van Assche @ 2014-05-19 11:26 UTC (permalink / raw)
  To: Hannes Reinecke, Bart Van Assche, emilne,
	Elliott, Robert (Server Storage)
  Cc: James Bottomley (jbottomley@parallels.com), Christoph Hellwig,
	scameron@beardog.cce.hp.com, linux-scsi@vger.kernel.org

On 05/19/14 12:37, Hannes Reinecke wrote:
> Plus this is just a test patch, nothing official yet.
> I'm happy to include the changes for SRP if you can confirm that SRP
> requires autosense.

Hello Hannes,

Since the SRP protocol supports returning sense data in the SRP response
message and since every SRP target system I'm familiar with supports
command queueing I think it is safe to enable the autosense feature in
the SRP initiator driver.

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-16 19:02 scsi error handling thread and REQUEST SENSE Elliott, Robert (Server Storage)
  2014-05-16 20:05 ` Ewan Milne
@ 2014-05-19 11:40 ` Bart Van Assche
  1 sibling, 0 replies; 9+ messages in thread
From: Bart Van Assche @ 2014-05-19 11:40 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage),
	James Bottomley (jbottomley@parallels.com), Hannes Reinecke,
	Christoph Hellwig, scameron@beardog.cce.hp.com,
	linux-scsi@vger.kernel.org

On 05/16/14 21:02, Elliott, Robert (Server Storage) wrote:
> The command is still outstanding; data transfers might still occur, 
> and a completion using its tag could still appear at any time. 
> However, the error handler declares that the command is done, 
> so all the buffers are freed and the tag is reused.
> 
> The SCSI error handler needs to escalate this to a reset that 
> ensures that the command is no longer outstanding: ABORT
> TASK (which already didn't work), ABORT TASK SET, LOGICAL 
> UNIT RESET, I_T NEXUS RESET, or hard reset.

If my interpretation of the SCSI mid-layer source code is correct then
even with the patch "improved eh timeout handler" applied the SCSI
mid-layer still guarantees for each SCSI host that at most one
eh_abort_handler() call is active at any given time (since tmf_work_q is
created with max_active = 1) and also that at least one of the eh_*
functions is invoked before the SCSI mid-layer finishes a command. Does
your comment mean that you have found a scenario in which none of the
LLD eh_* callback functions was invoked before the SCSI mid-layer
finished a SCSI command ?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: scsi error handling thread and REQUEST SENSE
  2014-05-19  8:32   ` Hannes Reinecke
  2014-05-19 10:29     ` Bart Van Assche
@ 2014-05-19 13:41     ` James Bottomley
  2014-05-19 15:15       ` Elliott, Robert (Server Storage)
  1 sibling, 1 reply; 9+ messages in thread
From: James Bottomley @ 2014-05-19 13:41 UTC (permalink / raw)
  To: hare@suse.de
  Cc: linux-scsi@vger.kernel.org, emilne@redhat.com, Elliott@hp.com,
	scameron@beardog.cce.hp.com, hch@infradead.org


On Mon, 2014-05-19 at 10:32 +0200, Hannes Reinecke wrote:
> On 05/16/2014 10:05 PM, Ewan Milne wrote:
> > On Fri, 2014-05-16 at 19:02 +0000, Elliott, Robert (Server Storage)
> > wrote:
> >> There is an issue with a command timeout followed by a failed
> >> abort in the linux SCSI stack.
> >
> > This might explain some  odd crashes I've seen, where it looks like
> > a command might have completed *long* after it should have timed out.
> > I have a few questions, see below:
> >
> >>
> >> After triggering a timeout on a command like:
> >> [ 5454.196861] sd 2:0:0:1: [sds] Done: TIMEOUT
> >> [ 5454.196863] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> [ 5454.196866] sd 2:0:0:1: [sds] CDB: Mode Sense(10): 5a 00 03 00 00 00 00 00 04 00
> >>
> >> scsi_times_out() invokes scsi_abort_command():
> >> [ 5454.196880] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort scheduled
> >>
> >> and scmd_eh_abort_handler() tries to abort the command:
> >> [ 5454.206828] sd 2:0:0:1: [sds] aborting command ffff880428963a70
> >>
> >> If the abort fails (with return value FAILED (0x2003 == 8195)):
> >> [ 5454.206832] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort failed, rtn 8195
> >>
> >> then scmd_eh_abort_handler() just gives up and expects the error
> >> handler thread to deal with the problem.
> >>
> >> When that thread (scsi_error_handler()) wakes up later on, it finds
> >> this command (and others) outstanding:
> >> [ 5454.373581] scsi_eh_2: waking up 0/3/3
> >> [ 5454.375037] sd 2:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
> >> [ 5454.377332] sd 2:0:0:11: scsi_eh_prt_fail_stats: cmds failed: 2, cancel: 0
> >> [ 5454.379779] Total of 3 commands on 2 devices require eh work
> >>
> >> For each command, it starts with this check:
> >>
> >> #define SCSI_SENSE_VALID(scmd) \
> >>          (((scmd)->sense_buffer[0] & 0x70) == 0x70)
> >>
> >>                  if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
> >>                      SCSI_SENSE_VALID(scmd))
> >>                          continue;
> >>
> >> In this case, that if statement fails.  The eflags bit is not
> >> set, and the sense data buffer still contains zeros or garbage -
> >> the command is still outstanding, so the buffer might be written
> >> at any time.
> >>
> >> (the sense buffer shouldn't be read unless a valid bit says
> >> it's filled in, and this lacks support for descriptor format
> >> sense data (type 0x72), but those are side issues)
> >
> > Doesn't the check for:   (byte[0] & 0x70) == 0x70   cover 0x70 - 0x73?
> >
> >>
> >> Strangely, the error handler code (scsi_unjam_host()) proceeds
> >> to send a REQUEST SENSE command and sees the resulting sense
> >> key of NO SENSE:
> >>
> >> [ 5454.381659] sd 2:0:0:1: [sds] scsi_eh_2: requesting sense
> >> [ 5454.383597] scsi_eh_done scmd: ffff880428963a70 result: 0
> >> [ 5454.385457] sd 2:0:0:1: [sds] Done: UNKNOWN
> >> [ 5454.387430] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> >> [ 5454.390450] sd 2:0:0:1: [sds] CDB: Request Sense: 03 00 00 00 60 00
> >> [ 5454.393497] scsi_send_eh_cmnd: scmd: ffff880428963a70, timeleft: 9998
> >> [ 5454.395667] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> >> [ 5454.397842] sense requested for ffff880428963a70 result 0
> >> [ 5454.399675] sd 2:0:0:1: [sds] Sense Key : No Sense [current]
> >> [ 5454.402570] sd 2:0:0:1: [sds] Add. Sense: No additional sense information
> >
> > So, a command timed out, the abort didn't succeed, but a
> > REQUEST SENSE completed normally?
> >
> > What kernel was this?  Did it have the change to issue the abort
> > in the timeout handler rather than the EH thread?  It seems like
> > it does, based on your description above.  However, I'm wondering
> > because I have seen crashes on kernels both with and without that
> > change.
> >
> >>
> >> The bogus "UNKNOWN" print is being fixed by Hannes' logging
> >> patch. It just means the REQUEST SENSE command was submitted
> >> successfully.
> >>
> >> This device uses autosense, so REQUEST SENSE is not a valid way
> >> to find out any information for the timed out command. There is
> >> no contingent allegiance condition stalling the queue until
> >> REQUEST SENSE comes along to collect the sense data - that
> >> parallel SCSI concept went obsolete in SAM-3 revision 5 in
> >> January 2003.
> >>
> >> The command is still outstanding; data transfers might still occur,
> >> and a completion using its tag could still appear at any time.
> >> However, the error handler declares that the command is done,
> >> so all the buffers are freed and the tag is reused.
> >>
> >> The SCSI error handler needs to escalate this to a reset that
> >> ensures that the command is no longer outstanding: ABORT
> >> TASK (which already didn't work), ABORT TASK SET, LOGICAL
> >> UNIT RESET, I_T NEXUS RESET, or hard reset.
> >
> > What is supposed to happen is that the EH will escalate and
> > eventually reset the HBA if all else fails.  It definitely
> > should not be returning the scmd if the LLD is still using it.
> >
> Well, problem here is that the 'REQUEST SENSE' command has two problems:
> a) Most modern HBA (ie all non-SPI HBAs) use autosense, ie the sense 
> code is returned with the command. So issuing 'REQUEST SENSE' here 
> is pointless.
> b) The sense code (when retrieved via 'REQUEST SENSE') relates to 
> the most recently processed command (from the target perspective).
> Which is a bit hard to make out, as by the time SCSI EH starts
> several other commands might have been processed already, so any
> sense we'd be retrieving most likely does not relate to the failed 
> command.
> 
> I would propose to disable the 'REQUEST_SENSE' step as soon as the 
> HBA is capable of autosensing. We requires us to add another flag
> to the scsi_host field.
> 
> What about the attached patch? That should roughly do what's 
> required here, right?

This patch shouldn't be necessary at all.  A driver with autosense
returning check condition should already have collected the sense, so we
should succeed in the first if condition

		if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
		    SCSI_SENSE_VALID(scmd))
			continue;

If we drop through, the return code shouldn't be CHECK_CONDITION, so it
should get kicked out here:

		if (status_byte(scmd->result) != CHECK_CONDITION)
			/*
			 * don't request sense if there's no check condition
			 * status because the error we're processing isn't one
			 * that has a sense code (and some devices get
			 * confused by sense requests out of the blue)
			 */
			continue;

However, that last bit is a recent introduction:


commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e
Author: James Bottomley <JBottomley@Parallels.com>
Date:   Fri Mar 28 10:50:17 2014 -0700

    [SCSI] Fix spurious request sense in error handling

So if the problem occurred before that patch, it may be fixed by it.

James


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: scsi error handling thread and REQUEST SENSE
  2014-05-19 13:41     ` James Bottomley
@ 2014-05-19 15:15       ` Elliott, Robert (Server Storage)
  0 siblings, 0 replies; 9+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-05-19 15:15 UTC (permalink / raw)
  To: James Bottomley, hare@suse.de
  Cc: linux-scsi@vger.kernel.org, emilne@redhat.com,
	scameron@beardog.cce.hp.com, hch@infradead.org



> -----Original Message-----
> From: James Bottomley [mailto:jbottomley@parallels.com]
> Sent: Monday, 19 May, 2014 8:42 AM
> To: hare@suse.de
> Cc: linux-scsi@vger.kernel.org; emilne@redhat.com; Elliott, Robert (Server
> Storage); scameron@beardog.cce.hp.com; hch@infradead.org
> Subject: Re: scsi error handling thread and REQUEST SENSE
> 
> 
> On Mon, 2014-05-19 at 10:32 +0200, Hannes Reinecke wrote:
> > On 05/16/2014 10:05 PM, Ewan Milne wrote:
> > > On Fri, 2014-05-16 at 19:02 +0000, Elliott, Robert (Server Storage)
> > > wrote:
> > >> There is an issue with a command timeout followed by a failed
> > >> abort in the linux SCSI stack.
> > >
> > > This might explain some  odd crashes I've seen, where it looks like
> > > a command might have completed *long* after it should have timed out.
> > > I have a few questions, see below:
> > >
> > >>
> > >> After triggering a timeout on a command like:
> > >> [ 5454.196861] sd 2:0:0:1: [sds] Done: TIMEOUT
> > >> [ 5454.196863] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK
> driverbyte=DRIVER_OK
> > >> [ 5454.196866] sd 2:0:0:1: [sds] CDB: Mode Sense(10): 5a 00 03 00 00 00
> 00 00 04 00
> > >>
> > >> scsi_times_out() invokes scsi_abort_command():
> > >> [ 5454.196880] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort scheduled
> > >>
> > >> and scmd_eh_abort_handler() tries to abort the command:
> > >> [ 5454.206828] sd 2:0:0:1: [sds] aborting command ffff880428963a70
> > >>
> > >> If the abort fails (with return value FAILED (0x2003 == 8195)):
> > >> [ 5454.206832] sd 2:0:0:1: [sds] scmd ffff880428963a70 abort failed, rtn
> 8195
> > >>
> > >> then scmd_eh_abort_handler() just gives up and expects the error
> > >> handler thread to deal with the problem.
> > >>
> > >> When that thread (scsi_error_handler()) wakes up later on, it finds
> > >> this command (and others) outstanding:
> > >> [ 5454.373581] scsi_eh_2: waking up 0/3/3
> > >> [ 5454.375037] sd 2:0:0:1: scsi_eh_prt_fail_stats: cmds failed: 1,
> cancel: 0
> > >> [ 5454.377332] sd 2:0:0:11: scsi_eh_prt_fail_stats: cmds failed: 2,
> cancel: 0
> > >> [ 5454.379779] Total of 3 commands on 2 devices require eh work
> > >>
> > >> For each command, it starts with this check:
> > >>
> > >> #define SCSI_SENSE_VALID(scmd) \
> > >>          (((scmd)->sense_buffer[0] & 0x70) == 0x70)
> > >>
> > >>                  if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
> > >>                      SCSI_SENSE_VALID(scmd))
> > >>                          continue;
> > >>
> > >> In this case, that if statement fails.  The eflags bit is not
> > >> set, and the sense data buffer still contains zeros or garbage -
> > >> the command is still outstanding, so the buffer might be written
> > >> at any time.
> > >>
> > >> (the sense buffer shouldn't be read unless a valid bit says
> > >> it's filled in, and this lacks support for descriptor format
> > >> sense data (type 0x72), but those are side issues)
> > >
> > > Doesn't the check for:   (byte[0] & 0x70) == 0x70   cover 0x70 - 0x73?

Yes, I didn't read that closely enough.  That means it also covers
0x74 to 0x7F, though, which are "reserved" and should not be 
interpreted as having any particular meaning.

> > >
> > >>
> > >> Strangely, the error handler code (scsi_unjam_host()) proceeds
> > >> to send a REQUEST SENSE command and sees the resulting sense
> > >> key of NO SENSE:
> > >>
> > >> [ 5454.381659] sd 2:0:0:1: [sds] scsi_eh_2: requesting sense
> > >> [ 5454.383597] scsi_eh_done scmd: ffff880428963a70 result: 0
> > >> [ 5454.385457] sd 2:0:0:1: [sds] Done: UNKNOWN
> > >> [ 5454.387430] sd 2:0:0:1: [sds] Result: hostbyte=DID_OK
> driverbyte=DRIVER_OK
> > >> [ 5454.390450] sd 2:0:0:1: [sds] CDB: Request Sense: 03 00 00 00 60 00
> > >> [ 5454.393497] scsi_send_eh_cmnd: scmd: ffff880428963a70, timeleft: 9998
> > >> [ 5454.395667] scsi_send_eh_cmnd: scsi_eh_completed_normally 2002
> > >> [ 5454.397842] sense requested for ffff880428963a70 result 0
> > >> [ 5454.399675] sd 2:0:0:1: [sds] Sense Key : No Sense [current]
> > >> [ 5454.402570] sd 2:0:0:1: [sds] Add. Sense: No additional sense
> information
> > >
> > > So, a command timed out, the abort didn't succeed, but a
> > > REQUEST SENSE completed normally?
> > >
> > > What kernel was this?  Did it have the change to issue the abort
> > > in the timeout handler rather than the EH thread?  It seems like
> > > it does, based on your description above.  However, I'm wondering
> > > because I have seen crashes on kernels both with and without that
> > > change.

This is with Christoph's scsi-mq-wip.7, which is based on Jens'
block/for-next, which is based on 3.15-rc1.  In that version,
scsi_times_out() does this:

        if (host->transportt->eh_timed_out)
                rtn = host->transportt->eh_timed_out(scmd);
        else if (host->hostt->eh_timed_out)
                rtn = host->hostt->eh_timed_out(scmd);

        if (rtn == BLK_EH_NOT_HANDLED && !host->hostt->no_async_abort)
                if (scsi_abort_command(scmd) == SUCCESS)
                        return BLK_EH_NOT_HANDLED;

In this case, there are no transport or LLD eh_timed_out()
functions; scsi_abort_command() is called and returns FAILED.

> > >>
> > >> The bogus "UNKNOWN" print is being fixed by Hannes' logging
> > >> patch. It just means the REQUEST SENSE command was submitted
> > >> successfully.
> > >>
> > >> This device uses autosense, so REQUEST SENSE is not a valid way
> > >> to find out any information for the timed out command. There is
> > >> no contingent allegiance condition stalling the queue until
> > >> REQUEST SENSE comes along to collect the sense data - that
> > >> parallel SCSI concept went obsolete in SAM-3 revision 5 in
> > >> January 2003.
> > >>
> > >> The command is still outstanding; data transfers might still occur,
> > >> and a completion using its tag could still appear at any time.
> > >> However, the error handler declares that the command is done,
> > >> so all the buffers are freed and the tag is reused.
> > >>
> > >> The SCSI error handler needs to escalate this to a reset that
> > >> ensures that the command is no longer outstanding: ABORT
> > >> TASK (which already didn't work), ABORT TASK SET, LOGICAL
> > >> UNIT RESET, I_T NEXUS RESET, or hard reset.
> > >
> > > What is supposed to happen is that the EH will escalate and
> > > eventually reset the HBA if all else fails.  It definitely
> > > should not be returning the scmd if the LLD is still using it.
> > >
> > Well, problem here is that the 'REQUEST SENSE' command has two problems:
> > a) Most modern HBA (ie all non-SPI HBAs) use autosense, ie the sense
> > code is returned with the command. So issuing 'REQUEST SENSE' here
> > is pointless.
> > b) The sense code (when retrieved via 'REQUEST SENSE') relates to
> > the most recently processed command (from the target perspective).
> > Which is a bit hard to make out, as by the time SCSI EH starts
> > several other commands might have been processed already, so any
> > sense we'd be retrieving most likely does not relate to the failed
> > command.

Another problem pointed out is that scsi_send_eh_cmnd(), which 
is used to send this REQUEST SENSE, hijacks the SCSI command
structure.  That's not safe while the original command is 
still outstanding and might complete at any time; the status
for the original command will be confused with the status for
the REQUEST SENSE, and overlapping tags might confuse the
controller or device behind the controller (depending on whether
the block layer tag is used or the LLD generates its own).


> > I would propose to disable the 'REQUEST_SENSE' step as soon as the
> > HBA is capable of autosensing. We requires us to add another flag
> > to the scsi_host field.
> >
> > What about the attached patch? That should roughly do what's
> > required here, right?

In the timeout case, there is no sense data for the command yet - 
the sense buffer is still wide-open to be written by the controller, 
and could contain garbage.  This patch would treat the sense buffer
as valid too soon.

I do think a scsi_host flag indicating the LLD wants to opt out 
of this antiquated error handling code will be part of the solution.

> This patch shouldn't be necessary at all.  A driver with autosense
> returning check condition should already have collected the sense, so we
> should succeed in the first if condition
> 
> 		if ((scmd->eh_eflags & SCSI_EH_CANCEL_CMD) ||
> 		    SCSI_SENSE_VALID(scmd))
> 			continue;

In this case, the command has timed out but the abort has
failed, so there is no sense data yet - the command is still 
pending.  It still might complete at any time.

> If we drop through, the return code shouldn't be CHECK_CONDITION, so it
> should get kicked out here:
> 
> 		if (status_byte(scmd->result) != CHECK_CONDITION)
> 			/*
> 			 * don't request sense if there's no check condition
> 			 * status because the error we're processing isn't one
> 			 * that has a sense code (and some devices get
> 			 * confused by sense requests out of the blue)
> 			 */
> 			continue;
> 
> However, that last bit is a recent introduction:
> 
> 
> commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e
> Author: James Bottomley <JBottomley@Parallels.com>
> Date:   Fri Mar 28 10:50:17 2014 -0700
> 
>     [SCSI] Fix spurious request sense in error handling
> 
> So if the problem occurred before that patch, it may be fixed by it.
> 
> James

This tree does not have that patch. 

That should help; the timed out command should still have that byte 
set to 0x00 (GOOD).  I think that needs to be qualified by 
msg_byte(), host_byte(), or driver_byte(), or whatever indicates 
that scmd->result has been filled in, though.  The command could 
complete with GOOD status while the error handler is running.

With the following change, my test system ran for 31 hours without 
triggering NULL dereferences or hung threads:

@@ -2110,9 +2108,9 @@ static void scsi_unjam_host(struct Scsi_Host *shost)

        SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));

-       if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
-               if (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q))
-                       scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
+       if (1) /* (!scsi_eh_get_sense(&eh_work_q, &eh_done_q)) sending REQUEST SENSE is totally bogus */
+               if (1) /* (!scsi_eh_abort_cmds(&eh_work_q, &eh_done_q)) if abort didn't work before, it won't work now */
+                       scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);      /* this tries various levels of resets, which are the right thing to do */

        spin_lock_irqsave(shost->host_lock, flags);
        if (shost->eh_deadline != -1)


---
Rob Elliott    HP Server Storage




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-19 15:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-16 19:02 scsi error handling thread and REQUEST SENSE Elliott, Robert (Server Storage)
2014-05-16 20:05 ` Ewan Milne
2014-05-19  8:32   ` Hannes Reinecke
2014-05-19 10:29     ` Bart Van Assche
2014-05-19 10:37       ` Hannes Reinecke
2014-05-19 11:26         ` Bart Van Assche
2014-05-19 13:41     ` James Bottomley
2014-05-19 15:15       ` Elliott, Robert (Server Storage)
2014-05-19 11:40 ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).