public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] TUR path is down after adapter gets reset in multipath configuration
@ 2014-10-27 18:01 wenxiong
  2014-10-27 18:01 ` [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c) wenxiong
  2014-10-27 18:01 ` [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c) wenxiong
  0 siblings, 2 replies; 7+ messages in thread
From: wenxiong @ 2014-10-27 18:01 UTC (permalink / raw)
  To: James.Bottomley; +Cc: hch, linux-scsi, brking

Based on Christoph's and Brian's patches, these patch fix the TUR path is down after adapter gets reset issue.


Thanks,
Wendy
-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c)
  2014-10-27 18:01 [PATCH 0/2] TUR path is down after adapter gets reset in multipath configuration wenxiong
@ 2014-10-27 18:01 ` wenxiong
  2014-10-27 19:07   ` Elliott, Robert (Server Storage)
  2014-10-28  9:04   ` Christoph Hellwig
  2014-10-27 18:01 ` [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c) wenxiong
  1 sibling, 2 replies; 7+ messages in thread
From: wenxiong @ 2014-10-27 18:01 UTC (permalink / raw)
  To: James.Bottomley; +Cc: hch, linux-scsi, brking

[-- Attachment #1: allow_restart1 --]
[-- Type: text/plain, Size: 1928 bytes --]

After an ipr adapter gets reset, all disk array devices require a start
unit command to be issued to them before they will accept commands. So,
with the SCSI EH change, we now end up in a scenario with dual ipr
adapters where the TUR getting issued from the health checker returns
with a Not Ready response and since SCSI EH no longer triggers the Start
Unit in this scenario, the path never recovers.

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
---
 drivers/scsi/scsi_error.c |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

Index: b/drivers/scsi/scsi_error.c
===================================================================
--- a/drivers/scsi/scsi_error.c	2014-10-23 12:54:16.000000000 -0500
+++ b/drivers/scsi/scsi_error.c	2014-10-23 12:57:44.642078988 -0500
@@ -459,14 +459,6 @@ static int scsi_check_sense(struct scsi_
 	if (! scsi_command_normalize_sense(scmd, &sshdr))
 		return FAILED;	/* no valid sense data */
 
-	if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
-		/*
-		 * nasty: for mid-layer issued TURs, we need to return the
-		 * actual sense data without any recovery attempt.  For eh
-		 * issued ones, we need to try to recover and interpret
-		 */
-		return SUCCESS;
-
 	scsi_report_sense(sdev, &sshdr);
 
 	if (scsi_sense_is_deferred(&sshdr))
@@ -482,6 +474,14 @@ static int scsi_check_sense(struct scsi_
 		/* handler does not care. Drop down to default handling */
 	}
 
+	if (scmd->cmnd[0] == TEST_UNIT_READY && scmd->scsi_done != scsi_eh_done)
+		/*
+		 * nasty: for mid-layer issued TURs, we need to return the
+		 * actual sense data without any recovery attempt.  For eh
+		 * issued ones, we need to try to recover and interpret
+		 */
+		return SUCCESS;
+
 	/*
 	 * Previous logic looked for FILEMARK, EOM or ILI which are
 	 * mainly associated with tapes and returned SUCCESS.

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c)
  2014-10-27 18:01 [PATCH 0/2] TUR path is down after adapter gets reset in multipath configuration wenxiong
  2014-10-27 18:01 ` [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c) wenxiong
@ 2014-10-27 18:01 ` wenxiong
  2014-10-27 19:56   ` Elliott, Robert (Server Storage)
  1 sibling, 1 reply; 7+ messages in thread
From: wenxiong @ 2014-10-27 18:01 UTC (permalink / raw)
  To: James.Bottomley; +Cc: hch, linux-scsi, brking

[-- Attachment #1: allow_restart2 --]
[-- Type: text/plain, Size: 1009 bytes --]

This patch also fixes the 02/04/02 K/C/Q check in alua_check_sense
handler.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Teste-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
---
 drivers/scsi/device_handler/scsi_dh_alua.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: b/drivers/scsi/device_handler/scsi_dh_alua.c
===================================================================
--- a/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23 13:00:45.000000000 -0500
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23 13:04:16.152079004 -0500
@@ -474,6 +474,13 @@ static int alua_check_sense(struct scsi_
 			 * LUN Not Ready -- Offline
 			 */
 			return SUCCESS;
+		if (sdev->allow_restart &&
+		    (sense_hdr->asc == 0x04) && (sense_hdr->ascq == 0x02))
+			/*
+			 * if the device is not started, we need to wake
+			 * the error handler to start the motor
+			 */
+			return FAILED;
 		break;
 	case UNIT_ATTENTION:
 		if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00)

-- 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c)
  2014-10-27 18:01 ` [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c) wenxiong
@ 2014-10-27 19:07   ` Elliott, Robert (Server Storage)
  2014-10-28  9:04   ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-10-27 19:07 UTC (permalink / raw)
  To: wenxiong@linux.vnet.ibm.com,
	James.Bottomley@HansenPartnership.com
  Cc: hch@infradead.org, linux-scsi@vger.kernel.org,
	brking@linux.vnet.ibm.com

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of wenxiong@linux.vnet.ibm.com
> Sent: Monday, 27 October, 2014 1:02 PM
> To: James.Bottomley@HansenPartnership.com
> Cc: hch@infradead.org; linux-scsi@vger.kernel.org;
> brking@linux.vnet.ibm.com
> Subject: [PATCH 1/2] scsi: TUR path is down after adapter gets reset
> in multipath configuration(scsi_error.c)
> 
> After an ipr adapter gets reset, all disk array devices require a
> start unit command to be issued to them before they will accept
> commands. So, with the SCSI EH change, we now end up in a scenario 
> with dual ipr adapters where the TUR getting issued from the health
> checker returns with a Not Ready response and since SCSI EH no
> longer triggers the Start Unit in this scenario, the path never
> recovers.
> 

This description doesn't make sense without reading the thread:
[dm-devel] [PATCH 1/1] multipath-tools: Change path checker for IBM IPR devices

Example: what is "the SCSI EH change"?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c)
  2014-10-27 18:01 ` [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c) wenxiong
@ 2014-10-27 19:56   ` Elliott, Robert (Server Storage)
  2014-10-27 22:48     ` Brian King
  0 siblings, 1 reply; 7+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-10-27 19:56 UTC (permalink / raw)
  To: wenxiong@linux.vnet.ibm.com,
	James.Bottomley@HansenPartnership.com
  Cc: hch@infradead.org, linux-scsi@vger.kernel.org,
	brking@linux.vnet.ibm.com

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of wenxiong@linux.vnet.ibm.com
> Sent: Monday, 27 October, 2014 1:02 PM
> To: James.Bottomley@HansenPartnership.com
> Cc: hch@infradead.org; linux-scsi@vger.kernel.org;
> brking@linux.vnet.ibm.com
> Subject: [PATCH 2/2] scsi: TUR path is down after adapter gets reset
> in multipath configuration(scsi_dh_alus.c)
> 
> This patch also fixes the 02/04/02 K/C/Q check in alua_check_sense
> handler.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
> Teste-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>

Missing a d

> ---
>  drivers/scsi/device_handler/scsi_dh_alua.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> Index: b/drivers/scsi/device_handler/scsi_dh_alua.c
> ===================================================================
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23
> 13:00:45.000000000 -0500
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23
> 13:04:16.152079004 -0500
> @@ -474,6 +474,13 @@ static int alua_check_sense(struct scsi_
>  			 * LUN Not Ready -- Offline
>  			 */
>  			return SUCCESS;
> +		if (sdev->allow_restart &&
> +		    (sense_hdr->asc == 0x04) && (sense_hdr->ascq ==
> 0x02))

The coding style in that function does not include the extra 
parenthesis, as shown by the next excerpt:

> +			/*
> +			 * if the device is not started, we need to wake
> +			 * the error handler to start the motor
> +			 */
> +			return FAILED;
>  		break;
>  	case UNIT_ATTENTION:
>  		if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00)

Thus function is used several places:
* installed as the scsi_device_handler .check_sense function 
* called to parse the response to REPORT TARGET PORT GROUPS
  in alua_rtpg
* called to parse the response to SET TARGET PORT GROUPS
  in stpg_endio

I'm not sure that adding NOT READY/LOGICAL UNIT NOT READY,
INITIALIZING COMMAND REQUIRED (2h/04h/02h) is a good idea
for the second case. The expected way to handle that 
response is to send START STOP UNIT with START=1.

There are conditions in which REPORT TARGET PORT GROUPS
is allowed and START STOP UNIT with START=1 is not allowed:
* CbCS (capabilities-based command security) only allows 
  START STOP UNIT if physical access (PHY ACC) is enabled,
  while REPORT TARGET PORT GROUPS is always allowed.
* the standby or unavailable asymmetric access states only 
  guarantee that REPORT TARGET PORT GROUPS is allowed, not
  START STOP UNIT.  The device is permitted to support START
  STOP UNIT, but it's not required.

So, it's really not a response that should be returned for
that command.

Any device that does return that response must also support
START STOP UNIT or it's misleading the application client.
In that case, falling through to the EH to send START STOP
UNIT is the right thing to do.

SET TARGET PORT GROUPS is questionable too; it also has 
different CbCS permissions and asymmetric access state
requirements than START STOP UNIT with START=1.

Perhaps that new "return FAILED" should be skipped if the
opcode is REPORT TARGET PORT GROUPS or SET TARGET PORT
GROUPS?

---
Rob Elliott    HP Server Storage



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c)
  2014-10-27 19:56   ` Elliott, Robert (Server Storage)
@ 2014-10-27 22:48     ` Brian King
  0 siblings, 0 replies; 7+ messages in thread
From: Brian King @ 2014-10-27 22:48 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), wenxiong@linux.vnet.ibm.com,
	James.Bottomley@HansenPartnership.com
  Cc: hch@infradead.org, linux-scsi@vger.kernel.org

On 10/27/2014 02:56 PM, Elliott, Robert (Server Storage) wrote:
>> -----Original Message-----
>> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
>> owner@vger.kernel.org] On Behalf Of wenxiong@linux.vnet.ibm.com
>> Sent: Monday, 27 October, 2014 1:02 PM
>> To: James.Bottomley@HansenPartnership.com
>> Cc: hch@infradead.org; linux-scsi@vger.kernel.org;
>> brking@linux.vnet.ibm.com
>> Subject: [PATCH 2/2] scsi: TUR path is down after adapter gets reset
>> in multipath configuration(scsi_dh_alus.c)
>>
>> This patch also fixes the 02/04/02 K/C/Q check in alua_check_sense
>> handler.
>>
>> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
>> Teste-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
> 
> Missing a d
> 
>> ---
>>  drivers/scsi/device_handler/scsi_dh_alua.c |    7 +++++++
>>  1 file changed, 7 insertions(+)
>>
>> Index: b/drivers/scsi/device_handler/scsi_dh_alua.c
>> ===================================================================
>> --- a/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23
>> 13:00:45.000000000 -0500
>> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c	2014-10-23
>> 13:04:16.152079004 -0500
>> @@ -474,6 +474,13 @@ static int alua_check_sense(struct scsi_
>>  			 * LUN Not Ready -- Offline
>>  			 */
>>  			return SUCCESS;
>> +		if (sdev->allow_restart &&
>> +		    (sense_hdr->asc == 0x04) && (sense_hdr->ascq ==
>> 0x02))
> 
> The coding style in that function does not include the extra 
> parenthesis, as shown by the next excerpt:

I just lifted this bit from scsi_error.c without noticing the
coding style difference. We can certainly change this.

> 
>> +			/*
>> +			 * if the device is not started, we need to wake
>> +			 * the error handler to start the motor
>> +			 */
>> +			return FAILED;
>>  		break;
>>  	case UNIT_ATTENTION:
>>  		if (sense_hdr->asc == 0x29 && sense_hdr->ascq == 0x00)
> 
> Thus function is used several places:
> * installed as the scsi_device_handler .check_sense function 
> * called to parse the response to REPORT TARGET PORT GROUPS
>   in alua_rtpg
> * called to parse the response to SET TARGET PORT GROUPS
>   in stpg_endio
> 
> I'm not sure that adding NOT READY/LOGICAL UNIT NOT READY,
> INITIALIZING COMMAND REQUIRED (2h/04h/02h) is a good idea
> for the second case. The expected way to handle that 
> response is to send START STOP UNIT with START=1.
> 
> There are conditions in which REPORT TARGET PORT GROUPS
> is allowed and START STOP UNIT with START=1 is not allowed:
> * CbCS (capabilities-based command security) only allows 
>   START STOP UNIT if physical access (PHY ACC) is enabled,
>   while REPORT TARGET PORT GROUPS is always allowed.
> * the standby or unavailable asymmetric access states only 
>   guarantee that REPORT TARGET PORT GROUPS is allowed, not
>   START STOP UNIT.  The device is permitted to support START
>   STOP UNIT, but it's not required.
> 
> So, it's really not a response that should be returned for
> that command.
> 
> Any device that does return that response must also support
> START STOP UNIT or it's misleading the application client.
> In that case, falling through to the EH to send START STOP
> UNIT is the right thing to do.
> 
> SET TARGET PORT GROUPS is questionable too; it also has 
> different CbCS permissions and asymmetric access state
> requirements than START STOP UNIT with START=1.
> 
> Perhaps that new "return FAILED" should be skipped if the
> opcode is REPORT TARGET PORT GROUPS or SET TARGET PORT
> GROUPS?

I'm fine with that.

Wendy - do you want to make these changes and resend?

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c)
  2014-10-27 18:01 ` [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c) wenxiong
  2014-10-27 19:07   ` Elliott, Robert (Server Storage)
@ 2014-10-28  9:04   ` Christoph Hellwig
  1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2014-10-28  9:04 UTC (permalink / raw)
  To: wenxiong; +Cc: James.Bottomley, hch, linux-scsi, brking

On Mon, Oct 27, 2014 at 01:01:48PM -0500, wenxiong@linux.vnet.ibm.com wrote:
> After an ipr adapter gets reset, all disk array devices require a start
> unit command to be issued to them before they will accept commands. So,
> with the SCSI EH change, we now end up in a scenario with dual ipr
> adapters where the TUR getting issued from the health checker returns
> with a Not Ready response and since SCSI EH no longer triggers the Start
> Unit in this scenario, the path never recovers.
> 
> Signed-off-by: Christoph Hellwig <hch@infradead.org>
> Tested-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>

Th patch description is entirely wrong, but I guess I'll have to
take the blame for that as it's my patch.  How about something like:

From: Christoph Hellwig <hch@infradead.org>
Subject: scsi: call device handler for failed TUR command

Multipath devices using the TUR path checker need to see the sense
code for a failed TUR command in their device handler.  Since commit
<insert commit id here> we always return success for mid layer issued
TUR commands before calling the device handler, which stopped the
TUR path checker from working.

Move the call to the device handler check sense method before the early
return for TUR commands to give the device handler a chance to intercept
them.

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Tested-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2014-10-28  9:04 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-27 18:01 [PATCH 0/2] TUR path is down after adapter gets reset in multipath configuration wenxiong
2014-10-27 18:01 ` [PATCH 1/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_error.c) wenxiong
2014-10-27 19:07   ` Elliott, Robert (Server Storage)
2014-10-28  9:04   ` Christoph Hellwig
2014-10-27 18:01 ` [PATCH 2/2] scsi: TUR path is down after adapter gets reset in multipath configuration(scsi_dh_alus.c) wenxiong
2014-10-27 19:56   ` Elliott, Robert (Server Storage)
2014-10-27 22:48     ` Brian King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox