[PATCH] scsi_error: count medium access timeout only once per EH run

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] scsi_error: count medium access timeout only once per EH run
@ 2017-02-23 10:27 Hannes Reinecke
  2017-02-23 14:13 ` Laurence Oberman
  2017-02-27 19:33 ` Ewan D. Milne
  0 siblings, 2 replies; 5+ messages in thread
From: Hannes Reinecke @ 2017-02-23 10:27 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Christoph Hellwig, James Bottomley, linux-scsi, Hannes Reinecke,
	Ewan Milne, Lawrence Oberman, Benjamin Block, Steffen Maier,
	Hannes Reinecke

The current medium access timeout counter will be increased for
each command, so if there are enough failed commands we'll hit
the medium access timeout for even a single failure.
Fix this by making the timeout per EH run, ie the counter will
only be increased once per device and EH run.

Cc: Ewan Milne <emilne@redhat.com>
Cc: Lawrence Oberman <loberman@redhat.com>
Cc: Benjamin Block <bblock@linux.vnet.ibm.vom>
Cc: Steffen Maier <maier@de.ibm.com>
Signed-off-by: Hannes Reinecke <hare@suse.com>
---
 drivers/scsi/scsi_error.c |  2 ++
 drivers/scsi/sd.c         | 16 ++++++++++++++--
 drivers/scsi/sd.h         |  1 +
 include/scsi/scsi.h       |  1 +
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index f2cafae..481ea1b 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -58,6 +58,7 @@
 static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
 static int scsi_try_to_abort_cmd(struct scsi_host_template *,
 				 struct scsi_cmnd *);
+static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn);
 
 /* called with shost->host_lock held */
 void scsi_eh_wakeup(struct Scsi_Host *shost)
@@ -249,6 +250,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
 	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
 		eh_flag &= ~SCSI_EH_CANCEL_CMD;
 	scmd->eh_eflags |= eh_flag;
+	scsi_eh_action(scmd, NEEDS_RESET);
 	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
 	shost->host_failed++;
 	scsi_eh_wakeup(shost);
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index be535d4..cd9f290 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1696,12 +1696,21 @@ static int sd_pr_clear(struct block_device *bdev, u64 key)
  *	the eh command is passed in eh_disp.  We're looking for devices that
  *	fail medium access commands but are OK with non access commands like
  *	test unit ready (so wrongly see the device as having a successful
- *	recovery)
+ *	recovery).
+ *	We have to be careful to count a medium access failure only once
+ *	per SCSI EH run; there might be several timed out commands which
+ *	will cause the 'max_medium_access_timeouts' counter to trigger
+ *	after the first SCSI EH run already and set the device to offline.
  **/
 static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
 {
 	struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
 
+	if (eh_disp == NEEDS_RESET) {
+		/* New SCSI EH run, reset gate variable */
+		sdkp->medium_access_reset = 0;
+		return eh_disp;
+	}
 	if (!scsi_device_online(scmd->device) ||
 	    !scsi_medium_access_command(scmd) ||
 	    host_byte(scmd->result) != DID_TIME_OUT ||
@@ -1715,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
 	 * process of recovering or has it suffered an internal failure
 	 * that prevents access to the storage medium.
 	 */
-	sdkp->medium_access_timed_out++;
+	if (!sdkp->medium_access_reset) {
+		sdkp->medium_access_timed_out++;
+		sdkp->medium_access_reset++;
+	}
 
 	/*
 	 * If the device keeps failing read/write commands but TEST UNIT
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 4dac35e..19e0bab 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -85,6 +85,7 @@ struct scsi_disk {
 	unsigned int	physical_block_size;
 	unsigned int	max_medium_access_timeouts;
 	unsigned int	medium_access_timed_out;
+	unsigned int	medium_access_reset;
 	u8		media_present;
 	u8		write_prot;
 	u8		protection_type;/* Data Integrity Field */
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index a1e1930..b6c750f 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -185,6 +185,7 @@ static inline int scsi_is_wlun(u64 lun)
 #define TIMEOUT_ERROR   0x2007
 #define SCSI_RETURN_NOT_HANDLED   0x2008
 #define FAST_IO_FAIL	0x2009
+#define NEEDS_RESET     0x2010
 
 /*
  * Midlevel queue return values.
-- 
1.8.5.6

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] scsi_error: count medium access timeout only once per EH run
  2017-02-23 10:27 [PATCH] scsi_error: count medium access timeout only once per EH run Hannes Reinecke
@ 2017-02-23 14:13 ` Laurence Oberman
  2017-02-27 19:33 ` Ewan D. Milne
  1 sibling, 0 replies; 5+ messages in thread
From: Laurence Oberman @ 2017-02-23 14:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Christoph Hellwig, James Bottomley,
	linux-scsi, Ewan Milne, Benjamin Block, Steffen Maier,
	Hannes Reinecke



----- Original Message -----
> From: "Hannes Reinecke" <hare@suse.de>
> To: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: "Christoph Hellwig" <hch@lst.de>, "James Bottomley" <james.bottomley@hansenpartnership.com>,
> linux-scsi@vger.kernel.org, "Hannes Reinecke" <hare@suse.de>, "Ewan Milne" <emilne@redhat.com>, "Lawrence Oberman"
> <loberman@redhat.com>, "Benjamin Block" <bblock@linux.vnet.ibm.vom>, "Steffen Maier" <maier@de.ibm.com>, "Hannes
> Reinecke" <hare@suse.com>
> Sent: Thursday, February 23, 2017 5:27:19 AM
> Subject: [PATCH] scsi_error: count medium access timeout only once per EH run
> 
> The current medium access timeout counter will be increased for
> each command, so if there are enough failed commands we'll hit
> the medium access timeout for even a single failure.
> Fix this by making the timeout per EH run, ie the counter will
> only be increased once per device and EH run.
> 
> Cc: Ewan Milne <emilne@redhat.com>
> Cc: Lawrence Oberman <loberman@redhat.com>
> Cc: Benjamin Block <bblock@linux.vnet.ibm.vom>
> Cc: Steffen Maier <maier@de.ibm.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/scsi/scsi_error.c |  2 ++
>  drivers/scsi/sd.c         | 16 ++++++++++++++--
>  drivers/scsi/sd.h         |  1 +
>  include/scsi/scsi.h       |  1 +
>  4 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index f2cafae..481ea1b 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -58,6 +58,7 @@
>  static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
>  static int scsi_try_to_abort_cmd(struct scsi_host_template *,
>  				 struct scsi_cmnd *);
> +static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn);
>  
>  /* called with shost->host_lock held */
>  void scsi_eh_wakeup(struct Scsi_Host *shost)
> @@ -249,6 +250,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>  	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>  		eh_flag &= ~SCSI_EH_CANCEL_CMD;
>  	scmd->eh_eflags |= eh_flag;
> +	scsi_eh_action(scmd, NEEDS_RESET);
>  	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>  	shost->host_failed++;
>  	scsi_eh_wakeup(shost);
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index be535d4..cd9f290 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1696,12 +1696,21 @@ static int sd_pr_clear(struct block_device *bdev, u64
> key)
>   *	the eh command is passed in eh_disp.  We're looking for devices that
>   *	fail medium access commands but are OK with non access commands like
>   *	test unit ready (so wrongly see the device as having a successful
> - *	recovery)
> + *	recovery).
> + *	We have to be careful to count a medium access failure only once
> + *	per SCSI EH run; there might be several timed out commands which
> + *	will cause the 'max_medium_access_timeouts' counter to trigger
> + *	after the first SCSI EH run already and set the device to offline.
>   **/
>  static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>  {
>  	struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
>  
> +	if (eh_disp == NEEDS_RESET) {
> +		/* New SCSI EH run, reset gate variable */
> +		sdkp->medium_access_reset = 0;
> +		return eh_disp;
> +	}
>  	if (!scsi_device_online(scmd->device) ||
>  	    !scsi_medium_access_command(scmd) ||
>  	    host_byte(scmd->result) != DID_TIME_OUT ||
> @@ -1715,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int
> eh_disp)
>  	 * process of recovering or has it suffered an internal failure
>  	 * that prevents access to the storage medium.
>  	 */
> -	sdkp->medium_access_timed_out++;
> +	if (!sdkp->medium_access_reset) {
> +		sdkp->medium_access_timed_out++;
> +		sdkp->medium_access_reset++;
> +	}
>  
>  	/*
>  	 * If the device keeps failing read/write commands but TEST UNIT
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index 4dac35e..19e0bab 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -85,6 +85,7 @@ struct scsi_disk {
>  	unsigned int	physical_block_size;
>  	unsigned int	max_medium_access_timeouts;
>  	unsigned int	medium_access_timed_out;
> +	unsigned int	medium_access_reset;
>  	u8		media_present;
>  	u8		write_prot;
>  	u8		protection_type;/* Data Integrity Field */
> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
> index a1e1930..b6c750f 100644
> --- a/include/scsi/scsi.h
> +++ b/include/scsi/scsi.h
> @@ -185,6 +185,7 @@ static inline int scsi_is_wlun(u64 lun)
>  #define TIMEOUT_ERROR   0x2007
>  #define SCSI_RETURN_NOT_HANDLED   0x2008
>  #define FAST_IO_FAIL	0x2009
> +#define NEEDS_RESET     0x2010
>  
>  /*
>   * Midlevel queue return values.
> --
> 1.8.5.6
> 
> 

Hello Hannes 
This makes sense to me what you are doing here.
I will also wait for Ewan to weigh in but I wonder if we should make a simple change.
Maybe good to clarify the RESET here by simply changing the name.

Change 
+#define NEEDS_RESET     0x2010
to
+#define MAX_MEDIUM_ERROR_NEEDS_RESET

Of course then also change
+	if (eh_disp == NEEDS_RESET) {
to
+	if (eh_disp == MAX_MEDIUM_ERROR_NEEDS_RESET) {

Reviewed-by: Laurence Oberman <loberman@redhat.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] scsi_error: count medium access timeout only once per EH run
  2017-02-23 10:27 [PATCH] scsi_error: count medium access timeout only once per EH run Hannes Reinecke
  2017-02-23 14:13 ` Laurence Oberman
@ 2017-02-27 19:33 ` Ewan D. Milne
  2017-02-28  3:04   ` Martin K. Petersen
  2017-02-28 10:03   ` Hannes Reinecke
  1 sibling, 2 replies; 5+ messages in thread
From: Ewan D. Milne @ 2017-02-27 19:33 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Martin K. Petersen, Christoph Hellwig, James Bottomley,
	linux-scsi, Lawrence Oberman, Benjamin Block, Steffen Maier,
	Hannes Reinecke

On Thu, 2017-02-23 at 11:27 +0100, Hannes Reinecke wrote:
> The current medium access timeout counter will be increased for
> each command, so if there are enough failed commands we'll hit
> the medium access timeout for even a single failure.
> Fix this by making the timeout per EH run, ie the counter will
> only be increased once per device and EH run.

So, this is good, the current implementation has a flaw in that
under certain conditions, a device will get offlined immediately,
(i.e. if there are a few medium access commands pending, and
they all timeout), which isn't what was intended.

It means, of course, that we will no longer detect cases like:

<timeout>, <timeout>, SUCCESS, SUCCESS, SUCCESS, <timeout>

as separate medium access timeouts, but I think the original
intent of Martin's change wasn't to operate on such a short
time-scale, am I right, Martin?

I made a few notes on the coding/implementation (below), but that
doesn't affect the functional change.  We should definitely change
what we have now, it is causing people problems.

> 
> Cc: Ewan Milne <emilne@redhat.com>
> Cc: Lawrence Oberman <loberman@redhat.com>
> Cc: Benjamin Block <bblock@linux.vnet.ibm.vom>
> Cc: Steffen Maier <maier@de.ibm.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/scsi/scsi_error.c |  2 ++
>  drivers/scsi/sd.c         | 16 ++++++++++++++--
>  drivers/scsi/sd.h         |  1 +
>  include/scsi/scsi.h       |  1 +
>  4 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index f2cafae..481ea1b 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -58,6 +58,7 @@
>  static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
>  static int scsi_try_to_abort_cmd(struct scsi_host_template *,
>  				 struct scsi_cmnd *);
> +static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn);
>  
>  /* called with shost->host_lock held */
>  void scsi_eh_wakeup(struct Scsi_Host *shost)
> @@ -249,6 +250,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>  	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>  		eh_flag &= ~SCSI_EH_CANCEL_CMD;
>  	scmd->eh_eflags |= eh_flag;
> +	scsi_eh_action(scmd, NEEDS_RESET);

So here we are overloading the eh_disp argument with a flag to reset the
medium_access_reset variable.  James changed the calling sequence of
this function already to remove arguments, we could just add another
boolean parameter "reset".  scsi_driver.eh_action() would need it too.

>  	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>  	shost->host_failed++;
>  	scsi_eh_wakeup(shost);
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index be535d4..cd9f290 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1696,12 +1696,21 @@ static int sd_pr_clear(struct block_device *bdev, u64 key)
>   *	the eh command is passed in eh_disp.  We're looking for devices that
>   *	fail medium access commands but are OK with non access commands like
>   *	test unit ready (so wrongly see the device as having a successful
> - *	recovery)
> + *	recovery).
> + *	We have to be careful to count a medium access failure only once
> + *	per SCSI EH run; there might be several timed out commands which
> + *	will cause the 'max_medium_access_timeouts' counter to trigger
> + *	after the first SCSI EH run already and set the device to offline.
>   **/
>  static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>  {
>  	struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
>  
> +	if (eh_disp == NEEDS_RESET) {
> +		/* New SCSI EH run, reset gate variable */
> +		sdkp->medium_access_reset = 0;
> +		return eh_disp;
> +	}
>  	if (!scsi_device_online(scmd->device) ||
>  	    !scsi_medium_access_command(scmd) ||
>  	    host_byte(scmd->result) != DID_TIME_OUT ||
> @@ -1715,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>  	 * process of recovering or has it suffered an internal failure
>  	 * that prevents access to the storage medium.
>  	 */
> -	sdkp->medium_access_timed_out++;
> +	if (!sdkp->medium_access_reset) {
> +		sdkp->medium_access_timed_out++;
> +		sdkp->medium_access_reset++;
> +	}

If we only increment sdkp->medium_access_reset when it was 0, then it
will only have the values 0 and 1, and does not need to have the full
unsigned int precision.  A single bit field is sufficient, in which
case the code would be:  sdkp->medium_access_reset = 1;

>  
>  	/*
>  	 * If the device keeps failing read/write commands but TEST UNIT
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index 4dac35e..19e0bab 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -85,6 +85,7 @@ struct scsi_disk {
>  	unsigned int	physical_block_size;
>  	unsigned int	max_medium_access_timeouts;
>  	unsigned int	medium_access_timed_out;
> +	unsigned int	medium_access_reset;

This could be an unsigned int : 1 with the other single bit fields at
the end of the structure, with the change above.

>  	u8		media_present;
>  	u8		write_prot;
>  	u8		protection_type;/* Data Integrity Field */
> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
> index a1e1930..b6c750f 100644
> --- a/include/scsi/scsi.h
> +++ b/include/scsi/scsi.h
> @@ -185,6 +185,7 @@ static inline int scsi_is_wlun(u64 lun)
>  #define TIMEOUT_ERROR   0x2007
>  #define SCSI_RETURN_NOT_HANDLED   0x2008
>  #define FAST_IO_FAIL	0x2009
> +#define NEEDS_RESET     0x2010

See above, this is overloading the use of the parameter.

>  
>  /*
>   * Midlevel queue return values.

Reviewed-by: Ewan D. Milne <emilne@redhat.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] scsi_error: count medium access timeout only once per EH run
  2017-02-27 19:33 ` Ewan D. Milne
@ 2017-02-28  3:04   ` Martin K. Petersen
  2017-02-28 10:03   ` Hannes Reinecke
  1 sibling, 0 replies; 5+ messages in thread
From: Martin K. Petersen @ 2017-02-28  3:04 UTC (permalink / raw)
  To: Ewan D. Milne
  Cc: Hannes Reinecke, Martin K. Petersen, Christoph Hellwig,
	James Bottomley, linux-scsi, Lawrence Oberman, Benjamin Block,
	Steffen Maier, Hannes Reinecke

>>>>> "Ewan" == Ewan D Milne <emilne@redhat.com> writes:

Ewan,

Ewan> So, this is good, the current implementation has a flaw in that
Ewan> under certain conditions, a device will get offlined immediately,
Ewan> (i.e. if there are a few medium access commands pending, and they
Ewan> all timeout), which isn't what was intended.

Yeah. That was OK for my use case. I was trying to prevent the server
from going into a tail spin. There was no chance of recovering the disk.

But ideally we'd be offlining based on how many times we retry the same
medium access command.

Ewan> as separate medium access timeouts, but I think the original
Ewan> intent of Martin's change wasn't to operate on such a short
Ewan> time-scale, am I right, Martin?

On the device that begat my original patch, SPC command responses were
handled by the SAS controller firmware on behalf of all discovered
devices. Regardless of whether said drives were still alive or not.

Medium Access commands, however, would always get passed on to the
physical drive for processing. So when a drive went pining for the
fjords, TUR would always succeed whereas reads and writes would time
out.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] scsi_error: count medium access timeout only once per EH run
  2017-02-27 19:33 ` Ewan D. Milne
  2017-02-28  3:04   ` Martin K. Petersen
@ 2017-02-28 10:03   ` Hannes Reinecke
  1 sibling, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2017-02-28 10:03 UTC (permalink / raw)
  To: emilne
  Cc: Martin K. Petersen, Christoph Hellwig, James Bottomley,
	linux-scsi, Lawrence Oberman, Benjamin Block, Steffen Maier,
	Hannes Reinecke

On 02/27/2017 08:33 PM, Ewan D. Milne wrote:
> On Thu, 2017-02-23 at 11:27 +0100, Hannes Reinecke wrote:
>> The current medium access timeout counter will be increased for
>> each command, so if there are enough failed commands we'll hit
>> the medium access timeout for even a single failure.
>> Fix this by making the timeout per EH run, ie the counter will
>> only be increased once per device and EH run.
> 
> So, this is good, the current implementation has a flaw in that
> under certain conditions, a device will get offlined immediately,
> (i.e. if there are a few medium access commands pending, and
> they all timeout), which isn't what was intended.
> 
> It means, of course, that we will no longer detect cases like:
> 
> <timeout>, <timeout>, SUCCESS, SUCCESS, SUCCESS, <timeout>
> 
> as separate medium access timeouts, but I think the original
> intent of Martin's change wasn't to operate on such a short
> time-scale, am I right, Martin?
> 
> I made a few notes on the coding/implementation (below), but that
> doesn't affect the functional change.  We should definitely change
> what we have now, it is causing people problems.
> 
>>
>> Cc: Ewan Milne <emilne@redhat.com>
>> Cc: Lawrence Oberman <loberman@redhat.com>
>> Cc: Benjamin Block <bblock@linux.vnet.ibm.vom>
>> Cc: Steffen Maier <maier@de.ibm.com>
>> Signed-off-by: Hannes Reinecke <hare@suse.com>
>> ---
>>  drivers/scsi/scsi_error.c |  2 ++
>>  drivers/scsi/sd.c         | 16 ++++++++++++++--
>>  drivers/scsi/sd.h         |  1 +
>>  include/scsi/scsi.h       |  1 +
>>  4 files changed, 18 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
>> index f2cafae..481ea1b 100644
>> --- a/drivers/scsi/scsi_error.c
>> +++ b/drivers/scsi/scsi_error.c
>> @@ -58,6 +58,7 @@
>>  static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
>>  static int scsi_try_to_abort_cmd(struct scsi_host_template *,
>>  				 struct scsi_cmnd *);
>> +static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn);
>>  
>>  /* called with shost->host_lock held */
>>  void scsi_eh_wakeup(struct Scsi_Host *shost)
>> @@ -249,6 +250,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>>  	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>>  		eh_flag &= ~SCSI_EH_CANCEL_CMD;
>>  	scmd->eh_eflags |= eh_flag;
>> +	scsi_eh_action(scmd, NEEDS_RESET);
> 
> So here we are overloading the eh_disp argument with a flag to reset the
> medium_access_reset variable.  James changed the calling sequence of
> this function already to remove arguments, we could just add another
> boolean parameter "reset".  scsi_driver.eh_action() would need it too.
> 
Sure, I could be doing it.
Using a separate 'eh_disp' variable has the added benefit that it
doesn't break the kABI :-)
But yeah, I modify that.

>>  	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>>  	shost->host_failed++;
>>  	scsi_eh_wakeup(shost);
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index be535d4..cd9f290 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -1696,12 +1696,21 @@ static int sd_pr_clear(struct block_device *bdev, u64 key)
>>   *	the eh command is passed in eh_disp.  We're looking for devices that
>>   *	fail medium access commands but are OK with non access commands like
>>   *	test unit ready (so wrongly see the device as having a successful
>> - *	recovery)
>> + *	recovery).
>> + *	We have to be careful to count a medium access failure only once
>> + *	per SCSI EH run; there might be several timed out commands which
>> + *	will cause the 'max_medium_access_timeouts' counter to trigger
>> + *	after the first SCSI EH run already and set the device to offline.
>>   **/
>>  static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>>  {
>>  	struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
>>  
>> +	if (eh_disp == NEEDS_RESET) {
>> +		/* New SCSI EH run, reset gate variable */
>> +		sdkp->medium_access_reset = 0;
>> +		return eh_disp;
>> +	}
>>  	if (!scsi_device_online(scmd->device) ||
>>  	    !scsi_medium_access_command(scmd) ||
>>  	    host_byte(scmd->result) != DID_TIME_OUT ||
>> @@ -1715,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>>  	 * process of recovering or has it suffered an internal failure
>>  	 * that prevents access to the storage medium.
>>  	 */
>> -	sdkp->medium_access_timed_out++;
>> +	if (!sdkp->medium_access_reset) {
>> +		sdkp->medium_access_timed_out++;
>> +		sdkp->medium_access_reset++;
>> +	}
> 
> If we only increment sdkp->medium_access_reset when it was 0, then it
> will only have the values 0 and 1, and does not need to have the full
> unsigned int precision.  A single bit field is sufficient, in which
> case the code would be:  sdkp->medium_access_reset = 1;
> 
Okay, can do that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-02-28 10:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-02-23 10:27 [PATCH] scsi_error: count medium access timeout only once per EH run Hannes Reinecke
2017-02-23 14:13 ` Laurence Oberman
2017-02-27 19:33 ` Ewan D. Milne
2017-02-28  3:04   ` Martin K. Petersen
2017-02-28 10:03   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox