Re: [PATCH] scsi_error: count medium access timeout only once per EH run

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Laurence Oberman <loberman@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@lst.de>,
	James Bottomley <james.bottomley@hansenpartnership.com>,
	linux-scsi@vger.kernel.org, Ewan Milne <emilne@redhat.com>,
	Benjamin Block <bblock@linux.vnet.ibm.vom>,
	Steffen Maier <maier@de.ibm.com>, Hannes Reinecke <hare@suse.com>
Subject: Re: [PATCH] scsi_error: count medium access timeout only once per EH run
Date: Thu, 23 Feb 2017 09:13:02 -0500 (EST)	[thread overview]
Message-ID: <398739338.37615684.1487859182429.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <1487845639-73322-1-git-send-email-hare@suse.de>



----- Original Message -----
> From: "Hannes Reinecke" <hare@suse.de>
> To: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: "Christoph Hellwig" <hch@lst.de>, "James Bottomley" <james.bottomley@hansenpartnership.com>,
> linux-scsi@vger.kernel.org, "Hannes Reinecke" <hare@suse.de>, "Ewan Milne" <emilne@redhat.com>, "Lawrence Oberman"
> <loberman@redhat.com>, "Benjamin Block" <bblock@linux.vnet.ibm.vom>, "Steffen Maier" <maier@de.ibm.com>, "Hannes
> Reinecke" <hare@suse.com>
> Sent: Thursday, February 23, 2017 5:27:19 AM
> Subject: [PATCH] scsi_error: count medium access timeout only once per EH run
> 
> The current medium access timeout counter will be increased for
> each command, so if there are enough failed commands we'll hit
> the medium access timeout for even a single failure.
> Fix this by making the timeout per EH run, ie the counter will
> only be increased once per device and EH run.
> 
> Cc: Ewan Milne <emilne@redhat.com>
> Cc: Lawrence Oberman <loberman@redhat.com>
> Cc: Benjamin Block <bblock@linux.vnet.ibm.vom>
> Cc: Steffen Maier <maier@de.ibm.com>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> ---
>  drivers/scsi/scsi_error.c |  2 ++
>  drivers/scsi/sd.c         | 16 ++++++++++++++--
>  drivers/scsi/sd.h         |  1 +
>  include/scsi/scsi.h       |  1 +
>  4 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index f2cafae..481ea1b 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -58,6 +58,7 @@
>  static int scsi_eh_try_stu(struct scsi_cmnd *scmd);
>  static int scsi_try_to_abort_cmd(struct scsi_host_template *,
>  				 struct scsi_cmnd *);
> +static int scsi_eh_action(struct scsi_cmnd *scmd, int rtn);
>  
>  /* called with shost->host_lock held */
>  void scsi_eh_wakeup(struct Scsi_Host *shost)
> @@ -249,6 +250,7 @@ int scsi_eh_scmd_add(struct scsi_cmnd *scmd, int eh_flag)
>  	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>  		eh_flag &= ~SCSI_EH_CANCEL_CMD;
>  	scmd->eh_eflags |= eh_flag;
> +	scsi_eh_action(scmd, NEEDS_RESET);
>  	list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>  	shost->host_failed++;
>  	scsi_eh_wakeup(shost);
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index be535d4..cd9f290 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1696,12 +1696,21 @@ static int sd_pr_clear(struct block_device *bdev, u64
> key)
>   *	the eh command is passed in eh_disp.  We're looking for devices that
>   *	fail medium access commands but are OK with non access commands like
>   *	test unit ready (so wrongly see the device as having a successful
> - *	recovery)
> + *	recovery).
> + *	We have to be careful to count a medium access failure only once
> + *	per SCSI EH run; there might be several timed out commands which
> + *	will cause the 'max_medium_access_timeouts' counter to trigger
> + *	after the first SCSI EH run already and set the device to offline.
>   **/
>  static int sd_eh_action(struct scsi_cmnd *scmd, int eh_disp)
>  {
>  	struct scsi_disk *sdkp = scsi_disk(scmd->request->rq_disk);
>  
> +	if (eh_disp == NEEDS_RESET) {
> +		/* New SCSI EH run, reset gate variable */
> +		sdkp->medium_access_reset = 0;
> +		return eh_disp;
> +	}
>  	if (!scsi_device_online(scmd->device) ||
>  	    !scsi_medium_access_command(scmd) ||
>  	    host_byte(scmd->result) != DID_TIME_OUT ||
> @@ -1715,7 +1724,10 @@ static int sd_eh_action(struct scsi_cmnd *scmd, int
> eh_disp)
>  	 * process of recovering or has it suffered an internal failure
>  	 * that prevents access to the storage medium.
>  	 */
> -	sdkp->medium_access_timed_out++;
> +	if (!sdkp->medium_access_reset) {
> +		sdkp->medium_access_timed_out++;
> +		sdkp->medium_access_reset++;
> +	}
>  
>  	/*
>  	 * If the device keeps failing read/write commands but TEST UNIT
> diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
> index 4dac35e..19e0bab 100644
> --- a/drivers/scsi/sd.h
> +++ b/drivers/scsi/sd.h
> @@ -85,6 +85,7 @@ struct scsi_disk {
>  	unsigned int	physical_block_size;
>  	unsigned int	max_medium_access_timeouts;
>  	unsigned int	medium_access_timed_out;
> +	unsigned int	medium_access_reset;
>  	u8		media_present;
>  	u8		write_prot;
>  	u8		protection_type;/* Data Integrity Field */
> diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
> index a1e1930..b6c750f 100644
> --- a/include/scsi/scsi.h
> +++ b/include/scsi/scsi.h
> @@ -185,6 +185,7 @@ static inline int scsi_is_wlun(u64 lun)
>  #define TIMEOUT_ERROR   0x2007
>  #define SCSI_RETURN_NOT_HANDLED   0x2008
>  #define FAST_IO_FAIL	0x2009
> +#define NEEDS_RESET     0x2010
>  
>  /*
>   * Midlevel queue return values.
> --
> 1.8.5.6
> 
> 

Hello Hannes 
This makes sense to me what you are doing here.
I will also wait for Ewan to weigh in but I wonder if we should make a simple change.
Maybe good to clarify the RESET here by simply changing the name.

Change 
+#define NEEDS_RESET     0x2010
to
+#define MAX_MEDIUM_ERROR_NEEDS_RESET

Of course then also change
+	if (eh_disp == NEEDS_RESET) {
to
+	if (eh_disp == MAX_MEDIUM_ERROR_NEEDS_RESET) {

Reviewed-by: Laurence Oberman <loberman@redhat.com

next prev parent reply	other threads:[~2017-02-23 14:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23 10:27 [PATCH] scsi_error: count medium access timeout only once per EH run Hannes Reinecke
2017-02-23 14:13 ` Laurence Oberman [this message]
2017-02-27 19:33 ` Ewan D. Milne
2017-02-28  3:04   ` Martin K. Petersen
2017-02-28 10:03   ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=398739338.37615684.1487859182429.JavaMail.zimbra@redhat.com \
    --to=loberman@redhat.com \
    --cc=bblock@linux.vnet.ibm.vom \
    --cc=emilne@redhat.com \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=james.bottomley@hansenpartnership.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=maier@de.ibm.com \
    --cc=martin.petersen@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.