Re: [PATCH 2/2] scsi: Do not complete timed-out command

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: Ewan Milne <emilne@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: James Bottomley <jbottomley@parallels.com>,
	Christoph Hellwig <hch@infradead.org>,
	Bart vn Assche <bvanassche@acm.org>,
	linux-scsi@vger.kernel.org
Subject: Re: [PATCH 2/2] scsi: Do not complete timed-out command
Date: Mon, 16 Jun 2014 14:23:30 -0400	[thread overview]
Message-ID: <1402943010.6423.67.camel@localhost.localdomain> (raw)
In-Reply-To: <1402660906-74661-3-git-send-email-hare@suse.de>

On Fri, 2014-06-13 at 14:01 +0200, Hannes Reinecke wrote:
> There is a possible race between scsi_times_out() and scsi_done();
> the LLDD is only notified about a timed out command by calling
> scsi_try_to_abort_command(). Anytime before that the LLDD is
> free to complete the command via scsi_done().
> By this time, however, the command might've been timed-out already,
> leaving us with a stale command in the error handler.
> Fix this by separating out the two meanings of the
> SCSI_EH_ABORT_SCHEDULED flag; it's used to signal that
> a) the asynchronous abort has been schedule
> and
> b) that an asynchronous abourt had been running, the command
>    has been retried, and has timed out again.
> 
> This patch implements a new flag SCSI_EH_CMD_TIMEOUT to signal
> that a command has run into a timeout (and thus can be used
> to detect case b) above). The existing flag
> SCSI_EH_CMD_ABORT_SCHEDULED is now restricted to signal
> 'the asynchronous abort has been scheduled', and thus should be
> checked in scsi_done() to avoid the mentioned race.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  drivers/scsi/scsi.c       | 2 ++
>  drivers/scsi/scsi_error.c | 6 ++++--
>  drivers/scsi/scsi_priv.h  | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
> index 88d46fe..7a58fbe 100644
> --- a/drivers/scsi/scsi.c
> +++ b/drivers/scsi/scsi.c
> @@ -739,6 +739,8 @@ int scsi_dispatch_cmd(struct scsi_cmnd *cmd)
>   */
>  static void scsi_done(struct scsi_cmnd *cmd)
>  {
> +	if (WARN_ON_ONCE(cmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED))
> +		return;
>  	trace_scsi_dispatch_cmd_done(cmd);
>  	blk_complete_request(cmd->request);
>  }
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index ff176a6..cde9c11 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -125,11 +125,13 @@ scmd_eh_abort_handler(struct work_struct *work)
>  			scmd_printk(KERN_INFO, scmd,
>  				    "scmd %p eh timeout, not aborting\n",
>  				    scmd));
> +		scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>  	} else {
>  		SCSI_LOG_ERROR_RECOVERY(3,
>  			scmd_printk(KERN_INFO, scmd,
>  				    "aborting command %p\n", scmd));
>  		rtn = scsi_try_to_abort_cmd(sdev->host->hostt, scmd);
> +		scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>  		if (rtn == SUCCESS) {
>  			scmd->result |= DID_TIME_OUT << 16;
>  			if (scsi_host_eh_past_deadline(sdev->host)) {
> @@ -185,17 +187,17 @@ scsi_abort_command(struct scsi_cmnd *scmd)
>  	struct Scsi_Host *shost = sdev->host;
>  	unsigned long flags;
>  
> -	if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) {
> +	if (scmd->eh_eflags & SCSI_EH_CMD_TIMEOUT) {
>  		/*
>  		 * Retry after abort failed, escalate to next level.
>  		 */
> -		scmd->eh_eflags &= ~SCSI_EH_ABORT_SCHEDULED;
>  		SCSI_LOG_ERROR_RECOVERY(3,
>  			scmd_printk(KERN_INFO, scmd,
>  				    "scmd %p previous abort failed\n", scmd));
>  		cancel_delayed_work(&scmd->abort_work);
>  		return FAILED;
>  	}
> +	scmd->eh_eflags |= SCSI_EH_CMD_TIMEOUT;
>  
>  	/*
>  	 * Do not try a command abort if
> diff --git a/drivers/scsi/scsi_priv.h b/drivers/scsi/scsi_priv.h
> index 48e5b65..66b387d 100644
> --- a/drivers/scsi/scsi_priv.h
> +++ b/drivers/scsi/scsi_priv.h
> @@ -20,6 +20,7 @@ struct scsi_nl_hdr;
>   */
>  #define SCSI_EH_CANCEL_CMD	0x0001	/* Cancel this cmd */
>  #define SCSI_EH_ABORT_SCHEDULED	0x0002	/* Abort has been scheduled */
> +#define SCSI_EH_CMD_TIMEOUT	0x0004	/* Command has timed out */
>  
>  #define SCSI_SENSE_VALID(scmd) \
>  	(((scmd)->sense_buffer[0] & 0x70) == 0x70)

So, let me see if I follow the logic here...

This patch changes the meaning of the SCSI_EH_ABORT_SCHEDULED flag
so that this flag is set during the time the queue_delayed_work() item
is pending, until the LLD returns from the ->eh_abort_handler() call.
It also adds a new flag SCSI_EH_CMD_TIMEOUT which is set (essentially)
when the scmd has timed out, and prevent scsi_abort_command() from
doing anything more than once (until scmd->eh_eflags is reset).

I don't quite get why scsi_eh_scmd_add() tests SCSI_EH_ABORT_SCHEDULED:

>        if (shost->eh_deadline != -1 && !shost->last_reset)
>                shost->last_reset = jiffies;
>
>        ret = 1;
>        if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED)
>                eh_flag &= ~SCSI_EH_CANCEL_CMD;
>        scmd->eh_eflags |= eh_flag;
>        list_add_tail(&scmd->eh_entry, &shost->eh_cmd_q);
>        shost->host_failed++;
>        scsi_eh_wakeup(shost);

...since it seems like that flag wouldn't still be set by the time we
get to the point where we are adding the scmd to the eh_cmd_q list.

I'm also not sure why this case in scsi_decide_disposition():

>        case DID_ABORT:
>                if (scmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED) {
>                        scmd->result |= DID_TIME_OUT << 16;
>                        return SUCCESS;
>                }

...is looking at SCSI_EH_ABORT_SCHEDULED, for similar reasons.

The WARN_ON_ONCE() case in scsi_done():

>static void scsi_done(struct scsi_cmnd *cmd)
>{
>        if (WARN_ON_ONCE(cmd->eh_eflags & SCSI_EH_ABORT_SCHEDULED))
>                return;
>        trace_scsi_dispatch_cmd_done(cmd);
>        blk_complete_request(cmd->request);
>}

...seems like it should never go off because of the REQ_ATOM_COMPLETE
logic in the block layer, assuming that logic is working.

Can you verify this is the intended behavior, or perhaps explain
what hole this patch is attempting to plug?  It seems like it is
dealing with the case of scsi_done() being called after a timeout
but before the delayed_work abort mechanism runs.  Is this actually
what is happening?

Reviewed-by: Ewan D. Milne <emilne@redhat.com>

next prev parent reply	other threads:[~2014-06-16 18:24 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-06-13 12:01 [PATCH 0/2] scsi command timeout fixes Hannes Reinecke
2014-06-13 12:01 ` [PATCH 1/2] scsi_error: set DID_TIME_OUT correctly Hannes Reinecke
2014-06-13 14:32   ` Ewan Milne
2014-06-13 16:25     ` Christoph Hellwig
2014-06-13 12:01 ` [PATCH 2/2] scsi: Do not complete timed-out command Hannes Reinecke
2014-06-13 16:29   ` Christoph Hellwig
2014-06-16 18:23   ` Ewan Milne [this message]
2014-06-17  5:53     ` Hannes Reinecke
2014-06-19 14:43       ` Ewan Milne
2014-06-13 16:52 ` [PATCH 0/2] scsi command timeout fixes James Bottomley
2014-06-13 17:16   ` hch

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1402943010.6423.67.camel@localhost.localdomain \
    --to=emilne@redhat.com \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=jbottomley@parallels.com \
    --cc=linux-scsi@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox