Re: Make SCSI error handler code easier to understand

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

From: James Bottomley <jbottomley@parallels.com>
To: "joe.lawrence@stratus.com" <joe.lawrence@stratus.com>
Cc: "hch@infradead.org" <hch@infradead.org>,
	"bvanassche@acm.org" <bvanassche@acm.org>,
	"hare@suse.de" <hare@suse.de>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"linux-scsi@vger.kernel.org" <linux-scsi@vger.kernel.org>,
	"william.kuzeja@stratus.com" <william.kuzeja@stratus.com>,
	"jdl1291@gmail.com" <jdl1291@gmail.com>,
	"pbonzini@redhat.com" <pbonzini@redhat.com>
Subject: Re: Make SCSI error handler code easier to understand
Date: Thu, 29 May 2014 11:33:59 +0000	[thread overview]
Message-ID: <1401363238.2163.5.camel@dabdike> (raw)
In-Reply-To: <20140528161528.55443865@jlaw-desktop.mno.stratus.com>

On Wed, 2014-05-28 at 16:15 -0400, Joe Lawrence wrote:
> On Mon, 26 May 2014 17:12:27 +0200
> Bart Van Assche <bvanassche@acm.org> wrote:
> 
> > Every now and then someone asks how it is avoided that the SCSI error
> > handler and the SCSI completion handler are invoked concurrently for
> > the same SCSI command. Hence this patch series that should make the SCSI
> > error handler code a little easier to understand.
> 
> Hi Bart,
> 
> We talked about REQ_ATOM_COMPLETE a while back (so I may be one of
> those people who periodically ask about SCSI error handling /
> completion), and I just thought I'd chime in here to clarify the
> scenario we were worried about.
> 
> Before d555a2ab "[SCSI] Fix spurious request sense in error handling"
> we saw the following sequence of events:
> 
>   [ 1394.345991] sd 2:0:1:0: [sdw] Done:
>   [ 1394.361790] TIMEOUT
>   [ 1394.374148] sd 2:0:1:0: [sdw]
>   [ 1394.388775] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
>   [ 1394.409068] sd 2:0:1:0: [sdw] CDB:
>   [ 1394.424782] Read(10): 28 00 00 00 00 18 00 00 08 00
>   [ 1394.443772] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 abort scheduled
>   [ 1394.474026] sd 2:0:1:0: [sdw] aborting command ffff88104dce02c0
>   [ 1394.573338] qla2xxx [0000:46:00.0]-801c:2: Abort command issued nexus=2:1:0 --  1 2002.
>   [ 1394.609313] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 retry aborted command
>   
>   [ 1425.397434] sd 2:0:1:0: [sdw] Done:
>   [ 1425.417802] TIMEOUT
>   [ 1425.431914] sd 2:0:1:0: [sdw]
>   [ 1425.448247] Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK
>   [ 1425.470736] sd 2:0:1:0: [sdw] CDB:
>   [ 1425.488005] Read(10): 28 00 00 00 00 18 00 00 08 00
>   [ 1425.508472] sd 2:0:1:0: [sdw] scmd ffff88104dce02c0 previous abort failed
>   [ 1425.533299] scsi_eh_2: waking up 0/1/1
>   [ 1425.551189] sd 2:0:1:0: scsi_eh_prt_fail_stats: cmds failed: 1, cancel: 0
>   [ 1425.575897] Total of 1 commands on 1 devices require eh work
>   [ 1425.597970] sd 2:0:1:0: [sdw] scsi_eh_2: requesting sense
> 
>   ie, something like:
> 
>   CMD -> .queuecommand ... timeout
>   ABORT -> .eh_abort_handler
>   CMD (retry) -> .queuecommand ... timeout
>   ABORT -> scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD)
>   ... into scsi_error_handler ...
>   REQUEST SENSE -> .queuecommand
> 
> We eventually found our way into scsi_unjam_host, which "*knows*
> that all commands on the bus have either completed, failed or timed
> out.", and then scsi_send_eh_cmnd, which "hijacks" a SCSI command
> structure.  Presumably the second comment follows from the first.
> 
> But in the case of timeout, when was the LLDD informed to abort or
> forget about that scsi command structure?

When we get a successful abort, the command is back with the mid layer,
or when we complete a reset (all commands should now be taken from the
LLD).  We don't start trying to probe the device with a TUR until after
the command should be back with the mid layer.

> In our testing, it appeared that the LLDD's .queuecommand was called
> back-to-back with the same scsi_cmnd pointer.  The driver in question
> was qla2xxx, who keeps a driver-private structure to scsi_cmnd
> mapping (assumed to be one-to-one) that got confused by this sequence
> of events.

This was the bug in ACA emulation which the fix corrected.  After that
fix, we should only reuse the command for ACA emulation *if* we get a
successful return code indicating sense should be collected (meaning the
LLD relinquished the command).

> After d555a2ab, in this scenario scsi_unjam_host skips the request
> sense and escalates to the next level (scsi_eh_abort_cmds), skipping
> straight to the abort as we would have expected...  This avoids the
> scsi_cmnd "hijack"... but was it ever safe to do so in the first place?

The hijack is safe provided the LLD has relinquished the command, which
it does after status return, abort or reset.

> Or should LLDD expect to gracefully handle the same scsi_cmnd pointer
> coming into .queuecommand w/o an intermediate abort or other
> scsi_host_template callback?

No.

James

     prev parent reply	other threads:[~2014-05-29 11:34 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-26 15:12 Make SCSI error handler code easier to understand Bart Van Assche
2014-05-26 15:14 ` [PATCH 1/3] Remove two cancel_delayed_work() calls from the error handler Bart Van Assche
2014-05-26 15:15   ` [PATCH 2/3] block: Introduce blk_rq_completed() Bart Van Assche
2014-05-26 15:27     ` James Bottomley
2014-05-27  7:49       ` Bart Van Assche
2014-05-27  7:52         ` hch
2014-05-27  8:00           ` James Bottomley
2014-05-27  8:23         ` James Bottomley
2014-05-27  9:00           ` Bart Van Assche
2014-05-27 10:21             ` James Bottomley
2014-05-27 10:47               ` Paolo Bonzini
2014-05-27 10:59                 ` James Bottomley
2014-05-27 11:13                   ` Paolo Bonzini
2014-05-27 11:26                     ` James Bottomley
2014-05-27 11:52                       ` Paolo Bonzini
2014-05-27 11:57                         ` James Bottomley
2014-05-27  5:40     ` Hannes Reinecke
2014-05-26 15:23   ` [PATCH 1/3] Remove two cancel_delayed_work() calls from the error handler Paolo Bonzini
2014-05-26 15:25     ` James Bottomley
2014-05-27  8:06     ` Bart Van Assche
2014-05-27  8:09       ` James Bottomley
2014-05-27  8:36         ` Bart Van Assche
2014-05-27  8:56           ` James Bottomley
2014-05-27  9:06             ` Paolo Bonzini
2014-05-27  5:40   ` Hannes Reinecke
2014-05-27  6:08     ` Bart Van Assche
2014-05-27  6:22       ` Hannes Reinecke
2014-05-26 15:15 ` [PATCH 3/3] Make SCSI error handler code easier to understand Bart Van Assche
2014-05-27  5:42   ` Hannes Reinecke
2014-05-28 20:15 ` Joe Lawrence
2014-05-29 11:33   ` James Bottomley [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1401363238.2163.5.camel@dabdike \
    --to=jbottomley@parallels.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.de \
    --cc=hch@infradead.org \
    --cc=jdl1291@gmail.com \
    --cc=joe.lawrence@stratus.com \
    --cc=linux-scsi@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=william.kuzeja@stratus.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox