dangling pointers and/or reentrancy in scmd_eh_abort

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* dangling pointers and/or reentrancy in scmd_eh_abort_handler?
@ 2014-05-19 14:08 Paolo Bonzini
  2014-05-19 15:08 ` Bart Van Assche
  2014-05-20  8:46 ` Bart Van Assche
  0 siblings, 2 replies; 13+ messages in thread
From: Paolo Bonzini @ 2014-05-19 14:08 UTC (permalink / raw)
  To: linux-scsi, Bart Van Assche, Ulrich Obergfell

Hi all,

I'm trying to understand asynchronous abort in the current upstream 
code, and the code seems to have some dubious locking.

Here are some examples of the issue:

1) dangling pointers: scsi_put_command calls cancel_delayed_work(), but 
that doesn't mean that the scmd_eh_abort_handler couldn't be already 
running.  If the scmd_eh_abort_handler starts while the softirq handler 
is calling scsi_put_command (e.g. scsi_finish_command -> 
scsi_io_completion -> scsi_end_request -> scsi_next_command), the 
pointer to the Scsi_Cmnd* becomes invalid in the middle of the abort 
handler.

2) reentrancy: the softirq handler and scmd_eh_abort_handler can run 
concurrently, and call scsi_finish_command without any lock protecting 
the calls.  You can then get memory corruption.

I don't have any reproducer for this; we're seeing related crashes in 
virtio-scsi EH but those are due to a bug in the driver.  But it means 
that I have no sensible way to write the eh_abort_handler.

Example (1) means that the eh_abort_handler cannot use the passed 
Scsi_Cmnd, because it might not even be valid when entering the 
eh_abort_handler.  Example (2) means that the eh_abort_handler cannot 
return SUCCESS if it detects that the command has been completed in the 
meanwhile.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 14:08 dangling pointers and/or reentrancy in scmd_eh_abort_handler? Paolo Bonzini
@ 2014-05-19 15:08 ` Bart Van Assche
  2014-05-19 15:25   ` Christoph Hellwig
  2014-05-19 16:09   ` Paolo Bonzini
  2014-05-20  8:46 ` Bart Van Assche
  1 sibling, 2 replies; 13+ messages in thread
From: Bart Van Assche @ 2014-05-19 15:08 UTC (permalink / raw)
  To: Paolo Bonzini, linux-scsi, Ulrich Obergfell

On 05/19/14 16:08, Paolo Bonzini wrote:
> 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
> concurrently, and call scsi_finish_command without any lock protecting
> the calls.  You can then get memory corruption.

I'm not sure what the recommended approach is to address this race. But
it is possible to address this in the LLD. See e.g. the srp_claim_req()
function in the SRP LLD and how it is invoked from the reply handler,
the abort handler and the reset handlers in that LLD.

Bart.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 15:08 ` Bart Van Assche
@ 2014-05-19 15:25   ` Christoph Hellwig
  2014-05-19 16:09   ` Paolo Bonzini
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2014-05-19 15:25 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Paolo Bonzini, linux-scsi, Ulrich Obergfell

On Mon, May 19, 2014 at 05:08:56PM +0200, Bart Van Assche wrote:
> On 05/19/14 16:08, Paolo Bonzini wrote:
> > 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
> > concurrently, and call scsi_finish_command without any lock protecting
> > the calls.  You can then get memory corruption.
> 
> I'm not sure what the recommended approach is to address this race. But
> it is possible to address this in the LLD. See e.g. the srp_claim_req()
> function in the SRP LLD and how it is invoked from the reply handler,
> the abort handler and the reset handlers in that LLD.

blk-mq triest to solve this a test_and_set_bit for a completion flag
at the block layer for completions vs timeouts.  I think doing this in
the SCSI layer as well would be very useful as we can't expect Joe
Random Driver Developer to get it right in every driver.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 15:08 ` Bart Van Assche
  2014-05-19 15:25   ` Christoph Hellwig
@ 2014-05-19 16:09   ` Paolo Bonzini
  2014-05-19 16:43     ` Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2014-05-19 16:09 UTC (permalink / raw)
  To: Bart Van Assche, linux-scsi, Ulrich Obergfell

Il 19/05/2014 17:08, Bart Van Assche ha scritto:
> On 05/19/14 16:08, Paolo Bonzini wrote:
>> 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
>> concurrently, and call scsi_finish_command without any lock protecting
>> the calls.  You can then get memory corruption.
>
> I'm not sure what the recommended approach is to address this race. But
> it is possible to address this in the LLD. See e.g. the srp_claim_req()
> function in the SRP LLD and how it is invoked from the reply handler,
> the abort handler and the reset handlers in that LLD.

That's not enough, unless I'm missing something.  Say the request 
handler claims the request and the abort handler doesn't:

- the request handler calls scsi_done and ends up in scsi_finish_command.

- the abort handler will return SUCCESS, and scmd_eh_abort_handler then 
calls scsi_finish_command.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 16:09   ` Paolo Bonzini
@ 2014-05-19 16:43     ` Bart Van Assche
  2014-05-20  7:32       ` Bart Van Assche
  2014-05-21 14:16       ` Mark Wu
  0 siblings, 2 replies; 13+ messages in thread
From: Bart Van Assche @ 2014-05-19 16:43 UTC (permalink / raw)
  To: Paolo Bonzini, linux-scsi, Ulrich Obergfell

On 05/19/14 18:09, Paolo Bonzini wrote:
> Il 19/05/2014 17:08, Bart Van Assche ha scritto:
>> On 05/19/14 16:08, Paolo Bonzini wrote:
>>> 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
>>> concurrently, and call scsi_finish_command without any lock protecting
>>> the calls.  You can then get memory corruption.
>>
>> I'm not sure what the recommended approach is to address this race. But
>> it is possible to address this in the LLD. See e.g. the srp_claim_req()
>> function in the SRP LLD and how it is invoked from the reply handler,
>> the abort handler and the reset handlers in that LLD.
> 
> That's not enough, unless I'm missing something.  Say the request
> handler claims the request and the abort handler doesn't:
> 
> - the request handler calls scsi_done and ends up in scsi_finish_command.
> 
> - the abort handler will return SUCCESS, and scmd_eh_abort_handler then
> calls scsi_finish_command.

It depends on how the SCSI abort handler gets invoked. If the SCSI abort
handler gets invoked because a SCSI command timed out that means that
the block layer has already detected a timeout and also that the
REQ_ATOM_COMPLETE bit has already been set. In this scenario if a SCSI
LLD invokes scsi_done() that causes blk_complete_request() to return
without invoking __blk_complete_request() and hence without invoking
scsi_softirq_done().

Bart.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 16:43     ` Bart Van Assche
@ 2014-05-20  7:32       ` Bart Van Assche
  2014-05-20  8:10         ` Bart Van Assche
  2014-05-21 14:16       ` Mark Wu
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2014-05-20  7:32 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-scsi, Ulrich Obergfell

On 05/19/14 18:43, Bart Van Assche wrote:
> On 05/19/14 18:09, Paolo Bonzini wrote:
>> Il 19/05/2014 17:08, Bart Van Assche ha scritto:
>>> On 05/19/14 16:08, Paolo Bonzini wrote:
>>>> 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
>>>> concurrently, and call scsi_finish_command without any lock protecting
>>>> the calls.  You can then get memory corruption.
>>>
>>> I'm not sure what the recommended approach is to address this race. But
>>> it is possible to address this in the LLD. See e.g. the srp_claim_req()
>>> function in the SRP LLD and how it is invoked from the reply handler,
>>> the abort handler and the reset handlers in that LLD.
>>
>> That's not enough, unless I'm missing something.  Say the request
>> handler claims the request and the abort handler doesn't:
>>
>> - the request handler calls scsi_done and ends up in scsi_finish_command.
>>
>> - the abort handler will return SUCCESS, and scmd_eh_abort_handler then
>> calls scsi_finish_command.
> 
> It depends on how the SCSI abort handler gets invoked. If the SCSI abort
> handler gets invoked because a SCSI command timed out that means that
> the block layer has already detected a timeout and also that the
> REQ_ATOM_COMPLETE bit has already been set. In this scenario if a SCSI
> LLD invokes scsi_done() that causes blk_complete_request() to return
> without invoking __blk_complete_request() and hence without invoking
> scsi_softirq_done().

(replying to my own e-mail)

Please note that scsi_eh_abort_cmds() neither checks nor sets the
REQ_ATOM_COMPLETE bit before it invokes hostt->eh_abort_handler(). Would
it make sense to modify that function such that it invokes
blk_abort_request() instead ? That last function atomically
test-and-sets the REQ_ATOM_COMPLETE bit before invoking the timeout handler.

Bart.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-20  7:32       ` Bart Van Assche
@ 2014-05-20  8:10         ` Bart Van Assche
  2014-05-20  8:40           ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2014-05-20  8:10 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-scsi, Ulrich Obergfell

On 05/20/14 09:32, Bart Van Assche wrote:
> On 05/19/14 18:43, Bart Van Assche wrote:
>> On 05/19/14 18:09, Paolo Bonzini wrote:
>>> Il 19/05/2014 17:08, Bart Van Assche ha scritto:
>>>> On 05/19/14 16:08, Paolo Bonzini wrote:
>>>>> 2) reentrancy: the softirq handler and scmd_eh_abort_handler can run
>>>>> concurrently, and call scsi_finish_command without any lock protecting
>>>>> the calls.  You can then get memory corruption.
>>>>
>>>> I'm not sure what the recommended approach is to address this race. But
>>>> it is possible to address this in the LLD. See e.g. the srp_claim_req()
>>>> function in the SRP LLD and how it is invoked from the reply handler,
>>>> the abort handler and the reset handlers in that LLD.
>>>
>>> That's not enough, unless I'm missing something.  Say the request
>>> handler claims the request and the abort handler doesn't:
>>>
>>> - the request handler calls scsi_done and ends up in scsi_finish_command.
>>>
>>> - the abort handler will return SUCCESS, and scmd_eh_abort_handler then
>>> calls scsi_finish_command.
>>
>> It depends on how the SCSI abort handler gets invoked. If the SCSI abort
>> handler gets invoked because a SCSI command timed out that means that
>> the block layer has already detected a timeout and also that the
>> REQ_ATOM_COMPLETE bit has already been set. In this scenario if a SCSI
>> LLD invokes scsi_done() that causes blk_complete_request() to return
>> without invoking __blk_complete_request() and hence without invoking
>> scsi_softirq_done().
> 
> (replying to my own e-mail)
> 
> Please note that scsi_eh_abort_cmds() neither checks nor sets the
> REQ_ATOM_COMPLETE bit before it invokes hostt->eh_abort_handler(). Would
> it make sense to modify that function such that it invokes
> blk_abort_request() instead ? That last function atomically
> test-and-sets the REQ_ATOM_COMPLETE bit before invoking the timeout handler.

(answering my own question)

REQ_ATOM_COMPLETE is already set before scsi_eh_scmd_add() is called
since that function is only invoked after the block layer has marked a
request as "complete". The only callers of scsi_eh_scmd_add() are
scsi_softirq_done(), scsi_times_out() and scmd_eh_abort_handler(). That
last function is invoked (indirectly) by scsi_times_out().

Bart.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-20  8:10         ` Bart Van Assche
@ 2014-05-20  8:40           ` Paolo Bonzini
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2014-05-20  8:40 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-scsi, Ulrich Obergfell

Il 20/05/2014 10:10, Bart Van Assche ha scritto:
> REQ_ATOM_COMPLETE is already set before scsi_eh_scmd_add() is called
> since that function is only invoked after the block layer has marked a
> request as "complete". The only callers of scsi_eh_scmd_add() are
> scsi_softirq_done(), scsi_times_out() and scmd_eh_abort_handler(). That
> last function is invoked (indirectly) by scsi_times_out().

Yes, and answering my own question, you cannot have a dangling pointer 
in eh_abort_handler (unless you have a driver bug).  This is because 
once eh_abort_handler is called, you know the interrupt handler will not 
trigger the softirq and scsi_finish_command won't be called.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 14:08 dangling pointers and/or reentrancy in scmd_eh_abort_handler? Paolo Bonzini
  2014-05-19 15:08 ` Bart Van Assche
@ 2014-05-20  8:46 ` Bart Van Assche
  1 sibling, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2014-05-20  8:46 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Paolo Bonzini, linux-scsi, Ulrich Obergfell

On 05/19/14 16:08, Paolo Bonzini wrote:
> 1) dangling pointers: scsi_put_command calls cancel_delayed_work(), but
> that doesn't mean that the scmd_eh_abort_handler couldn't be already
> running.  If the scmd_eh_abort_handler starts while the softirq handler
> is calling scsi_put_command (e.g. scsi_finish_command ->
> scsi_io_completion -> scsi_end_request -> scsi_next_command), the
> pointer to the Scsi_Cmnd* becomes invalid in the middle of the abort
> handler.

Hannes, can you clarify why a cancel_delayed_work() statement was added
in scsi_put_command() ? How can scsi_put_command() get invoked after the
SCSI timeout handler queued &scmd->abort_work and before the function
associated with that work struct (scmd_eh_abort_handler()) is called ?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-19 16:43     ` Bart Van Assche
  2014-05-20  7:32       ` Bart Van Assche
@ 2014-05-21 14:16       ` Mark Wu
  2014-05-21 20:34         ` Paolo Bonzini
  1 sibling, 1 reply; 13+ messages in thread
From: Mark Wu @ 2014-05-21 14:16 UTC (permalink / raw)
  To: linux-scsi

Is it possible that when scsi_done is invoked, the scsi command or the 
request has been freed and then reallocated for a new I/O request?  Because 
of this the bit flag REQ_ATOM_COMPLETE becomes unreliable.  Thanks!

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-21 14:16       ` Mark Wu
@ 2014-05-21 20:34         ` Paolo Bonzini
  2014-05-23  1:28           ` Elliott, Robert (Server Storage)
  0 siblings, 1 reply; 13+ messages in thread
From: Paolo Bonzini @ 2014-05-21 20:34 UTC (permalink / raw)
  To: Mark Wu, linux-scsi

Il 21/05/2014 16:16, Mark Wu ha scritto:
> Is it possible that when scsi_done is invoked, the scsi command or the
> request has been freed and then reallocated for a new I/O request?  Because
> of this the bit flag REQ_ATOM_COMPLETE becomes unreliable.  Thanks!

It is up to the driver to ensure that the interrupt handler will not 
process the Scsi_Cmnd* after returning from the eh_abort_handler.  If 
you do this, what you say cannot happen.  Otherwise you'll get a variety 
of failures, the most common of which for me are OOPSes and a BUG in 
blk_finish_request.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-21 20:34         ` Paolo Bonzini
@ 2014-05-23  1:28           ` Elliott, Robert (Server Storage)
  2014-05-23  9:22             ` Paolo Bonzini
  0 siblings, 1 reply; 13+ messages in thread
From: Elliott, Robert (Server Storage) @ 2014-05-23  1:28 UTC (permalink / raw)
  To: Paolo Bonzini, Mark Wu, linux-scsi@vger.kernel.org

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of Paolo Bonzini
> Sent: Wednesday, 21 May, 2014 3:34 PM
> To: Mark Wu; linux-scsi@vger.kernel.org
> Subject: Re: dangling pointers and/or reentrancy in
> scmd_eh_abort_handler?
> 
> Il 21/05/2014 16:16, Mark Wu ha scritto:
> > Is it possible that when scsi_done is invoked, the scsi command or the
> > request has been freed and then reallocated for a new I/O request?
> Because
> > of this the bit flag REQ_ATOM_COMPLETE becomes unreliable.  Thanks!
> 
> It is up to the driver to ensure that the interrupt handler will not
> process the Scsi_Cmnd* after returning from the eh_abort_handler.  If
> you do this, what you say cannot happen.  Otherwise you'll get a variety
> of failures, the most common of which for me are OOPSes and a BUG in
> blk_finish_request.
> 
> Paolo

Aborts don't always work; scsi_eh_abort_handler cannot promise that 
it aborted the command (unless it escalated to sending resets.)
When the abort fails, scmd_eh_abort_handler just leaves it
outstanding, and the error handler thread (scsi_error_handler
function) must deal with it.

If the abort succeeds, then scmd_eh_abort_handler calls
scsi_finish_command.

In SCSI terms, FUNCTION COMPLETE for ABORT TASK doesn't mean the 
command was present and aborted and won't be sending back status; 
if the device server already sent status for the command, the 
task manager also sends FUNCTION COMPLETE.

Depending on the transport, there may be a race condition between 
the command status and the ABORT TASK response; the ABORT TASK 
response might get back first.  I think that means 
scsi_eh_abort_handler's call to scsi_finish_command could
happen during or after scsi_softirq_done has called
scsi_finish_command.

---
Rob Elliott    HP Server Storage

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: dangling pointers and/or reentrancy in scmd_eh_abort_handler?
  2014-05-23  1:28           ` Elliott, Robert (Server Storage)
@ 2014-05-23  9:22             ` Paolo Bonzini
  0 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2014-05-23  9:22 UTC (permalink / raw)
  To: Elliott, Robert (Server Storage), Mark Wu,
	linux-scsi@vger.kernel.org

Il 23/05/2014 03:28, Elliott, Robert (Server Storage) ha scritto:
> Depending on the transport, there may be a race condition between
> the command status and the ABORT TASK response; the ABORT TASK
> response might get back first.  I think that means
> scsi_eh_abort_handler's call to scsi_finish_command could
> happen during or after scsi_softirq_done has called
> scsi_finish_command.

That was my doubt, in fact.  But actually Bart explained that this is 
not the case.  Either scsi_eh_abort_handler will be called via the work 
item, or the command will never be sent to the softirq.

Paolo


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-05-23  9:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-19 14:08 dangling pointers and/or reentrancy in scmd_eh_abort_handler? Paolo Bonzini
2014-05-19 15:08 ` Bart Van Assche
2014-05-19 15:25   ` Christoph Hellwig
2014-05-19 16:09   ` Paolo Bonzini
2014-05-19 16:43     ` Bart Van Assche
2014-05-20  7:32       ` Bart Van Assche
2014-05-20  8:10         ` Bart Van Assche
2014-05-20  8:40           ` Paolo Bonzini
2014-05-21 14:16       ` Mark Wu
2014-05-21 20:34         ` Paolo Bonzini
2014-05-23  1:28           ` Elliott, Robert (Server Storage)
2014-05-23  9:22             ` Paolo Bonzini
2014-05-20  8:46 ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).