Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
       [not found]         ` <32a121b7-2444-ac19-420d-4961f2a18129@acm.org>
@ 2021-04-23  3:52           ` Ming Lei
  2021-04-23 17:52             ` Bart Van Assche
  0 siblings, 1 reply; 7+ messages in thread
From: Ming Lei @ 2021-04-23  3:52 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On Thu, Apr 22, 2021 at 08:51:06AM -0700, Bart Van Assche wrote:
> On 4/22/21 12:13 AM, Ming Lei wrote:
> > On Wed, Apr 21, 2021 at 08:54:30PM -0700, Bart Van Assche wrote:
> >> On 4/21/21 8:15 PM, Ming Lei wrote:
> >>> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> >>>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>>> +{
> >>>> +	struct bt_tags_iter_data *iter_data = data;
> >>>> +	struct blk_mq_tags *tags = iter_data->tags;
> >>>> +	bool res;
> >>>> +
> >>>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> >>>> +		down_read(&tags->iter_rwsem);
> >>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >>>> +		up_read(&tags->iter_rwsem);
> >>>> +	} else {
> >>>> +		rcu_read_lock();
> >>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >>>> +		rcu_read_unlock();
> >>>> +	}
> >>>> +
> >>>> +	return res;
> >>>> +}
> >>>
> >>> Holding one rwsem or rcu read lock won't avoid the issue completely
> >>> because request may be completed remotely in iter_data->fn(), such as
> >>> nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
> >>> mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
> >>> request in softirq, remote IPI, even wq, and the request is still
> >>> referenced in these contexts after bt_tags_iter() returns.
> >>
> >> The rwsem and RCU read lock are used to serialize iterating over
> >> requests against blk_mq_sched_free_requests() calls. I don't think it
> >> matters for this patch from which context requests are freed.
> > 
> > Requests still can be referred in other context after blk_mq_wait_for_tag_iter()
> > returns, then follows freeing request pool. And use-after-free exists too, doesn't it?
> 
> The request pool should only be freed after it has been guaranteed that
> all pending requests have finished and also that no new requests will be
> started. This patch series adds two blk_mq_wait_for_tag_iter() calls.
> Both calls happen while the queue is frozen so I don't think that the
> issue mentioned in your email can happen.

For example, scsi aacraid normal completion vs. reset together with elevator
switch, aacraid is one single queue HBA, and the request will be completed
via IPI or softirq asynchronously, that said request isn't really completed
after blk_mq_complete_request() returns.

1) interrupt comes, and request A is completed via blk_mq_complete_request()
from aacraid's interrupt handler via ->scsi_done()

2) _aac_reset_adapter() comes because of reset event which can be
triggered by sysfs store or whatever, irq is drained in 
_aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
context is done, but request A is just scheduled to be completed via IPI
or softirq asynchronously, not really done yet.

3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
failing all pending requests. request A is still visible in
scsi_host_complete_all_commands, because its tag isn't freed yet. But the
tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
-> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
IPI or softirq, and request A is addded in ipi or softirq list.

4) meantime request A is freed from normal completion triggered by interrupt, one
pending elevator switch can move on since request A drops the last reference; and
bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
too, then the whole scheduler request pool is freed now.

5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
is triggered.

It is supposed that driver covers normal completion vs. error handling, but wrt.
remove completion, not sure driver is capable of covering that.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-23  3:52           ` [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests Ming Lei
@ 2021-04-23 17:52             ` Bart Van Assche
  2021-04-25  0:09               ` Ming Lei
  0 siblings, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2021-04-23 17:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On 4/22/21 8:52 PM, Ming Lei wrote:
> For example, scsi aacraid normal completion vs. reset together with elevator
> switch, aacraid is one single queue HBA, and the request will be completed
> via IPI or softirq asynchronously, that said request isn't really completed
> after blk_mq_complete_request() returns.
> 
> 1) interrupt comes, and request A is completed via blk_mq_complete_request()
> from aacraid's interrupt handler via ->scsi_done()
> 
> 2) _aac_reset_adapter() comes because of reset event which can be
> triggered by sysfs store or whatever, irq is drained in 
> _aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
> context is done, but request A is just scheduled to be completed via IPI
> or softirq asynchronously, not really done yet.
> 
> 3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
> failing all pending requests. request A is still visible in
> scsi_host_complete_all_commands, because its tag isn't freed yet. But the
> tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
> reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
> -> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
> IPI or softirq, and request A is addded in ipi or softirq list.
> 
> 4) meantime request A is freed from normal completion triggered by interrupt, one
> pending elevator switch can move on since request A drops the last reference; and
> bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
> too, then the whole scheduler request pool is freed now.
> 
> 5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
> is triggered.
> 
> It is supposed that driver covers normal completion vs. error handling, but wrt.
> remove completion, not sure driver is capable of covering that.

Hi Ming,

I agree that the scenario above can happen and also that a fix is
needed. However, I do not agree that this should be fixed by modifying
the tag iteration functions. I see scsi_host_complete_all_commands() as
a workaround for broken storage controllers - storage controllers that
do not have a facility for terminating all pending commands. NVMe
controllers can be told to terminate all pending I/O commands by e.g.
deleting all I/O submission queues. Many SCSI controllers also provide a
facility for aborting all pending commands. For SCSI controllers that do
not provide such a facility I propose to fix races like the race
described above in the SCSI LLD instead of modifying the block layer tag
iteration functions.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-23 17:52             ` Bart Van Assche
@ 2021-04-25  0:09               ` Ming Lei
  2021-04-25 21:01                 ` Bart Van Assche
  2021-04-26 16:29                 ` Bart Van Assche
  0 siblings, 2 replies; 7+ messages in thread
From: Ming Lei @ 2021-04-25  0:09 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On Fri, Apr 23, 2021 at 10:52:43AM -0700, Bart Van Assche wrote:
> On 4/22/21 8:52 PM, Ming Lei wrote:
> > For example, scsi aacraid normal completion vs. reset together with elevator
> > switch, aacraid is one single queue HBA, and the request will be completed
> > via IPI or softirq asynchronously, that said request isn't really completed
> > after blk_mq_complete_request() returns.
> > 
> > 1) interrupt comes, and request A is completed via blk_mq_complete_request()
> > from aacraid's interrupt handler via ->scsi_done()
> > 
> > 2) _aac_reset_adapter() comes because of reset event which can be
> > triggered by sysfs store or whatever, irq is drained in 
> > _aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
> > context is done, but request A is just scheduled to be completed via IPI
> > or softirq asynchronously, not really done yet.
> > 
> > 3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
> > failing all pending requests. request A is still visible in
> > scsi_host_complete_all_commands, because its tag isn't freed yet. But the
> > tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
> > reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
> > -> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
> > IPI or softirq, and request A is addded in ipi or softirq list.
> > 
> > 4) meantime request A is freed from normal completion triggered by interrupt, one
> > pending elevator switch can move on since request A drops the last reference; and
> > bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
> > too, then the whole scheduler request pool is freed now.
> > 
> > 5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
> > is triggered.
> > 
> > It is supposed that driver covers normal completion vs. error handling, but wrt.
> > remove completion, not sure driver is capable of covering that.
> 
> Hi Ming,
> 
> I agree that the scenario above can happen and also that a fix is
> needed. However, I do not agree that this should be fixed by modifying
> the tag iteration functions. I see scsi_host_complete_all_commands() as
> a workaround for broken storage controllers - storage controllers that
> do not have a facility for terminating all pending commands. NVMe
> controllers can be told to terminate all pending I/O commands by e.g.
> deleting all I/O submission queues. Many SCSI controllers also provide a
> facility for aborting all pending commands. For SCSI controllers that do
> not provide such a facility I propose to fix races like the race
> described above in the SCSI LLD instead of modifying the block layer tag
> iteration functions.

Hi Bart,

Terminating all pending commands can't avoid the issue wrt. request UAF,
so far blk_mq_tagset_wait_completed_request() is used for making sure
that all pending requests are really aborted.

However, blk_mq_wait_for_tag_iter() still may return before
blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
supposes all request reference is just done inside bt_tags_iter(),
especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

What I really meant is that .iter_rwsem/read rcu added or blk_mq_wait_for_tag_iter
can't do what is expected.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-25  0:09               ` Ming Lei
@ 2021-04-25 21:01                 ` Bart Van Assche
  2021-04-26  0:55                   ` Ming Lei
  2021-04-26 16:29                 ` Bart Van Assche
  1 sibling, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2021-04-25 21:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On 4/24/21 5:09 PM, Ming Lei wrote:
> However, blk_mq_wait_for_tag_iter() still may return before
> blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> supposes all request reference is just done inside bt_tags_iter(),
> especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

The comment above blk_mq_wait_for_tag_iter() needs to be updated but I
believe that the code is fine. Waiting for bt_tags_iter() to finish
should be sufficient to fix the UAF. What matters is that the pointer
read by rcu_dereference(tags->rqs[bitnr]) remains valid until the
callback function has finished. I think that is guaranteed by the
current implementation.

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-25 21:01                 ` Bart Van Assche
@ 2021-04-26  0:55                   ` Ming Lei
  0 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2021-04-26  0:55 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On Sun, Apr 25, 2021 at 02:01:11PM -0700, Bart Van Assche wrote:
> On 4/24/21 5:09 PM, Ming Lei wrote:
> > However, blk_mq_wait_for_tag_iter() still may return before
> > blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> > supposes all request reference is just done inside bt_tags_iter(),
> > especially .iter_rwsem and read rcu lock is added in bt_tags_iter().
> 
> The comment above blk_mq_wait_for_tag_iter() needs to be updated but I
> believe that the code is fine. Waiting for bt_tags_iter() to finish
> should be sufficient to fix the UAF. What matters is that the pointer
> read by rcu_dereference(tags->rqs[bitnr]) remains valid until the
> callback function has finished. I think that is guaranteed by the
> current implementation.

It depends if 'rq' will be passed to another new context from ->fn(),
since 'rq' still can be USEed in the new context after ->fn() returns.


thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-25  0:09               ` Ming Lei
  2021-04-25 21:01                 ` Bart Van Assche
@ 2021-04-26 16:29                 ` Bart Van Assche
  2021-04-27  0:11                   ` Ming Lei
  1 sibling, 1 reply; 7+ messages in thread
From: Bart Van Assche @ 2021-04-26 16:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On 4/24/21 5:09 PM, Ming Lei wrote:
> Terminating all pending commands can't avoid the issue wrt. request UAF,
> so far blk_mq_tagset_wait_completed_request() is used for making sure
> that all pending requests are really aborted.
> 
> However, blk_mq_wait_for_tag_iter() still may return before
> blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> supposes all request reference is just done inside bt_tags_iter(),
> especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

Hi Ming,

I think that we agree that completing a request from inside a tag
iteration callback function may cause the request completion to happen
after tag iteration has finished. This can happen because
blk_mq_complete_request() may redirect completion processing to another
CPU via an IPI.

But can this mechanism trigger a use-after-free by itself? If request
completion is redirected to another CPU, the request is still considered
pending and request queue freezing won't complete. Request queue
freezing will only succeed after __blk_mq_free_request() has been called
because it is __blk_mq_free_request() that calls blk_queue_exit().

In other words, do we really need the new
blk_mq_complete_request_locally() function?

Did I perhaps miss something?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
  2021-04-26 16:29                 ` Bart Van Assche
@ 2021-04-27  0:11                   ` Ming Lei
  0 siblings, 0 replies; 7+ messages in thread
From: Ming Lei @ 2021-04-27  0:11 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, Christoph Hellwig, Daniel Wagner,
	Khazhismel Kumykov, Shin'ichiro Kawasaki, Martin K . Petersen,
	Hannes Reinecke, Johannes Thumshirn, John Garry, linux-scsi

On Mon, Apr 26, 2021 at 09:29:54AM -0700, Bart Van Assche wrote:
> On 4/24/21 5:09 PM, Ming Lei wrote:
> > Terminating all pending commands can't avoid the issue wrt. request UAF,
> > so far blk_mq_tagset_wait_completed_request() is used for making sure
> > that all pending requests are really aborted.
> > 
> > However, blk_mq_wait_for_tag_iter() still may return before
> > blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> > supposes all request reference is just done inside bt_tags_iter(),
> > especially .iter_rwsem and read rcu lock is added in bt_tags_iter().
> 
> Hi Ming,
> 
> I think that we agree that completing a request from inside a tag
> iteration callback function may cause the request completion to happen
> after tag iteration has finished. This can happen because
> blk_mq_complete_request() may redirect completion processing to another
> CPU via an IPI.
> 
> But can this mechanism trigger a use-after-free by itself? If request
> completion is redirected to another CPU, the request is still considered
> pending and request queue freezing won't complete. Request queue
> freezing will only succeed after __blk_mq_free_request() has been called
> because it is __blk_mq_free_request() that calls blk_queue_exit().
> 
> In other words, do we really need the new
> blk_mq_complete_request_locally() function?
> 
> Did I perhaps miss something?

Please see the example in the following link:

https://lore.kernel.org/linux-block/20210421000235.2028-4-bvanassche@acm.org/T/#m4d7bc9aa01108f03d5b4b7ee102eb26eb0c778aa

In short:

1) One async completion from interrupt context is pending, and one request
isn't really freed yet because its driver tag isn't released.

2) Meantime iterating code still can visit this request, and ->fn() schedules
a new remote completion, and returns.

3) The scheduled async completion in 1) is really done now,
but the scheduled async completion in 2) isn't started or done yet

4) queue becomes frozen because of one pending elevator switching, so
sched request pool is freed since there isn't any pending iterator ->fn()

5) request UAF is caused when running the scheduled async completion in 2)
now.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-04-27  0:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210421000235.2028-1-bvanassche@acm.org>
     [not found] ` <20210421000235.2028-4-bvanassche@acm.org>
     [not found]   ` <YIDqa6YkNoD5OiKN@T590>
     [not found]     ` <b717ffc0-a434-738f-9c63-32901bd164b2@acm.org>
     [not found]       ` <YIEiElb9wxReV/oL@T590>
     [not found]         ` <32a121b7-2444-ac19-420d-4961f2a18129@acm.org>
2021-04-23  3:52           ` [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests Ming Lei
2021-04-23 17:52             ` Bart Van Assche
2021-04-25  0:09               ` Ming Lei
2021-04-25 21:01                 ` Bart Van Assche
2021-04-26  0:55                   ` Ming Lei
2021-04-26 16:29                 ` Bart Van Assche
2021-04-27  0:11                   ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox