[PATCH 0/2] dm: Avoid use-after-free of a mapped device

linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] dm: Avoid use-after-free of a mapped device
@ 2013-02-22 10:45 Bart Van Assche
  2013-02-22 10:46 ` [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively Bart Van Assche
  2013-02-22 10:47 ` [PATCH 2/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
  0 siblings, 2 replies; 13+ messages in thread
From: Bart Van Assche @ 2013-02-22 10:45 UTC (permalink / raw)
  To: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, Tejun Heo, James Bottomley

This mini-series of two patches avoids that the device mapper
implementation triggers a use-after-free during removal of a
mapped device. The two patches in this series are:
- block: Avoid invoking blk_run_queue() recursively
- dm: Avoid use-after-free of a mapped device

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively
  2013-02-22 10:45 [PATCH 0/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
@ 2013-02-22 10:46 ` Bart Van Assche
  2013-02-22 18:14   ` Tejun Heo
  2013-02-22 10:47 ` [PATCH 2/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2013-02-22 10:46 UTC (permalink / raw)
  To: device-mapper development
  Cc: linux-scsi, Alasdair G Kergon, Jens Axboe, Mike Snitzer,
	Tejun Heo, James Bottomley

Some block drivers, e.g. dm and SCSI, need to trigger a queue
run from inside functions that may be invoked by their request_fn()
implementation. Make sure that invoking blk_run_queue() instead
of blk_run_queue_async() from such functions does not trigger
recursion. Making blk_run_queue() skip queue processing when
invoked recursively is safe because the only two affected
request_fn() implementations (dm and SCSI) guarantee that the
request queue will be reexamined sooner or later before returning
from their request_fn() implementation.

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@parallels.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: <stable@vger.kernel.org>
---
 block/blk-core.c |   21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c973249..cf26e3a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -304,19 +304,24 @@ EXPORT_SYMBOL(blk_sync_queue);
  *    This variant runs the queue whether or not the queue has been
  *    stopped. Must be called with the queue lock held and interrupts
  *    disabled. See also @blk_run_queue.
+ *
+ * Note:
+ *    Request handling functions are allowed to invoke __blk_run_queue() or
+ *    blk_run_queue() directly or indirectly. This will not result in a
+ *    recursive call of the request handler. However, such request handling
+ *    functions must, before they return, either reexamine the request queue
+ *    or invoke blk_delay_queue() to avoid that queue processing stops.
+ *
+ *    Some request handler implementations, e.g. scsi_request_fn() and
+ *    dm_request_fn(), unlock the queue lock internally. Returning immediately
+ *    if q->request_fn_active > 0 avoids that for the same queue multiple
+ *    threads execute the request handling function concurrently.
  */
 inline void __blk_run_queue_uncond(struct request_queue *q)
 {
-	if (unlikely(blk_queue_dead(q)))
+	if (unlikely(blk_queue_dead(q) || q->request_fn_active))
 		return;
 
-	/*
-	 * Some request_fn implementations, e.g. scsi_request_fn(), unlock
-	 * the queue lock internally. As a result multiple threads may be
-	 * running such a request function concurrently. Keep track of the
-	 * number of active request_fn invocations such that blk_drain_queue()
-	 * can wait until all these request_fn calls have finished.
-	 */
 	q->request_fn_active++;
 	q->request_fn(q);
 	q->request_fn_active--;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-22 10:45 [PATCH 0/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
  2013-02-22 10:46 ` [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively Bart Van Assche
@ 2013-02-22 10:47 ` Bart Van Assche
  2013-02-22 11:08   ` Mike Snitzer
  2013-02-25  9:49   ` Jun'ichi Nomura
  1 sibling, 2 replies; 13+ messages in thread
From: Bart Van Assche @ 2013-02-22 10:47 UTC (permalink / raw)
  To: device-mapper development
  Cc: linux-scsi, Alasdair G Kergon, Jens Axboe, Mike Snitzer,
	Tejun Heo, James Bottomley

As the comment above rq_completed() explains, md members must
not be touched after the dm_put() at the end of that function
has been invoked. Avoid that the md->queue can be run
asynchronously after the last md reference has been dropped by
running that queue synchronously. This patch fixes the
following kernel oops:

general protection fault: 0000 [#1] SMP
RIP: 0010:[<ffffffff810fe754>]  [<ffffffff810fe754>] mempool_free+0x24/0xb0
Call Trace:
  <IRQ>
  [<ffffffff81187417>] bio_put+0x97/0xc0
  [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod]
  [<ffffffff81185efd>] bio_endio+0x1d/0x30
  [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0
  [<ffffffff811f2f68>] blk_update_request+0x118/0x520
  [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0
  [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80
  [<ffffffff811f34d0>] blk_end_request+0x10/0x20
  [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod]
  [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod]
  [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod]
  [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0
  [<ffffffff81044551>] __do_softirq+0xf1/0x250
  [<ffffffff8142ee8c>] call_softirq+0x1c/0x30
  [<ffffffff8100420d>] do_softirq+0x8d/0xc0
  [<ffffffff81044885>] irq_exit+0xd5/0xe0
  [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0
  [<ffffffff814257af>] common_interrupt+0x6f/0x6f
  <EOI>
  [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp]
  [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod]
  [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod]
  [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50
  [<ffffffff811f1f69>] blk_delay_work+0x29/0x40
  [<ffffffff81059003>] process_one_work+0x1c3/0x5c0
  [<ffffffff8105b22e>] worker_thread+0x15e/0x440
  [<ffffffff8106164b>] kthread+0xdb/0xe0
  [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <JBottomley@parallels.com>
Cc: <stable@vger.kernel.org>
---
 drivers/md/dm.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 314a0e2..0218fc3 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -729,13 +729,13 @@ static void rq_completed(struct mapped_device *md, int rw, int run_queue)
 		wake_up(&md->wait);
 
 	/*
-	 * Run this off this callpath, as drivers could invoke end_io while
-	 * inside their request_fn (and holding the queue lock). Calling
-	 * back into ->request_fn() could deadlock attempting to grab the
-	 * queue lock again.
+	 * Although this function may be invoked indirectly from inside
+	 * blk_run_queue(), invoking blk_run_queue() here is safe because that
+	 * function returns immediately when it detects that it has been
+	 * called recursively.
 	 */
 	if (run_queue)
-		blk_run_queue_async(md->queue);
+		blk_run_queue(md->queue);
 
 	/*
 	 * dm_put() must be at the end of this function. See the comment above
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-22 10:47 ` [PATCH 2/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
@ 2013-02-22 11:08   ` Mike Snitzer
  2013-02-22 11:22     ` Bart Van Assche
  2013-02-25  9:49   ` Jun'ichi Nomura
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Snitzer @ 2013-02-22 11:08 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Tejun Heo, James Bottomley

On Fri, Feb 22 2013 at  5:47am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> As the comment above rq_completed() explains, md members must
> not be touched after the dm_put() at the end of that function
> has been invoked. Avoid that the md->queue can be run
> asynchronously after the last md reference has been dropped by
> running that queue synchronously. This patch fixes the
> following kernel oops:
> 
> general protection fault: 0000 [#1] SMP
> RIP: 0010:[<ffffffff810fe754>]  [<ffffffff810fe754>] mempool_free+0x24/0xb0
> Call Trace:
>   <IRQ>
>   [<ffffffff81187417>] bio_put+0x97/0xc0
>   [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod]
>   [<ffffffff81185efd>] bio_endio+0x1d/0x30
>   [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0
>   [<ffffffff811f2f68>] blk_update_request+0x118/0x520
>   [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0
>   [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80
>   [<ffffffff811f34d0>] blk_end_request+0x10/0x20
>   [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod]
>   [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod]
>   [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod]
>   [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0
>   [<ffffffff81044551>] __do_softirq+0xf1/0x250
>   [<ffffffff8142ee8c>] call_softirq+0x1c/0x30
>   [<ffffffff8100420d>] do_softirq+0x8d/0xc0
>   [<ffffffff81044885>] irq_exit+0xd5/0xe0
>   [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0
>   [<ffffffff814257af>] common_interrupt+0x6f/0x6f
>   <EOI>
>   [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp]
>   [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod]
>   [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod]
>   [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50
>   [<ffffffff811f1f69>] blk_delay_work+0x29/0x40
>   [<ffffffff81059003>] process_one_work+0x1c3/0x5c0
>   [<ffffffff8105b22e>] worker_thread+0x15e/0x440
>   [<ffffffff8106164b>] kthread+0xdb/0xe0
>   [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0

Your commit header should probably reference commit
a8c32a5c98943d370ea606a2e7dc04717eb92206 ("dm: fix deadlock with request
based dm and queue request_fn recursion") and cc: stable with "v3.7+"
guidance.

Acked-by: Mike Snitzer <snitzer@redhat.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-22 11:08   ` Mike Snitzer
@ 2013-02-22 11:22     ` Bart Van Assche
  2013-02-22 11:28       ` Mike Snitzer
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2013-02-22 11:22 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Tejun Heo, James Bottomley

On 02/22/13 12:08, Mike Snitzer wrote:
> On Fri, Feb 22 2013 at  5:47am -0500,
> Bart Van Assche <bvanassche@acm.org> wrote:
>
>> As the comment above rq_completed() explains, md members must
>> not be touched after the dm_put() at the end of that function
>> has been invoked. Avoid that the md->queue can be run
>> asynchronously after the last md reference has been dropped by
>> running that queue synchronously.
>
> Your commit header should probably reference commit
> a8c32a5c98943d370ea606a2e7dc04717eb92206 ("dm: fix deadlock with request
> based dm and queue request_fn recursion") and cc: stable with "v3.7+"
> guidance.
>
> Acked-by: Mike Snitzer <snitzer@redhat.com>

Hello Mike,

Thanks for reviewing this patch and for the ack. Regarding the stable 
tag: had you noticed that commit a8c32a5 had a "Cc: stable" tag itself 
and hence probably has already been backported to kernels older than v3.7 ?

Bart.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-22 11:22     ` Bart Van Assche
@ 2013-02-22 11:28       ` Mike Snitzer
  0 siblings, 0 replies; 13+ messages in thread
From: Mike Snitzer @ 2013-02-22 11:28 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Tejun Heo, James Bottomley

On Fri, Feb 22 2013 at  6:22am -0500,
Bart Van Assche <bvanassche@acm.org> wrote:

> On 02/22/13 12:08, Mike Snitzer wrote:
> >On Fri, Feb 22 2013 at  5:47am -0500,
> >Bart Van Assche <bvanassche@acm.org> wrote:
> >
> >>As the comment above rq_completed() explains, md members must
> >>not be touched after the dm_put() at the end of that function
> >>has been invoked. Avoid that the md->queue can be run
> >>asynchronously after the last md reference has been dropped by
> >>running that queue synchronously.
> >
> >Your commit header should probably reference commit
> >a8c32a5c98943d370ea606a2e7dc04717eb92206 ("dm: fix deadlock with request
> >based dm and queue request_fn recursion") and cc: stable with "v3.7+"
> >guidance.
> >
> >Acked-by: Mike Snitzer <snitzer@redhat.com>
> 
> Hello Mike,
> 
> Thanks for reviewing this patch and for the ack. Regarding the
> stable tag: had you noticed that commit a8c32a5 had a "Cc: stable"
> tag itself and hence probably has already been backported to kernels
> older than v3.7 ?

Ah yes.  Good point.

Jens, since this DM change is dependent on Bart's 1/2 block patch it'd
be ideal if you could pick up both of these patches for v3.9.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively
  2013-02-22 10:46 ` [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively Bart Van Assche
@ 2013-02-22 18:14   ` Tejun Heo
  2013-02-22 18:57     ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2013-02-22 18:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, James Bottomley

Hello, Bart.

On Fri, Feb 22, 2013 at 11:46:14AM +0100, Bart Van Assche wrote:
> Some block drivers, e.g. dm and SCSI, need to trigger a queue
> run from inside functions that may be invoked by their request_fn()
> implementation. Make sure that invoking blk_run_queue() instead
> of blk_run_queue_async() from such functions does not trigger
> recursion. Making blk_run_queue() skip queue processing when
> invoked recursively is safe because the only two affected
> request_fn() implementations (dm and SCSI) guarantee that the
> request queue will be reexamined sooner or later before returning
> from their request_fn() implementation.
> 
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: James Bottomley <JBottomley@parallels.com>
> Cc: Alasdair G Kergon <agk@redhat.com>
> Cc: Mike Snitzer <snitzer@redhat.com>
> Cc: <stable@vger.kernel.org>
> ---
>  block/blk-core.c |   21 +++++++++++++--------
>  1 file changed, 13 insertions(+), 8 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index c973249..cf26e3a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -304,19 +304,24 @@ EXPORT_SYMBOL(blk_sync_queue);
>   *    This variant runs the queue whether or not the queue has been
>   *    stopped. Must be called with the queue lock held and interrupts
>   *    disabled. See also @blk_run_queue.
> + *
> + * Note:
> + *    Request handling functions are allowed to invoke __blk_run_queue() or
> + *    blk_run_queue() directly or indirectly. This will not result in a
> + *    recursive call of the request handler. However, such request handling
> + *    functions must, before they return, either reexamine the request queue
> + *    or invoke blk_delay_queue() to avoid that queue processing stops.
> + *
> + *    Some request handler implementations, e.g. scsi_request_fn() and
> + *    dm_request_fn(), unlock the queue lock internally. Returning immediately
> + *    if q->request_fn_active > 0 avoids that for the same queue multiple
> + *    threads execute the request handling function concurrently.
>   */
>  inline void __blk_run_queue_uncond(struct request_queue *q)
>  {
> -	if (unlikely(blk_queue_dead(q)))
> +	if (unlikely(blk_queue_dead(q) || q->request_fn_active))
>  		return;

Hmmm... I can't say I like this.  Silently ignoring explicit
blk_run_queue() call like this is likely to introduce nasty queue hang
bugs later on even if it's okay for the current users and there isn't
even debugging aid to detect what's going on.  If somebody invokes
blk_run_queue(), block layer better guarantee that the queue will be
run at least once afterwards no matter what, so please don't it this
way.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively
  2013-02-22 18:14   ` Tejun Heo
@ 2013-02-22 18:57     ` Bart Van Assche
  2013-02-22 19:01       ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2013-02-22 18:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, James Bottomley

On 02/22/13 19:14, Tejun Heo wrote:
> Hello, Bart.
> 
> On Fri, Feb 22, 2013 at 11:46:14AM +0100, Bart Van Assche wrote:
>> Some block drivers, e.g. dm and SCSI, need to trigger a queue
>> run from inside functions that may be invoked by their request_fn()
>> implementation. Make sure that invoking blk_run_queue() instead
>> of blk_run_queue_async() from such functions does not trigger
>> recursion. Making blk_run_queue() skip queue processing when
>> invoked recursively is safe because the only two affected
>> request_fn() implementations (dm and SCSI) guarantee that the
>> request queue will be reexamined sooner or later before returning
>> from their request_fn() implementation.
>>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Tejun Heo <tj@kernel.org>
>> Cc: James Bottomley <JBottomley@parallels.com>
>> Cc: Alasdair G Kergon <agk@redhat.com>
>> Cc: Mike Snitzer <snitzer@redhat.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>   block/blk-core.c |   21 +++++++++++++--------
>>   1 file changed, 13 insertions(+), 8 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index c973249..cf26e3a 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -304,19 +304,24 @@ EXPORT_SYMBOL(blk_sync_queue);
>>    *    This variant runs the queue whether or not the queue has been
>>    *    stopped. Must be called with the queue lock held and interrupts
>>    *    disabled. See also @blk_run_queue.
>> + *
>> + * Note:
>> + *    Request handling functions are allowed to invoke __blk_run_queue() or
>> + *    blk_run_queue() directly or indirectly. This will not result in a
>> + *    recursive call of the request handler. However, such request handling
>> + *    functions must, before they return, either reexamine the request queue
>> + *    or invoke blk_delay_queue() to avoid that queue processing stops.
>> + *
>> + *    Some request handler implementations, e.g. scsi_request_fn() and
>> + *    dm_request_fn(), unlock the queue lock internally. Returning immediately
>> + *    if q->request_fn_active > 0 avoids that for the same queue multiple
>> + *    threads execute the request handling function concurrently.
>>    */
>>   inline void __blk_run_queue_uncond(struct request_queue *q)
>>   {
>> -	if (unlikely(blk_queue_dead(q)))
>> +	if (unlikely(blk_queue_dead(q) || q->request_fn_active))
>>   		return;
> 
> Hmmm... I can't say I like this.  Silently ignoring explicit
> blk_run_queue() call like this is likely to introduce nasty queue hang
> bugs later on even if it's okay for the current users and there isn't
> even debugging aid to detect what's going on.  If somebody invokes
> blk_run_queue(), block layer better guarantee that the queue will be
> run at least once afterwards no matter what, so please don't it this
> way.

How about returning to an approach similar to the one introduced in commit
a538cd03 (April 2009) ? This is how the last part of __blk_run_queue()
looked like after that commit:

        /*
         * Only recurse once to avoid overrunning the stack, let the unplug
         * handling reinvoke the handler shortly if we already got there.
         */
        if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
                q->request_fn(q);
                queue_flag_clear(QUEUE_FLAG_REENTER, q);
        } else {
                queue_flag_set(QUEUE_FLAG_PLUGGED, q);
                kblockd_schedule_work(q, &q->unplug_work);
        }

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively
  2013-02-22 18:57     ` Bart Van Assche
@ 2013-02-22 19:01       ` Jens Axboe
  2013-02-23 12:34         ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2013-02-22 19:01 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Tejun Heo, device-mapper development, linux-scsi,
	Alasdair G Kergon, Mike Snitzer, James Bottomley

On Fri, Feb 22 2013, Bart Van Assche wrote:
> On 02/22/13 19:14, Tejun Heo wrote:
> > Hello, Bart.
> > 
> > On Fri, Feb 22, 2013 at 11:46:14AM +0100, Bart Van Assche wrote:
> >> Some block drivers, e.g. dm and SCSI, need to trigger a queue
> >> run from inside functions that may be invoked by their request_fn()
> >> implementation. Make sure that invoking blk_run_queue() instead
> >> of blk_run_queue_async() from such functions does not trigger
> >> recursion. Making blk_run_queue() skip queue processing when
> >> invoked recursively is safe because the only two affected
> >> request_fn() implementations (dm and SCSI) guarantee that the
> >> request queue will be reexamined sooner or later before returning
> >> from their request_fn() implementation.
> >>
> >> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> >> Cc: Jens Axboe <axboe@kernel.dk>
> >> Cc: Tejun Heo <tj@kernel.org>
> >> Cc: James Bottomley <JBottomley@parallels.com>
> >> Cc: Alasdair G Kergon <agk@redhat.com>
> >> Cc: Mike Snitzer <snitzer@redhat.com>
> >> Cc: <stable@vger.kernel.org>
> >> ---
> >>   block/blk-core.c |   21 +++++++++++++--------
> >>   1 file changed, 13 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index c973249..cf26e3a 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -304,19 +304,24 @@ EXPORT_SYMBOL(blk_sync_queue);
> >>    *    This variant runs the queue whether or not the queue has been
> >>    *    stopped. Must be called with the queue lock held and interrupts
> >>    *    disabled. See also @blk_run_queue.
> >> + *
> >> + * Note:
> >> + *    Request handling functions are allowed to invoke __blk_run_queue() or
> >> + *    blk_run_queue() directly or indirectly. This will not result in a
> >> + *    recursive call of the request handler. However, such request handling
> >> + *    functions must, before they return, either reexamine the request queue
> >> + *    or invoke blk_delay_queue() to avoid that queue processing stops.
> >> + *
> >> + *    Some request handler implementations, e.g. scsi_request_fn() and
> >> + *    dm_request_fn(), unlock the queue lock internally. Returning immediately
> >> + *    if q->request_fn_active > 0 avoids that for the same queue multiple
> >> + *    threads execute the request handling function concurrently.
> >>    */
> >>   inline void __blk_run_queue_uncond(struct request_queue *q)
> >>   {
> >> -	if (unlikely(blk_queue_dead(q)))
> >> +	if (unlikely(blk_queue_dead(q) || q->request_fn_active))
> >>   		return;
> > 
> > Hmmm... I can't say I like this.  Silently ignoring explicit
> > blk_run_queue() call like this is likely to introduce nasty queue hang
> > bugs later on even if it's okay for the current users and there isn't
> > even debugging aid to detect what's going on.  If somebody invokes
> > blk_run_queue(), block layer better guarantee that the queue will be
> > run at least once afterwards no matter what, so please don't it this
> > way.
> 
> How about returning to an approach similar to the one introduced in commit
> a538cd03 (April 2009) ? This is how the last part of __blk_run_queue()
> looked like after that commit:
> 
>         /*
>          * Only recurse once to avoid overrunning the stack, let the unplug
>          * handling reinvoke the handler shortly if we already got there.
>          */
>         if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
>                 q->request_fn(q);
>                 queue_flag_clear(QUEUE_FLAG_REENTER, q);
>         } else {
>                 queue_flag_set(QUEUE_FLAG_PLUGGED, q);
>                 kblockd_schedule_work(q, &q->unplug_work);
>         }

That'd be fine. I agree with Tejun that just returning is error prone
and could cause hard-to-debug hangs or stalls. So something ala

        if (blk_queue(dead))
                return;
        else if (q->request_fn_active) {
                blk_delay_queue(q, 0);
                return;
        }

would work. Alternatively, you could mark the queue as needing a re-run,
so any existing runner of it would notice and clear this flag and re-run
the queue. But that's somewhat more fragile, and since it isn't a
hot/performance path, I suspect the simple "just re-run me soon" is good
enough.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively
  2013-02-22 19:01       ` Jens Axboe
@ 2013-02-23 12:34         ` Bart Van Assche
  0 siblings, 0 replies; 13+ messages in thread
From: Bart Van Assche @ 2013-02-23 12:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Tejun Heo, device-mapper development, linux-scsi,
	Alasdair G Kergon, Mike Snitzer, James Bottomley

On 02/22/13 20:01, Jens Axboe wrote:
> That'd be fine. I agree with Tejun that just returning is error prone
> and could cause hard-to-debug hangs or stalls. So something ala
>
>          if (blk_queue(dead))
>                  return;
>          else if (q->request_fn_active) {
>                  blk_delay_queue(q, 0);
>                  return;
>          }
>
> would work. Alternatively, you could mark the queue as needing a re-run,
> so any existing runner of it would notice and clear this flag and re-run
> the queue. But that's somewhat more fragile, and since it isn't a
> hot/performance path, I suspect the simple "just re-run me soon" is good
> enough.

I'm worried that in the device mapper the approach using 
blk_delay_queue() could trigger a queue run after the final dm_put(). 
And I'm also afraid that that approach could trigger several needless 
context switches before q->request_fn_active finally drops to zero. So 
the approach with the "need re-run" flag seems preferable to me. But 
maybe I'm overlooking something ?

Bart.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-22 10:47 ` [PATCH 2/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
  2013-02-22 11:08   ` Mike Snitzer
@ 2013-02-25  9:49   ` Jun'ichi Nomura
  2013-02-25 15:09     ` Bart Van Assche
  1 sibling, 1 reply; 13+ messages in thread
From: Jun'ichi Nomura @ 2013-02-25  9:49 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, Tejun Heo, James Bottomley

Hello Bart,

On 02/22/13 19:47, Bart Van Assche wrote:
> As the comment above rq_completed() explains, md members must
> not be touched after the dm_put() at the end of that function
> has been invoked. Avoid that the md->queue can be run
> asynchronously after the last md reference has been dropped by
> running that queue synchronously. This patch fixes the
> following kernel oops:

Calling blk_run_queue_async() there should be ok.
After dm_put(), the dm device may be removed. But free_dev() in dm.c
calls blk_queue_cleanup() and it should solve the race vs. delayed work.

And I could reproduce very similar oops without removing dm device
by following procedure:
(please replace "mpathX" with your dm-multipath map name)

  # t=`dmsetup table mpathX`
  # while sleep 1; do \
      echo "$t" | dmsetup load mpathX; dmsetup resume mpathX; done

Looking at the following back trace:

> general protection fault: 0000 [#1] SMP
> RIP: 0010:[<ffffffff810fe754>]  [<ffffffff810fe754>] mempool_free+0x24/0xb0
> Call Trace:
>   <IRQ>
>   [<ffffffff81187417>] bio_put+0x97/0xc0
>   [<ffffffffa02247a5>] end_clone_bio+0x35/0x90 [dm_mod]
>   [<ffffffff81185efd>] bio_endio+0x1d/0x30
>   [<ffffffff811f03a3>] req_bio_endio.isra.51+0xa3/0xe0
>   [<ffffffff811f2f68>] blk_update_request+0x118/0x520
>   [<ffffffff811f3397>] blk_update_bidi_request+0x27/0xa0
>   [<ffffffff811f343c>] blk_end_bidi_request+0x2c/0x80
>   [<ffffffff811f34d0>] blk_end_request+0x10/0x20
>   [<ffffffffa000b32b>] scsi_io_completion+0xfb/0x6c0 [scsi_mod]
>   [<ffffffffa000107d>] scsi_finish_command+0xbd/0x120 [scsi_mod]
>   [<ffffffffa000b12f>] scsi_softirq_done+0x13f/0x160 [scsi_mod]
>   [<ffffffff811f9fd0>] blk_done_softirq+0x80/0xa0
>   [<ffffffff81044551>] __do_softirq+0xf1/0x250
>   [<ffffffff8142ee8c>] call_softirq+0x1c/0x30
>   [<ffffffff8100420d>] do_softirq+0x8d/0xc0
>   [<ffffffff81044885>] irq_exit+0xd5/0xe0
>   [<ffffffff8142f3e3>] do_IRQ+0x63/0xe0
>   [<ffffffff814257af>] common_interrupt+0x6f/0x6f
>   <EOI>
>   [<ffffffffa021737c>] srp_queuecommand+0x8c/0xcb0 [ib_srp]
>   [<ffffffffa0002f18>] scsi_dispatch_cmd+0x148/0x310 [scsi_mod]
>   [<ffffffffa000a38e>] scsi_request_fn+0x31e/0x520 [scsi_mod]
>   [<ffffffff811f1e57>] __blk_run_queue+0x37/0x50
>   [<ffffffff811f1f69>] blk_delay_work+0x29/0x40
>   [<ffffffff81059003>] process_one_work+0x1c3/0x5c0
>   [<ffffffff8105b22e>] worker_thread+0x15e/0x440
>   [<ffffffff8106164b>] kthread+0xdb/0xe0
>   [<ffffffff8142db9c>] ret_from_fork+0x7c/0xb0

it seems that the bioset was removed while being referenced.

c0820cf5 "dm: introduce per_bio_data" started to replace dm bioset
during table replacement because the size of bioset front_pad might
change for bio-based dm.
However, for request-based dm, it is not necessary because the size
of front_pad is static. Also we can't simply replace bioset because
prep-ed requests in queue have reference to the old bioset.

The patch below changes it not to replace bioset for request-based dm.
(Brings back to the same behavior with v3.7)
With this patch, I could not reproduce the problem.
Could you try this?

-- 
Jun'ichi Nomura, NEC Corporation

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 314a0e2..51fefb5 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1973,15 +1973,27 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
 {
 	struct dm_md_mempools *p = dm_table_get_md_mempools(t);
 
-	if (md->io_pool && (md->tio_pool || dm_table_get_type(t) == DM_TYPE_BIO_BASED) && md->bs) {
-		/*
-		 * The md already has necessary mempools. Reload just the
-		 * bioset because front_pad may have changed because
-		 * a different table was loaded.
-		 */
-		bioset_free(md->bs);
-		md->bs = p->bs;
-		p->bs = NULL;
+	if (md->io_pool && md->bs) {
+		/* The md already has necessary mempools. */
+		if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
+			/*
+			 * Reload bioset because front_pad may have changed
+			 * because a different table was loaded.
+			 */
+			bioset_free(md->bs);
+			md->bs = p->bs;
+			p->bs = NULL;
+		} else if (dm_table_get_type(t) == DM_TYPE_REQUEST_BASED) {
+			BUG_ON(!md->tio_pool);
+			/*
+			 * No need to reload in case of request-based dm
+			 * because of fixed size front_pad.
+			 * Note for future: if you are to reload bioset,
+			 * prep-ed requests in queue may have reference
+			 * to bio from the old bioset.
+			 * So you must walk through the queue to unprep.
+			 */
+		}
 		goto out;
 	}
 

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-25  9:49   ` Jun'ichi Nomura
@ 2013-02-25 15:09     ` Bart Van Assche
  2013-02-26  0:30       ` Jun'ichi Nomura
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2013-02-25 15:09 UTC (permalink / raw)
  To: Jun'ichi Nomura
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, Tejun Heo, James Bottomley

On 02/25/13 10:49, Jun'ichi Nomura wrote:
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 314a0e2..51fefb5 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1973,15 +1973,27 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
>  {
>  	struct dm_md_mempools *p = dm_table_get_md_mempools(t);
>
> -	if (md->io_pool && (md->tio_pool || dm_table_get_type(t) == DM_TYPE_BIO_BASED) && md->bs) {
> -		/*
> -		 * The md already has necessary mempools. Reload just the
> -		 * bioset because front_pad may have changed because
> -		 * a different table was loaded.
> -		 */
> -		bioset_free(md->bs);
> -		md->bs = p->bs;
> -		p->bs = NULL;
> +	if (md->io_pool && md->bs) {
> +		/* The md already has necessary mempools. */
> +		if (dm_table_get_type(t) == DM_TYPE_BIO_BASED) {
> +			/*
> +			 * Reload bioset because front_pad may have changed
> +			 * because a different table was loaded.
> +			 */
> +			bioset_free(md->bs);
> +			md->bs = p->bs;
> +			p->bs = NULL;
> +		} else if (dm_table_get_type(t) == DM_TYPE_REQUEST_BASED) {
> +			BUG_ON(!md->tio_pool);
> +			/*
> +			 * No need to reload in case of request-based dm
> +			 * because of fixed size front_pad.
> +			 * Note for future: if you are to reload bioset,
> +			 * prep-ed requests in queue may have reference
> +			 * to bio from the old bioset.
> +			 * So you must walk through the queue to unprep.
> +			 */
> +		}
>  		goto out;
>  	}

Without your patch my test failed after two or three iterations. With 
your patch my test is still running after 53 iterations. So if you want 
you can add Tested-by: Bart Van Assche <bvanassche@acm.org>.

Your e-mail and the above patch are also interesting because these 
explain why reverting to the v3.7 of drivers/md made my test succeed.

Note: even if this patch gets accepted I think it's still useful to 
modify blk_run_queue() such that it converts recursion into iteration.

Bart.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] dm: Avoid use-after-free of a mapped device
  2013-02-25 15:09     ` Bart Van Assche
@ 2013-02-26  0:30       ` Jun'ichi Nomura
  0 siblings, 0 replies; 13+ messages in thread
From: Jun'ichi Nomura @ 2013-02-26  0:30 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: device-mapper development, linux-scsi, Alasdair G Kergon,
	Jens Axboe, Mike Snitzer, Tejun Heo, James Bottomley

On 02/26/13 00:09, Bart Van Assche wrote:
> Without your patch my test failed after two or three iterations. With your patch my test is still running after 53 iterations. So if you want you can add Tested-by: Bart Van Assche <bvanassche@acm.org>.

Great. Thanks for testing.
I'll submit a patch with your Reported-by and Tested-by.

> Your e-mail and the above patch are also interesting because these explain why reverting to the v3.7 of drivers/md made my test succeed.
> 
> Note: even if this patch gets accepted I think it's still useful to modify blk_run_queue() such that it converts recursion into iteration.

Yes. That's a separate discussion.
Though I'm not sure if it's ok in general to implicitly convert
sync run-queue to async one.

-- 
Jun'ichi Nomura, NEC Corporation


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2013-02-26  0:32 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-22 10:45 [PATCH 0/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
2013-02-22 10:46 ` [PATCH 1/2] block: Avoid invoking blk_run_queue() recursively Bart Van Assche
2013-02-22 18:14   ` Tejun Heo
2013-02-22 18:57     ` Bart Van Assche
2013-02-22 19:01       ` Jens Axboe
2013-02-23 12:34         ` Bart Van Assche
2013-02-22 10:47 ` [PATCH 2/2] dm: Avoid use-after-free of a mapped device Bart Van Assche
2013-02-22 11:08   ` Mike Snitzer
2013-02-22 11:22     ` Bart Van Assche
2013-02-22 11:28       ` Mike Snitzer
2013-02-25  9:49   ` Jun'ichi Nomura
2013-02-25 15:09     ` Bart Van Assche
2013-02-26  0:30       ` Jun'ichi Nomura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).