[RFC PATCH 1/2] block: allow initialization of previously allocated request

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue
@ 2010-05-10 22:55 Mike Snitzer
  2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
  2010-05-11  6:55 ` [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Jens Axboe
  0 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-10 22:55 UTC (permalink / raw)
  To: dm-devel
  Cc: linux-kernel, Jens Axboe, Kiyoshi Ueda, Jun'ichi Nomura,
	Vivek Goyal

blk_init_queue() allocates the request_queue structure and then
initializes it as needed (request_fn, elevator, etc).

Split initialization out to blk_init_allocated_queue_node.
Introduce blk_init_allocated_queue wrapper function to model existing
blk_init_queue and blk_init_queue_node interfaces.

Export elv_register_queue to allow a newly added elevator to be
registered with sysfs.  Export elv_unregister_queue for symmetry.

These changes allow DM to initialize a device's request_queue with more
precision.  In particular, DM no longer unconditionally initializes a
full request_queue (elevator et al).  It only does so for a
request-based DM device.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/blk-core.c       |   18 +++++++++++++++++-
 block/elevator.c       |    2 ++
 include/linux/blkdev.h |    5 +++++
 3 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 9fe174d..bfd420c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -567,7 +567,23 @@ struct request_queue *
 blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
 	struct request_queue *q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+	
+	return blk_init_allocated_queue_node(q, rfn, lock, node_id);
+}
+EXPORT_SYMBOL(blk_init_queue_node);
 
+struct request_queue *
+blk_init_allocated_queue(struct request_queue *q, request_fn_proc *rfn,
+			 spinlock_t *lock)
+{
+	return blk_init_allocated_queue_node(q, rfn, lock, -1);
+}
+EXPORT_SYMBOL(blk_init_allocated_queue);
+
+struct request_queue *
+blk_init_allocated_queue_node(struct request_queue *q, request_fn_proc *rfn,
+			      spinlock_t *lock, int node_id)
+{
 	if (!q)
 		return NULL;
 
@@ -601,7 +617,7 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 	blk_put_queue(q);
 	return NULL;
 }
-EXPORT_SYMBOL(blk_init_queue_node);
+EXPORT_SYMBOL(blk_init_allocated_queue_node);
 
 int blk_get_queue(struct request_queue *q)
 {
diff --git a/block/elevator.c b/block/elevator.c
index 76e3702..6ec9137 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -921,6 +921,7 @@ int elv_register_queue(struct request_queue *q)
 	}
 	return error;
 }
+EXPORT_SYMBOL_GPL(elv_register_queue);
 
 static void __elv_unregister_queue(struct elevator_queue *e)
 {
@@ -933,6 +934,7 @@ void elv_unregister_queue(struct request_queue *q)
 	if (q)
 		__elv_unregister_queue(q->elevator);
 }
+EXPORT_SYMBOL_GPL(elv_unregister_queue);
 
 void elv_register(struct elevator_type *e)
 {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6690e8b..89ce7a2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -917,7 +917,12 @@ extern void blk_abort_queue(struct request_queue *);
  */
 extern struct request_queue *blk_init_queue_node(request_fn_proc *rfn,
 					spinlock_t *lock, int node_id);
+extern struct request_queue *blk_init_allocated_queue_node(struct request_queue *,
+							   request_fn_proc *,
+							   spinlock_t *, int node_id);
 extern struct request_queue *blk_init_queue(request_fn_proc *, spinlock_t *);
+extern struct request_queue *blk_init_allocated_queue(struct request_queue *,
+						      request_fn_proc *, spinlock_t *);
 extern void blk_cleanup_queue(struct request_queue *);
 extern void blk_queue_make_request(struct request_queue *, make_request_fn *);
 extern void blk_queue_bounce_limit(struct request_queue *, u64);
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-10 22:55 [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Mike Snitzer
@ 2010-05-10 22:55 ` Mike Snitzer
  2010-05-11  4:23   ` Kiyoshi Ueda
  2010-05-13  4:31   ` [PATCH 2/2 v2] " Mike Snitzer
  2010-05-11  6:55 ` [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Jens Axboe
  1 sibling, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-10 22:55 UTC (permalink / raw)
  To: dm-devel
  Cc: linux-kernel, Jens Axboe, Kiyoshi Ueda, Jun'ichi Nomura,
	Vivek Goyal

Revert back to only allocating a minimalist request_queue structure
initially (needed for both bio and request-based DM).  Initialization of
a full request_queue (request_fn, elevator, etc) is deferred until it is
known that the DM device is request-based.

dm_init_request_based_queue is now called at the end of a request-based
DM device's table load.

Otherwise bio-based DM devices will have an elevator, request_fn, etc.
As a result bio-based DM devices had an 'iosched/' tree as well as an
'elevator' other than "none" in sysfs.

Reported-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-table.c |    5 ++++
 drivers/md/dm.c       |   52 ++++++++++++++++++++++++++++++++-----------------
 drivers/md/dm.h       |    2 +
 3 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 9924ea2..64a8578 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -829,6 +829,11 @@ int dm_table_set_type(struct dm_table *t)
 		return -EINVAL;
 	}
 
+	if (!dm_init_request_based_queue(t->md)) {
+		DMWARN("Cannot initialize queue for Request-based dm");
+		return -EINVAL;
+	}
+
 	t->type = DM_TYPE_REQUEST_BASED;
 
 	return 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d21e128..6df7f6c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1886,22 +1886,10 @@ static struct mapped_device *alloc_dev(int minor)
 	INIT_LIST_HEAD(&md->uevent_list);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_init_queue(dm_request_fn, NULL);
+	md->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!md->queue)
 		goto bad_queue;
 
-	/*
-	 * Request-based dm devices cannot be stacked on top of bio-based dm
-	 * devices.  The type of this dm device has not been decided yet,
-	 * although we initialized the queue using blk_init_queue().
-	 * The type is decided at the first table loading time.
-	 * To prevent problematic device stacking, clear the queue flag
-	 * for request stacking support until then.
-	 *
-	 * This queue is new, so no concurrency on the queue_flags.
-	 */
-	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
-	md->saved_make_request_fn = md->queue->make_request_fn;
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1909,11 +1897,6 @@ static struct mapped_device *alloc_dev(int minor)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
-	blk_queue_softirq_done(md->queue, dm_softirq_done);
-	blk_queue_prep_rq(md->queue, dm_prep_fn);
-	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  dm_rq_prepare_flush);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1968,6 +1951,39 @@ bad_module_get:
 	return NULL;
 }
 
+int dm_init_request_based_queue(struct mapped_device *md)
+{
+	struct request_queue *q = NULL;
+
+	q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
+	if (!q)
+		return 0;
+	md->queue = q;
+
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet.
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+	md->saved_make_request_fn = md->queue->make_request_fn;
+
+	blk_queue_softirq_done(md->queue, dm_softirq_done);
+	blk_queue_prep_rq(md->queue, dm_prep_fn);
+	blk_queue_lld_busy(md->queue, dm_lld_busy);
+	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
+			  dm_rq_prepare_flush);
+
+	/* Register the request-based queue's elevator with sysfs */
+	elv_register_queue(md->queue);
+
+	return 1;
+}
+
 static void unlock_fs(struct mapped_device *md);
 
 static void free_dev(struct mapped_device *md)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index bad1724..489feba 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -113,6 +113,8 @@ void dm_sysfs_exit(struct mapped_device *md);
 struct kobject *dm_kobject(struct mapped_device *md);
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
 
+int dm_init_request_based_queue(struct mapped_device *md);
+
 /*
  * Targets for linear and striped mappings
  */
-- 
1.6.6.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
@ 2010-05-11  4:23   ` Kiyoshi Ueda
  2010-05-11 13:15     ` Mike Snitzer
  2010-05-13  4:31   ` [PATCH 2/2 v2] " Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-11  4:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal

Hi Mike,

On 05/11/2010 07:55 AM +0900, Mike Snitzer wrote:
> Revert back to only allocating a minimalist request_queue structure
> initially (needed for both bio and request-based DM).  Initialization of
> a full request_queue (request_fn, elevator, etc) is deferred until it is
> known that the DM device is request-based.

Thank you for working on this.
However, I still disagree with this patch as we discussed on this thread:
http://marc.info/?t=124990138700003&r=1&w=2
(Exporting a part of queue's features may cause some maintenance costs
 in future.)

As I mentioned on the last email of the thread above (see below),
specifying device type at the device creation time by userspace tools
should make dm code very simple.  So that may be a better approach.

> By the way, another approach to optimizing the memory usage would be
> to determine whether the dm device is bio-based or request-based
> at the device creation time, instead of the table binding time.
> We want the delayed allocation, since kernel can't decide the device
> type until the first table is bound because of the auto-detection
> mechanism.  The auto-detection is good for keeping compatibility with
> existing user-space tools.  But once user-space tools are changed to
> specify device type at the device creation time, we can eventually
> remove the auto-detection.
> Then, kernel can decide device type in alloc_dev(), so
> the initialization code in kernel will become very simple.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue
  2010-05-10 22:55 [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Mike Snitzer
  2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
@ 2010-05-11  6:55 ` Jens Axboe
  2010-05-11 13:18   ` Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Jens Axboe @ 2010-05-11  6:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Kiyoshi Ueda, Jun'ichi Nomura,
	Vivek Goyal

On Mon, May 10 2010, Mike Snitzer wrote:
> blk_init_queue() allocates the request_queue structure and then
> initializes it as needed (request_fn, elevator, etc).
> 
> Split initialization out to blk_init_allocated_queue_node.
> Introduce blk_init_allocated_queue wrapper function to model existing
> blk_init_queue and blk_init_queue_node interfaces.
> 
> Export elv_register_queue to allow a newly added elevator to be
> registered with sysfs.  Export elv_unregister_queue for symmetry.
> 
> These changes allow DM to initialize a device's request_queue with more
> precision.  In particular, DM no longer unconditionally initializes a
> full request_queue (elevator et al).  It only does so for a
> request-based DM device.

Looks good, I'll apply this. I don't think the exports need to be _GPL
in this case, generally I've only done that with the exports for hooking
in a new IO scheduler.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-11  4:23   ` Kiyoshi Ueda
@ 2010-05-11 13:15     ` Mike Snitzer
  2010-05-12  8:23       ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-11 13:15 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan

On Tue, May 11 2010 at 12:23am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/11/2010 07:55 AM +0900, Mike Snitzer wrote:
> > Revert back to only allocating a minimalist request_queue structure
> > initially (needed for both bio and request-based DM).  Initialization of
> > a full request_queue (request_fn, elevator, etc) is deferred until it is
> > known that the DM device is request-based.
> 
> Thank you for working on this.
> However, I still disagree with this patch as we discussed on this thread:
> http://marc.info/?t=124990138700003&r=1&w=2
> (Exporting a part of queue's features may cause some maintenance costs
>  in future.)

Thanks for the reference.  I completely forgot about that thread (even
though I responded to Nikanth's patches in detail! :)

It is clear we need to resolve the current full request_queue
initialization that occurs even for bio-based DM devices.

I believe the 2 patches I posted accomplish this in a stright-forward
way.  We can always improve on it (by looking at what you proposed
below) but we need a minimlaist fix that doesn't depend on userspace
LVM2 changes right now.

Interestingly, having to revisit this issue (forgetting that this line
of work was already explored) I came up with roughly the same type of
change to the block layer as Nikanth's 1/2 patch.  The difference being
my blk_init_allocated_queue is more minimalist because the block code
has evolved to allow this change to be more natural.

Similarly, my proposed DM changes are also quite natural.  By using
dm_table_set_type() as the hook to initialize the request-based DM
device's elevator we perform allocations during table load.

Having just looked at Nikanth's proposed DM patch 2/2 again it shows
that blk_init_allocated_queue(), which allocates memory, was being
called during resume (dm_swap_table).  Allocations are not allowed
during resume.

> As I mentioned on the last email of the thread above (see below),
> specifying device type at the device creation time by userspace tools
> should make dm code very simple.  So that may be a better approach.
> 
> > By the way, another approach to optimizing the memory usage would be
> > to determine whether the dm device is bio-based or request-based
> > at the device creation time, instead of the table binding time.
> > We want the delayed allocation, since kernel can't decide the device
> > type until the first table is bound because of the auto-detection
> > mechanism.  The auto-detection is good for keeping compatibility with
> > existing user-space tools.  But once user-space tools are changed to
> > specify device type at the device creation time, we can eventually
> > remove the auto-detection.
> > Then, kernel can decide device type in alloc_dev(), so
> > the initialization code in kernel will become very simple.
> 
> Thanks,
> Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue
  2010-05-11  6:55 ` [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Jens Axboe
@ 2010-05-11 13:18   ` Mike Snitzer
  2010-05-11 13:21     ` Jens Axboe
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-11 13:18 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dm-devel, linux-kernel, Kiyoshi Ueda, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan

On Tue, May 11 2010 at  2:55am -0400,
Jens Axboe <jens.axboe@oracle.com> wrote:

> On Mon, May 10 2010, Mike Snitzer wrote:
> > blk_init_queue() allocates the request_queue structure and then
> > initializes it as needed (request_fn, elevator, etc).
> > 
> > Split initialization out to blk_init_allocated_queue_node.
> > Introduce blk_init_allocated_queue wrapper function to model existing
> > blk_init_queue and blk_init_queue_node interfaces.
> > 
> > Export elv_register_queue to allow a newly added elevator to be
> > registered with sysfs.  Export elv_unregister_queue for symmetry.
> > 
> > These changes allow DM to initialize a device's request_queue with more
> > precision.  In particular, DM no longer unconditionally initializes a
> > full request_queue (elevator et al).  It only does so for a
> > request-based DM device.
> 
> Looks good, I'll apply this. I don't think the exports need to be _GPL
> in this case, generally I've only done that with the exports for hooking
> in a new IO scheduler.

OK, thanks for picking this up.  Do you want me to send v2 that drops
the _GPL or will you make those changes?

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue
  2010-05-11 13:18   ` Mike Snitzer
@ 2010-05-11 13:21     ` Jens Axboe
  0 siblings, 0 replies; 28+ messages in thread
From: Jens Axboe @ 2010-05-11 13:21 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Kiyoshi Ueda, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan

On Tue, May 11 2010, Mike Snitzer wrote:
> On Tue, May 11 2010 at  2:55am -0400,
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Mon, May 10 2010, Mike Snitzer wrote:
> > > blk_init_queue() allocates the request_queue structure and then
> > > initializes it as needed (request_fn, elevator, etc).
> > > 
> > > Split initialization out to blk_init_allocated_queue_node.
> > > Introduce blk_init_allocated_queue wrapper function to model existing
> > > blk_init_queue and blk_init_queue_node interfaces.
> > > 
> > > Export elv_register_queue to allow a newly added elevator to be
> > > registered with sysfs.  Export elv_unregister_queue for symmetry.
> > > 
> > > These changes allow DM to initialize a device's request_queue with more
> > > precision.  In particular, DM no longer unconditionally initializes a
> > > full request_queue (elevator et al).  It only does so for a
> > > request-based DM device.
> > 
> > Looks good, I'll apply this. I don't think the exports need to be _GPL
> > in this case, generally I've only done that with the exports for hooking
> > in a new IO scheduler.
> 
> OK, thanks for picking this up.  Do you want me to send v2 that drops
> the _GPL or will you make those changes?

I just hand-edited that part. I can add the dm bit as well, when it has
been reviewed and acked.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-11 13:15     ` Mike Snitzer
@ 2010-05-12  8:23       ` Kiyoshi Ueda
  2010-05-13  3:57         ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-12  8:23 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan

Hi Mike,

On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
> On Tue, May 11 2010 at 12:23am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>> Hi Mike,
>>
>> On 05/11/2010 07:55 AM +0900, Mike Snitzer wrote:
>>> Revert back to only allocating a minimalist request_queue structure
>>> initially (needed for both bio and request-based DM).  Initialization of
>>> a full request_queue (request_fn, elevator, etc) is deferred until it is
>>> known that the DM device is request-based.
>>
>> Thank you for working on this.
>> However, I still disagree with this patch as we discussed on this thread:
>> http://marc.info/?t=124990138700003&r=1&w=2
>> (Exporting a part of queue's features may cause some maintenance costs
>>  in future.)
> 
> Thanks for the reference.  I completely forgot about that thread (even
> though I responded to Nikanth's patches in detail! :)
> 
> It is clear we need to resolve the current full request_queue
> initialization that occurs even for bio-based DM devices.
> 
> I believe the 2 patches I posted accomplish this in a stright-forward
> way.  We can always improve on it (by looking at what you proposed
> below) but we need a minimlaist fix that doesn't depend on userspace
> LVM2 changes right now.

Humm, OK.
Indeed, showing iosched directory in bio-based device's sysfs is
confusing users actually, and we need something to resolve that soon.
So I don't strongly object to your approach as the first step, as long
as we can accept the risk of the maintenance cost which I mentioned.

By the way, your current patch has a problem below.
It needs to be fixed at least.


> Similarly, my proposed DM changes are also quite natural. By using
> dm_table_set_type() as the hook to initialize the request-based DM
> device's elevator we perform allocations during table load.

Your patch initializes queue everytime request-based table is loaded.
I think that could cause a problem (although I haven't tested your
patch yet).


Also, as you know, the table load can be canceled.
So initializing queue at table loading time may cause some weird
behaviors for users.  For example,
    # dmsetup create --notable bio-based
    # echo "0 100 multipath ..." | dmsetup load bio-based
    # echo "0 100 linear ..."    | dmsetup load bio-based
    # dmsetup resume bio-based
    # ls /sys/block/dm-0/queue/
    ... iosched ...
If you (and Alasdair) think this behavior is acceptable, it might
be OK.  I just feel it's weird though...


> Having just looked at Nikanth's proposed DM patch 2/2 again it shows
> that blk_init_allocated_queue(), which allocates memory, was being
> called during resume (dm_swap_table).  Allocations are not allowed
> during resume.

Right, in general.
However, in this special case, I think initializing queue (allocating
memory) during resume should be OK as I mentioned like below in:
    http://marc.info/?l=linux-kernel&m=124999806420663&w=2

    > Generally, dm must not allocate memory during resume because
    > it may cause a deadlock in no memory situation.
    > However, there is no I/O on this device at this point,
    > so the allocation should be ok for this special case.
    > I think some comments are needed here to describe that.

So you should be able to take this approach.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-12  8:23       ` Kiyoshi Ueda
@ 2010-05-13  3:57         ` Mike Snitzer
  2010-05-14  8:06           ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-13  3:57 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan

On Wed, May 12 2010 at  4:23am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
> > It is clear we need to resolve the current full request_queue
> > initialization that occurs even for bio-based DM devices.
> > 
> > I believe the 2 patches I posted accomplish this in a stright-forward
> > way.  We can always improve on it (by looking at what you proposed
> > below) but we need a minimlaist fix that doesn't depend on userspace
> > LVM2 changes right now.
> 
> Humm, OK.
> Indeed, showing iosched directory in bio-based device's sysfs is
> confusing users actually, and we need something to resolve that soon.
> So I don't strongly object to your approach as the first step, as long
> as we can accept the risk of the maintenance cost which I mentioned.

OK, I understand your concern (especially after having gone through this
further over the past couple days).  I'm hopeful maintenance will be
minimal.

> By the way, your current patch has a problem below.
> It needs to be fixed at least.

Thanks for the review.  I've addressed both your points with additional
changes (split between 2 patches).

> > Similarly, my proposed DM changes are also quite natural. By using
> > dm_table_set_type() as the hook to initialize the request-based DM
> > device's elevator we perform allocations during table load.
> 
> Your patch initializes queue everytime request-based table is loaded.
> I think that could cause a problem (although I haven't tested your
> patch yet).

Yes, that was definitely a problem.  I've included an incremental patch
that shows the fixes to dm_init_request_based_queue below.  I'll send
out a revised patch: [PATCH 2/2 v2] ...

If you're comfortable with it please respond with your Acked-by.

> Also, as you know, the table load can be canceled.
> So initializing queue at table loading time may cause some weird
> behaviors for users.  For example,
>     # dmsetup create --notable bio-based
>     # echo "0 100 multipath ..." | dmsetup load bio-based
>     # echo "0 100 linear ..."    | dmsetup load bio-based
>     # dmsetup resume bio-based
>     # ls /sys/block/dm-0/queue/
>     ... iosched ...
> If you (and Alasdair) think this behavior is acceptable, it might
> be OK.  I just feel it's weird though...

Agreed, we don't think this is ideal.. I have a follow-on patch that
I'll mail out separately for your consideration ([RFC PATCH 3/2]).

> > Having just looked at Nikanth's proposed DM patch 2/2 again it shows
> > that blk_init_allocated_queue(), which allocates memory, was being
> > called during resume (dm_swap_table).  Allocations are not allowed
> > during resume.
> 
> Right, in general.
> However, in this special case, I think initializing queue (allocating
> memory) during resume should be OK as I mentioned like below in:
>     http://marc.info/?l=linux-kernel&m=124999806420663&w=2
> 
>     > Generally, dm must not allocate memory during resume because
>     > it may cause a deadlock in no memory situation.
>     > However, there is no I/O on this device at this point,
>     > so the allocation should be ok for this special case.
>     > I think some comments are needed here to describe that.
> 
> So you should be able to take this approach.

I confirmed with Alasdair that we don't want to conditionally relax DM's
allocation constraints for request-based DM.  Best to be consistent
about not allowing allocations during resume.

Here is the incremental patch I mentioned above.
---
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Wed May 12 18:52:01 2010 -0400

    idempotent dm_init_request_based_queue

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 6df7f6c..b2171be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1951,14 +1951,30 @@ bad_module_get:
 	return NULL;
 }
 
+/*
+ * Fully initialize the request_queue (elevator, ->request_fn, etc).
+ */
 int dm_init_request_based_queue(struct mapped_device *md)
 {
 	struct request_queue *q = NULL;
 
-	q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
-	if (!q)
+	if (!md->queue)
 		return 0;
-	md->queue = q;
+
+	/*
+	 * Avoid re-initializing the queue (and leaking the existing
+	 * elevator) if dm_init_request_based_queue() was already used.
+	 */
+	if (!md->queue->elevator) {
+		/* Fully initialize the queue */
+		q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
+		if (!q)
+			return 0;
+		md->queue = q;
+		md->saved_make_request_fn = md->queue->make_request_fn;
+		blk_queue_make_request(md->queue, dm_request);
+		elv_register_queue(md->queue);
+	}
 
 	/*
 	 * Request-based dm devices cannot be stacked on top of bio-based dm
@@ -1970,7 +1986,6 @@ int dm_init_request_based_queue(struct mapped_device *md)
 	 * This queue is new, so no concurrency on the queue_flags.
 	 */
 	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
-	md->saved_make_request_fn = md->queue->make_request_fn;
 
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
@@ -1978,9 +1993,6 @@ int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
 			  dm_rq_prepare_flush);
 
-	/* Register the request-based queue's elevator with sysfs */
-	elv_register_queue(md->queue);
-
 	return 1;
 }
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2 v2] dm: only initialize full request_queue for request-based device
  2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
  2010-05-11  4:23   ` Kiyoshi Ueda
@ 2010-05-13  4:31   ` Mike Snitzer
  2010-05-13  5:02     ` [RFC PATCH 3/2] dm: bio-based device must not register elevator in sysfs Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-13  4:31 UTC (permalink / raw)
  To: dm-devel
  Cc: Jun'ichi Nomura, Kiyoshi Ueda, linux-kernel, Vivek Goyal,
	Jens Axboe, Nikanth Karthikesan

Revert back to only allocating a minimalist request_queue structure
initially (needed for both bio and request-based DM).  Initialization of
a full request_queue (request_fn, elevator, etc) is deferred until it is
known that the DM device is request-based.

dm_init_request_based_queue is now called at the end of a request-based
DM device's table load.

Otherwise bio-based DM devices will have an elevator, request_fn, etc.
As a result bio-based DM devices had an 'iosched/' tree as well as a
'scheduler' other than "none" in sysfs.

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 drivers/md/dm-table.c |    5 ++++
 drivers/md/dm.c       |   64 +++++++++++++++++++++++++++++++++++-------------
 drivers/md/dm.h       |    2 +
 3 files changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 9924ea2..64a8578 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -829,6 +829,11 @@ int dm_table_set_type(struct dm_table *t)
 		return -EINVAL;
 	}
 
+	if (!dm_init_request_based_queue(t->md)) {
+		DMWARN("Cannot initialize queue for Request-based dm");
+		return -EINVAL;
+	}
+
 	t->type = DM_TYPE_REQUEST_BASED;
 
 	return 0;
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index d21e128..b2171be 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1886,22 +1886,10 @@ static struct mapped_device *alloc_dev(int minor)
 	INIT_LIST_HEAD(&md->uevent_list);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_init_queue(dm_request_fn, NULL);
+	md->queue = blk_alloc_queue(GFP_KERNEL);
 	if (!md->queue)
 		goto bad_queue;
 
-	/*
-	 * Request-based dm devices cannot be stacked on top of bio-based dm
-	 * devices.  The type of this dm device has not been decided yet,
-	 * although we initialized the queue using blk_init_queue().
-	 * The type is decided at the first table loading time.
-	 * To prevent problematic device stacking, clear the queue flag
-	 * for request stacking support until then.
-	 *
-	 * This queue is new, so no concurrency on the queue_flags.
-	 */
-	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
-	md->saved_make_request_fn = md->queue->make_request_fn;
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1909,11 +1897,6 @@ static struct mapped_device *alloc_dev(int minor)
 	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
 	md->queue->unplug_fn = dm_unplug_all;
 	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
-	blk_queue_softirq_done(md->queue, dm_softirq_done);
-	blk_queue_prep_rq(md->queue, dm_prep_fn);
-	blk_queue_lld_busy(md->queue, dm_lld_busy);
-	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
-			  dm_rq_prepare_flush);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1968,6 +1951,51 @@ bad_module_get:
 	return NULL;
 }
 
+/*
+ * Fully initialize the request_queue (elevator, ->request_fn, etc).
+ */
+int dm_init_request_based_queue(struct mapped_device *md)
+{
+	struct request_queue *q = NULL;
+
+	if (!md->queue)
+		return 0;
+
+	/*
+	 * Avoid re-initializing the queue (and leaking the existing
+	 * elevator) if dm_init_request_based_queue() was already used.
+	 */
+	if (!md->queue->elevator) {
+		/* Fully initialize the queue */
+		q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
+		if (!q)
+			return 0;
+		md->queue = q;
+		md->saved_make_request_fn = md->queue->make_request_fn;
+		blk_queue_make_request(md->queue, dm_request);
+		elv_register_queue(md->queue);
+	}
+
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet.
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+
+	blk_queue_softirq_done(md->queue, dm_softirq_done);
+	blk_queue_prep_rq(md->queue, dm_prep_fn);
+	blk_queue_lld_busy(md->queue, dm_lld_busy);
+	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
+			  dm_rq_prepare_flush);
+
+	return 1;
+}
+
 static void unlock_fs(struct mapped_device *md);
 
 static void free_dev(struct mapped_device *md)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index bad1724..489feba 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -113,6 +113,8 @@ void dm_sysfs_exit(struct mapped_device *md);
 struct kobject *dm_kobject(struct mapped_device *md);
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
 
+int dm_init_request_based_queue(struct mapped_device *md);
+
 /*
  * Targets for linear and striped mappings
  */

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 3/2] dm: bio-based device must not register elevator in sysfs
  2010-05-13  4:31   ` [PATCH 2/2 v2] " Mike Snitzer
@ 2010-05-13  5:02     ` Mike Snitzer
  2010-05-13 22:14       ` [PATCH 3/2 v2] " Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-13  5:02 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Nikanth Karthikesan, linux-kernel, Jens Axboe,
	Jun'ichi Nomura, Vivek Goyal

Properly clear io scheduler state from sysfs if transitioning from a
request-based table back to bio-based.  Unregister the elevator from
sysfs (removes 'iosched/') and have the 'scheduler' sysfs attribute
properly report "none" for bio-based DM.

Adjust elv_iosched_show() to return "none" if !blk_queue_stackable.
This allows the block layer to take bio-based DM into consideration.

These changes address the following (albeit obscure) corner-case:
   # dmsetup create --notable bio-based
   # echo "0 100 multipath ..." | dmsetup load bio-based
   # echo "0 100 linear ..."    | dmsetup load bio-based
   # dmsetup resume bio-based
   # ls /sys/block/dm-0/queue/
   ... iosched ...

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/elevator.c      |    2 +-
 drivers/md/dm-table.c |    1 +
 drivers/md/dm.c       |   41 ++++++++++++++++++++++++++++++++++++++---
 drivers/md/dm.h       |    1 +
 4 files changed, 41 insertions(+), 4 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 6ec9137..168112e 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -1088,7 +1088,7 @@ ssize_t elv_iosched_show(struct request_queue *q, char *name)
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator)
+	if (!q->elevator || !blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
 	elv = e->elevator_type;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 64a8578..dec8295 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -803,6 +803,7 @@ int dm_table_set_type(struct dm_table *t)
 	if (bio_based) {
 		/* We must use this table as bio-based */
 		t->type = DM_TYPE_BIO_BASED;
+		dm_clear_request_based_queue(t->md);
 		return 0;
 	}
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index b2171be..08a4745 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1957,9 +1957,9 @@ bad_module_get:
 int dm_init_request_based_queue(struct mapped_device *md)
 {
 	struct request_queue *q = NULL;
+	int register_queue = 0;
 
-	if (!md->queue)
-		return 0;
+	BUG_ON(!md->queue);
 
 	/*
 	 * Avoid re-initializing the queue (and leaking the existing
@@ -1973,9 +1973,22 @@ int dm_init_request_based_queue(struct mapped_device *md)
 		md->queue = q;
 		md->saved_make_request_fn = md->queue->make_request_fn;
 		blk_queue_make_request(md->queue, dm_request);
-		elv_register_queue(md->queue);
+		register_queue = 1;
+	} else if (!md->queue->request_fn) {
+		/*
+		 * Handle corner-case where bio-based table was cleared
+		 * and requested-based table was loaded in its place.
+		 * - re-instate the ->request_fn that was cleared
+		 *   by dm_clear_request_based_queue().
+		 * - register the queue's 'iosched/' sysfs attributes
+		 */
+		md->queue->request_fn = dm_request_fn;
+		register_queue = 1;
 	}
 
+	if (register_queue)
+		elv_register_queue(md->queue);
+
 	/*
 	 * Request-based dm devices cannot be stacked on top of bio-based dm
 	 * devices.  The type of this dm device has not been decided yet.
@@ -1996,6 +2009,28 @@ int dm_init_request_based_queue(struct mapped_device *md)
 	return 1;
 }
 
+void dm_clear_request_based_queue(struct mapped_device *md)
+{
+	BUG_ON(!md->queue);
+
+	if (!md->queue->elevator)
+		return; /* already bio-based so return */
+
+	/*
+	 * Clear ->request_fn, treat NULL as a flag to indicate a
+	 * bio-based table was loaded after a request-based table.
+	 */
+	md->queue->request_fn = NULL;
+
+	/*
+	 * Unregister the request-based queue's elevator from sysfs.
+	 * NOTE: md->queue doesn't have the QUEUE_FLAG_STACKABLE flag
+	 * set so elv_iosched_show() will report the allocated queue's
+	 * type as "none" through the 'scheduler' sysfs attribute.
+	 */
+	elv_unregister_queue(md->queue);
+}
+
 static void unlock_fs(struct mapped_device *md);
 
 static void free_dev(struct mapped_device *md)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 489feba..95aec64 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -114,6 +114,7 @@ struct kobject *dm_kobject(struct mapped_device *md);
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
 
 int dm_init_request_based_queue(struct mapped_device *md);
+void dm_clear_request_based_queue(struct mapped_device *md);
 
 /*
  * Targets for linear and striped mappings

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/2 v2] dm: bio-based device must not register elevator in sysfs
  2010-05-13  5:02     ` [RFC PATCH 3/2] dm: bio-based device must not register elevator in sysfs Mike Snitzer
@ 2010-05-13 22:14       ` Mike Snitzer
  0 siblings, 0 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-13 22:14 UTC (permalink / raw)
  To: dm-devel
  Cc: Kiyoshi Ueda, Nikanth Karthikesan, linux-kernel, Jens Axboe,
	Jun'ichi Nomura, Vivek Goyal

Properly clear io scheduler state from sysfs if transitioning from a
request-based table back to bio-based.  Unregister the elevator from
sysfs (removes 'iosched/') and have the 'scheduler' sysfs attribute
properly report "none" for bio-based DM.

Adjust elv_iosched_show() to return "none" if !blk_queue_stackable.
This allows the block layer to take bio-based DM into consideration.

These changes address the following (albeit obscure) corner-case:
   # dmsetup create --notable bio-based
   # echo "0 100 multipath ..." | dmsetup load bio-based
   # echo "0 100 linear ..."    | dmsetup load bio-based
   # dmsetup resume bio-based
   # ls /sys/block/dm-0/queue/
   ... iosched ...

Signed-off-by: Mike Snitzer <snitzer@redhat.com>
---
 block/elevator.c      |    2 +-
 drivers/md/dm-table.c |    1 +
 drivers/md/dm.c       |   44 ++++++++++++++++++++++++++++++++++++--------
 drivers/md/dm.h       |    1 +
 4 files changed, 39 insertions(+), 9 deletions(-)

v2 changes:
* switched to using BUG_ON in dm_{init,clear}_request_based_queue
* refined dm_clear_request_based_queue slightly
* introduced dm_bio_based_md
* cleaned up some comments

Index: linux-2.6/drivers/md/dm-table.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-table.c
+++ linux-2.6/drivers/md/dm-table.c
@@ -803,6 +803,7 @@ int dm_table_set_type(struct dm_table *t
 	if (bio_based) {
 		/* We must use this table as bio-based */
 		t->type = DM_TYPE_BIO_BASED;
+		dm_clear_request_based_queue(t->md);
 		return 0;
 	}
 
Index: linux-2.6/drivers/md/dm.c
===================================================================
--- linux-2.6.orig/drivers/md/dm.c
+++ linux-2.6/drivers/md/dm.c
@@ -1445,6 +1445,11 @@ static int dm_request_based(struct mappe
 	return blk_queue_stackable(md->queue);
 }
 
+static int dm_bio_based_md(struct mapped_device *md)
+{
+	return (md->queue->request_fn) ? 0 : 1;
+}
+
 static int dm_request(struct request_queue *q, struct bio *bio)
 {
 	struct mapped_device *md = q->queuedata;
@@ -1952,19 +1957,16 @@ bad_module_get:
 }
 
 /*
- * Fully initialize the request_queue (elevator, ->request_fn, etc).
+ * Fully initialize a request-based queue (->elevator, ->request_fn, etc).
  */
 int dm_init_request_based_queue(struct mapped_device *md)
 {
 	struct request_queue *q = NULL;
+	int register_elevator = 0;
 
-	if (!md->queue)
-		return 0;
+	BUG_ON(!md->queue);
 
-	/*
-	 * Avoid re-initializing the queue (and leaking the existing
-	 * elevator) if dm_init_request_based_queue() was already used.
-	 */
+	/* Avoid re-initializing the queue if already fully initialized */
 	if (!md->queue->elevator) {
 		/* Fully initialize the queue */
 		q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
@@ -1973,9 +1975,21 @@ int dm_init_request_based_queue(struct m
 		md->queue = q;
 		md->saved_make_request_fn = md->queue->make_request_fn;
 		blk_queue_make_request(md->queue, dm_request);
-		elv_register_queue(md->queue);
+		register_elevator = 1;
+	} else if (dm_bio_based_md(md)) {
+		/*
+		 * Queue was fully initialized on behalf of a previous
+		 * request-based table load.  Table is now switching from
+		 * bio-based back to request-based, e.g.: rq -> bio -> rq
+		 */
+		md->queue->request_fn = dm_request_fn;
+		register_elevator = 1;
 	}
 
+	/* Avoid re-registering elevator if already registered */
+	if (register_elevator)
+		elv_register_queue(md->queue);
+
 	/*
 	 * Request-based dm devices cannot be stacked on top of bio-based dm
 	 * devices.  The type of this dm device has not been decided yet.
@@ -1996,6 +2010,20 @@ int dm_init_request_based_queue(struct m
 	return 1;
 }
 
+void dm_clear_request_based_queue(struct mapped_device *md)
+{
+	BUG_ON(!md->queue);
+
+	if (dm_bio_based_md(md))
+		return; /* already bio-based so return */
+
+	/* Clear ->request_fn, not used for bio-based */
+	md->queue->request_fn = NULL;
+
+	/* Unregister the request-based queue's elevator from sysfs */
+	elv_unregister_queue(md->queue);
+}
+
 static void unlock_fs(struct mapped_device *md);
 
 static void free_dev(struct mapped_device *md)
Index: linux-2.6/drivers/md/dm.h
===================================================================
--- linux-2.6.orig/drivers/md/dm.h
+++ linux-2.6/drivers/md/dm.h
@@ -114,6 +114,7 @@ struct kobject *dm_kobject(struct mapped
 struct mapped_device *dm_get_from_kobject(struct kobject *kobj);
 
 int dm_init_request_based_queue(struct mapped_device *md);
+void dm_clear_request_based_queue(struct mapped_device *md);
 
 /*
  * Targets for linear and striped mappings
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c
+++ linux-2.6/block/elevator.c
@@ -1086,7 +1086,7 @@ ssize_t elv_iosched_show(struct request_
 	struct elevator_type *__e;
 	int len = 0;
 
-	if (!q->elevator)
+	if (!q->elevator || !blk_queue_stackable(q))
 		return sprintf(name, "none\n");
 
 	elv = e->elevator_type;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-13  3:57         ` Mike Snitzer
@ 2010-05-14  8:06           ` Kiyoshi Ueda
  2010-05-14 14:08             ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-14  8:06 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

Hi Mike,

On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
> On Wed, May 12 2010 at  4:23am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>> Hi Mike,
>>
>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>> It is clear we need to resolve the current full request_queue
>>> initialization that occurs even for bio-based DM devices.
>>>
>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>> way.  We can always improve on it (by looking at what you proposed
>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>> LVM2 changes right now.
>> Humm, OK.
>> Indeed, showing iosched directory in bio-based device's sysfs is
>> confusing users actually, and we need something to resolve that soon.
>> So I don't strongly object to your approach as the first step, as long
>> as we can accept the risk of the maintenance cost which I mentioned.
> 
> OK, I understand your concern (especially after having gone through this
> further over the past couple days).  I'm hopeful maintenance will be
> minimal.
> 
>> By the way, your current patch has a problem below.
>> It needs to be fixed at least.
> 
> Thanks for the review.  I've addressed both your points with additional
> changes (split between 2 patches).

I found another bug; blk_init_allocated_queue() overwrites q->unplug_fn
with generic one. (Although it's currently harmless for multipath target.)

I feel your patch-set is growing and becoming complex fix rather than
minimalist/simple fix.
I think touching mapped_device/queue at table loading time makes
things complex.  It is natural that table's arguments are reflected
to mapped_device/queue at table binding time instead.

> I confirmed with Alasdair that we don't want to conditionally relax DM's
> allocation constraints for request-based DM.  Best to be consistent
> about not allowing allocations during resume.

OK.
Then, how about the patch below?
It still fully initializes queue at device creation time for both
bio-based and request-based.  Then, it drops elevator when the device
type is decided as bio-based at table binding time.
So no memory allocation during resume (freeing instead).

I hope this simple work-around approach is acceptable for you and
Alasdair (and others).
(Then, let's focus on making a right fix rather than hacking
 the block layer.)


[PATCH] dm: drop elevator when the device is bio-based

I/O scheduler affects nothing for bio-based dm device.
Showing I/O scheduler's attributes in sysfs for bio-based dm devices
is confusing users.
So drop them from sysfs when a dm device is decided as bio-based.

This patch depends on the commit below in the for-2.6.35 of Jens'
linux-2.6-block git:
commit 01effb0dc1451fad55925873ffbfb88fa4eadce0
Author: Mike Snitzer <snitzer@redhat.com>
Date:   Tue May 11 08:57:42 2010 +0200

    block: allow initialization of previously allocated request_queue

Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
 drivers/md/dm.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: 2.6.34-rc7/drivers/md/dm.c
===================================================================
--- 2.6.34-rc7.orig/drivers/md/dm.c
+++ 2.6.34-rc7/drivers/md/dm.c
@@ -2410,6 +2410,14 @@ struct dm_table *dm_swap_table(struct ma
 		goto out;
 	}
 
+	/* drop elevator when the device type is decided as bio-based */
+	if (!md->map && dm_table_get_type(table) == DM_TYPE_BIO_BASED) {
+		elv_unregister_queue(md->queue);
+		elevator_exit(md->queue->elevator);
+		md->queue->request_fn = NULL;
+		md->queue->elevator = NULL;
+	}
+
 	map = __bind(md, table, &limits);
 
 out:

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-14  8:06           ` Kiyoshi Ueda
@ 2010-05-14 14:08             ` Mike Snitzer
  2010-05-17  9:27               ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-14 14:08 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

On Fri, May 14 2010 at  4:06am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
> > On Wed, May 12 2010 at  4:23am -0400,
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > 
> >> Hi Mike,
> >>
> >> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
> >>> It is clear we need to resolve the current full request_queue
> >>> initialization that occurs even for bio-based DM devices.
> >>>
> >>> I believe the 2 patches I posted accomplish this in a stright-forward
> >>> way.  We can always improve on it (by looking at what you proposed
> >>> below) but we need a minimlaist fix that doesn't depend on userspace
> >>> LVM2 changes right now.
> >> Humm, OK.
> >> Indeed, showing iosched directory in bio-based device's sysfs is
> >> confusing users actually, and we need something to resolve that soon.
> >> So I don't strongly object to your approach as the first step, as long
> >> as we can accept the risk of the maintenance cost which I mentioned.
> > 
> > OK, I understand your concern (especially after having gone through this
> > further over the past couple days).  I'm hopeful maintenance will be
> > minimal.
> > 
> >> By the way, your current patch has a problem below.
> >> It needs to be fixed at least.
> > 
> > Thanks for the review.  I've addressed both your points with additional
> > changes (split between 2 patches).
> 
> I found another bug; blk_init_allocated_queue() overwrites q->unplug_fn
> with generic one. (Although it's currently harmless for multipath target.)

Thanks again for your review!

Here is a simple incremental fix:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5390754..985dd9c 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1854,6 +1854,17 @@ static const struct block_device_operations dm_blk_dops;
 static void dm_wq_work(struct work_struct *work);
 static void dm_rq_barrier_work(struct work_struct *work);
 
+static void dm_init_md_queue(struct mapped_device *md)
+{
+	md->queue->queuedata = md;
+	md->queue->backing_dev_info.congested_fn = dm_any_congested;
+	md->queue->backing_dev_info.congested_data = md;
+	blk_queue_make_request(md->queue, dm_request);
+	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
+	md->queue->unplug_fn = dm_unplug_all;
+	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+}
+
 /*
  * Allocate and initialise a blank device with a given minor.
  */
@@ -1895,13 +1906,7 @@ static struct mapped_device *alloc_dev(int minor)
 	if (!md->queue)
 		goto bad_queue;
 
-	md->queue->queuedata = md;
-	md->queue->backing_dev_info.congested_fn = dm_any_congested;
-	md->queue->backing_dev_info.congested_data = md;
-	blk_queue_make_request(md->queue, dm_request);
-	blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY);
-	md->queue->unplug_fn = dm_unplug_all;
-	blk_queue_merge_bvec(md->queue, dm_merge_bvec);
+	dm_init_md_queue(md);
 
 	md->disk = alloc_disk(1);
 	if (!md->disk)
@@ -1974,7 +1979,7 @@ int dm_init_request_based_queue(struct mapped_device *md)
 			return 0;
 		md->queue = q;
 		md->saved_make_request_fn = md->queue->make_request_fn;
-		blk_queue_make_request(md->queue, dm_request);
+		dm_init_md_queue(md);
 		register_elevator = 1;
 	} else if (dm_bio_based_md(md)) {
 		/*

> I feel your patch-set is growing and becoming complex fix rather than
> minimalist/simple fix.

Seems the numerous patches I've posted over the past couple days have
given a false impression.  But I do understand your concern.

The longer-term direction you want to take DM (known type during
alloc_dev) illustrates that we share a common goal: only do the precise
work that is needed to initialize the request_queue for a DM device
(whether it is bio-based or request-based).

My changes do accomplish that in the near-term without requiring
userspace change.  Many of my proposed changes are just refactoring
existing code to clearly split out what is needed for request-based vs
bio-based.

I'll post a single patch that contains my changes (no need to have 2
patches any more).  With that patch I'm hopeful you'll see my changes
aren't as complex as they might have seemed over the past few days.

> I think touching mapped_device/queue at table loading time makes
> things complex.  It is natural that table's arguments are reflected
> to mapped_device/queue at table binding time instead.
> 
> > I confirmed with Alasdair that we don't want to conditionally relax DM's
> > allocation constraints for request-based DM.  Best to be consistent
> > about not allowing allocations during resume.
> 
> OK.
> Then, how about the patch below?
> It still fully initializes queue at device creation time for both
> bio-based and request-based.  Then, it drops elevator when the device
> type is decided as bio-based at table binding time.
> So no memory allocation during resume (freeing instead).

I'm inclined to believe that the extra work for bio-based DM is not
ideal.  And even freeing during resume _could_ pose a problem during a
resume (if it triggers unknown additional housekeeping).
elv_unregister_queue() and elevator_exit() do quite a bit more than
simply freeing memory.  We can't safely assume those methods will never
allocate memory to accomplish their cleanup (now or in the future).

> I hope this simple work-around approach is acceptable for you and
> Alasdair (and others).

While it is certainly a small change it carries with it the disadvantage
of still requiring more work than is needed for bio-based request_queue
initialization.  It also poses a risk by requiring additional work when
resuming a bio-based device.

> (Then, let's focus on making a right fix rather than hacking
>  the block layer.)

I haven't been hacking the block layer.  Please don't misrepresent what
I proposed.

Regards,
Mike

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-14 14:08             ` Mike Snitzer
@ 2010-05-17  9:27               ` Kiyoshi Ueda
  2010-05-17 17:27                 ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-17  9:27 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

Hi Mike,

On 05/14/2010 11:08 PM +0900, Mike Snitzer wrote:
> On Fri, May 14 2010 at  4:06am -0400,
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> 
>> Hi Mike,
>>
>> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
>>> On Wed, May 12 2010 at  4:23am -0400,
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>>>> It is clear we need to resolve the current full request_queue
>>>>> initialization that occurs even for bio-based DM devices.
>>>>>
>>>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>>>> way.  We can always improve on it (by looking at what you proposed
>>>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>>>> LVM2 changes right now.
>>>>
>>>> Humm, OK.
>>>> Indeed, showing iosched directory in bio-based device's sysfs is
>>>> confusing users actually, and we need something to resolve that soon.
>>>> So I don't strongly object to your approach as the first step, as long
>>>> as we can accept the risk of the maintenance cost which I mentioned.
>>>
>>> OK, I understand your concern (especially after having gone through this
>>> further over the past couple days).  I'm hopeful maintenance will be
>>> minimal.
snip
>> I feel your patch-set is growing and becoming complex fix rather than
>> minimalist/simple fix.
>> 
>> I think touching mapped_device/queue at table loading time makes
>> things complex.  It is natural that table's arguments are reflected
>> to mapped_device/queue at table binding time instead.
>
> Seems the numerous patches I've posted over the past couple days have
> given a false impression.  But I do understand your concern.
> 
> The longer-term direction you want to take DM (known type during
> alloc_dev) illustrates that we share a common goal: only do the precise
> work that is needed to initialize the request_queue for a DM device
> (whether it is bio-based or request-based).
> 
> My changes do accomplish that in the near-term without requiring
> userspace change.  Many of my proposed changes are just refactoring
> existing code to clearly split out what is needed for request-based vs
> bio-based.

As far as I understand, the current model of device-mapper is:
  - a table (precisely, a target) has various attributes,
    bio-based/request-based is one of such attributes
  - a table and its attributes are bound to the block device on resume
If we want to fix a problem, I think we should either work based on
this model or change the model.

Your patch makes that loading table affects the block device, so you
are changing the model.

If you change the model, it should be done carefully.
For example, the current model allows most of the table loading code
to run without exclusive lock on the device because it doesn't affect
the device itself.  If you change this model, table loading needs to
be serialized with appropriate locking.

My suggestion was (and is) to change the current model by determining
bio-based/request-based at device creation time, not by a table.


> I'll post a single patch that contains my changes (no need to have 2
> patches any more).  With that patch I'm hopeful you'll see my changes
> aren't as complex as they might have seemed over the past few days.
 
The number of patches doesn't matter.
I still feel it's becoming complex, for example it now has to take
care of reverting md/queue changes by previous preload.

Also, you have to take care of race condition between concurrent table
loadings which was mentioned above, because the table which was used to
initialized the queue may not be set in hc->new_map.
Although such races may not happen in real usages, they could cause
some critical problems (maybe kernel panic, maybe memory leak).

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-17  9:27               ` Kiyoshi Ueda
@ 2010-05-17 17:27                 ` Mike Snitzer
  2010-05-18  8:32                   ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-17 17:27 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

On Mon, May 17 2010 at  5:27am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/14/2010 11:08 PM +0900, Mike Snitzer wrote:
> > On Fri, May 14 2010 at  4:06am -0400,
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > 
> >> Hi Mike,
> >>
> >> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
> >>> On Wed, May 12 2010 at  4:23am -0400,
> >>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >>>
> >>>> Hi Mike,
> >>>>
> >>>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
> >>>>> It is clear we need to resolve the current full request_queue
> >>>>> initialization that occurs even for bio-based DM devices.
> >>>>>
> >>>>> I believe the 2 patches I posted accomplish this in a stright-forward
> >>>>> way.  We can always improve on it (by looking at what you proposed
> >>>>> below) but we need a minimlaist fix that doesn't depend on userspace
> >>>>> LVM2 changes right now.
> >>>>
> >>>> Humm, OK.
> >>>> Indeed, showing iosched directory in bio-based device's sysfs is
> >>>> confusing users actually, and we need something to resolve that soon.
> >>>> So I don't strongly object to your approach as the first step, as long
> >>>> as we can accept the risk of the maintenance cost which I mentioned.
> >>>
> >>> OK, I understand your concern (especially after having gone through this
> >>> further over the past couple days).  I'm hopeful maintenance will be
> >>> minimal.
> snip
> >> I feel your patch-set is growing and becoming complex fix rather than
> >> minimalist/simple fix.
> >> 
> >> I think touching mapped_device/queue at table loading time makes
> >> things complex.  It is natural that table's arguments are reflected
> >> to mapped_device/queue at table binding time instead.
> >
> > Seems the numerous patches I've posted over the past couple days have
> > given a false impression.  But I do understand your concern.
> > 
> > The longer-term direction you want to take DM (known type during
> > alloc_dev) illustrates that we share a common goal: only do the precise
> > work that is needed to initialize the request_queue for a DM device
> > (whether it is bio-based or request-based).
> > 
> > My changes do accomplish that in the near-term without requiring
> > userspace change.  Many of my proposed changes are just refactoring
> > existing code to clearly split out what is needed for request-based vs
> > bio-based.
> 
> As far as I understand, the current model of device-mapper is:
>   - a table (precisely, a target) has various attributes,
>     bio-based/request-based is one of such attributes
>   - a table and its attributes are bound to the block device on resume
> If we want to fix a problem, I think we should either work based on
> this model or change the model.
> 
> Your patch makes that loading table affects the block device, so you
> are changing the model.
> 
> If you change the model, it should be done carefully.
> For example, the current model allows most of the table loading code
> to run without exclusive lock on the device because it doesn't affect
> the device itself.  If you change this model, table loading needs to
> be serialized with appropriate locking.

Nice catch, yes md->queue needs protection (see patch below).
 
> My suggestion was (and is) to change the current model by determining
> bio-based/request-based at device creation time, not by a table.
> 
> 
> > I'll post a single patch that contains my changes (no need to have 2
> > patches any more).  With that patch I'm hopeful you'll see my changes
> > aren't as complex as they might have seemed over the past few days.
>  
> The number of patches doesn't matter.
> I still feel it's becoming complex, for example it now has to take
> care of reverting md/queue changes by previous preload.

AFAIK, no md/queue changes would need to be reverted.  The last table
load wins and the DM device's queue is left in the correct state for
that winning table (assuming the patch at the end of this mail).

As for the overall complexity of my proposed changes, I'm inclined to
address this issue without userspace changes.  But I do understand we're
at a cross-roads (and likely need Alasdair, and/or others, to weigh in):

1) Are my proposed kernel-only changes somehow detrimental to DM right
   now?
   - We cannot let known bugs go unfixed because of unknown future
     requirements.

2) Is requiring the DM create ioctl to know that a device will be
   request-based vs bio-based really a reasonable requirement?
   - To me a DM device's queue type is a DM (kernel) implementation
     detail that userspace shouldn't _ever_ need to know about.  If
     userspace knew it'd make the kernel's life easier but that doesn't
     make it the correct solution on its own.

> Also, you have to take care of race condition between concurrent table
> loadings which was mentioned above, because the table which was used to
> initialized the queue may not be set in hc->new_map.
> Although such races may not happen in real usages, they could cause
> some critical problems (maybe kernel panic, maybe memory leak).

In the worst case, the hypothetical concurrent table loadings (of
conflicting types: request vs bio) will just cause the request_queue to
be fully initialized (allocated elevator) for bio-based devices.  There
aren't any other concerns with panics or memory leaks.  Granted
something like the following patch is required:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 143082c..02966ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -123,6 +123,11 @@ struct mapped_device {
 	unsigned long flags;
 
 	struct request_queue *queue;
+	/*
+	 * Protect queue from concurrent initialization during
+	 * table load
+	 */
+	struct mutex queue_lock;
 	struct gendisk *disk;
 	char name[16];
 
@@ -1892,6 +1897,7 @@ static struct mapped_device *alloc_dev(int minor)
 
 	init_rwsem(&md->io_lock);
 	mutex_init(&md->suspend_lock);
+	mutex_init(&md->queue_lock);
 	spin_lock_init(&md->deferred_lock);
 	spin_lock_init(&md->barrier_error_lock);
 	rwlock_init(&md->map_lock);
@@ -1966,14 +1972,16 @@ bad_module_get:
  */
 int dm_init_request_based_queue(struct mapped_device *md)
 {
+	int r = 0;
 	struct request_queue *q = NULL;
 
+	mutex_lock(&md->queue_lock);
 	/* Avoid re-initializing the queue if already fully initialized */
 	if (!md->queue->elevator) {
 		/* Fully initialize the queue */
 		q = blk_init_allocated_queue(md->queue, dm_request_fn, NULL);
 		if (!q)
-			return 0;
+			goto out;
 		md->queue = q;
 		md->saved_make_request_fn = md->queue->make_request_fn;
 		dm_init_md_queue(md);
@@ -1984,8 +1992,11 @@ int dm_init_request_based_queue(struct mapped_device *md)
 		 * bio-based back to request-based, e.g.: rq -> bio -> rq
 		 */
 		md->queue->request_fn = dm_request_fn;
-	} else
-		return 1; /* already request-based */
+	} else {
+		/* already request-based */
+		r = 1;
+		goto out;
+	}
 
 	elv_register_queue(md->queue);
 
@@ -2006,7 +2017,10 @@ int dm_init_request_based_queue(struct mapped_device *md)
 	blk_queue_ordered(md->queue, QUEUE_ORDERED_DRAIN_FLUSH,
 			  dm_rq_prepare_flush);
 
-	return 1;
+	r = 1;
+out:
+	mutex_unlock(&md->queue_lock);
+	return r;
 }
 
 void dm_clear_request_based_queue(struct mapped_device *md)
@@ -2015,8 +2029,10 @@ void dm_clear_request_based_queue(struct mapped_device *md)
 		return; /* already bio-based */
 
 	/* Unregister elevator from sysfs and clear ->request_fn */
+	mutex_lock(&md->queue_lock);
 	elv_unregister_queue(md->queue);
 	md->queue->request_fn = NULL;
+	mutex_unlock(&md->queue_lock);
 }
 
 static void unlock_fs(struct mapped_device *md);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-17 17:27                 ` Mike Snitzer
@ 2010-05-18  8:32                   ` Kiyoshi Ueda
  2010-05-18 13:46                     ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-18  8:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

Hi Mike,

On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 05/14/2010 11:08 PM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>> On 05/13/2010 12:57 PM +0900, Mike Snitzer wrote:
>>>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>>>> On 05/11/2010 10:15 PM +0900, Mike Snitzer wrote:
>>>>>>> It is clear we need to resolve the current full request_queue
>>>>>>> initialization that occurs even for bio-based DM devices.
>>>>>>>
>>>>>>> I believe the 2 patches I posted accomplish this in a stright-forward
>>>>>>> way.  We can always improve on it (by looking at what you proposed
>>>>>>> below) but we need a minimlaist fix that doesn't depend on userspace
>>>>>>> LVM2 changes right now.
>>>>>>
>>>>>> Humm, OK.
>>>>>> Indeed, showing iosched directory in bio-based device's sysfs is
>>>>>> confusing users actually, and we need something to resolve that soon.
>>>>>> So I don't strongly object to your approach as the first step, as long
>>>>>> as we can accept the risk of the maintenance cost which I mentioned.
>>>>>
>>>>> OK, I understand your concern (especially after having gone through this
>>>>> further over the past couple days).  I'm hopeful maintenance will be
>>>>> minimal.
>> snip
>>>> I feel your patch-set is growing and becoming complex fix rather than
>>>> minimalist/simple fix.
>>>>
>>>> I think touching mapped_device/queue at table loading time makes
>>>> things complex.  It is natural that table's arguments are reflected
>>>> to mapped_device/queue at table binding time instead.
>>>
>>> Seems the numerous patches I've posted over the past couple days have
>>> given a false impression.  But I do understand your concern.
>>>
>>> The longer-term direction you want to take DM (known type during
>>> alloc_dev) illustrates that we share a common goal: only do the precise
>>> work that is needed to initialize the request_queue for a DM device
>>> (whether it is bio-based or request-based).
>>>
>>> My changes do accomplish that in the near-term without requiring
>>> userspace change.  Many of my proposed changes are just refactoring
>>> existing code to clearly split out what is needed for request-based vs
>>> bio-based.
>>
>> As far as I understand, the current model of device-mapper is:
>>   - a table (precisely, a target) has various attributes,
>>     bio-based/request-based is one of such attributes
>>   - a table and its attributes are bound to the block device on resume
>> If we want to fix a problem, I think we should either work based on
>> this model or change the model.
>>
>> Your patch makes that loading table affects the block device, so you
>> are changing the model.
>>
>> If you change the model, it should be done carefully.
>> For example, the current model allows most of the table loading code
>> to run without exclusive lock on the device because it doesn't affect
>> the device itself.  If you change this model, table loading needs to
>> be serialized with appropriate locking.
> 
> Nice catch, yes md->queue needs protection (see patch below).

Not enough. (See drivers/md/dm-ioctl.c:table_load().)
Table load sequence is:
  1. populate table
  2. set the table to ->new_map of the hash_cell for the mapped_device
     in protection by _hash_lock.

Since your fix only serializes the step 1, concurrent table loading
could end up with inconsistent status; e.g. request-based table is
bound to the mapped_device while the queue is initialized as bio-based.
With your new model, those 2 steps above must be atomic.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-18  8:32                   ` Kiyoshi Ueda
@ 2010-05-18 13:46                     ` Mike Snitzer
  2010-05-19  5:57                       ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-18 13:46 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

On Tue, May 18 2010 at  4:32am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >> As far as I understand, the current model of device-mapper is:
> >>   - a table (precisely, a target) has various attributes,
> >>     bio-based/request-based is one of such attributes
> >>   - a table and its attributes are bound to the block device on resume
> >> If we want to fix a problem, I think we should either work based on
> >> this model or change the model.
> >>
> >> Your patch makes that loading table affects the block device, so you
> >> are changing the model.
> >>
> >> If you change the model, it should be done carefully.
> >> For example, the current model allows most of the table loading code
> >> to run without exclusive lock on the device because it doesn't affect
> >> the device itself.  If you change this model, table loading needs to
> >> be serialized with appropriate locking.
> > 
> > Nice catch, yes md->queue needs protection (see patch below).
> 
> Not enough. (See drivers/md/dm-ioctl.c:table_load().)
> Table load sequence is:
>   1. populate table
>   2. set the table to ->new_map of the hash_cell for the mapped_device
>      in protection by _hash_lock.
> 
> Since your fix only serializes the step 1, concurrent table loading
> could end up with inconsistent status; e.g. request-based table is
> bound to the mapped_device while the queue is initialized as bio-based.
> With your new model, those 2 steps above must be atomic.

Ah, yes.. I looked at the possibility of serializing the entirety of
table_load but determined that would be too excessive (would reduce
parallelism of table_load).  But I clearly missed the fact that there
could be a race to the _hash_lock protected critical section in
table_load() -- leading to queue inconsistency.

I'll post v5 of the overall patch which will revert the mapped_device
'queue_lock' serialization that I proposed in v4.  v5 will contain
the following patch to localize all table load related queue
manipulation to the _hash_lock protected critical section in
table_load().  So it sets the queue up _after_ the table's type is
established with dm_table_set_type().

Thanks again for your review; I'll work to be more meticulous in the
future!

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index d7500e1..bfb283c 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1180,6 +1180,14 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
 		goto out;
 	}
 
+	r = dm_table_setup_md_queue(t);
+	if (r) {
+		DMWARN("unable to setup device queue for this table");
+		dm_table_destroy(t);
+		up_write(&_hash_lock);
+		goto out;
+	}
+
 	if (hc->new_map)
 		dm_table_destroy(hc->new_map);
 	hc->new_map = t;
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index dec8295..990cf17 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -803,7 +803,6 @@ int dm_table_set_type(struct dm_table *t)
 	if (bio_based) {
 		/* We must use this table as bio-based */
 		t->type = DM_TYPE_BIO_BASED;
-		dm_clear_request_based_queue(t->md);
 		return 0;
 	}
 
@@ -830,11 +829,6 @@ int dm_table_set_type(struct dm_table *t)
 		return -EINVAL;
 	}
 
-	if (!dm_init_request_based_queue(t->md)) {
-		DMWARN("Cannot initialize queue for Request-based dm");
-		return -EINVAL;
-	}
-
 	t->type = DM_TYPE_REQUEST_BASED;
 
 	return 0;
@@ -850,6 +844,23 @@ bool dm_table_request_based(struct dm_table *t)
 	return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
 }
 
+/*
+ * Setup the DM device's queue based on table's type.
+ * Caller must serialize access to table's md.
+ */
+int dm_table_setup_md_queue(struct dm_table *t)
+{
+	if (dm_table_request_based(t)) {
+		if (!dm_init_request_based_queue(t->md)) {
+			DMWARN("Cannot initialize queue for Request-based dm");
+			return -EINVAL;
+		}
+	} else
+		dm_clear_request_based_queue(t->md);
+
+	return 0;
+}
+
 int dm_table_alloc_md_mempools(struct dm_table *t)
 {
 	unsigned type = dm_table_get_type(t);
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 143082c..76b61ca 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1856,6 +1856,17 @@ static void dm_rq_barrier_work(struct work_struct *work);
 
 static void dm_init_md_queue(struct mapped_device *md)
 {
+	/*
+	 * Request-based dm devices cannot be stacked on top of bio-based dm
+	 * devices.  The type of this dm device has not been decided yet.
+	 * The type is decided at the first table loading time.
+	 * To prevent problematic device stacking, clear the queue flag
+	 * for request stacking support until then.
+	 *
+	 * This queue is new, so no concurrency on the queue_flags.
+	 */
+	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
+
 	md->queue->queuedata = md;
 	md->queue->backing_dev_info.congested_fn = dm_any_congested;
 	md->queue->backing_dev_info.congested_data = md;
@@ -1989,17 +2000,6 @@ int dm_init_request_based_queue(struct mapped_device *md)
 
 	elv_register_queue(md->queue);
 
-	/*
-	 * Request-based dm devices cannot be stacked on top of bio-based dm
-	 * devices.  The type of this dm device has not been decided yet.
-	 * The type is decided at the first table loading time.
-	 * To prevent problematic device stacking, clear the queue flag
-	 * for request stacking support until then.
-	 *
-	 * This queue is new, so no concurrency on the queue_flags.
-	 */
-	queue_flag_clear_unlocked(QUEUE_FLAG_STACKABLE, md->queue);
-
 	blk_queue_softirq_done(md->queue, dm_softirq_done);
 	blk_queue_prep_rq(md->queue, dm_prep_fn);
 	blk_queue_lld_busy(md->queue, dm_lld_busy);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 95aec64..b4ebb11 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -65,6 +65,7 @@ bool dm_table_request_based(struct dm_table *t);
 int dm_table_alloc_md_mempools(struct dm_table *t);
 void dm_table_free_md_mempools(struct dm_table *t);
 struct dm_md_mempools *dm_table_get_md_mempools(struct dm_table *t);
+int dm_table_setup_md_queue(struct dm_table *t);
 
 /*
  * To check the return value from dm_table_find_target().

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-18 13:46                     ` Mike Snitzer
@ 2010-05-19  5:57                       ` Kiyoshi Ueda
  2010-05-19 12:01                         ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-19  5:57 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

Hi Mike,

On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>> As far as I understand, the current model of device-mapper is:
>>>>   - a table (precisely, a target) has various attributes,
>>>>     bio-based/request-based is one of such attributes
>>>>   - a table and its attributes are bound to the block device on resume
>>>> If we want to fix a problem, I think we should either work based on
>>>> this model or change the model.
>>>>
>>>> Your patch makes that loading table affects the block device, so you
>>>> are changing the model.
>>>>
>>>> If you change the model, it should be done carefully.
>>>> For example, the current model allows most of the table loading code
>>>> to run without exclusive lock on the device because it doesn't affect
>>>> the device itself.  If you change this model, table loading needs to
>>>> be serialized with appropriate locking.
>>>
>>> Nice catch, yes md->queue needs protection (see patch below).
>>
>> Not enough. (See drivers/md/dm-ioctl.c:table_load().)
>> Table load sequence is:
>>   1. populate table
>>   2. set the table to ->new_map of the hash_cell for the mapped_device
>>      in protection by _hash_lock.
>>
>> Since your fix only serializes the step 1, concurrent table loading
>> could end up with inconsistent status; e.g. request-based table is
>> bound to the mapped_device while the queue is initialized as bio-based.
>> With your new model, those 2 steps above must be atomic.
> 
> Ah, yes.. I looked at the possibility of serializing the entirety of
> table_load but determined that would be too excessive (would reduce
> parallelism of table_load).  But I clearly missed the fact that there
> could be a race to the _hash_lock protected critical section in
> table_load() -- leading to queue inconsistency.
> 
> I'll post v5 of the overall patch which will revert the mapped_device
> 'queue_lock' serialization that I proposed in v4.  v5 will contain
> the following patch to localize all table load related queue
> manipulation to the _hash_lock protected critical section in
> table_load().  So it sets the queue up _after_ the table's type is
> established with dm_table_set_type().

dm_table_setup_md_queue() may allocate memory with blocking mode.
Blocking allocation inside exclusive _hash_lock can cause deadlock;
e.g. when it has to wait for other dm devices to resume to free some
memory.

Also, your patch changes the queue configuration even when a table is
already active and used.  (e.g. Loading bio-based table to a mapped_device
which is already active/used as request-based sets q->requst_fn in NULL.)
That could cause some critical problems.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for  request-based device
  2010-05-19  5:57                       ` Kiyoshi Ueda
@ 2010-05-19 12:01                         ` Mike Snitzer
  2010-05-19 14:39                           ` Mike Snitzer
  2010-05-19 21:51                           ` Mike Snitzer
  0 siblings, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-19 12:01 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: dm-devel, linux-kernel, Jens Axboe, Jun'ichi Nomura,
	Vivek Goyal, Nikanth Karthikesan, Alasdair Kergon

On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> Hi Mike,
>
> On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>> On 05/18/2010 02:27 AM +0900, Mike Snitzer wrote:
>>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>>> As far as I understand, the current model of device-mapper is:
>>>>>   - a table (precisely, a target) has various attributes,
>>>>>     bio-based/request-based is one of such attributes
>>>>>   - a table and its attributes are bound to the block device on resume
>>>>> If we want to fix a problem, I think we should either work based on
>>>>> this model or change the model.
>>>>>
>>>>> Your patch makes that loading table affects the block device, so you
>>>>> are changing the model.
>>>>>
>>>>> If you change the model, it should be done carefully.
>>>>> For example, the current model allows most of the table loading code
>>>>> to run without exclusive lock on the device because it doesn't affect
>>>>> the device itself.  If you change this model, table loading needs to
>>>>> be serialized with appropriate locking.
>>>>
>>>> Nice catch, yes md->queue needs protection (see patch below).
>>>
>>> Not enough. (See drivers/md/dm-ioctl.c:table_load().)
>>> Table load sequence is:
>>>   1. populate table
>>>   2. set the table to ->new_map of the hash_cell for the mapped_device
>>>      in protection by _hash_lock.
>>>
>>> Since your fix only serializes the step 1, concurrent table loading
>>> could end up with inconsistent status; e.g. request-based table is
>>> bound to the mapped_device while the queue is initialized as bio-based.
>>> With your new model, those 2 steps above must be atomic.
>>
>> Ah, yes.. I looked at the possibility of serializing the entirety of
>> table_load but determined that would be too excessive (would reduce
>> parallelism of table_load).  But I clearly missed the fact that there
>> could be a race to the _hash_lock protected critical section in
>> table_load() -- leading to queue inconsistency.
>>
>> I'll post v5 of the overall patch which will revert the mapped_device
>> 'queue_lock' serialization that I proposed in v4.  v5 will contain
>> the following patch to localize all table load related queue
>> manipulation to the _hash_lock protected critical section in
>> table_load().  So it sets the queue up _after_ the table's type is
>> established with dm_table_set_type().
>
> dm_table_setup_md_queue() may allocate memory with blocking mode.
> Blocking allocation inside exclusive _hash_lock can cause deadlock;
> e.g. when it has to wait for other dm devices to resume to free some
> memory.

We make no guarantees that other DM devices will resume before a table
load -- so calling dm_table_setup_md_queue() within the exclusive
_hash_lock is no different than other DM devices being suspended while
a request-based DM device performs its first table_load().

My thinking was this should not be a problem as it is only valid to
call dm_table_setup_md_queue() before the newly created request-based
DM device has been resumed.

AFAIK we don't have any explicit constraints on memory allocations
during table load (e.g. table loads shouldn't depend on other devices'
writeback) -- but any GFP_KERNEL allocation could recurse
(elevator_alloc() currently uses GFP_KERNEL with kmalloc_node)...

I'll have to review the DM code further to see if all memory
allocations during table_load() are done via mempools.  I'll also
bring this up on this week's LVM call.

> Also, your patch changes the queue configuration even when a table is
> already active and used.  (e.g. Loading bio-based table to a mapped_device
> which is already active/used as request-based sets q->requst_fn in NULL.)
> That could cause some critical problems.

Yes, that is possible and I can add additional checks to prevent this.
But this speaks to a more general problem with the existing DM code.

dm_swap_table() has the negative check to prevent such table loads, e.g.:
/* cannot change the device type, once a table is bound */

This check should come during table_load, as part of
dm_table_set_type(), rather than during table resume.

Thanks,
Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-19 12:01                         ` Mike Snitzer
@ 2010-05-19 14:39                           ` Mike Snitzer
  2010-05-19 14:45                             ` Mike Snitzer
  2010-05-20 11:21                             ` Kiyoshi Ueda
  2010-05-19 21:51                           ` Mike Snitzer
  1 sibling, 2 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-19 14:39 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

On Wed, May 19 2010 at  8:01am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > Hi Mike,
> >
> > On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
> >> I'll post v5 of the overall patch which will revert the mapped_device
> >> 'queue_lock' serialization that I proposed in v4.  v5 will contain
> >> the following patch to localize all table load related queue
> >> manipulation to the _hash_lock protected critical section in
> >> table_load().  So it sets the queue up _after_ the table's type is
> >> established with dm_table_set_type().
> >
> > dm_table_setup_md_queue() may allocate memory with blocking mode.
> > Blocking allocation inside exclusive _hash_lock can cause deadlock;
> > e.g. when it has to wait for other dm devices to resume to free some
> > memory.
> 
> We make no guarantees that other DM devices will resume before a table
> load -- so calling dm_table_setup_md_queue() within the exclusive
> _hash_lock is no different than other DM devices being suspended while
> a request-based DM device performs its first table_load().
> 
> My thinking was this should not be a problem as it is only valid to
> call dm_table_setup_md_queue() before the newly created request-based
> DM device has been resumed.
> 
> AFAIK we don't have any explicit constraints on memory allocations
> during table load (e.g. table loads shouldn't depend on other devices'
> writeback) -- but any GFP_KERNEL allocation could recurse
> (elevator_alloc() currently uses GFP_KERNEL with kmalloc_node)...
> 
> I'll have to review the DM code further to see if all memory
> allocations during table_load() are done via mempools.  I'll also
> bring this up on this week's LVM call.

We discussed this and I understand the scope of the problem now.

Just reiterating what you covered when you first pointed this issue out:

It could be that a table load gets blocked (waiting on a memory
allocation).  The table load can take as long as it needs.  But we can't
have it block holding the exclusive _hash_lock while blocking.  Having
_hash_lock prevents further DM ioctls.  The table load's allocation may
be blocking waiting for writeback to a DM device that will be resumed by
another thread.

Thanks again for pointing this out; I'll work to arrive at an
alternative locking scheme.  Likely introduce a lock local to the
multiple_device (effectively the 'queue_lock' I had before).  But
difference is I'd take that lock before taking _hash_lock.

I hope to have v6 available at some point today but it may be delayed by
a day.

> > Also, your patch changes the queue configuration even when a table is
> > already active and used.  (e.g. Loading bio-based table to a mapped_device
> > which is already active/used as request-based sets q->requst_fn in NULL.)
> > That could cause some critical problems.
> 
> Yes, that is possible and I can add additional checks to prevent this.
> But this speaks to a more general problem with the existing DM code.
> 
> dm_swap_table() has the negative check to prevent such table loads, e.g.:
> /* cannot change the device type, once a table is bound */
> 
> This check should come during table_load, as part of
> dm_table_set_type(), rather than during table resume.

I'll look to address this issue in v6 too.

Regards,
Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-19 14:39                           ` Mike Snitzer
@ 2010-05-19 14:45                             ` Mike Snitzer
  2010-05-20 11:21                             ` Kiyoshi Ueda
  1 sibling, 0 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-19 14:45 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Vivek Goyal,
	Jens Axboe, Jun'ichi Nomura, Alasdair Kergon

On Wed, May 19 2010 at 10:39am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> > On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > > Hi Mike,
> > >
> > > dm_table_setup_md_queue() may allocate memory with blocking mode.
> > > Blocking allocation inside exclusive _hash_lock can cause deadlock;
> > > e.g. when it has to wait for other dm devices to resume to free some
> > > memory.

<snip>

> We discussed this and I understand the scope of the problem now.
> 
> Just reiterating what you covered when you first pointed this issue out:
> 
> It could be that a table load gets blocked (waiting on a memory
> allocation).  The table load can take as long as it needs.  But we can't
> have it block holding the exclusive _hash_lock while blocking.  Having
> _hash_lock prevents further DM ioctls.  The table load's allocation may
> be blocking waiting for writeback to a DM device that will be resumed by
> another thread.
> 
> Thanks again for pointing this out; I'll work to arrive at an
> alternative locking scheme.  Likely introduce a lock local to the
> multiple_device (effectively the 'queue_lock' I had before).  But

s/multiple_device/mapped_device/

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-19 12:01                         ` Mike Snitzer
  2010-05-19 14:39                           ` Mike Snitzer
@ 2010-05-19 21:51                           ` Mike Snitzer
  1 sibling, 0 replies; 28+ messages in thread
From: Mike Snitzer @ 2010-05-19 21:51 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

On Wed, May 19 2010 at  8:01am -0400,
Mike Snitzer <snitzer@redhat.com> wrote:

> On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> > Also, your patch changes the queue configuration even when a table is
> > already active and used.  (e.g. Loading bio-based table to a mapped_device
> > which is already active/used as request-based sets q->requst_fn in NULL.)
> > That could cause some critical problems.
> 
> Yes, that is possible and I can add additional checks to prevent this.
> But this speaks to a more general problem with the existing DM code.
> 
> dm_swap_table() has the negative check to prevent such table loads, e.g.:
> /* cannot change the device type, once a table is bound */
> 
> This check should come during table_load, as part of
> dm_table_set_type(), rather than during table resume.

FYI, the following patch is the relevant portion of the v6 patch that
addresses the issue discussed above.

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 990cf17..62ebd94 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -782,6 +782,7 @@ int dm_table_set_type(struct dm_table *t)
 {
 	unsigned i;
 	unsigned bio_based = 0, request_based = 0;
+	struct dm_table *live_table = NULL;
 	struct dm_target *tgt;
 	struct dm_dev_internal *dd;
 	struct list_head *devices;
@@ -803,7 +804,7 @@ int dm_table_set_type(struct dm_table *t)
 	if (bio_based) {
 		/* We must use this table as bio-based */
 		t->type = DM_TYPE_BIO_BASED;
-		return 0;
+		goto finish;
 	}
 
 	BUG_ON(!request_based); /* No targets in this table */
@@ -831,6 +832,18 @@ int dm_table_set_type(struct dm_table *t)
 
 	t->type = DM_TYPE_REQUEST_BASED;
 
+finish:
+	/* cannot change the device type, once a table is bound */
+	live_table = dm_get_live_table(t->md);
+	if (live_table) {
+		if (dm_table_get_type(live_table) != dm_table_get_type(t)) {
+			DMWARN("can't change the device type after a table is bound");
+			dm_table_put(live_table);
+			return -EINVAL;
+		}
+		dm_table_put(live_table);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a47f0b8..923b662 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2470,13 +2470,6 @@ struct dm_table *dm_swap_table(struct mapped_device *md, struct dm_table *table)
 		goto out;
 	}
 
-	/* cannot change the device type, once a table is bound */
-	if (md->map &&
-	    (dm_table_get_type(md->map) != dm_table_get_type(table))) {
-		DMWARN("can't change the device type after a table is bound");
-		goto out;
-	}
-
 	map = __bind(md, table, &limits);
 
 out:

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-19 14:39                           ` Mike Snitzer
  2010-05-19 14:45                             ` Mike Snitzer
@ 2010-05-20 11:21                             ` Kiyoshi Ueda
  2010-05-20 17:07                               ` Mike Snitzer
  1 sibling, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-20 11:21 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

Hi Mike,

On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
> Mike Snitzer <snitzer@redhat.com> wrote:
>> On Wed, May 19, 2010 at 1:57 AM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>> On 05/18/2010 10:46 PM +0900, Mike Snitzer wrote:
>>>> I'll post v5 of the overall patch which will revert the mapped_device
>>>> 'queue_lock' serialization that I proposed in v4.  v5 will contain
>>>> the following patch to localize all table load related queue
>>>> manipulation to the _hash_lock protected critical section in
>>>> table_load().  So it sets the queue up _after_ the table's type is
>>>> established with dm_table_set_type().
>>>
>>> dm_table_setup_md_queue() may allocate memory with blocking mode.
>>> Blocking allocation inside exclusive _hash_lock can cause deadlock;
>>> e.g. when it has to wait for other dm devices to resume to free some
>>> memory.
>>
>> We make no guarantees that other DM devices will resume before a table
>> load -- so calling dm_table_setup_md_queue() within the exclusive
>> _hash_lock is no different than other DM devices being suspended while
>> a request-based DM device performs its first table_load().
>>
>> My thinking was this should not be a problem as it is only valid to
>> call dm_table_setup_md_queue() before the newly created request-based
>> DM device has been resumed.
>>
>> AFAIK we don't have any explicit constraints on memory allocations
>> during table load (e.g. table loads shouldn't depend on other devices'
>> writeback) -- but any GFP_KERNEL allocation could recurse
>> (elevator_alloc() currently uses GFP_KERNEL with kmalloc_node)...
>>
>> I'll have to review the DM code further to see if all memory
>> allocations during table_load() are done via mempools.  I'll also
>> bring this up on this week's LVM call.
> 
> We discussed this and I understand the scope of the problem now.
> 
> Just reiterating what you covered when you first pointed this issue out:
> 
> It could be that a table load gets blocked (waiting on a memory
> allocation).  The table load can take as long as it needs.  But we can't
> have it block holding the exclusive _hash_lock while blocking.  Having
> _hash_lock prevents further DM ioctls.  The table load's allocation may
> be blocking waiting for writeback to a DM device that will be resumed by
> another thread.

That's right.
 
> Thanks again for pointing this out; I'll work to arrive at an
> alternative locking scheme.  Likely introduce a lock local to the
> multiple_device (effectively the 'queue_lock' I had before).  But
> difference is I'd take that lock before taking _hash_lock.

You have to see do_resume(), too.
do_resume() gets hc->new_map with _hash_lock held but without your
'queue_lock' held.  This could cause inconsistent status;
  table_load() for bio-based table    do_resume() for rq-based table
  --------------------------------------------------------------------
  dm_lock_md_queue(md)
  dm_table_setup_md_queue(t) (<= bio-based)
    dm_clear_request_based_queue(md)
      md->queue->request_fn = NULL
                                       down_write(_hash_lock)
                                       new_map = hc->new_map (<= rq-based)
                                       up_write(_hash_lock)
                                       dm_swap_table(md, new_map)
                                       __bind(md, new_map) (<= rq-based)
  down_write(_hash_lock)
  hc->new_map = t
  up_write(_hash_lock)
  dm_unlock_md_queue(md)
 
Other ioctls might be as well, since I have checked only interaction
between table load and resume.


> I hope to have v6 available at some point today but it may be delayed by
> a day.

Sorry for repeating but since you are trying to change the current dm
model, please design/consider it carefully; what objects/ioctls are
involved and how you change their relationships, where are the critical
sections and how you protect them.
Otherwise, I'm afraid we will just spend your time in transforming
one bug to another. (and my time for reviewing and verifying problems :)


>>> Also, your patch changes the queue configuration even when a table is
>>> already active and used.  (e.g. Loading bio-based table to a mapped_device
>>> which is already active/used as request-based sets q->requst_fn in NULL.)
>>> That could cause some critical problems.
>>
>> Yes, that is possible and I can add additional checks to prevent this.
>> But this speaks to a more general problem with the existing DM code.
>>
>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
>> /* cannot change the device type, once a table is bound */
>>
>> This check should come during table_load, as part of
>> dm_table_set_type(), rather than during table resume.
> 
> I'll look to address this issue in v6 too.

It can race between table_load() and do_resume() as well;
  table_load() for bio-based            do_resume() for rq-based
  ---------------------------------------------------------------------
  dm_table_set_type(t) (<= bio-based)
    live_table = dm_get_live_table(md)
    (assume no live_table yet)
                                        new_map = hc->new_map (<= rq-based)
                                        dm_swap_table(md, new_map)
                                        __bind(md, new_map)
  dm_table_setup_md_queue(t)
    dm_clear_request_based_queue(md)
      md->queue->request_fn = NULL

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-20 11:21                             ` Kiyoshi Ueda
@ 2010-05-20 17:07                               ` Mike Snitzer
  2010-05-21  8:32                                 ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-20 17:07 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

On Thu, May 20 2010 at  7:21am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
> > Thanks again for pointing this out; I'll work to arrive at an
> > alternative locking scheme.  Likely introduce a lock local to the
> > multiple_device (effectively the 'queue_lock' I had before).  But
> > difference is I'd take that lock before taking _hash_lock.
> 
> You have to see do_resume(), too.
> do_resume() gets hc->new_map with _hash_lock held but without your
> 'queue_lock' held.  This could cause inconsistent status;
>   table_load() for bio-based table    do_resume() for rq-based table
>   --------------------------------------------------------------------
>   dm_lock_md_queue(md)
>   dm_table_setup_md_queue(t) (<= bio-based)
>     dm_clear_request_based_queue(md)
>       md->queue->request_fn = NULL
>                                        down_write(_hash_lock)
>                                        new_map = hc->new_map (<= rq-based)
>                                        up_write(_hash_lock)
>                                        dm_swap_table(md, new_map)
>                                        __bind(md, new_map) (<= rq-based)
>   down_write(_hash_lock)
>   hc->new_map = t
>   up_write(_hash_lock)
>   dm_unlock_md_queue(md)

Understood.

> Other ioctls might be as well, since I have checked only interaction
> between table load and resume.

Table load and resume should be the only ioctls that are relevant.

> > I hope to have v6 available at some point today but it may be delayed by
> > a day.
> 
> Sorry for repeating but since you are trying to change the current dm
> model, please design/consider it carefully; what objects/ioctls are
> involved and how you change their relationships, where are the critical
> sections and how you protect them.
> Otherwise, I'm afraid we will just spend your time in transforming
> one bug to another. (and my time for reviewing and verifying problems :)

Noted.  I do appreciate your thorough reviews but will hopefully reduce
your burden by improving my awareness and code.

> >>> Also, your patch changes the queue configuration even when a table is
> >>> already active and used.  (e.g. Loading bio-based table to a mapped_device
> >>> which is already active/used as request-based sets q->requst_fn in NULL.)
> >>> That could cause some critical problems.
> >>
> >> Yes, that is possible and I can add additional checks to prevent this.
> >> But this speaks to a more general problem with the existing DM code.
> >>
> >> dm_swap_table() has the negative check to prevent such table loads, e.g.:
> >> /* cannot change the device type, once a table is bound */
> >>
> >> This check should come during table_load, as part of
> >> dm_table_set_type(), rather than during table resume.
> > 
> > I'll look to address this issue in v6 too.
> 
> It can race between table_load() and do_resume() as well;
>   table_load() for bio-based            do_resume() for rq-based
>   ---------------------------------------------------------------------
>   dm_table_set_type(t) (<= bio-based)
>     live_table = dm_get_live_table(md)
>     (assume no live_table yet)
>                                         new_map = hc->new_map (<= rq-based)
>                                         dm_swap_table(md, new_map)
>                                         __bind(md, new_map)
>   dm_table_setup_md_queue(t)
>     dm_clear_request_based_queue(md)
>       md->queue->request_fn = NULL

Understood.

I've addressed the above races by:
1) properly protecting md->queue during do_resume().  Both do_resume()
   and table_load() first take md->queue_lock and then _hash_lock.
2) adding a negative check for an existing live table to
   dm_table_setup_md_queue().

I will mail out v7 after I've given it additional review.

FYI, I've revised and split out the conflicting table type checking
patch as that patch should not be controversial and is completely
independent of this DM queue initialization work.  I'll email it to
dm-devel soon.

Regards,
Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-20 17:07                               ` Mike Snitzer
@ 2010-05-21  8:32                                 ` Kiyoshi Ueda
  2010-05-21 13:34                                   ` Mike Snitzer
  0 siblings, 1 reply; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-21  8:32 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

Hi Mike,

On 05/21/2010 02:07 AM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
>>> Thanks again for pointing this out; I'll work to arrive at an
>>> alternative locking scheme.  Likely introduce a lock local to the
>>> multiple_device (effectively the 'queue_lock' I had before).  But
>>> difference is I'd take that lock before taking _hash_lock.
>>
>> You have to see do_resume(), too.
>> do_resume() gets hc->new_map with _hash_lock held but without your
>> 'queue_lock' held.  This could cause inconsistent status;
>>   table_load() for bio-based table    do_resume() for rq-based table
>>   --------------------------------------------------------------------
>>   dm_lock_md_queue(md)
>>   dm_table_setup_md_queue(t) (<= bio-based)
>>     dm_clear_request_based_queue(md)
>>       md->queue->request_fn = NULL
>>                                        down_write(_hash_lock)
>>                                        new_map = hc->new_map (<= rq-based)
>>                                        up_write(_hash_lock)
>>                                        dm_swap_table(md, new_map)
>>                                        __bind(md, new_map) (<= rq-based)
>>   down_write(_hash_lock)
>>   hc->new_map = t
>>   up_write(_hash_lock)
>>   dm_unlock_md_queue(md)
snip
>>>>> Also, your patch changes the queue configuration even when a table is
>>>>> already active and used.  (e.g. Loading bio-based table to a mapped_device
>>>>> which is already active/used as request-based sets q->requst_fn in NULL.)
>>>>> That could cause some critical problems.
>>>>
>>>> Yes, that is possible and I can add additional checks to prevent this.
>>>> But this speaks to a more general problem with the existing DM code.
>>>>
>>>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
>>>> /* cannot change the device type, once a table is bound */
>>>>
>>>> This check should come during table_load, as part of
>>>> dm_table_set_type(), rather than during table resume.
>>>
>>> I'll look to address this issue in v6 too.
>>
>> It can race between table_load() and do_resume() as well;
>>   table_load() for bio-based            do_resume() for rq-based
>>   ---------------------------------------------------------------------
>>   dm_table_set_type(t) (<= bio-based)
>>     live_table = dm_get_live_table(md)
>>     (assume no live_table yet)
>>                                         new_map = hc->new_map (<= rq-based)
>>                                         dm_swap_table(md, new_map)
>>                                         __bind(md, new_map)
>>   dm_table_setup_md_queue(t)
>>     dm_clear_request_based_queue(md)
>>       md->queue->request_fn = NULL
> 
> Understood.
> 
> I've addressed the above races by:
> 1) properly protecting md->queue during do_resume().  Both do_resume()
>    and table_load() first take md->queue_lock and then _hash_lock.
> 2) adding a negative check for an existing live table to
>    dm_table_setup_md_queue().
> 
> I will mail out v7 after I've given it additional review.

dm_table_setup_md_queue() may allocate memory with blocking mode
inside md->queue_lock which is needed to resume the md.
That could cause a deadlock which doesn't happen in the current model;
e.g:
  - Assume a bio-based table has been already loaded in hc->new_map
    and no live table yet in a mapped_device.
  - Load a request-based table and then no memory situation happens
    in dm_table_setup_md_queue().
  - Then, the mapped_device can't be resumed even in the case that
    resuming the mapped_device with the preloaded bio-based table
    can make some memory by writeback.
  => deadlock

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-21  8:32                                 ` Kiyoshi Ueda
@ 2010-05-21 13:34                                   ` Mike Snitzer
  2010-05-24  9:58                                     ` Kiyoshi Ueda
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Snitzer @ 2010-05-21 13:34 UTC (permalink / raw)
  To: Kiyoshi Ueda
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

On Fri, May 21 2010 at  4:32am -0400,
Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:

> Hi Mike,
> 
> On 05/21/2010 02:07 AM +0900, Mike Snitzer wrote:
> > Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
> >> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
> >>> Thanks again for pointing this out; I'll work to arrive at an
> >>> alternative locking scheme.  Likely introduce a lock local to the
> >>> multiple_device (effectively the 'queue_lock' I had before).  But
> >>> difference is I'd take that lock before taking _hash_lock.
> >>
> >> You have to see do_resume(), too.
> >> do_resume() gets hc->new_map with _hash_lock held but without your
> >> 'queue_lock' held.  This could cause inconsistent status;
> >>   table_load() for bio-based table    do_resume() for rq-based table
> >>   --------------------------------------------------------------------
> >>   dm_lock_md_queue(md)
> >>   dm_table_setup_md_queue(t) (<= bio-based)
> >>     dm_clear_request_based_queue(md)
> >>       md->queue->request_fn = NULL
> >>                                        down_write(_hash_lock)
> >>                                        new_map = hc->new_map (<= rq-based)
> >>                                        up_write(_hash_lock)
> >>                                        dm_swap_table(md, new_map)
> >>                                        __bind(md, new_map) (<= rq-based)
> >>   down_write(_hash_lock)
> >>   hc->new_map = t
> >>   up_write(_hash_lock)
> >>   dm_unlock_md_queue(md)
> snip
> >>>>> Also, your patch changes the queue configuration even when a table is
> >>>>> already active and used.  (e.g. Loading bio-based table to a mapped_device
> >>>>> which is already active/used as request-based sets q->requst_fn in NULL.)
> >>>>> That could cause some critical problems.
> >>>>
> >>>> Yes, that is possible and I can add additional checks to prevent this.
> >>>> But this speaks to a more general problem with the existing DM code.
> >>>>
> >>>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
> >>>> /* cannot change the device type, once a table is bound */
> >>>>
> >>>> This check should come during table_load, as part of
> >>>> dm_table_set_type(), rather than during table resume.
> >>>
> >>> I'll look to address this issue in v6 too.
> >>
> >> It can race between table_load() and do_resume() as well;
> >>   table_load() for bio-based            do_resume() for rq-based
> >>   ---------------------------------------------------------------------
> >>   dm_table_set_type(t) (<= bio-based)
> >>     live_table = dm_get_live_table(md)
> >>     (assume no live_table yet)
> >>                                         new_map = hc->new_map (<= rq-based)
> >>                                         dm_swap_table(md, new_map)
> >>                                         __bind(md, new_map)
> >>   dm_table_setup_md_queue(t)
> >>     dm_clear_request_based_queue(md)
> >>       md->queue->request_fn = NULL
> > 
> > Understood.
> > 
> > I've addressed the above races by:
> > 1) properly protecting md->queue during do_resume().  Both do_resume()
> >    and table_load() first take md->queue_lock and then _hash_lock.
> > 2) adding a negative check for an existing live table to
> >    dm_table_setup_md_queue().
> > 
> > I will mail out v7 after I've given it additional review.
> 
> dm_table_setup_md_queue() may allocate memory with blocking mode
> inside md->queue_lock which is needed to resume the md.
> That could cause a deadlock which doesn't happen in the current model;
> e.g:
>   - Assume a bio-based table has been already loaded in hc->new_map
>     and no live table yet in a mapped_device.
>   - Load a request-based table and then no memory situation happens
>     in dm_table_setup_md_queue().
>   - Then, the mapped_device can't be resumed even in the case that
>     resuming the mapped_device with the preloaded bio-based table
>     can make some memory by writeback.
>   => deadlock

My patch introduces a new mutex that is local to the mapped_device being
loaded.  And that same mutex is taken during do_resume (just like you
implicitly said was needed to avoid races).

But I'm _very_ skeptical of the validity of this deadlock scenario.
You're leaving out some important context for why we might expect a DM
device to service (queue) IO during the table load _before_ the device
has _ever_ been resumed.  When I issue IO to a DM device that only has
an inactive table it returns immediately:

# dmsetup table --inactive
bio-based: 0 8388608 linear 254:16 384
# dd if=/dev/mapper/bio-based of=/dev/null bs=1024k count=10
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00016418 s, 0.0 kB/s

As we discussed earlier in this thread, a table load can block
indefinitely (waiting for memory) as long as _hash_lock is not held (so
other DM ioctls are still possible).

It is a serious reach to suggest that your new deadlock scenario is a
legitimate concern.  This would mean that a user would:

1) mistakenly load a bio-based table when they meant rq-based
2) load rq-based -- but block waiting for elevator_alloc()
3) attempt to resume device with bio-based anyway -- to make forward
   progress with  writeback destined to the device being resumed for the
   first time? -- but again, why would dirty pages be destined for this
   inactive device already?
4) even if the user could resume they'd be left with a bio-based device;
   which isn't what they wanted..

Do you see the obscure nature of the scenario you've presented?

Please help me understand what I am missing?

Mike

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device
  2010-05-21 13:34                                   ` Mike Snitzer
@ 2010-05-24  9:58                                     ` Kiyoshi Ueda
  0 siblings, 0 replies; 28+ messages in thread
From: Kiyoshi Ueda @ 2010-05-24  9:58 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Nikanth Karthikesan, linux-kernel, dm-devel, Alasdair Kergon,
	Jens Axboe, Jun'ichi Nomura, Vivek Goyal

Hi Mike,

On 05/21/2010 10:34 PM +0900, Mike Snitzer wrote:
> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>> On 05/21/2010 02:07 AM +0900, Mike Snitzer wrote:
>>> Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote:
>>>> On 05/19/2010 11:39 PM +0900, Mike Snitzer wrote:
>>>>> Thanks again for pointing this out; I'll work to arrive at an
>>>>> alternative locking scheme.  Likely introduce a lock local to the
>>>>> multiple_device (effectively the 'queue_lock' I had before).  But
>>>>> difference is I'd take that lock before taking _hash_lock.
>>>> You have to see do_resume(), too.
>>>> do_resume() gets hc->new_map with _hash_lock held but without your
>>>> 'queue_lock' held.  This could cause inconsistent status;
>>>>   table_load() for bio-based table    do_resume() for rq-based table
>>>>   --------------------------------------------------------------------
>>>>   dm_lock_md_queue(md)
>>>>   dm_table_setup_md_queue(t) (<= bio-based)
>>>>     dm_clear_request_based_queue(md)
>>>>       md->queue->request_fn = NULL
>>>>                                        down_write(_hash_lock)
>>>>                                        new_map = hc->new_map (<= rq-based)
>>>>                                        up_write(_hash_lock)
>>>>                                        dm_swap_table(md, new_map)
>>>>                                        __bind(md, new_map) (<= rq-based)
>>>>   down_write(_hash_lock)
>>>>   hc->new_map = t
>>>>   up_write(_hash_lock)
>>>>   dm_unlock_md_queue(md)
>> snip
>>>>>>> Also, your patch changes the queue configuration even when a table is
>>>>>>> already active and used.  (e.g. Loading bio-based table to a mapped_device
>>>>>>> which is already active/used as request-based sets q->requst_fn in NULL.)
>>>>>>> That could cause some critical problems.
>>>>>> Yes, that is possible and I can add additional checks to prevent this.
>>>>>> But this speaks to a more general problem with the existing DM code.
>>>>>>
>>>>>> dm_swap_table() has the negative check to prevent such table loads, e.g.:
>>>>>> /* cannot change the device type, once a table is bound */
>>>>>>
>>>>>> This check should come during table_load, as part of
>>>>>> dm_table_set_type(), rather than during table resume.
>>>>> I'll look to address this issue in v6 too.
>>>> It can race between table_load() and do_resume() as well;
>>>>   table_load() for bio-based            do_resume() for rq-based
>>>>   ---------------------------------------------------------------------
>>>>   dm_table_set_type(t) (<= bio-based)
>>>>     live_table = dm_get_live_table(md)
>>>>     (assume no live_table yet)
>>>>                                         new_map = hc->new_map (<= rq-based)
>>>>                                         dm_swap_table(md, new_map)
>>>>                                         __bind(md, new_map)
>>>>   dm_table_setup_md_queue(t)
>>>>     dm_clear_request_based_queue(md)
>>>>       md->queue->request_fn = NULL
>>> Understood.
>>>
>>> I've addressed the above races by:
>>> 1) properly protecting md->queue during do_resume().  Both do_resume()
>>>    and table_load() first take md->queue_lock and then _hash_lock.
>>> 2) adding a negative check for an existing live table to
>>>    dm_table_setup_md_queue().
>>>
>>> I will mail out v7 after I've given it additional review.
>>
>> dm_table_setup_md_queue() may allocate memory with blocking mode
>> inside md->queue_lock which is needed to resume the md.
>> That could cause a deadlock which doesn't happen in the current model;
>> e.g:
>>   - Assume a bio-based table has been already loaded in hc->new_map
>>     and no live table yet in a mapped_device.
>>   - Load a request-based table and then no memory situation happens
>>     in dm_table_setup_md_queue().
>>   - Then, the mapped_device can't be resumed even in the case that
>>     resuming the mapped_device with the preloaded bio-based table
>>     can make some memory by writeback.
>>   => deadlock
> 
> My patch introduces a new mutex that is local to the mapped_device being
> loaded.  And that same mutex is taken during do_resume (just like you
> implicitly said was needed to avoid races).
> 
> But I'm _very_ skeptical of the validity of this deadlock scenario.
> You're leaving out some important context for why we might expect a DM
> device to service (queue) IO during the table load _before_ the device
> has _ever_ been resumed.

Because upper device might wait for the DM device's resume to serve
pending I/O.

In a stacked configuration, load/suspend/resume happens in this order:
  1. load tables to upper and lower(s)
  2. suspend upper
  3. suspend lower(s)
  4. resume lower(s)
  5. resume upper
And "lower(s)" may include a new DM device which has never been resumed
before the step 4.

So, the following example could cause the deadlock scenario:
  # echo "0 10 linear /dev/sda 0" | dmsetup create lower0
  # echo "0 10 linear /dev/mapper/lower0 0" | dmsetup create upper
  # dmsetup create --notable lower1
  # echo "0 10 linear /dev/mapper/lower1 0" | dmsetup load upper
  # echo "0 10 linear /dev/sdb 0" dmsetup load lower1
  <Any I/Os to /dev/mapper/upper is possible here>
  # dmsetup suspend upper
  <I/Os are queued on /dev/mapper/upper>
  # echo "0 10 multipath ../dev/sda../dev/sdb.." | dmsetup load lower1

  If no memory happens here, the system may be hanged up because
  /dev/mapper/lower1 can't be resumed where /dev/mapper/lower1 will be
  needed for writeback for /dev/mapper/upper when /dev/mapper/upper is
  resumed.
  (Without your patch, /dev/mapper/upper and /dev/mapper/lower1 could
   be resumed using pre-loaded bio-based tables and the system could
   get some memory and be alive.)


> As we discussed earlier in this thread, a table load can block
> indefinitely (waiting for memory) as long as _hash_lock is not held (so
> other DM ioctls are still possible).
> 
> It is a serious reach to suggest that your new deadlock scenario is a
> legitimate concern.  This would mean that a user would:
> 
> 1) mistakenly load a bio-based table when they meant rq-based
> 2) load rq-based -- but block waiting for elevator_alloc()
> 3) attempt to resume device with bio-based anyway -- to make forward
>    progress with  writeback destined to the device being resumed for the
>    first time? -- but again, why would dirty pages be destined for this
>    inactive device already?
> 4) even if the user could resume they'd be left with a bio-based device;
>    which isn't what they wanted..

Although the scenario/example which I mentioned above may be too
intentional, it is still possible.
We should design not to cause such critical problems or prevent/reject
such problematic operations.

Your new model, which determines md type at the first table loading time
and prevents type switching, is close to my initial idea, which determines
md type at device creation time, and no userspace change is required.
I think it's reasonable model.
But I have some comments on that, so I'll reply to the new thread of
your new model patch-set.

Thanks,
Kiyoshi Ueda

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2010-05-24 10:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-10 22:55 [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Mike Snitzer
2010-05-10 22:55 ` [RFC PATCH 2/2] dm: only initialize full request_queue for request-based device Mike Snitzer
2010-05-11  4:23   ` Kiyoshi Ueda
2010-05-11 13:15     ` Mike Snitzer
2010-05-12  8:23       ` Kiyoshi Ueda
2010-05-13  3:57         ` Mike Snitzer
2010-05-14  8:06           ` Kiyoshi Ueda
2010-05-14 14:08             ` Mike Snitzer
2010-05-17  9:27               ` Kiyoshi Ueda
2010-05-17 17:27                 ` Mike Snitzer
2010-05-18  8:32                   ` Kiyoshi Ueda
2010-05-18 13:46                     ` Mike Snitzer
2010-05-19  5:57                       ` Kiyoshi Ueda
2010-05-19 12:01                         ` Mike Snitzer
2010-05-19 14:39                           ` Mike Snitzer
2010-05-19 14:45                             ` Mike Snitzer
2010-05-20 11:21                             ` Kiyoshi Ueda
2010-05-20 17:07                               ` Mike Snitzer
2010-05-21  8:32                                 ` Kiyoshi Ueda
2010-05-21 13:34                                   ` Mike Snitzer
2010-05-24  9:58                                     ` Kiyoshi Ueda
2010-05-19 21:51                           ` Mike Snitzer
2010-05-13  4:31   ` [PATCH 2/2 v2] " Mike Snitzer
2010-05-13  5:02     ` [RFC PATCH 3/2] dm: bio-based device must not register elevator in sysfs Mike Snitzer
2010-05-13 22:14       ` [PATCH 3/2 v2] " Mike Snitzer
2010-05-11  6:55 ` [RFC PATCH 1/2] block: allow initialization of previously allocated request_queue Jens Axboe
2010-05-11 13:18   ` Mike Snitzer
2010-05-11 13:21     ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).