Re: [PATCH 05/10] block: remove per-queue plugging

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 05/10] block: remove per-queue plugging
       [not found]                 ` <20110405130541.6c2b5f86@notabene.brown>
@ 2011-04-11  4:50                   ` NeilBrown
  2011-04-11  9:19                     ` Jens Axboe
  2011-04-11 16:59                     ` hch
  0 siblings, 2 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-11  4:50 UTC (permalink / raw)
  To: Mike Snitzer, Jens Axboe
  Cc: linux-kernel@vger.kernel.org, hch@infradead.org, dm-devel,
	linux-raid

On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:

> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
> 
> > Also, in your MD changes, you removed all calls to md_unplug() but
> > didn't remove md_unplug().  Seems it should be removed along with the
> > 'plug' member of 'struct mddev_t'?  Neil?
> 
> I've been distracted by other things and only just managed to have a look at
> this.
> 
> The new plugging code seems to completely ignore the needs of stacked devices
> - or at least my needs in md.
> 
> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
> an unplug I update the write-intent-bitmap to mark all the relevant blocks
> and then release the writes.
> 
> With the new code there is no way for an unplug event to wake up the raid1d
> thread to start the writeout - I haven't tested it but I suspect it will just
> hang.
> 
> Similarly for RAID5 I gather write bios (long before they become 'struct
> request' which is what the plugging code understands) and on an unplug event
> I release the writes - hopefully with enough bios per stripe so that we don't
> need to pre-read.
> 
> Possibly the simplest fix would be to have a second list_head in 'struct
> blk_plug' which contained callbacks (a function pointer a list_head in a
> struct which is passed as an arg to the function!).
> blk_finish_plug could then walk the list and call the call-backs.
> It would be quite easy to hook into that.

I've implemented this and it seems to work.
Jens:  could you please review and hopefully ack the patch below, and let
me know if you will submit it or should I?

My testing of this combined with some other patches which cause various md
personalities to use it shows up a bug somewhere.

The symptoms are crashes in various places in blk-core and sometimes
elevator.c
list_sort occurs fairly often included in the stack but not always.

This patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 273d60b..903ce8d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
 	struct request_queue *q;
 	unsigned long flags;
 	struct request *rq;
+	struct list_head head;
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
 	if (list_empty(&plug->list))
 		return;
+	list_add(&head, &plug->list);
+	list_del_init(&plug->list);
 
 	if (plug->should_sort)
-		list_sort(NULL, &plug->list, plug_rq_cmp);
+		list_sort(NULL, &head, plug_rq_cmp);
+	plug->should_sort = 0;
 
 	q = NULL;
 	local_irq_save(flags);
-	while (!list_empty(&plug->list)) {
-		rq = list_entry_rq(plug->list.next);
+	while (!list_empty(&head)) {
+		rq = list_entry_rq(head.next);
 		list_del_init(&rq->queuelist);
 		BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
 		BUG_ON(!rq->q);


makes the symptom go away.  It simply moves the plug list onto a separate
list head before sorting and processing it.
My test was simply writing to a RAID1 with dd:
  while true; do dd if=/dev/zero of=/dev/md0 size=4k; done

Obviously all writes go to two devices so the plug list will always need
sorting.

The only explanation I can come up with is that very occasionally schedule on
2 separate cpus calls blk_flush_plug for the same task.  I don't understand
the scheduler nearly well enough to know if or how that can happen.
However with this patch in place I can write to a RAID1 constantly for half
an hour, and without it, the write rarely lasts for 3 minutes.

If you want to reproduce my experiment, you can pull from
  git://neil.brown.name/md plug-test
to get my patches for plugging in md (which are quite ready for submission
but seem to work), create a RAID1 using e.g.
   mdadm -C /dev/md0 --level=1 --raid-disks=2 /dev/device1 /dev/device2
   while true; do dd if=/dev/zero of=/dev/md0 bs=4K ; done


Thanks,
NeilBrown



From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Thu, 7 Apr 2011 13:16:59 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   13 +++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 725091d..273d60b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;
 
 	/*
@@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
 	local_irq_restore(flags);
 }
 
+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	while (!list_empty(&plug->cb_list)) {
+		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
 {
 	flush_plug_list(plug);
+	flush_plug_callbacks(plug);
 
 	if (plug == tsk->plug)
 		tsk->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..3e5e604 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }
 
 /*
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  4:50                   ` [PATCH 05/10] block: remove per-queue plugging NeilBrown
@ 2011-04-11  9:19                     ` Jens Axboe
  2011-04-11 10:59                       ` NeilBrown
  2011-04-11 16:59                     ` hch
  1 sibling, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-11  9:19 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 06:50, NeilBrown wrote:
> On Tue, 5 Apr 2011 13:05:41 +1000 NeilBrown <neilb@suse.de> wrote:
> 
>> On Wed, 9 Mar 2011 19:58:10 -0500 Mike Snitzer <snitzer@redhat.com> wrote:
>>
>>> Also, in your MD changes, you removed all calls to md_unplug() but
>>> didn't remove md_unplug().  Seems it should be removed along with the
>>> 'plug' member of 'struct mddev_t'?  Neil?
>>
>> I've been distracted by other things and only just managed to have a look at
>> this.
>>
>> The new plugging code seems to completely ignore the needs of stacked devices
>> - or at least my needs in md.
>>
>> For RAID1 with a write-intent-bitmap, I queue all write requests and then on
>> an unplug I update the write-intent-bitmap to mark all the relevant blocks
>> and then release the writes.
>>
>> With the new code there is no way for an unplug event to wake up the raid1d
>> thread to start the writeout - I haven't tested it but I suspect it will just
>> hang.
>>
>> Similarly for RAID5 I gather write bios (long before they become 'struct
>> request' which is what the plugging code understands) and on an unplug event
>> I release the writes - hopefully with enough bios per stripe so that we don't
>> need to pre-read.
>>
>> Possibly the simplest fix would be to have a second list_head in 'struct
>> blk_plug' which contained callbacks (a function pointer a list_head in a
>> struct which is passed as an arg to the function!).
>> blk_finish_plug could then walk the list and call the call-backs.
>> It would be quite easy to hook into that.
> 
> I've implemented this and it seems to work.
> Jens:  could you please review and hopefully ack the patch below, and let
> me know if you will submit it or should I?
> 
> My testing of this combined with some other patches which cause various md
> personalities to use it shows up a bug somewhere.
> 
> The symptoms are crashes in various places in blk-core and sometimes
> elevator.c
> list_sort occurs fairly often included in the stack but not always.
> 
> This patch
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 273d60b..903ce8d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>  	struct request_queue *q;
>  	unsigned long flags;
>  	struct request *rq;
> +	struct list_head head;
>  
>  	BUG_ON(plug->magic != PLUG_MAGIC);
>  
>  	if (list_empty(&plug->list))
>  		return;
> +	list_add(&head, &plug->list);
> +	list_del_init(&plug->list);
>  
>  	if (plug->should_sort)
> -		list_sort(NULL, &plug->list, plug_rq_cmp);
> +		list_sort(NULL, &head, plug_rq_cmp);
> +	plug->should_sort = 0;
>  
>  	q = NULL;
>  	local_irq_save(flags);
> -	while (!list_empty(&plug->list)) {
> -		rq = list_entry_rq(plug->list.next);
> +	while (!list_empty(&head)) {
> +		rq = list_entry_rq(head.next);
>  		list_del_init(&rq->queuelist);
>  		BUG_ON(!(rq->cmd_flags & REQ_ON_PLUG));
>  		BUG_ON(!rq->q);
> 
> 
> makes the symptom go away.  It simply moves the plug list onto a separate
> list head before sorting and processing it.
> My test was simply writing to a RAID1 with dd:
>   while true; do dd if=/dev/zero of=/dev/md0 size=4k; done
> 
> Obviously all writes go to two devices so the plug list will always need
> sorting.
> 
> The only explanation I can come up with is that very occasionally schedule on
> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> the scheduler nearly well enough to know if or how that can happen.
> However with this patch in place I can write to a RAID1 constantly for half
> an hour, and without it, the write rarely lasts for 3 minutes.

Or perhaps if the request_fn blocks, that would be problematic. So the
patch is likely a good idea even for that case.

I'll merge it, changing it to list_splice_init() as I think that would
be more clear.

> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Thu, 7 Apr 2011 13:16:59 +1000
> Subject: [PATCH] Enhance new plugging support to support general callbacks.
> 
> md/raid requires an unplug callback, but as it does not uses
> requests the current code cannot provide one.
> 
> So allow arbitrary callbacks to be attached to the blk_plug.
> 
> Cc: Jens Axboe <jaxboe@fusionio.com>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  block/blk-core.c       |   13 +++++++++++++
>  include/linux/blkdev.h |    7 ++++++-
>  2 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 725091d..273d60b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>  
>  	plug->magic = PLUG_MAGIC;
>  	INIT_LIST_HEAD(&plug->list);
> +	INIT_LIST_HEAD(&plug->cb_list);
>  	plug->should_sort = 0;
>  
>  	/*
> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
>  	local_irq_restore(flags);
>  }
>  
> +static void flush_plug_callbacks(struct blk_plug *plug)
> +{
> +	while (!list_empty(&plug->cb_list)) {
> +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> +							  struct blk_plug_cb,
> +							  list);
> +		list_del(&cb->list);
> +		cb->callback(cb);
> +	}
> +}
> +
>  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>  {
>  	flush_plug_list(plug);
> +	flush_plug_callbacks(plug);
>  
>  	if (plug == tsk->plug)
>  		tsk->plug = NULL;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 32176cc..3e5e604 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
>  struct blk_plug {
>  	unsigned long magic;
>  	struct list_head list;
> +	struct list_head cb_list;
>  	unsigned int should_sort;
>  };
> +struct blk_plug_cb {
> +	struct list_head list;
> +	void (*callback)(struct blk_plug_cb *);
> +};
>  
>  extern void blk_start_plug(struct blk_plug *);
>  extern void blk_finish_plug(struct blk_plug *);
> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>  {
>  	struct blk_plug *plug = tsk->plug;
>  
> -	return plug && !list_empty(&plug->list);
> +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
>  }
>  
>  /*

Maybe I'm missing something, but why do you need those callbacks? If
it's to use plugging yourself, perhaps we can just ensure that those
don't get assigned in the task - so it would be have to used with care.

It's not that I disagree to these callbacks, I just want to ensure I
understand why you need them.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  9:19                     ` Jens Axboe
@ 2011-04-11 10:59                       ` NeilBrown
  2011-04-11 11:04                         ` Jens Axboe
  2011-04-11 11:55                         ` NeilBrown
  0 siblings, 2 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-11 10:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 06:50, NeilBrown wrote:

> > The only explanation I can come up with is that very occasionally schedule on
> > 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> > the scheduler nearly well enough to know if or how that can happen.
> > However with this patch in place I can write to a RAID1 constantly for half
> > an hour, and without it, the write rarely lasts for 3 minutes.
> 
> Or perhaps if the request_fn blocks, that would be problematic. So the
> patch is likely a good idea even for that case.
> 
> I'll merge it, changing it to list_splice_init() as I think that would
> be more clear.

OK - though I'm not 100% the patch fixes the problem - just that it hides the
symptom for me.
I might try instrumenting the code a bit more and see if I can find exactly
where it is re-entering flush_plug_list - as that seems to be what is
happening.

And yeah - list_split_init is probably better.  I just never remember exactly
what list_split means and have to look it up every time, where as
list_add/list_del are very clear to me.


> 
> > From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Thu, 7 Apr 2011 13:16:59 +1000
> > Subject: [PATCH] Enhance new plugging support to support general callbacks.
> > 
> > md/raid requires an unplug callback, but as it does not uses
> > requests the current code cannot provide one.
> > 
> > So allow arbitrary callbacks to be attached to the blk_plug.
> > 
> > Cc: Jens Axboe <jaxboe@fusionio.com>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  block/blk-core.c       |   13 +++++++++++++
> >  include/linux/blkdev.h |    7 ++++++-
> >  2 files changed, 19 insertions(+), 1 deletions(-)
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 725091d..273d60b 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
> >  
> >  	plug->magic = PLUG_MAGIC;
> >  	INIT_LIST_HEAD(&plug->list);
> > +	INIT_LIST_HEAD(&plug->cb_list);
> >  	plug->should_sort = 0;
> >  
> >  	/*
> > @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
> >  	local_irq_restore(flags);
> >  }
> >  
> > +static void flush_plug_callbacks(struct blk_plug *plug)
> > +{
> > +	while (!list_empty(&plug->cb_list)) {
> > +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
> > +							  struct blk_plug_cb,
> > +							  list);
> > +		list_del(&cb->list);
> > +		cb->callback(cb);
> > +	}
> > +}
> > +
> >  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
> >  {
> >  	flush_plug_list(plug);
> > +	flush_plug_callbacks(plug);
> >  
> >  	if (plug == tsk->plug)
> >  		tsk->plug = NULL;
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 32176cc..3e5e604 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
> >  struct blk_plug {
> >  	unsigned long magic;
> >  	struct list_head list;
> > +	struct list_head cb_list;
> >  	unsigned int should_sort;
> >  };
> > +struct blk_plug_cb {
> > +	struct list_head list;
> > +	void (*callback)(struct blk_plug_cb *);
> > +};
> >  
> >  extern void blk_start_plug(struct blk_plug *);
> >  extern void blk_finish_plug(struct blk_plug *);
> > @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
> >  {
> >  	struct blk_plug *plug = tsk->plug;
> >  
> > -	return plug && !list_empty(&plug->list);
> > +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
> >  }
> >  
> >  /*
> 
> Maybe I'm missing something, but why do you need those callbacks? If
> it's to use plugging yourself, perhaps we can just ensure that those
> don't get assigned in the task - so it would be have to used with care.
> 
> It's not that I disagree to these callbacks, I just want to ensure I
> understand why you need them.
> 

I'm sure one of us is missing something (probably both) but I'm not sure what.

The callback is central.

It is simply to use plugging in md.
Just like blk-core, md will notice that a blk_plug is active and will put
requests aside.  I then need something to call in to md when blk_finish_plug
is called so that put-aside requests can be released.
As md can be built as a module, that call must be a call-back of some sort.
blk-core doesn't need to register blk_plug_flush because that is never in a
module, so it can be called directly.  But the md equivalent could be in a
module, so I need to be able to register a call back.

Does that help? 

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 10:59                       ` NeilBrown
@ 2011-04-11 11:04                         ` Jens Axboe
  2011-04-11 11:26                           ` NeilBrown
  2011-04-11 11:55                         ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-11 11:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 12:59, NeilBrown wrote:
> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-11 06:50, NeilBrown wrote:
> 
>>> The only explanation I can come up with is that very occasionally schedule on
>>> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
>>> the scheduler nearly well enough to know if or how that can happen.
>>> However with this patch in place I can write to a RAID1 constantly for half
>>> an hour, and without it, the write rarely lasts for 3 minutes.
>>
>> Or perhaps if the request_fn blocks, that would be problematic. So the
>> patch is likely a good idea even for that case.
>>
>> I'll merge it, changing it to list_splice_init() as I think that would
>> be more clear.
> 
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

It's definitely a good thing to add, to avoid the list fudging on
schedule. Whether it's your exact problem, I can't tell.

> And yeah - list_split_init is probably better.  I just never remember exactly
> what list_split means and have to look it up every time, where as
> list_add/list_del are very clear to me.

splice, no split :-)

>>> From 687b189c02276887dd7d5b87a817da9f67ed3c2c Mon Sep 17 00:00:00 2001
>>> From: NeilBrown <neilb@suse.de>
>>> Date: Thu, 7 Apr 2011 13:16:59 +1000
>>> Subject: [PATCH] Enhance new plugging support to support general callbacks.
>>>
>>> md/raid requires an unplug callback, but as it does not uses
>>> requests the current code cannot provide one.
>>>
>>> So allow arbitrary callbacks to be attached to the blk_plug.
>>>
>>> Cc: Jens Axboe <jaxboe@fusionio.com>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>>  block/blk-core.c       |   13 +++++++++++++
>>>  include/linux/blkdev.h |    7 ++++++-
>>>  2 files changed, 19 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 725091d..273d60b 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2644,6 +2644,7 @@ void blk_start_plug(struct blk_plug *plug)
>>>  
>>>  	plug->magic = PLUG_MAGIC;
>>>  	INIT_LIST_HEAD(&plug->list);
>>> +	INIT_LIST_HEAD(&plug->cb_list);
>>>  	plug->should_sort = 0;
>>>  
>>>  	/*
>>> @@ -2717,9 +2718,21 @@ static void flush_plug_list(struct blk_plug *plug)
>>>  	local_irq_restore(flags);
>>>  }
>>>  
>>> +static void flush_plug_callbacks(struct blk_plug *plug)
>>> +{
>>> +	while (!list_empty(&plug->cb_list)) {
>>> +		struct blk_plug_cb *cb = list_first_entry(&plug->cb_list,
>>> +							  struct blk_plug_cb,
>>> +							  list);
>>> +		list_del(&cb->list);
>>> +		cb->callback(cb);
>>> +	}
>>> +}
>>> +
>>>  static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
>>>  {
>>>  	flush_plug_list(plug);
>>> +	flush_plug_callbacks(plug);
>>>  
>>>  	if (plug == tsk->plug)
>>>  		tsk->plug = NULL;
>>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>>> index 32176cc..3e5e604 100644
>>> --- a/include/linux/blkdev.h
>>> +++ b/include/linux/blkdev.h
>>> @@ -857,8 +857,13 @@ extern void blk_put_queue(struct request_queue *);
>>>  struct blk_plug {
>>>  	unsigned long magic;
>>>  	struct list_head list;
>>> +	struct list_head cb_list;
>>>  	unsigned int should_sort;
>>>  };
>>> +struct blk_plug_cb {
>>> +	struct list_head list;
>>> +	void (*callback)(struct blk_plug_cb *);
>>> +};
>>>  
>>>  extern void blk_start_plug(struct blk_plug *);
>>>  extern void blk_finish_plug(struct blk_plug *);
>>> @@ -876,7 +881,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
>>>  {
>>>  	struct blk_plug *plug = tsk->plug;
>>>  
>>> -	return plug && !list_empty(&plug->list);
>>> +	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
>>>  }
>>>  
>>>  /*
>>
>> Maybe I'm missing something, but why do you need those callbacks? If
>> it's to use plugging yourself, perhaps we can just ensure that those
>> don't get assigned in the task - so it would be have to used with care.
>>
>> It's not that I disagree to these callbacks, I just want to ensure I
>> understand why you need them.
>>
> 
> I'm sure one of us is missing something (probably both) but I'm not
> sure what.
> 
> The callback is central.
> 
> It is simply to use plugging in md.
> Just like blk-core, md will notice that a blk_plug is active and will put
> requests aside.  I then need something to call in to md when blk_finish_plug

But this is done in __make_request(), so md devices should not be
affected at all. This is the part of your explanation that I do not
connect with the code.

If md itself is putting things on the plug list, why is it doing that?

> is called so that put-aside requests can be released.
> As md can be built as a module, that call must be a call-back of some sort.
> blk-core doesn't need to register blk_plug_flush because that is never in a
> module, so it can be called directly.  But the md equivalent could be in a
> module, so I need to be able to register a call back.
> 
> Does that help? 

Not really. Is the problem that _you_ would like to stash things aside,
not the fact that __make_request() puts things on a task plug list?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:04                         ` Jens Axboe
@ 2011-04-11 11:26                           ` NeilBrown
  2011-04-11 11:37                             ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-11 11:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > 
> > I'm sure one of us is missing something (probably both) but I'm not
> > sure what.
> > 
> > The callback is central.
> > 
> > It is simply to use plugging in md.
> > Just like blk-core, md will notice that a blk_plug is active and will put
> > requests aside.  I then need something to call in to md when blk_finish_plug
> 
> But this is done in __make_request(), so md devices should not be
> affected at all. This is the part of your explanation that I do not
> connect with the code.
> 
> If md itself is putting things on the plug list, why is it doing that?

Yes.  Exactly.  md itself want to put things aside on some list.
e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
requests as possible so I can update the bits for all of them at once.
So when a plug is in effect I just queue the bios somewhere and record the
bits that need to be set.
Then when the unplug happens I write out the bitmap updates in a single write
and when that completes, I write out the data (to all devices).

Also in RAID5 it is good if I can wait for lots of write request to arrive
before committing any of them to increase the possibility of getting a
full-stripe write.

Previously I used ->unplug_fn to release the queued requests.  Now that has
gone I need a different way to register a callback when an unplug happens.

> 
> > is called so that put-aside requests can be released.
> > As md can be built as a module, that call must be a call-back of some sort.
> > blk-core doesn't need to register blk_plug_flush because that is never in a
> > module, so it can be called directly.  But the md equivalent could be in a
> > module, so I need to be able to register a call back.
> > 
> > Does that help? 
> 
> Not really. Is the problem that _you_ would like to stash things aside,
> not the fact that __make_request() puts things on a task plug list?
> 

Yes, exactly.  I (in md) want to stash things aside.

(I don't actually put the stashed things on the blk_plug, though it might
make sense to do that later in some cases - I'm not sure.  Currently I stash
things in my own internal lists and just need a call back to say "ok, flush
those lists now").

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:26                           ` NeilBrown
@ 2011-04-11 11:37                             ` Jens Axboe
  2011-04-11 12:05                               ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-11 11:37 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 13:26, NeilBrown wrote:
> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>>
>>> I'm sure one of us is missing something (probably both) but I'm not
>>> sure what.
>>>
>>> The callback is central.
>>>
>>> It is simply to use plugging in md.
>>> Just like blk-core, md will notice that a blk_plug is active and will put
>>> requests aside.  I then need something to call in to md when blk_finish_plug
>>
>> But this is done in __make_request(), so md devices should not be
>> affected at all. This is the part of your explanation that I do not
>> connect with the code.
>>
>> If md itself is putting things on the plug list, why is it doing that?
> 
> Yes.  Exactly.  md itself want to put things aside on some list.
> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> requests as possible so I can update the bits for all of them at once.
> So when a plug is in effect I just queue the bios somewhere and record the
> bits that need to be set.
> Then when the unplug happens I write out the bitmap updates in a single write
> and when that completes, I write out the data (to all devices).
> 
> Also in RAID5 it is good if I can wait for lots of write request to arrive
> before committing any of them to increase the possibility of getting a
> full-stripe write.
> 
> Previously I used ->unplug_fn to release the queued requests.  Now that has
> gone I need a different way to register a callback when an unplug happens.

Ah, so this is what I was hinting at. But why use the task->plug for
that? Seems a bit counter intuitive. Why can't you just store these
internally?

> 
>>
>>> is called so that put-aside requests can be released.
>>> As md can be built as a module, that call must be a call-back of some sort.
>>> blk-core doesn't need to register blk_plug_flush because that is never in a
>>> module, so it can be called directly.  But the md equivalent could be in a
>>> module, so I need to be able to register a call back.
>>>
>>> Does that help? 
>>
>> Not really. Is the problem that _you_ would like to stash things aside,
>> not the fact that __make_request() puts things on a task plug list?
>>
> 
> Yes, exactly.  I (in md) want to stash things aside.
> 
> (I don't actually put the stashed things on the blk_plug, though it might
> make sense to do that later in some cases - I'm not sure.  Currently I stash
> things in my own internal lists and just need a call back to say "ok, flush
> those lists now").

So we are making some progress... The thing I then don't understand is
why you want to make it associated with the plug? Seems you don't have
any scheduling restrictions, and in which case just storing them in md
seems like a much better option.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:37                             ` Jens Axboe
@ 2011-04-11 12:05                               ` NeilBrown
  2011-04-11 12:11                                 ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-11 12:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-11 13:26, NeilBrown wrote:
> > On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> > 
> >>>
> >>> I'm sure one of us is missing something (probably both) but I'm not
> >>> sure what.
> >>>
> >>> The callback is central.
> >>>
> >>> It is simply to use plugging in md.
> >>> Just like blk-core, md will notice that a blk_plug is active and will put
> >>> requests aside.  I then need something to call in to md when blk_finish_plug
> >>
> >> But this is done in __make_request(), so md devices should not be
> >> affected at all. This is the part of your explanation that I do not
> >> connect with the code.
> >>
> >> If md itself is putting things on the plug list, why is it doing that?
> > 
> > Yes.  Exactly.  md itself want to put things aside on some list.
> > e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
> > requests as possible so I can update the bits for all of them at once.
> > So when a plug is in effect I just queue the bios somewhere and record the
> > bits that need to be set.
> > Then when the unplug happens I write out the bitmap updates in a single write
> > and when that completes, I write out the data (to all devices).
> > 
> > Also in RAID5 it is good if I can wait for lots of write request to arrive
> > before committing any of them to increase the possibility of getting a
> > full-stripe write.
> > 
> > Previously I used ->unplug_fn to release the queued requests.  Now that has
> > gone I need a different way to register a callback when an unplug happens.
> 
> Ah, so this is what I was hinting at. But why use the task->plug for
> that? Seems a bit counter intuitive. Why can't you just store these
> internally?
> 
> > 
> >>
> >>> is called so that put-aside requests can be released.
> >>> As md can be built as a module, that call must be a call-back of some sort.
> >>> blk-core doesn't need to register blk_plug_flush because that is never in a
> >>> module, so it can be called directly.  But the md equivalent could be in a
> >>> module, so I need to be able to register a call back.
> >>>
> >>> Does that help? 
> >>
> >> Not really. Is the problem that _you_ would like to stash things aside,
> >> not the fact that __make_request() puts things on a task plug list?
> >>
> > 
> > Yes, exactly.  I (in md) want to stash things aside.
> > 
> > (I don't actually put the stashed things on the blk_plug, though it might
> > make sense to do that later in some cases - I'm not sure.  Currently I stash
> > things in my own internal lists and just need a call back to say "ok, flush
> > those lists now").
> 
> So we are making some progress... The thing I then don't understand is
> why you want to make it associated with the plug? Seems you don't have
> any scheduling restrictions, and in which case just storing them in md
> seems like a much better option.
> 

Yes.  But I need to know when to release the requests that I have stored.
I need to know when ->write_pages or ->read_pages or whatever has finished
submitting a pile of pages so that I can start processing the request that I
have put aside.  So I need a callback from blk_finish_plug.

(and I also need to know if a thread that was plugging schedules for the same
reason that you do).

NeilBrown



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:05                               ` NeilBrown
@ 2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
                                                     ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-11 12:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 14:05, NeilBrown wrote:
> On Mon, 11 Apr 2011 13:37:20 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-11 13:26, NeilBrown wrote:
>>> On Mon, 11 Apr 2011 13:04:26 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>>
>>>>>
>>>>> I'm sure one of us is missing something (probably both) but I'm not
>>>>> sure what.
>>>>>
>>>>> The callback is central.
>>>>>
>>>>> It is simply to use plugging in md.
>>>>> Just like blk-core, md will notice that a blk_plug is active and will put
>>>>> requests aside.  I then need something to call in to md when blk_finish_plug
>>>>
>>>> But this is done in __make_request(), so md devices should not be
>>>> affected at all. This is the part of your explanation that I do not
>>>> connect with the code.
>>>>
>>>> If md itself is putting things on the plug list, why is it doing that?
>>>
>>> Yes.  Exactly.  md itself want to put things aside on some list.
>>> e.g. in RAID1 when using a write-intent bitmap I want to gather as many write
>>> requests as possible so I can update the bits for all of them at once.
>>> So when a plug is in effect I just queue the bios somewhere and record the
>>> bits that need to be set.
>>> Then when the unplug happens I write out the bitmap updates in a single write
>>> and when that completes, I write out the data (to all devices).
>>>
>>> Also in RAID5 it is good if I can wait for lots of write request to arrive
>>> before committing any of them to increase the possibility of getting a
>>> full-stripe write.
>>>
>>> Previously I used ->unplug_fn to release the queued requests.  Now that has
>>> gone I need a different way to register a callback when an unplug happens.
>>
>> Ah, so this is what I was hinting at. But why use the task->plug for
>> that? Seems a bit counter intuitive. Why can't you just store these
>> internally?
>>
>>>
>>>>
>>>>> is called so that put-aside requests can be released.
>>>>> As md can be built as a module, that call must be a call-back of some sort.
>>>>> blk-core doesn't need to register blk_plug_flush because that is never in a
>>>>> module, so it can be called directly.  But the md equivalent could be in a
>>>>> module, so I need to be able to register a call back.
>>>>>
>>>>> Does that help? 
>>>>
>>>> Not really. Is the problem that _you_ would like to stash things aside,
>>>> not the fact that __make_request() puts things on a task plug list?
>>>>
>>>
>>> Yes, exactly.  I (in md) want to stash things aside.
>>>
>>> (I don't actually put the stashed things on the blk_plug, though it might
>>> make sense to do that later in some cases - I'm not sure.  Currently I stash
>>> things in my own internal lists and just need a call back to say "ok, flush
>>> those lists now").
>>
>> So we are making some progress... The thing I then don't understand is
>> why you want to make it associated with the plug? Seems you don't have
>> any scheduling restrictions, and in which case just storing them in md
>> seems like a much better option.
>>
> 
> Yes.  But I need to know when to release the requests that I have stored.
> I need to know when ->write_pages or ->read_pages or whatever has finished
> submitting a pile of pages so that I can start processing the request that I
> have put aside.  So I need a callback from blk_finish_plug.

OK fair enough, I'll add your callback patch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
@ 2011-04-11 12:36                                   ` NeilBrown
  2011-04-11 12:48                                     ` Jens Axboe
  2011-04-15  4:26                                   ` hch
  2011-04-17 22:19                                   ` NeilBrown
  2 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-11 12:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > Yes.  But I need to know when to release the requests that I have stored.
> > I need to know when ->write_pages or ->read_pages or whatever has finished
> > submitting a pile of pages so that I can start processing the request that I
> > have put aside.  So I need a callback from blk_finish_plug.
> 
> OK fair enough, I'll add your callback patch.
> 

Thanks.  I'll queue up my md fixes to follow it once it gets  to -linus.

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:36                                   ` NeilBrown
@ 2011-04-11 12:48                                     ` Jens Axboe
  2011-04-12  1:12                                       ` hch
  0 siblings, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-11 12:48 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 14:36, NeilBrown wrote:
> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>> Yes.  But I need to know when to release the requests that I have stored.
>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>> submitting a pile of pages so that I can start processing the request that I
>>> have put aside.  So I need a callback from blk_finish_plug.
>>
>> OK fair enough, I'll add your callback patch.
>>
> 
> Thanks.  I'll queue up my md fixes to follow it once it gets  to -linus.

Great, once you do that and XFS kills the blk_flush_plug() calls too,
then we can remove that export and make it internal only.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:48                                     ` Jens Axboe
@ 2011-04-12  1:12                                       ` hch
  2011-04-12  8:36                                         ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: hch @ 2011-04-12  1:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> then we can remove that export and make it internal only.

Linus pulled the tree, so they are gone now.  Btw, there's still some
bits in the area that confuse me:

 - what's the point of the queue_sync_plugs?  It has a lot of comment
   that seem to pre-data the onstack plugging, but except for that
   it's trivial wrapper around blk_flush_plug, with an argument
   that is not used.
 - is there a good reason for the existance of __blk_flush_plug?  You'd
   get one additional instruction in the inlined version of
   blk_flush_plug when opencoding, but avoid the need for chained
   function calls.
 - Why is having a plug in blk_flush_plug marked unlikely?  Note that
   unlikely is the static branch prediction hint to mark the case
   extremly unlikely and is even used for hot/cold partitioning.  But
   when we call it we usually check beforehand if we actually have
   plugs, so it's actually likely to happen.
 - what is the point of blk_finish_plug?  All callers have
   the plug on stack, and there's no good reason for adding the NULL
   check.  Note that blk_start_plug doesn't have the NULL check either.
 - Why does __blk_flush_plug call __blk_finish_plug which might clear
   tsk->plug, just to set it back after the call? When manually inlining
   __blk_finish_plug ino __blk_flush_plug it looks like:

void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
{
	flush_plug_list(plug);
	if (plug == tsk->plug)
		tsk->plug = NULL;
	tsk->plug = plug;
}

   it would seem much smarted to just call flush_plug_list directly.
   In fact it seems like the tsk->plug is not nessecary at all and
   all remaining __blk_flush_plug callers could be replaced with
   flush_plug_list.

 - and of course the remaining issue of why io_schedule needs an
   expliciy blk_flush_plug when schedule() already does one in
   case it actually needs to schedule.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  1:12                                       ` hch
@ 2011-04-12  8:36                                         ` Jens Axboe
  2011-04-12 12:22                                           ` Dave Chinner
  2011-04-12 16:50                                           ` hch
  0 siblings, 2 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12  8:36 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-12 03:12, hch@infradead.org wrote:
> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>> then we can remove that export and make it internal only.
> 
> Linus pulled the tree, so they are gone now.  Btw, there's still some
> bits in the area that confuse me:

Great!

>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>    that seem to pre-data the onstack plugging, but except for that
>    it's trivial wrapper around blk_flush_plug, with an argument
>    that is not used.

There's really no point to it anymore. It's existance was due to the
older revision that had to track write requests for serializaing around
a barrier. I'll kill it, since we don't do that anymore.

>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>    get one additional instruction in the inlined version of
>    blk_flush_plug when opencoding, but avoid the need for chained
>    function calls.
>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>    unlikely is the static branch prediction hint to mark the case
>    extremly unlikely and is even used for hot/cold partitioning.  But
>    when we call it we usually check beforehand if we actually have
>    plugs, so it's actually likely to happen.

The existance and out-of-line is for the scheduler() hook. It should be
an unlikely event to schedule with a plug held, normally the plug should
have been explicitly unplugged before that happens.

>  - what is the point of blk_finish_plug?  All callers have
>    the plug on stack, and there's no good reason for adding the NULL
>    check.  Note that blk_start_plug doesn't have the NULL check either.

That one can probably go, I need to double check that part since some
things changed.

>  - Why does __blk_flush_plug call __blk_finish_plug which might clear
>    tsk->plug, just to set it back after the call? When manually inlining
>    __blk_finish_plug ino __blk_flush_plug it looks like:
> 
> void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> {
> 	flush_plug_list(plug);
> 	if (plug == tsk->plug)
> 		tsk->plug = NULL;
> 	tsk->plug = plug;
> }
> 
>    it would seem much smarted to just call flush_plug_list directly.
>    In fact it seems like the tsk->plug is not nessecary at all and
>    all remaining __blk_flush_plug callers could be replaced with
>    flush_plug_list.

It depends on whether this was an explicit unplug (eg
blk_finish_plug()), or whether it was an implicit event (eg on
schedule()). If we do it on schedule(), then we retain the plug after
the flush. Otherwise we clear it.

>  - and of course the remaining issue of why io_schedule needs an
>    expliciy blk_flush_plug when schedule() already does one in
>    case it actually needs to schedule.

Already answered in other email.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  8:36                                         ` Jens Axboe
@ 2011-04-12 12:22                                           ` Dave Chinner
  2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 16:50                                           ` hch
  1 sibling, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 12:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> On 2011-04-12 03:12, hch@infradead.org wrote:
> > On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >> then we can remove that export and make it internal only.
> > 
> > Linus pulled the tree, so they are gone now.  Btw, there's still some
> > bits in the area that confuse me:
> 
> Great!
> 
> >  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >    that seem to pre-data the onstack plugging, but except for that
> >    it's trivial wrapper around blk_flush_plug, with an argument
> >    that is not used.
> 
> There's really no point to it anymore. It's existance was due to the
> older revision that had to track write requests for serializaing around
> a barrier. I'll kill it, since we don't do that anymore.
> 
> >  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >    get one additional instruction in the inlined version of
> >    blk_flush_plug when opencoding, but avoid the need for chained
> >    function calls.
> >  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >    unlikely is the static branch prediction hint to mark the case
> >    extremly unlikely and is even used for hot/cold partitioning.  But
> >    when we call it we usually check beforehand if we actually have
> >    plugs, so it's actually likely to happen.
> 
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

Though if it does, haven't you just added a significant amount of
depth to the worst case stack usage? I'm seeing this sort of thing
from io_schedule():

        Depth    Size   Location    (40 entries)
        -----    ----   --------
  0)     4256      16   mempool_alloc_slab+0x15/0x20
  1)     4240     144   mempool_alloc+0x63/0x160
  2)     4096      16   scsi_sg_alloc+0x4c/0x60
  3)     4080     112   __sg_alloc_table+0x66/0x140
  4)     3968      32   scsi_init_sgtable+0x33/0x90
  5)     3936      48   scsi_init_io+0x31/0xc0
  6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
  7)     3856     112   sd_prep_fn+0x150/0xa90
  8)     3744      48   blk_peek_request+0x6a/0x1f0
  9)     3696      96   scsi_request_fn+0x60/0x510
 10)     3600      32   __blk_run_queue+0x57/0x100
 11)     3568      80   flush_plug_list+0x133/0x1d0
 12)     3488      32   __blk_flush_plug+0x24/0x50
 13)     3456      32   io_schedule+0x79/0x80

(This is from a page fault on ext3 that is doing page cache
readahead and blocking on a locked buffer.)

I've seen traces where mempool_alloc_slab enters direct reclaim
which adds another 1.5k of stack usage to this path. So I'm
extremely concerned that you've just reduced the stack available to
every thread by at least 2.5k of space...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:22                                           ` Dave Chinner
@ 2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 12:41                                               ` Dave Chinner
  2011-04-12 13:40                                               ` Dave Chinner
  0 siblings, 2 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 12:28 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 14:22, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>> then we can remove that export and make it internal only.
>>>
>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>> bits in the area that confuse me:
>>
>> Great!
>>
>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>    that seem to pre-data the onstack plugging, but except for that
>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>    that is not used.
>>
>> There's really no point to it anymore. It's existance was due to the
>> older revision that had to track write requests for serializaing around
>> a barrier. I'll kill it, since we don't do that anymore.
>>
>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>    get one additional instruction in the inlined version of
>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>    function calls.
>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>    unlikely is the static branch prediction hint to mark the case
>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>    when we call it we usually check beforehand if we actually have
>>>    plugs, so it's actually likely to happen.
>>
>> The existance and out-of-line is for the scheduler() hook. It should be
>> an unlikely event to schedule with a plug held, normally the plug should
>> have been explicitly unplugged before that happens.
> 
> Though if it does, haven't you just added a significant amount of
> depth to the worst case stack usage? I'm seeing this sort of thing
> from io_schedule():
> 
>         Depth    Size   Location    (40 entries)
>         -----    ----   --------
>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>   1)     4240     144   mempool_alloc+0x63/0x160
>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>   3)     4080     112   __sg_alloc_table+0x66/0x140
>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>   5)     3936      48   scsi_init_io+0x31/0xc0
>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>   7)     3856     112   sd_prep_fn+0x150/0xa90
>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>   9)     3696      96   scsi_request_fn+0x60/0x510
>  10)     3600      32   __blk_run_queue+0x57/0x100
>  11)     3568      80   flush_plug_list+0x133/0x1d0
>  12)     3488      32   __blk_flush_plug+0x24/0x50
>  13)     3456      32   io_schedule+0x79/0x80
> 
> (This is from a page fault on ext3 that is doing page cache
> readahead and blocking on a locked buffer.)
> 
> I've seen traces where mempool_alloc_slab enters direct reclaim
> which adds another 1.5k of stack usage to this path. So I'm
> extremely concerned that you've just reduced the stack available to
> every thread by at least 2.5k of space...

Yeah, that does not look great. If this turns out to be problematic, we
can turn the queue runs from the unlikely case into out-of-line from
kblockd.

But this really isn't that new, you could enter the IO dispatch path
when doing IO already (when submitting it). So we better be able to
handle that.

If it's a problem from the schedule()/io_schedule() path, then lets
ensure that those are truly unlikely events so we can punt them to
kblockd.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:28                                             ` Jens Axboe
@ 2011-04-12 12:41                                               ` Dave Chinner
  2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:40                                               ` Dave Chinner
  1 sibling, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 12:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>> then we can remove that export and make it internal only.
> >>>
> >>> Linus pulled the tree, so they are gone now.  Btw, there's still some
> >>> bits in the area that confuse me:
> >>
> >> Great!
> >>
> >>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >>>    that seem to pre-data the onstack plugging, but except for that
> >>>    it's trivial wrapper around blk_flush_plug, with an argument
> >>>    that is not used.
> >>
> >> There's really no point to it anymore. It's existance was due to the
> >> older revision that had to track write requests for serializaing around
> >> a barrier. I'll kill it, since we don't do that anymore.
> >>
> >>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >>>    get one additional instruction in the inlined version of
> >>>    blk_flush_plug when opencoding, but avoid the need for chained
> >>>    function calls.
> >>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>    unlikely is the static branch prediction hint to mark the case
> >>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>    when we call it we usually check beforehand if we actually have
> >>>    plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> > 
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> > 
> >         Depth    Size   Location    (40 entries)
> >         -----    ----   --------
> >   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >   1)     4240     144   mempool_alloc+0x63/0x160
> >   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >   3)     4080     112   __sg_alloc_table+0x66/0x140
> >   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >   5)     3936      48   scsi_init_io+0x31/0xc0
> >   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >   7)     3856     112   sd_prep_fn+0x150/0xa90
> >   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >   9)     3696      96   scsi_request_fn+0x60/0x510
> >  10)     3600      32   __blk_run_queue+0x57/0x100
> >  11)     3568      80   flush_plug_list+0x133/0x1d0
> >  12)     3488      32   __blk_flush_plug+0x24/0x50
> >  13)     3456      32   io_schedule+0x79/0x80
> > 
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)
> > 
> > I've seen traces where mempool_alloc_slab enters direct reclaim
> > which adds another 1.5k of stack usage to this path. So I'm
> > extremely concerned that you've just reduced the stack available to
> > every thread by at least 2.5k of space...
> 
> Yeah, that does not look great. If this turns out to be problematic, we
> can turn the queue runs from the unlikely case into out-of-line from
> kblockd.
> 
> But this really isn't that new, you could enter the IO dispatch path
> when doing IO already (when submitting it). So we better be able to
> handle that.

The problem I see is that IO is submitted when there's plenty of
stack available whould have previously been fine. However now it
hits the plug, and then later on after the thread consumes a lot
more stack it, say, waits for a completion. We then schedule, it
unplugs the queue and we add the IO stack to a place where there
isn't much space available.

So effectively we are moving the places where stack is consumed
about, and it's complete unpredictable where that stack is going to
land now.

> If it's a problem from the schedule()/io_schedule() path, then
> lets ensure that those are truly unlikely events so we can punt
> them to kblockd.

Rather than wait for an explosion to be reported before doing this,
why not just punt unplugs to kblockd unconditionally?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:41                                               ` Dave Chinner
@ 2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 16:44                                                   ` hch
  0 siblings, 2 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 12:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 14:41, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>> then we can remove that export and make it internal only.
>>>>>
>>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>>>> bits in the area that confuse me:
>>>>
>>>> Great!
>>>>
>>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>>>    that seem to pre-data the onstack plugging, but except for that
>>>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>>>    that is not used.
>>>>
>>>> There's really no point to it anymore. It's existance was due to the
>>>> older revision that had to track write requests for serializaing around
>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>
>>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>>>    get one additional instruction in the inlined version of
>>>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>>>    function calls.
>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>    when we call it we usually check beforehand if we actually have
>>>>>    plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>>         Depth    Size   Location    (40 entries)
>>>         -----    ----   --------
>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>  13)     3456      32   io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
>>>
>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>> which adds another 1.5k of stack usage to this path. So I'm
>>> extremely concerned that you've just reduced the stack available to
>>> every thread by at least 2.5k of space...
>>
>> Yeah, that does not look great. If this turns out to be problematic, we
>> can turn the queue runs from the unlikely case into out-of-line from
>> kblockd.
>>
>> But this really isn't that new, you could enter the IO dispatch path
>> when doing IO already (when submitting it). So we better be able to
>> handle that.
> 
> The problem I see is that IO is submitted when there's plenty of
> stack available whould have previously been fine. However now it
> hits the plug, and then later on after the thread consumes a lot
> more stack it, say, waits for a completion. We then schedule, it
> unplugs the queue and we add the IO stack to a place where there
> isn't much space available.
>
> So effectively we are moving the places where stack is consumed
> about, and it's complete unpredictable where that stack is going to
> land now.

Isn't that example fairly contrived? If we ended up doing the IO
dispatch before, then the only difference now is the stack usage of
schedule() itself. Apart from that, as far as I can tell, there should
not be much difference.

 
>> If it's a problem from the schedule()/io_schedule() path, then
>> lets ensure that those are truly unlikely events so we can punt
>> them to kblockd.
> 
> Rather than wait for an explosion to be reported before doing this,
> why not just punt unplugs to kblockd unconditionally?

Supposedly it's faster to do it inline rather than punt the dispatch.
But that may actually not be true, if you have multiple plugs going (and
thus multiple contenders for the queue lock on dispatch). So lets play
it safe and punt to kblockd, we can always revisit this later.

diff --git a/block/blk-core.c b/block/blk-core.c
index c6eaa1f..36b1a75 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2665,7 +2665,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 static void queue_unplugged(struct request_queue *q, unsigned int depth)
 {
 	trace_block_unplug_io(q, depth);
-	__blk_run_queue(q, false);
+	__blk_run_queue(q, true);
 
 	if (q->unplugged_fn)
 		q->unplugged_fn(q);


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:58                                                 ` Jens Axboe
@ 2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 13:45                                                     ` Jens Axboe
  2011-04-12 16:58                                                     ` hch
  2011-04-12 16:44                                                   ` hch
  1 sibling, 2 replies; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 13:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:41, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
> >>>>>> then we can remove that export and make it internal only.
> >>>>>
> >>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
> >>>>> bits in the area that confuse me:
> >>>>
> >>>> Great!
> >>>>
> >>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
> >>>>>    that seem to pre-data the onstack plugging, but except for that
> >>>>>    it's trivial wrapper around blk_flush_plug, with an argument
> >>>>>    that is not used.
> >>>>
> >>>> There's really no point to it anymore. It's existance was due to the
> >>>> older revision that had to track write requests for serializaing around
> >>>> a barrier. I'll kill it, since we don't do that anymore.
> >>>>
> >>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
> >>>>>    get one additional instruction in the inlined version of
> >>>>>    blk_flush_plug when opencoding, but avoid the need for chained
> >>>>>    function calls.
> >>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>>>    unlikely is the static branch prediction hint to mark the case
> >>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>>>    when we call it we usually check beforehand if we actually have
> >>>>>    plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>>         Depth    Size   Location    (40 entries)
> >>>         -----    ----   --------
> >>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >>>   1)     4240     144   mempool_alloc+0x63/0x160
> >>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >>>   3)     4080     112   __sg_alloc_table+0x66/0x140
> >>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >>>   5)     3936      48   scsi_init_io+0x31/0xc0
> >>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >>>   7)     3856     112   sd_prep_fn+0x150/0xa90
> >>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >>>   9)     3696      96   scsi_request_fn+0x60/0x510
> >>>  10)     3600      32   __blk_run_queue+0x57/0x100
> >>>  11)     3568      80   flush_plug_list+0x133/0x1d0
> >>>  12)     3488      32   __blk_flush_plug+0x24/0x50
> >>>  13)     3456      32   io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> >>>
> >>> I've seen traces where mempool_alloc_slab enters direct reclaim
> >>> which adds another 1.5k of stack usage to this path. So I'm
> >>> extremely concerned that you've just reduced the stack available to
> >>> every thread by at least 2.5k of space...
> >>
> >> Yeah, that does not look great. If this turns out to be problematic, we
> >> can turn the queue runs from the unlikely case into out-of-line from
> >> kblockd.
> >>
> >> But this really isn't that new, you could enter the IO dispatch path
> >> when doing IO already (when submitting it). So we better be able to
> >> handle that.
> > 
> > The problem I see is that IO is submitted when there's plenty of
> > stack available whould have previously been fine. However now it
> > hits the plug, and then later on after the thread consumes a lot
> > more stack it, say, waits for a completion. We then schedule, it
> > unplugs the queue and we add the IO stack to a place where there
> > isn't much space available.
> >
> > So effectively we are moving the places where stack is consumed
> > about, and it's complete unpredictable where that stack is going to
> > land now.
> 
> Isn't that example fairly contrived?

I don't think so. e.g. in the XFS allocation path we do btree block
readahead, then go do the real work. The real work can end up with a
deeper stack before blocking on locks or completions unrelated to
the readahead, leading to schedule() being called and an unplug
being issued at that point.  You might think it contrived, but if
you can't provide a guarantee that it can't happen then I have to
assume it will happen.

My concern is that we're already under stack space stress in the
writeback path, so anything that has the potential to increase it
significantly is a major worry from my point of view...

> If we ended up doing the IO
> dispatch before, then the only difference now is the stack usage of
> schedule() itself. Apart from that, as far as I can tell, there should
> not be much difference.

There's a difference between IO submission and IO dispatch. IO
submission is submit_bio thru to the plug; IO dispatch is from the
plug down to the disk. If they happen at the same place, there's no
problem. If IO dispatch is moved to schedule() via a plug....

> >> If it's a problem from the schedule()/io_schedule() path, then
> >> lets ensure that those are truly unlikely events so we can punt
> >> them to kblockd.
> > 
> > Rather than wait for an explosion to be reported before doing this,
> > why not just punt unplugs to kblockd unconditionally?
> 
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

It's always best to play it safe when it comes to other peoples
data....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:31                                                   ` Dave Chinner
@ 2011-04-12 13:45                                                     ` Jens Axboe
  2011-04-12 14:34                                                       ` Dave Chinner
  2011-04-12 16:58                                                     ` hch
  1 sibling, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 13:45 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 15:31, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:41, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 14:22, Dave Chinner wrote:
>>>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>>>> Great, once you do that and XFS kills the blk_flush_plug() calls too,
>>>>>>>> then we can remove that export and make it internal only.
>>>>>>>
>>>>>>> Linus pulled the tree, so they are gone now.  Btw, there's still some
>>>>>>> bits in the area that confuse me:
>>>>>>
>>>>>> Great!
>>>>>>
>>>>>>>  - what's the point of the queue_sync_plugs?  It has a lot of comment
>>>>>>>    that seem to pre-data the onstack plugging, but except for that
>>>>>>>    it's trivial wrapper around blk_flush_plug, with an argument
>>>>>>>    that is not used.
>>>>>>
>>>>>> There's really no point to it anymore. It's existance was due to the
>>>>>> older revision that had to track write requests for serializaing around
>>>>>> a barrier. I'll kill it, since we don't do that anymore.
>>>>>>
>>>>>>>  - is there a good reason for the existance of __blk_flush_plug?  You'd
>>>>>>>    get one additional instruction in the inlined version of
>>>>>>>    blk_flush_plug when opencoding, but avoid the need for chained
>>>>>>>    function calls.
>>>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>>>    when we call it we usually check beforehand if we actually have
>>>>>>>    plugs, so it's actually likely to happen.
>>>>>>
>>>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>>>> have been explicitly unplugged before that happens.
>>>>>
>>>>> Though if it does, haven't you just added a significant amount of
>>>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>>>> from io_schedule():
>>>>>
>>>>>         Depth    Size   Location    (40 entries)
>>>>>         -----    ----   --------
>>>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>>>  13)     3456      32   io_schedule+0x79/0x80
>>>>>
>>>>> (This is from a page fault on ext3 that is doing page cache
>>>>> readahead and blocking on a locked buffer.)
>>>>>
>>>>> I've seen traces where mempool_alloc_slab enters direct reclaim
>>>>> which adds another 1.5k of stack usage to this path. So I'm
>>>>> extremely concerned that you've just reduced the stack available to
>>>>> every thread by at least 2.5k of space...
>>>>
>>>> Yeah, that does not look great. If this turns out to be problematic, we
>>>> can turn the queue runs from the unlikely case into out-of-line from
>>>> kblockd.
>>>>
>>>> But this really isn't that new, you could enter the IO dispatch path
>>>> when doing IO already (when submitting it). So we better be able to
>>>> handle that.
>>>
>>> The problem I see is that IO is submitted when there's plenty of
>>> stack available whould have previously been fine. However now it
>>> hits the plug, and then later on after the thread consumes a lot
>>> more stack it, say, waits for a completion. We then schedule, it
>>> unplugs the queue and we add the IO stack to a place where there
>>> isn't much space available.
>>>
>>> So effectively we are moving the places where stack is consumed
>>> about, and it's complete unpredictable where that stack is going to
>>> land now.
>>
>> Isn't that example fairly contrived?
> 
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point.  You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

If you ended up in lock_page() somewhere along the way, the path would
have been pretty much the same as it is now:

lock_page()
        __lock_page()
                __wait_on_bit_lock()
                        sync_page()
                                aops->sync_page();
                                        block_sync_page()
                                                __blk_run_backing_dev()

and the dispatch follows after that. If your schedules are only due to,
say, blocking on a mutex, then yes it'll be different. But is that
really the case?

I bet that worst case stack usage is exactly the same as before, and
that's the only metric we really care about.

> My concern is that we're already under stack space stress in the
> writeback path, so anything that has the potential to increase it
> significantly is a major worry from my point of view...

I agree on writeback being a worry, and that's why I made the change
(since it makes sense for other reasons, too). I just don't think we are
worse of than before.

>> If we ended up doing the IO
>> dispatch before, then the only difference now is the stack usage of
>> schedule() itself. Apart from that, as far as I can tell, there should
>> not be much difference.
> 
> There's a difference between IO submission and IO dispatch. IO
> submission is submit_bio thru to the plug; IO dispatch is from the
> plug down to the disk. If they happen at the same place, there's no
> problem. If IO dispatch is moved to schedule() via a plug....

The IO submission can easily and non-deterministically turn into an IO
dispatch, so there's no real difference for the submitter. That was the
case before. With the explicit plug now, you _know_ that the IO
submission is only that and doesn't include IO dispatch. Not until you
schedule() or call blk_finish_plug(), both of which are events that you
can control.

>>>> If it's a problem from the schedule()/io_schedule() path, then
>>>> lets ensure that those are truly unlikely events so we can punt
>>>> them to kblockd.
>>>
>>> Rather than wait for an explosion to be reported before doing this,
>>> why not just punt unplugs to kblockd unconditionally?
>>
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
> 
> It's always best to play it safe when it comes to other peoples
> data....

Certainly, but so far I see no real evidence that this is in fact any
safer.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:45                                                     ` Jens Axboe
@ 2011-04-12 14:34                                                       ` Dave Chinner
  2011-04-12 21:08                                                         ` NeilBrown
  0 siblings, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 14:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:31, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:41, Dave Chinner wrote:
> >> Isn't that example fairly contrived?
> > 
> > I don't think so. e.g. in the XFS allocation path we do btree block
> > readahead, then go do the real work. The real work can end up with a
> > deeper stack before blocking on locks or completions unrelated to
> > the readahead, leading to schedule() being called and an unplug
> > being issued at that point.  You might think it contrived, but if
> > you can't provide a guarantee that it can't happen then I have to
> > assume it will happen.
> 
> If you ended up in lock_page() somewhere along the way, the path would
> have been pretty much the same as it is now:
> 
> lock_page()
>         __lock_page()
>                 __wait_on_bit_lock()
>                         sync_page()
>                                 aops->sync_page();
>                                         block_sync_page()
>                                                 __blk_run_backing_dev()
> 
> and the dispatch follows after that. If your schedules are only due to,
> say, blocking on a mutex, then yes it'll be different. But is that
> really the case?

XFS metadata IO does not use the page cache anymore, so won't take
that path - no page locks are taken during read or write. Even
before that change contending on page locks was extremely rare as
XFs uses the buffer container for synchronisation.

AFAICT, we have nothing that will cause plugs to be flushed until
scheduling occurs. In many cases it will be at the same points as
before (the explicit flushes XFS had), but there are going to be new
ones....

Like this:

  0)     5360      40   zone_statistics+0xad/0xc0
  1)     5320     288   get_page_from_freelist+0x2cf/0x840
  2)     5032     304   __alloc_pages_nodemask+0x121/0x930
  3)     4728      48   kmem_getpages+0x62/0x160
  4)     4680      96   cache_grow+0x308/0x330
  5)     4584      80   cache_alloc_refill+0x21c/0x260
  6)     4504      16   __kmalloc+0x230/0x240
  7)     4488     176   virtqueue_add_buf_gfp+0x1f9/0x3e0
  8)     4312     144   do_virtblk_request+0x1f3/0x400
  9)     4168      32   __blk_run_queue+0x57/0x100
 10)     4136      80   flush_plug_list+0x133/0x1d0
 11)     4056      32   __blk_flush_plug+0x24/0x50
 12)     4024     160   schedule+0x867/0x9f0
 13)     3864     208   schedule_timeout+0x1f5/0x2c0
 14)     3656     144   wait_for_common+0xe7/0x190
 15)     3512      16   wait_for_completion+0x1d/0x20
 16)     3496      48   xfs_buf_iowait+0x36/0xb0
 17)     3448      32   _xfs_buf_read+0x98/0xa0
 18)     3416      48   xfs_buf_read+0xa2/0x100
 19)     3368      80   xfs_trans_read_buf+0x1db/0x680
......

This path adds roughly 500 bytes to the previous case of
immediate dispatch of the IO down through _xfs_buf_read()...

> I bet that worst case stack usage is exactly the same as before, and
> that's the only metric we really care about.

I've already demonstrated much worse stack usage with ext3 through
the page fault path via io_schedule(). io_schedule() never used to
dispatch IO and now it does. Similarly there are changes and
increases in XFS stack usage like above. IMO, worst case stack
usage is definitely increased by these changes.

> > My concern is that we're already under stack space stress in the
> > writeback path, so anything that has the potential to increase it
> > significantly is a major worry from my point of view...
> 
> I agree on writeback being a worry, and that's why I made the change
> (since it makes sense for other reasons, too). I just don't think we are
> worse of than before.

We certainly are.

Hmmm, I just noticed a new cumulative stack usage path through
direct reclaim - via congestion_wait() -> io_schedule()....

> >> If we ended up doing the IO
> >> dispatch before, then the only difference now is the stack usage of
> >> schedule() itself. Apart from that, as far as I can tell, there should
> >> not be much difference.
> > 
> > There's a difference between IO submission and IO dispatch. IO
> > submission is submit_bio thru to the plug; IO dispatch is from the
> > plug down to the disk. If they happen at the same place, there's no
> > problem. If IO dispatch is moved to schedule() via a plug....
> 
> The IO submission can easily and non-deterministically turn into an IO
> dispatch, so there's no real difference for the submitter. That was the
> case before. With the explicit plug now, you _know_ that the IO
> submission is only that and doesn't include IO dispatch.

You're violently agreeing with me that you've changed where the IO
dispatch path is run from. ;)

> Not until you
> schedule() or call blk_finish_plug(), both of which are events that you
> can control.

Well, not really - now taking any sleeping lock or waiting on
anything can trigger a plug flush where previously you had to
explicitly issue them. I'm not saying what we had is better, just
that there are implicit flushes with your changes that are
inherently uncontrollable...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 14:34                                                       ` Dave Chinner
@ 2011-04-12 21:08                                                         ` NeilBrown
  2011-04-13  2:23                                                           ` Linus Torvalds
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-12 21:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, hch@infradead.org, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Apr 12, 2011 at 03:45:52PM +0200, Jens Axboe wrote:
> Not until you
> > schedule() or call blk_finish_plug(), both of which are events that you
> > can control.
> 
> Well, not really - now taking any sleeping lock or waiting on
> anything can trigger a plug flush where previously you had to
> explicitly issue them. I'm not saying what we had is better, just
> that there are implicit flushes with your changes that are
> inherently uncontrollable...

It's not just sleeping locks - if preempt is enabled a schedule can happen at
any time - at any depth.  I've seen a spin_unlock do it.

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 21:08                                                         ` NeilBrown
@ 2011-04-13  2:23                                                           ` Linus Torvalds
  2011-04-13 11:12                                                             ` Peter Zijlstra
  0 siblings, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2011-04-13  2:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dave Chinner, Jens Axboe, hch@infradead.org, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1433 bytes --]

On Tue, Apr 12, 2011 at 2:08 PM, NeilBrown <neilb@suse.de> wrote:
> On Wed, 13 Apr 2011 00:34:52 +1000 Dave Chinner <david@fromorbit.com> wrote:
>>
>> Well, not really - now taking any sleeping lock or waiting on
>> anything can trigger a plug flush where previously you had to
>> explicitly issue them. I'm not saying what we had is better, just
>> that there are implicit flushes with your changes that are
>> inherently uncontrollable...
>
> It's not just sleeping locks - if preempt is enabled a schedule can happen at
> any time - at any depth.  I've seen a spin_unlock do it.

Hmm. I don't think we should flush IO in the preemption path. That
smells wrong on many levels, just one of them being the "any time, any
depth".

It also sounds really wrong from an IO pattern standpoint. The process
is actually still running, and the IO flushing _already_ does the
"only if it's going to sleep" test, but it actually does it _wrong_.
The "current->state" check doesn't make sense for a preemption event,
because it's not actually going to sleep there.

So a patch like the attached (UNTESTED!) sounds like the right thing to do.

Whether it makes any difference for any MD issues, who knows.. But
considering that the unplugging already used to test for "prev->state
!= TASK_RUNNING", this is absolutely the right thing to do - that old
test was just broken.

                                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1000 bytes --]

 kernel/sched.c |   20 ++++++++++----------
 1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 48013633d792..a187c3fe027b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4111,20 +4111,20 @@ need_resched:
 					try_to_wake_up_local(to_wakeup);
 			}
 			deactivate_task(rq, prev, DEQUEUE_SLEEP);
+
+			/*
+			 * If we are going to sleep and we have plugged IO queued, make
+			 * sure to submit it to avoid deadlocks.
+			 */
+			if (blk_needs_flush_plug(prev)) {
+				raw_spin_unlock(&rq->lock);
+				blk_flush_plug(prev);
+				raw_spin_lock(&rq->lock);
+			}
 		}
 		switch_count = &prev->nvcsw;
 	}
 
-	/*
-	 * If we are going to sleep and we have plugged IO queued, make
-	 * sure to submit it to avoid deadlocks.
-	 */
-	if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
-		raw_spin_unlock(&rq->lock);
-		blk_flush_plug(prev);
-		raw_spin_lock(&rq->lock);
-	}
-
 	pre_schedule(rq, prev);
 
 	if (unlikely(!rq->nr_running))

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13  2:23                                                           ` Linus Torvalds
@ 2011-04-13 11:12                                                             ` Peter Zijlstra
  2011-04-13 11:23                                                               ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: Peter Zijlstra @ 2011-04-13 11:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Dave Chinner, Jens Axboe, hch@infradead.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
>  kernel/sched.c |   20 ++++++++++----------
>  1 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 48013633d792..a187c3fe027b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4111,20 +4111,20 @@ need_resched:
>                                         try_to_wake_up_local(to_wakeup);
>                         }
>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
> +
> +                       /*
> +                        * If we are going to sleep and we have plugged IO queued, make
> +                        * sure to submit it to avoid deadlocks.
> +                        */
> +                       if (blk_needs_flush_plug(prev)) {
> +                               raw_spin_unlock(&rq->lock);
> +                               blk_flush_plug(prev);
> +                               raw_spin_lock(&rq->lock);
> +                       }
>                 }
>                 switch_count = &prev->nvcsw;
>         }
>  
> -       /*
> -        * If we are going to sleep and we have plugged IO queued, make
> -        * sure to submit it to avoid deadlocks.
> -        */
> -       if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
> -               raw_spin_unlock(&rq->lock);
> -               blk_flush_plug(prev);
> -               raw_spin_lock(&rq->lock);
> -       }
> -
>         pre_schedule(rq, prev);
>  
>         if (unlikely(!rq->nr_running)) 

Right, that cures the preemption problem. The reason I suggested placing
it where it was is that I'd like to keep all things that release
rq->lock in the middle of schedule() in one place, but I guess we can
cure that with some extra comments.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:12                                                             ` Peter Zijlstra
@ 2011-04-13 11:23                                                               ` Jens Axboe
  2011-04-13 11:41                                                                 ` Peter Zijlstra
  2011-04-13 15:13                                                                 ` Linus Torvalds
  0 siblings, 2 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-13 11:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, NeilBrown, Dave Chinner, hch@infradead.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-13 13:12, Peter Zijlstra wrote:
> On Tue, 2011-04-12 at 19:23 -0700, Linus Torvalds wrote:
>>  kernel/sched.c |   20 ++++++++++----------
>>  1 files changed, 10 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/sched.c b/kernel/sched.c
>> index 48013633d792..a187c3fe027b 100644
>> --- a/kernel/sched.c
>> +++ b/kernel/sched.c
>> @@ -4111,20 +4111,20 @@ need_resched:
>>                                         try_to_wake_up_local(to_wakeup);
>>                         }
>>                         deactivate_task(rq, prev, DEQUEUE_SLEEP);
>> +
>> +                       /*
>> +                        * If we are going to sleep and we have plugged IO queued, make
>> +                        * sure to submit it to avoid deadlocks.
>> +                        */
>> +                       if (blk_needs_flush_plug(prev)) {
>> +                               raw_spin_unlock(&rq->lock);
>> +                               blk_flush_plug(prev);
>> +                               raw_spin_lock(&rq->lock);
>> +                       }
>>                 }
>>                 switch_count = &prev->nvcsw;
>>         }
>>  
>> -       /*
>> -        * If we are going to sleep and we have plugged IO queued, make
>> -        * sure to submit it to avoid deadlocks.
>> -        */
>> -       if (prev->state != TASK_RUNNING && blk_needs_flush_plug(prev)) {
>> -               raw_spin_unlock(&rq->lock);
>> -               blk_flush_plug(prev);
>> -               raw_spin_lock(&rq->lock);
>> -       }
>> -
>>         pre_schedule(rq, prev);
>>  
>>         if (unlikely(!rq->nr_running)) 
> 
> Right, that cures the preemption problem. The reason I suggested placing
> it where it was is that I'd like to keep all things that release
> rq->lock in the middle of schedule() in one place, but I guess we can
> cure that with some extra comments.

We definitely only want to do it on going to sleep, not preempt events.
So if you are fine with this change, then lets please do that.

Linus, I've got a few other things queued up in the area, I'll add this
and send them off soon. Or feel free to add this one yourself, since you
already did it.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:23                                                               ` Jens Axboe
@ 2011-04-13 11:41                                                                 ` Peter Zijlstra
  2011-04-13 15:13                                                                 ` Linus Torvalds
  1 sibling, 0 replies; 61+ messages in thread
From: Peter Zijlstra @ 2011-04-13 11:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, NeilBrown, Dave Chinner, hch@infradead.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Wed, 2011-04-13 at 13:23 +0200, Jens Axboe wrote:
> We definitely only want to do it on going to sleep, not preempt events.
> So if you are fine with this change, then lets please do that.

Here's the Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>, that goes
with it ;-)

> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it. 

Right, please send it onwards or have Linus commit it himself and I'll
cook up a patch clarifying the rq->lock'ing mess around there.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 11:23                                                               ` Jens Axboe
  2011-04-13 11:41                                                                 ` Peter Zijlstra
@ 2011-04-13 15:13                                                                 ` Linus Torvalds
  2011-04-13 17:35                                                                   ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: Linus Torvalds @ 2011-04-13 15:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, NeilBrown, Dave Chinner, hch@infradead.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>
> Linus, I've got a few other things queued up in the area, I'll add this
> and send them off soon. Or feel free to add this one yourself, since you
> already did it.

Ok, I committed it with Peter's and your acks.

And if you already put it in your git tree too, git will merge it.

                    Linus

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-13 15:13                                                                 ` Linus Torvalds
@ 2011-04-13 17:35                                                                   ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-13 17:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, NeilBrown, Dave Chinner, hch@infradead.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-13 17:13, Linus Torvalds wrote:
> On Wed, Apr 13, 2011 at 4:23 AM, Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>> Linus, I've got a few other things queued up in the area, I'll add this
>> and send them off soon. Or feel free to add this one yourself, since you
>> already did it.
> 
> Ok, I committed it with Peter's and your acks.

Great, thanks.

> And if you already put it in your git tree too, git will merge it.

I did not, I had a feeling you'd merge this one.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:31                                                   ` Dave Chinner
  2011-04-12 13:45                                                     ` Jens Axboe
@ 2011-04-12 16:58                                                     ` hch
  2011-04-12 17:29                                                       ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: hch @ 2011-04-12 16:58 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
> I don't think so. e.g. in the XFS allocation path we do btree block
> readahead, then go do the real work. The real work can end up with a
> deeper stack before blocking on locks or completions unrelated to
> the readahead, leading to schedule() being called and an unplug
> being issued at that point.  You might think it contrived, but if
> you can't provide a guarantee that it can't happen then I have to
> assume it will happen.

In addition to the stack issue, which is a killer to this also has
latency implications.  Before we could submit a synchronous metadata
read request inside readpage or writepage and kick it off to the disk
immediately, while now it won't get submitted until we block the next
time, i.e. have done some more work that could have been used for
doing I/O in the background.  With the kblockd offload not only have
we spent more time but at the point where we finally kick it we
also need another context switch.  It seem like we really need to
go through the filesystems and explicitly flush the plugging queue
for such cases.  In fact a bio flag marking things as synchronous
metadata reads would help, but then again we need to clean up our
existing bio flags first..

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:58                                                     ` hch
@ 2011-04-12 17:29                                                       ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 17:29 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Dave Chinner, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 18:58, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 11:31:17PM +1000, Dave Chinner wrote:
>> I don't think so. e.g. in the XFS allocation path we do btree block
>> readahead, then go do the real work. The real work can end up with a
>> deeper stack before blocking on locks or completions unrelated to
>> the readahead, leading to schedule() being called and an unplug
>> being issued at that point.  You might think it contrived, but if
>> you can't provide a guarantee that it can't happen then I have to
>> assume it will happen.
> 
> In addition to the stack issue, which is a killer to this also has
> latency implications.  Before we could submit a synchronous metadata
> read request inside readpage or writepage and kick it off to the disk
> immediately, while now it won't get submitted until we block the next
> time, i.e. have done some more work that could have been used for
> doing I/O in the background.  With the kblockd offload not only have
> we spent more time but at the point where we finally kick it we
> also need another context switch.  It seem like we really need to
> go through the filesystems and explicitly flush the plugging queue
> for such cases.  In fact a bio flag marking things as synchronous
> metadata reads would help, but then again we need to clean up our
> existing bio flags first..

I think it would be a good idea to audit the SYNC cases, and if feasible
let that retain the 'immediate kick off' logic. If not, have some way to
signal that at least. Essentially allow some fine grained control of
what goes into the plug and what does not.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:58                                                 ` Jens Axboe
  2011-04-12 13:31                                                   ` Dave Chinner
@ 2011-04-12 16:44                                                   ` hch
  2011-04-12 16:49                                                     ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: hch @ 2011-04-12 16:44 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dave Chinner, hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
> Supposedly it's faster to do it inline rather than punt the dispatch.
> But that may actually not be true, if you have multiple plugs going (and
> thus multiple contenders for the queue lock on dispatch). So lets play
> it safe and punt to kblockd, we can always revisit this later.

Note that this can be optimized further by adding a new helper that just
queues up work on kblockd without taking the queue lock, e.g. adding a
new

void blk_run_queue_async(struct request_queue *q)
{
	if (likely(!blk_queue_stopped(q)))
		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}

And replacing all

	__blk_run_queue(q, true);

callers with that, at which point they won't need the queuelock any
more.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:44                                                   ` hch
@ 2011-04-12 16:49                                                     ` Jens Axboe
  2011-04-12 16:54                                                       ` hch
  0 siblings, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 16:49 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Dave Chinner, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 18:44, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 02:58:46PM +0200, Jens Axboe wrote:
>> Supposedly it's faster to do it inline rather than punt the dispatch.
>> But that may actually not be true, if you have multiple plugs going (and
>> thus multiple contenders for the queue lock on dispatch). So lets play
>> it safe and punt to kblockd, we can always revisit this later.
> 
> Note that this can be optimized further by adding a new helper that just
> queues up work on kblockd without taking the queue lock, e.g. adding a
> new
> 
> void blk_run_queue_async(struct request_queue *q)
> {
> 	if (likely(!blk_queue_stopped(q)))
> 		queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> }
> 
> And replacing all
> 
> 	__blk_run_queue(q, true);
> 
> callers with that, at which point they won't need the queuelock any
> more.

I realize that, in fact it's already safe as long as you pass in 'true'
for __blk_run_queue(). Before I had rewritten it to move the running
out, so that makes the trick a little difficult. This afternoon I also
tested it and saw no noticable difference, but I'll probably just do it
anyway as it makes sense.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:49                                                     ` Jens Axboe
@ 2011-04-12 16:54                                                       ` hch
  2011-04-12 17:24                                                         ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: hch @ 2011-04-12 16:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, Dave Chinner, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
> I realize that, in fact it's already safe as long as you pass in 'true'
> for __blk_run_queue(). Before I had rewritten it to move the running
> out, so that makes the trick a little difficult. This afternoon I also
> tested it and saw no noticable difference, but I'll probably just do it
> anyway as it makes sense.

We still need the lock for __elv_add_request, so we'll need to keep the
logic anyway.  But splitting out the just queue to kblockd case from
__blk_run_queue and giving the latter a sane prototype still sounds
like a good idea to me.

Btw, now that we don't call the request_fn directly any more and thus
can't block, can the unplugging be moved into the preempt notifiers?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 16:54                                                       ` hch
@ 2011-04-12 17:24                                                         ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 17:24 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Dave Chinner, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 18:54, hch@infradead.org wrote:
> On Tue, Apr 12, 2011 at 06:49:53PM +0200, Jens Axboe wrote:
>> I realize that, in fact it's already safe as long as you pass in 'true'
>> for __blk_run_queue(). Before I had rewritten it to move the running
>> out, so that makes the trick a little difficult. This afternoon I also
>> tested it and saw no noticable difference, but I'll probably just do it
>> anyway as it makes sense.
> 
> We still need the lock for __elv_add_request, so we'll need to keep the
> logic anyway.  But splitting out the just queue to kblockd case from
> __blk_run_queue and giving the latter a sane prototype still sounds
> like a good idea to me.
> 
> Btw, now that we don't call the request_fn directly any more and thus
> can't block, can the unplugging be moved into the preempt notifiers?

It was only partly the reason, there's still the notice on preempt
(instead of schedule) and the runqueue lock problem. And if we allow
preempt, then we need to do disable preempt around all the plug logic.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 12:28                                             ` Jens Axboe
  2011-04-12 12:41                                               ` Dave Chinner
@ 2011-04-12 13:40                                               ` Dave Chinner
  2011-04-12 13:48                                                 ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 13:40 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> On 2011-04-12 14:22, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>    function calls.
> >>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>    unlikely is the static branch prediction hint to mark the case
> >>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>    when we call it we usually check beforehand if we actually have
> >>>    plugs, so it's actually likely to happen.
> >>
> >> The existance and out-of-line is for the scheduler() hook. It should be
> >> an unlikely event to schedule with a plug held, normally the plug should
> >> have been explicitly unplugged before that happens.
> > 
> > Though if it does, haven't you just added a significant amount of
> > depth to the worst case stack usage? I'm seeing this sort of thing
> > from io_schedule():
> > 
> >         Depth    Size   Location    (40 entries)
> >         -----    ----   --------
> >   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >   1)     4240     144   mempool_alloc+0x63/0x160
> >   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >   3)     4080     112   __sg_alloc_table+0x66/0x140
> >   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >   5)     3936      48   scsi_init_io+0x31/0xc0
> >   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >   7)     3856     112   sd_prep_fn+0x150/0xa90
> >   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >   9)     3696      96   scsi_request_fn+0x60/0x510
> >  10)     3600      32   __blk_run_queue+0x57/0x100
> >  11)     3568      80   flush_plug_list+0x133/0x1d0
> >  12)     3488      32   __blk_flush_plug+0x24/0x50
> >  13)     3456      32   io_schedule+0x79/0x80
> > 
> > (This is from a page fault on ext3 that is doing page cache
> > readahead and blocking on a locked buffer.)

FYI, the next step in the allocation chain adds >900 bytes to that
stack:

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (47 entries)
        -----    ----   --------
  0)     5176      40   zone_statistics+0xad/0xc0
  1)     5136     288   get_page_from_freelist+0x2cf/0x840
  2)     4848     304   __alloc_pages_nodemask+0x121/0x930
  3)     4544      48   kmem_getpages+0x62/0x160
  4)     4496      96   cache_grow+0x308/0x330
  5)     4400      80   cache_alloc_refill+0x21c/0x260
  6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
  7)     4256      16   mempool_alloc_slab+0x15/0x20
  8)     4240     144   mempool_alloc+0x63/0x160
  9)     4096      16   scsi_sg_alloc+0x4c/0x60
 10)     4080     112   __sg_alloc_table+0x66/0x140
 11)     3968      32   scsi_init_sgtable+0x33/0x90
 12)     3936      48   scsi_init_io+0x31/0xc0
 13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
 14)     3856     112   sd_prep_fn+0x150/0xa90
 15)     3744      48   blk_peek_request+0x6a/0x1f0
 16)     3696      96   scsi_request_fn+0x60/0x510
 17)     3600      32   __blk_run_queue+0x57/0x100
 18)     3568      80   flush_plug_list+0x133/0x1d0
 19)     3488      32   __blk_flush_plug+0x24/0x50
 20)     3456      32   io_schedule+0x79/0x80

That's close to 1800 bytes now, and that's not entering the reclaim
path. If i get one deeper than that, I'll be sure to post it. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:40                                               ` Dave Chinner
@ 2011-04-12 13:48                                                 ` Jens Axboe
  2011-04-12 23:35                                                   ` Dave Chinner
  0 siblings, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-12 13:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On 2011-04-12 15:40, Dave Chinner wrote:
> On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
>> On 2011-04-12 14:22, Dave Chinner wrote:
>>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
>>>> On 2011-04-12 03:12, hch@infradead.org wrote:
>>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
>>>>>    function calls.
>>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
>>>>>    unlikely is the static branch prediction hint to mark the case
>>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
>>>>>    when we call it we usually check beforehand if we actually have
>>>>>    plugs, so it's actually likely to happen.
>>>>
>>>> The existance and out-of-line is for the scheduler() hook. It should be
>>>> an unlikely event to schedule with a plug held, normally the plug should
>>>> have been explicitly unplugged before that happens.
>>>
>>> Though if it does, haven't you just added a significant amount of
>>> depth to the worst case stack usage? I'm seeing this sort of thing
>>> from io_schedule():
>>>
>>>         Depth    Size   Location    (40 entries)
>>>         -----    ----   --------
>>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
>>>   1)     4240     144   mempool_alloc+0x63/0x160
>>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
>>>   3)     4080     112   __sg_alloc_table+0x66/0x140
>>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
>>>   5)     3936      48   scsi_init_io+0x31/0xc0
>>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>>>   7)     3856     112   sd_prep_fn+0x150/0xa90
>>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
>>>   9)     3696      96   scsi_request_fn+0x60/0x510
>>>  10)     3600      32   __blk_run_queue+0x57/0x100
>>>  11)     3568      80   flush_plug_list+0x133/0x1d0
>>>  12)     3488      32   __blk_flush_plug+0x24/0x50
>>>  13)     3456      32   io_schedule+0x79/0x80
>>>
>>> (This is from a page fault on ext3 that is doing page cache
>>> readahead and blocking on a locked buffer.)
> 
> FYI, the next step in the allocation chain adds >900 bytes to that
> stack:
> 
> $ cat /sys/kernel/debug/tracing/stack_trace
>         Depth    Size   Location    (47 entries)
>         -----    ----   --------
>   0)     5176      40   zone_statistics+0xad/0xc0
>   1)     5136     288   get_page_from_freelist+0x2cf/0x840
>   2)     4848     304   __alloc_pages_nodemask+0x121/0x930
>   3)     4544      48   kmem_getpages+0x62/0x160
>   4)     4496      96   cache_grow+0x308/0x330
>   5)     4400      80   cache_alloc_refill+0x21c/0x260
>   6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
>   7)     4256      16   mempool_alloc_slab+0x15/0x20
>   8)     4240     144   mempool_alloc+0x63/0x160
>   9)     4096      16   scsi_sg_alloc+0x4c/0x60
>  10)     4080     112   __sg_alloc_table+0x66/0x140
>  11)     3968      32   scsi_init_sgtable+0x33/0x90
>  12)     3936      48   scsi_init_io+0x31/0xc0
>  13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
>  14)     3856     112   sd_prep_fn+0x150/0xa90
>  15)     3744      48   blk_peek_request+0x6a/0x1f0
>  16)     3696      96   scsi_request_fn+0x60/0x510
>  17)     3600      32   __blk_run_queue+0x57/0x100
>  18)     3568      80   flush_plug_list+0x133/0x1d0
>  19)     3488      32   __blk_flush_plug+0x24/0x50
>  20)     3456      32   io_schedule+0x79/0x80
> 
> That's close to 1800 bytes now, and that's not entering the reclaim
> path. If i get one deeper than that, I'll be sure to post it. :)

Do you have traces from 2.6.38, or are you just doing them now?

The path you quote above should not go into reclaim, it's a GFP_ATOMIC
allocation.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12 13:48                                                 ` Jens Axboe
@ 2011-04-12 23:35                                                   ` Dave Chinner
  0 siblings, 0 replies; 61+ messages in thread
From: Dave Chinner @ 2011-04-12 23:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: hch@infradead.org, NeilBrown, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 03:48:10PM +0200, Jens Axboe wrote:
> On 2011-04-12 15:40, Dave Chinner wrote:
> > On Tue, Apr 12, 2011 at 02:28:31PM +0200, Jens Axboe wrote:
> >> On 2011-04-12 14:22, Dave Chinner wrote:
> >>> On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> >>>> On 2011-04-12 03:12, hch@infradead.org wrote:
> >>>>> On Mon, Apr 11, 2011 at 02:48:45PM +0200, Jens Axboe wrote:
> >>>>>    function calls.
> >>>>>  - Why is having a plug in blk_flush_plug marked unlikely?  Note that
> >>>>>    unlikely is the static branch prediction hint to mark the case
> >>>>>    extremly unlikely and is even used for hot/cold partitioning.  But
> >>>>>    when we call it we usually check beforehand if we actually have
> >>>>>    plugs, so it's actually likely to happen.
> >>>>
> >>>> The existance and out-of-line is for the scheduler() hook. It should be
> >>>> an unlikely event to schedule with a plug held, normally the plug should
> >>>> have been explicitly unplugged before that happens.
> >>>
> >>> Though if it does, haven't you just added a significant amount of
> >>> depth to the worst case stack usage? I'm seeing this sort of thing
> >>> from io_schedule():
> >>>
> >>>         Depth    Size   Location    (40 entries)
> >>>         -----    ----   --------
> >>>   0)     4256      16   mempool_alloc_slab+0x15/0x20
> >>>   1)     4240     144   mempool_alloc+0x63/0x160
> >>>   2)     4096      16   scsi_sg_alloc+0x4c/0x60
> >>>   3)     4080     112   __sg_alloc_table+0x66/0x140
> >>>   4)     3968      32   scsi_init_sgtable+0x33/0x90
> >>>   5)     3936      48   scsi_init_io+0x31/0xc0
> >>>   6)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >>>   7)     3856     112   sd_prep_fn+0x150/0xa90
> >>>   8)     3744      48   blk_peek_request+0x6a/0x1f0
> >>>   9)     3696      96   scsi_request_fn+0x60/0x510
> >>>  10)     3600      32   __blk_run_queue+0x57/0x100
> >>>  11)     3568      80   flush_plug_list+0x133/0x1d0
> >>>  12)     3488      32   __blk_flush_plug+0x24/0x50
> >>>  13)     3456      32   io_schedule+0x79/0x80
> >>>
> >>> (This is from a page fault on ext3 that is doing page cache
> >>> readahead and blocking on a locked buffer.)
> > 
> > FYI, the next step in the allocation chain adds >900 bytes to that
> > stack:
> > 
> > $ cat /sys/kernel/debug/tracing/stack_trace
> >         Depth    Size   Location    (47 entries)
> >         -----    ----   --------
> >   0)     5176      40   zone_statistics+0xad/0xc0
> >   1)     5136     288   get_page_from_freelist+0x2cf/0x840
> >   2)     4848     304   __alloc_pages_nodemask+0x121/0x930
> >   3)     4544      48   kmem_getpages+0x62/0x160
> >   4)     4496      96   cache_grow+0x308/0x330
> >   5)     4400      80   cache_alloc_refill+0x21c/0x260
> >   6)     4320      64   kmem_cache_alloc+0x1b7/0x1e0
> >   7)     4256      16   mempool_alloc_slab+0x15/0x20
> >   8)     4240     144   mempool_alloc+0x63/0x160
> >   9)     4096      16   scsi_sg_alloc+0x4c/0x60
> >  10)     4080     112   __sg_alloc_table+0x66/0x140
> >  11)     3968      32   scsi_init_sgtable+0x33/0x90
> >  12)     3936      48   scsi_init_io+0x31/0xc0
> >  13)     3888      32   scsi_setup_fs_cmnd+0x79/0xe0
> >  14)     3856     112   sd_prep_fn+0x150/0xa90
> >  15)     3744      48   blk_peek_request+0x6a/0x1f0
> >  16)     3696      96   scsi_request_fn+0x60/0x510
> >  17)     3600      32   __blk_run_queue+0x57/0x100
> >  18)     3568      80   flush_plug_list+0x133/0x1d0
> >  19)     3488      32   __blk_flush_plug+0x24/0x50
> >  20)     3456      32   io_schedule+0x79/0x80
> > 
> > That's close to 1800 bytes now, and that's not entering the reclaim
> > path. If i get one deeper than that, I'll be sure to post it. :)
> 
> Do you have traces from 2.6.38, or are you just doing them now?

I do stack checks like this all the time. I generally don't keep
them around, just pay attention to the path and depth. ext3 is used
for / on my test VMs, and has never shown up as the worse case stack
usage when running xfstests. As of the block plugging code, this
trace is the top stack user for the first ~130 tests, and often for
the entire test run on XFS....

> The path you quote above should not go into reclaim, it's a GFP_ATOMIC
> allocation.

Right. I'm still trying to produce a trace that shows more stack
usage in the block layer. It's random chance as to what pops up most
of the time. However, some of the stacks that are showing up in
2.6.39 are quite different from any I've ever seen before...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-12  8:36                                         ` Jens Axboe
  2011-04-12 12:22                                           ` Dave Chinner
@ 2011-04-12 16:50                                           ` hch
  1 sibling, 0 replies; 61+ messages in thread
From: hch @ 2011-04-12 16:50 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Tue, Apr 12, 2011 at 10:36:30AM +0200, Jens Axboe wrote:
> The existance and out-of-line is for the scheduler() hook. It should be
> an unlikely event to schedule with a plug held, normally the plug should
> have been explicitly unplugged before that happens.

I still don't think unlikely() is the right thing to do.  The static
branch prediction hints cause a real massive slowdown if taken.  For
things like this that happen during normal operation you're much better
off leaving the dynamic branch prediction in the CPU predicting what's
going on.  And I don't think it's all that unlikely - e.g. for all the
metadata during readpages/writepages schedule/io_schedule will be
the unplugging point right now.  I'll see if I can run an I/O workload
with Steve's likely/unlikely profiling turned on.

> > void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
> > {
> > 	flush_plug_list(plug);
> > 	if (plug == tsk->plug)
> > 		tsk->plug = NULL;
> > 	tsk->plug = plug;
> > }
> > 
> >    it would seem much smarted to just call flush_plug_list directly.
> >    In fact it seems like the tsk->plug is not nessecary at all and
> >    all remaining __blk_flush_plug callers could be replaced with
> >    flush_plug_list.
> 
> It depends on whether this was an explicit unplug (eg
> blk_finish_plug()), or whether it was an implicit event (eg on
> schedule()). If we do it on schedule(), then we retain the plug after
> the flush. Otherwise we clear it.

blk_finish_plug doesn't got through this codepath.

This is an untested patch how the area should look to me:


diff --git a/block/blk-core.c b/block/blk-core.c
index 90f22cc..6fa5ba1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2668,7 +2668,7 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return !(rqa->q <= rqb->q);
 }
 
-static void flush_plug_list(struct blk_plug *plug)
+void blk_flush_plug_list(struct blk_plug *plug)
 {
 	struct request_queue *q;
 	unsigned long flags;
@@ -2716,29 +2716,16 @@ static void flush_plug_list(struct blk_plug *plug)
 	BUG_ON(!list_empty(&plug->list));
 	local_irq_restore(flags);
 }
-
-static void __blk_finish_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
-	flush_plug_list(plug);
-
-	if (plug == tsk->plug)
-		tsk->plug = NULL;
-}
+EXPORT_SYMBOL_GPL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
-	if (plug)
-		__blk_finish_plug(current, plug);
+	blk_flush_plug_list(plug);
+	if (plug == current->plug)
+		current->plug = NULL;
 }
 EXPORT_SYMBOL(blk_finish_plug);
 
-void __blk_flush_plug(struct task_struct *tsk, struct blk_plug *plug)
-{
-	__blk_finish_plug(tsk, plug);
-	tsk->plug = plug;
-}
-EXPORT_SYMBOL(__blk_flush_plug);
-
 int __init blk_dev_init(void)
 {
 	BUILD_BUG_ON(__REQ_NR_BITS > 8 *
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32176cc..fa6a4e1 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,14 +862,14 @@ struct blk_plug {
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
-extern void __blk_flush_plug(struct task_struct *, struct blk_plug *);
+extern void blk_flush_plug_list(struct blk_plug *);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	if (unlikely(plug))
-		__blk_flush_plug(tsk, plug);
+	if (plug)
+		blk_flush_plug_list(plug);
 }
 
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
@ 2011-04-15  4:26                                   ` hch
  2011-04-15  6:34                                     ` Jens Axboe
  2011-04-17 22:19                                   ` NeilBrown
  2 siblings, 1 reply; 61+ messages in thread
From: hch @ 2011-04-15  4:26 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

Btw, "block: move queue run on unplug to kblockd" currently moves
the __blk_run_queue call to kblockd unconditionally currently.  But
I'm not sure that's correct - if we do an explicit blk_finish_plug
there's no point in forcing the context switch.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-15  4:26                                   ` hch
@ 2011-04-15  6:34                                     ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-15  6:34 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-15 06:26, hch@infradead.org wrote:
> Btw, "block: move queue run on unplug to kblockd" currently moves
> the __blk_run_queue call to kblockd unconditionally currently.  But
> I'm not sure that's correct - if we do an explicit blk_finish_plug
> there's no point in forcing the context switch.

It's correct, but yes it's not optimal for the explicit unplug. Well I
think it really depends - for the single sync case, it's not ideal to
punt to kblockd. But if you have a bunch of threads doing IO, you
probably DO want to punt it to kblockd to avoid too many threads
hammering on the queue lock at the same time. Would need testing to be
sure, the below would a way to accomplish that.


diff --git a/block/blk-core.c b/block/blk-core.c
index b598fa7..995e995 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2662,16 +2662,16 @@ static int plug_rq_cmp(void *priv, struct list_head *a, struct list_head *b)
 	return !(rqa->q <= rqb->q);
 }
 
-static void queue_unplugged(struct request_queue *q, unsigned int depth)
+static void queue_unplugged(struct request_queue *q, unsigned int depth, bool run_from_wq)
 {
 	trace_block_unplug_io(q, depth);
-	__blk_run_queue(q, true);
+	__blk_run_queue(q, run_from_wq);
 
 	if (q->unplugged_fn)
 		q->unplugged_fn(q);
 }
 
-void blk_flush_plug_list(struct blk_plug *plug)
+void blk_flush_plug_list(struct blk_plug *plug, bool run_from_wq)
 {
 	struct request_queue *q;
 	unsigned long flags;
@@ -2706,7 +2706,7 @@ void blk_flush_plug_list(struct blk_plug *plug)
 		BUG_ON(!rq->q);
 		if (rq->q != q) {
 			if (q) {
-				queue_unplugged(q, depth);
+				queue_unplugged(q, depth, run_from_wq);
 				spin_unlock(q->queue_lock);
 			}
 			q = rq->q;
@@ -2727,7 +2727,7 @@ void blk_flush_plug_list(struct blk_plug *plug)
 	}
 
 	if (q) {
-		queue_unplugged(q, depth);
+		queue_unplugged(q, depth, run_from_wq);
 		spin_unlock(q->queue_lock);
 	}
 
@@ -2737,7 +2737,7 @@ EXPORT_SYMBOL(blk_flush_plug_list);
 
 void blk_finish_plug(struct blk_plug *plug)
 {
-	blk_flush_plug_list(plug);
+	blk_flush_plug_list(plug, false);
 
 	if (plug == current->plug)
 		current->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ffe48ff..1c76506 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -865,14 +865,14 @@ struct blk_plug {
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
-extern void blk_flush_plug_list(struct blk_plug *);
+extern void blk_flush_plug_list(struct blk_plug *, bool);
 
 static inline void blk_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
 	if (plug)
-		blk_flush_plug_list(plug);
+		blk_flush_plug_list(plug, true);
 }
 
 static inline bool blk_needs_flush_plug(struct task_struct *tsk)


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:11                                 ` Jens Axboe
  2011-04-11 12:36                                   ` NeilBrown
  2011-04-15  4:26                                   ` hch
@ 2011-04-17 22:19                                   ` NeilBrown
  2011-04-18  4:19                                     ` NeilBrown
  2011-04-18  6:38                                     ` Jens Axboe
  2 siblings, 2 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-17 22:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> > Yes.  But I need to know when to release the requests that I have stored.
> > I need to know when ->write_pages or ->read_pages or whatever has finished
> > submitting a pile of pages so that I can start processing the request that I
> > have put aside.  So I need a callback from blk_finish_plug.
> 
> OK fair enough, I'll add your callback patch.
> 

But you didn't did you?  You added a completely different patch which is
completely pointless.
If you don't like my patch I would really prefer you said so rather than
silently replace it with something completely different (and broken).

I'll try to explain again.

md does not use __make_request.  At all.
md does not use 'struct request'.  At all.

The 'list' in 'struct blk_plug' is a list of 'struct request'.

Therefore md cannot put anything useful on the list in 'struct blk_plug'.

So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
to a request found on the blk_plug list, that queue cannot possibly ever be
for an 'md' device (because no 'struct request' ever belongs to an md device,
because md doesn't not use 'struct request').

So your patch (commit f75664570d8b) doesn't help MD at all.

For md, I need to attach something to blk_plug which somehow identifies an md
device, so that blk_finish_plug can get to that device and let it unplug.
The most sensible thing to have is a completely generic callback.  That way
different block devices (which choose not to use __make_request) can attach
different sorts of things to blk_plug.

So can we please have my original patch applied? (Revised version using
list_splice_init included below).

Or if not, a clear explanation of why not?

Thanks,
NeilBrown

From 6a2aa888b855fd298c174bcee130cf43db0b3f7b Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 18 Apr 2011 08:15:45 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   20 ++++++++++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..c2b8006 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2638,6 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)

 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;

 	/*
@@ -2742,9 +2743,28 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 }
 EXPORT_SYMBOL(blk_flush_plug_list);

+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	LIST_HEAD(callbacks);
+
+	if (list_empty(&plug->cb_list))
+		return;
+
+	list_splice_init(&plug->cb_list, &callbacks);
+
+	while (!list_empty(&callbacks)) {
+		struct blk_plug_cb *cb = list_first_entry(&callbacks,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 void blk_finish_plug(struct blk_plug *plug)
 {
 	blk_flush_plug_list(plug, false);
+	flush_plug_callbacks(plug);

 	if (plug == current->plug)
 		current->plug = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..f3f7879 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,8 +860,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};

 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -887,7 +892,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;

-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }

 /*
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-17 22:19                                   ` NeilBrown
@ 2011-04-18  4:19                                     ` NeilBrown
  2011-04-18  6:38                                     ` Jens Axboe
  1 sibling, 0 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-18  4:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 18 Apr 2011 08:19:22 +1000 NeilBrown <neilb@suse.de> wrote:

> So can we please have my original patch applied? (Revised version using
> list_splice_init included below).

I hadn't adjusted that one properly for the recent code shuffling.
This one is actually tested...

Thanks,
NeilBrown

From 325b1c12b6165002022bd7b599f95c0331491cb3 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 18 Apr 2011 14:06:05 +1000
Subject: [PATCH] Enhance new plugging support to support general callbacks.

md/raid requires an unplug callback, but as it does not uses
requests the current code cannot provide one.

So allow arbitrary callbacks to be attached to the blk_plug.

Cc: Jens Axboe <jaxboe@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
---
 block/blk-core.c       |   20 ++++++++++++++++++++
 include/linux/blkdev.h |    7 ++++++-
 2 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..77edf05 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2638,6 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
+	INIT_LIST_HEAD(&plug->cb_list);
 	plug->should_sort = 0;
 
 	/*
@@ -2678,6 +2679,24 @@ static void queue_unplugged(struct request_queue *q, unsigned int depth,
 		q->unplugged_fn(q);
 }
 
+static void flush_plug_callbacks(struct blk_plug *plug)
+{
+	LIST_HEAD(callbacks);
+
+	if (list_empty(&plug->cb_list))
+		return;
+
+	list_splice_init(&plug->cb_list, &callbacks);
+
+	while (!list_empty(&callbacks)) {
+		struct blk_plug_cb *cb = list_first_entry(&callbacks,
+							  struct blk_plug_cb,
+							  list);
+		list_del(&cb->list);
+		cb->callback(cb);
+	}
+}
+
 void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 {
 	struct request_queue *q;
@@ -2688,6 +2707,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
+	flush_plug_callbacks(plug);
 	if (list_empty(&plug->list))
 		return;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..f3f7879 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,8 +860,13 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
+	struct list_head cb_list;
 	unsigned int should_sort;
 };
+struct blk_plug_cb {
+	struct list_head list;
+	void (*callback)(struct blk_plug_cb *);
+};
 
 extern void blk_start_plug(struct blk_plug *);
 extern void blk_finish_plug(struct blk_plug *);
@@ -887,7 +892,7 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) || !list_empty(&plug->cb_list));
 }
 
 /*
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-17 22:19                                   ` NeilBrown
  2011-04-18  4:19                                     ` NeilBrown
@ 2011-04-18  6:38                                     ` Jens Axboe
  2011-04-18  7:25                                       ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-18  6:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-18 00:19, NeilBrown wrote:
> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>>> Yes.  But I need to know when to release the requests that I have stored.
>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>> submitting a pile of pages so that I can start processing the request that I
>>> have put aside.  So I need a callback from blk_finish_plug.
>>
>> OK fair enough, I'll add your callback patch.
>>
> 
> But you didn't did you?  You added a completely different patch which is
> completely pointless.
> If you don't like my patch I would really prefer you said so rather than
> silently replace it with something completely different (and broken).

First of all, you were CC'ed on all that discussion, yet didn't speak up
until now. This was last week. Secondly, please change your tone.

> I'll try to explain again.
> 
> md does not use __make_request.  At all.
> md does not use 'struct request'.  At all.
> 
> The 'list' in 'struct blk_plug' is a list of 'struct request'.

I'm well aware of how these facts, but thanks for bringing it up.

> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> 
> So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
> to a request found on the blk_plug list, that queue cannot possibly ever be
> for an 'md' device (because no 'struct request' ever belongs to an md device,
> because md doesn't not use 'struct request').
> 
> So your patch (commit f75664570d8b) doesn't help MD at all.
> 
> For md, I need to attach something to blk_plug which somehow identifies an md
> device, so that blk_finish_plug can get to that device and let it unplug.
> The most sensible thing to have is a completely generic callback.  That way
> different block devices (which choose not to use __make_request) can attach
> different sorts of things to blk_plug.
> 
> So can we please have my original patch applied? (Revised version using
> list_splice_init included below).
> 
> Or if not, a clear explanation of why not?

So correct me if I'm wrong here, but the _only_ real difference between
this patch and the current code in the tree, is the checking of the
callback list indicating a need to flush the callbacks. And that's
definitely an oversight. It should be functionally equivelant if md
would just flag this need to get a callback, eg instead of queueing a
callback on the list, just set plug->need_unplug from md instead of
queuing a callback and have blk_needs_flush_plug() do:

        return plug && (!list_empty(&plug->list) || plug->need_unplug);

instead. Something like the below, completely untested.


diff --git a/block/blk-core.c b/block/blk-core.c
index 78b7b0c..e1f5635 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1305,12 +1305,12 @@ get_rq:
 		 */
 		if (list_empty(&plug->list))
 			trace_block_plug(q);
-		else if (!plug->should_sort) {
+		else if (!(plug->flags & BLK_PLUG_F_SORT)) {
 			struct request *__rq;
 
 			__rq = list_entry_rq(plug->list.prev);
 			if (__rq->q != q)
-				plug->should_sort = 1;
+				plug->flags |= BLK_PLUG_F_SORT;
 		}
 		/*
 		 * Debug flag, kill later
@@ -2638,7 +2638,7 @@ void blk_start_plug(struct blk_plug *plug)
 
 	plug->magic = PLUG_MAGIC;
 	INIT_LIST_HEAD(&plug->list);
-	plug->should_sort = 0;
+	plug->flags = 0;
 
 	/*
 	 * If this is a nested plug, don't actually assign it. It will be
@@ -2693,9 +2693,9 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	list_splice_init(&plug->list, &list);
 
-	if (plug->should_sort) {
+	if (plug->flags & BLK_PLUG_F_SORT) {
 		list_sort(NULL, &list, plug_rq_cmp);
-		plug->should_sort = 0;
+		plug->flags &= ~BLK_PLUG_F_SORT;
 	}
 
 	q = NULL;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ec0357d..1a0b76b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,7 +860,12 @@ extern void blk_put_queue(struct request_queue *);
 struct blk_plug {
 	unsigned long magic;
 	struct list_head list;
-	unsigned int should_sort;
+	unsigned int flags;
+};
+
+enum {
+	BLK_PLUG_F_SORT		= 1,
+	BLK_PLUG_F_NEED_UNPLUG	= 2,
 };
 
 extern void blk_start_plug(struct blk_plug *);
@@ -887,7 +892,8 @@ static inline bool blk_needs_flush_plug(struct task_struct *tsk)
 {
 	struct blk_plug *plug = tsk->plug;
 
-	return plug && !list_empty(&plug->list);
+	return plug && (!list_empty(&plug->list) ||
+			(plug->flags & BLK_PLUG_F_NEED_UNPLUG));
 }
 
 /*

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  6:38                                     ` Jens Axboe
@ 2011-04-18  7:25                                       ` NeilBrown
  2011-04-18  8:10                                         ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-18  7:25 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 18 Apr 2011 08:38:24 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-18 00:19, NeilBrown wrote:
> > On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> > 
> >>> Yes.  But I need to know when to release the requests that I have stored.
> >>> I need to know when ->write_pages or ->read_pages or whatever has finished
> >>> submitting a pile of pages so that I can start processing the request that I
> >>> have put aside.  So I need a callback from blk_finish_plug.
> >>
> >> OK fair enough, I'll add your callback patch.
> >>
> > 
> > But you didn't did you?  You added a completely different patch which is
> > completely pointless.
> > If you don't like my patch I would really prefer you said so rather than
> > silently replace it with something completely different (and broken).
> 
> First of all, you were CC'ed on all that discussion, yet didn't speak up
> until now. This was last week. Secondly, please change your tone.

Yes, I was CC'ed on a discussion.  In that discussion it was never mentioned
that you had completely changed the patch I sent you, and it never contained
the new patch in-line for review.   Nothing that was discussed was
particularly relevant to md's needs so there was nothing to speak up about.

Yes- there were 'git pull' requests and I could have done a pull myself to
review the code but there seemed to be no urgency because you had already
agreed to apply my patch.
When I did finally pull the patches (after all the other issues had settle
down and I had time to finish of the RAID side) I found ... what I found.

I apologise for my tone, but I was very frustrated.

> 
> > I'll try to explain again.
> > 
> > md does not use __make_request.  At all.
> > md does not use 'struct request'.  At all.
> > 
> > The 'list' in 'struct blk_plug' is a list of 'struct request'.
> 
> I'm well aware of how these facts, but thanks for bringing it up.
> 
> > Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> > 
> > So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
> > to a request found on the blk_plug list, that queue cannot possibly ever be
> > for an 'md' device (because no 'struct request' ever belongs to an md device,
> > because md doesn't not use 'struct request').
> > 
> > So your patch (commit f75664570d8b) doesn't help MD at all.
> > 
> > For md, I need to attach something to blk_plug which somehow identifies an md
> > device, so that blk_finish_plug can get to that device and let it unplug.
> > The most sensible thing to have is a completely generic callback.  That way
> > different block devices (which choose not to use __make_request) can attach
> > different sorts of things to blk_plug.
> > 
> > So can we please have my original patch applied? (Revised version using
> > list_splice_init included below).
> > 
> > Or if not, a clear explanation of why not?
> 
> So correct me if I'm wrong here, but the _only_ real difference between
> this patch and the current code in the tree, is the checking of the
> callback list indicating a need to flush the callbacks. And that's
> definitely an oversight. It should be functionally equivelant if md
> would just flag this need to get a callback, eg instead of queueing a
> callback on the list, just set plug->need_unplug from md instead of
> queuing a callback and have blk_needs_flush_plug() do:
> 
>         return plug && (!list_empty(&plug->list) || plug->need_unplug);
> 
> instead. Something like the below, completely untested.
> 

No, that is not the only real difference.

The real difference is that in the current code, md has no way to register
anything with a blk_plug because you can only register a 'struct request' on a
blk_plug, and md doesn't make any use of 'struct request'.

As I said in the Email you quote above:

> > Therefore md cannot put anything useful on the list in 'struct blk_plug'.

That is the heart of the problem.

NeilBrown



^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  7:25                                       ` NeilBrown
@ 2011-04-18  8:10                                         ` Jens Axboe
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  9:19                                           ` hch
  0 siblings, 2 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-18  8:10 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-18 09:25, NeilBrown wrote:
> On Mon, 18 Apr 2011 08:38:24 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-18 00:19, NeilBrown wrote:
>>> On Mon, 11 Apr 2011 14:11:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>>
>>>>> Yes.  But I need to know when to release the requests that I have stored.
>>>>> I need to know when ->write_pages or ->read_pages or whatever has finished
>>>>> submitting a pile of pages so that I can start processing the request that I
>>>>> have put aside.  So I need a callback from blk_finish_plug.
>>>>
>>>> OK fair enough, I'll add your callback patch.
>>>>
>>>
>>> But you didn't did you?  You added a completely different patch which is
>>> completely pointless.
>>> If you don't like my patch I would really prefer you said so rather than
>>> silently replace it with something completely different (and broken).
>>
>> First of all, you were CC'ed on all that discussion, yet didn't speak up
>> until now. This was last week. Secondly, please change your tone.
> 
> Yes, I was CC'ed on a discussion.  In that discussion it was never mentioned
> that you had completely changed the patch I sent you, and it never contained
> the new patch in-line for review.   Nothing that was discussed was
> particularly relevant to md's needs so there was nothing to speak up about.
> 
> Yes- there were 'git pull' requests and I could have done a pull myself to
> review the code but there seemed to be no urgency because you had already
> agreed to apply my patch.
> When I did finally pull the patches (after all the other issues had settle
> down and I had time to finish of the RAID side) I found ... what I found.
> 
> I apologise for my tone, but I was very frustrated.
> 
>>
>>> I'll try to explain again.
>>>
>>> md does not use __make_request.  At all.
>>> md does not use 'struct request'.  At all.
>>>
>>> The 'list' in 'struct blk_plug' is a list of 'struct request'.
>>
>> I'm well aware of how these facts, but thanks for bringing it up.
>>
>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
>>>
>>> So when blk_flush_plug_list calls queue_unplugged() on a queue that belonged
>>> to a request found on the blk_plug list, that queue cannot possibly ever be
>>> for an 'md' device (because no 'struct request' ever belongs to an md device,
>>> because md doesn't not use 'struct request').
>>>
>>> So your patch (commit f75664570d8b) doesn't help MD at all.
>>>
>>> For md, I need to attach something to blk_plug which somehow identifies an md
>>> device, so that blk_finish_plug can get to that device and let it unplug.
>>> The most sensible thing to have is a completely generic callback.  That way
>>> different block devices (which choose not to use __make_request) can attach
>>> different sorts of things to blk_plug.
>>>
>>> So can we please have my original patch applied? (Revised version using
>>> list_splice_init included below).
>>>
>>> Or if not, a clear explanation of why not?
>>
>> So correct me if I'm wrong here, but the _only_ real difference between
>> this patch and the current code in the tree, is the checking of the
>> callback list indicating a need to flush the callbacks. And that's
>> definitely an oversight. It should be functionally equivelant if md
>> would just flag this need to get a callback, eg instead of queueing a
>> callback on the list, just set plug->need_unplug from md instead of
>> queuing a callback and have blk_needs_flush_plug() do:
>>
>>         return plug && (!list_empty(&plug->list) || plug->need_unplug);
>>
>> instead. Something like the below, completely untested.
>>
> 
> No, that is not the only real difference.
> 
> The real difference is that in the current code, md has no way to register
> anything with a blk_plug because you can only register a 'struct request' on a
> blk_plug, and md doesn't make any use of 'struct request'.
> 
> As I said in the Email you quote above:
> 
>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> 
> That is the heart of the problem.

Hmm, I don't really see a way to avoid the list in that case. You really
do need some way to queue items, a single callback or flag or pointer
will not suffice.

I've added the patch and removed the (now) useless ->unplugged_fn
callback. I suggest you base your md changes on top of my for-linus
branch and tell me when you are confident it looks good, then I'll pull
in your MD changes and submit them later today.

OK with you?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:10                                         ` Jens Axboe
@ 2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
                                                               ` (2 more replies)
  2011-04-18  9:19                                           ` hch
  1 sibling, 3 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-18  8:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org



[[NOTE to dm-devel people - one of the patches here remove some
 now-unused code from dm-raid.c plus a declaration from device-mapper.h ]]]


On Mon, 18 Apr 2011 10:10:18 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:

> On 2011-04-18 09:25, NeilBrown wrote:

> >>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
> > 
> > That is the heart of the problem.
> 
> Hmm, I don't really see a way to avoid the list in that case. You really
> do need some way to queue items, a single callback or flag or pointer
> will not suffice.
> 
> I've added the patch and removed the (now) useless ->unplugged_fn
> callback. I suggest you base your md changes on top of my for-linus
> branch and tell me when you are confident it looks good, then I'll pull
> in your MD changes and submit them later today.
> 
> OK with you?
> 

Yes, that's perfect.  Thanks.

All of my plugging-related patches are now in a 'for-jens' branch:

The following changes since commit 99e22598e9a8e0a996d69c8c0f6b7027cb57720a:

  block: drop queue lock before calling __blk_run_queue() for kblockd punt (2011-04-18 09:59:55 +0200)

are available in the git repository at:
  git://neil.brown.name/md for-jens

NeilBrown (6):
      md: use new plugging interface for RAID IO.
      md/dm - remove remains of plug_fn callback.
      md - remove old plugging code.
      md: provide generic support for handling unplug callbacks.
      md: incorporate new plugging into raid5.
      md: fix up raid1/raid10 unplugging.

 drivers/md/dm-raid.c          |    8 ----
 drivers/md/md.c               |   87 +++++++++++++++++++++-------------------
 drivers/md/md.h               |   26 ++----------
 drivers/md/raid1.c            |   29 +++++++-------
 drivers/md/raid10.c           |   27 ++++++-------
 drivers/md/raid5.c            |   61 ++++++++++++----------------
 drivers/md/raid5.h            |    2 -
 include/linux/device-mapper.h |    1 -
 8 files changed, 103 insertions(+), 138 deletions(-)


Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
@ 2011-04-18  8:42                                             ` Jens Axboe
  2011-04-18 21:23                                             ` hch
  2011-04-18 21:30                                             ` hch
  2 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-18  8:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-18 10:33, NeilBrown wrote:
> 
> 
> [[NOTE to dm-devel people - one of the patches here remove some
>  now-unused code from dm-raid.c plus a declaration from device-mapper.h ]]]
> 
> 
> On Mon, 18 Apr 2011 10:10:18 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
>> On 2011-04-18 09:25, NeilBrown wrote:
> 
>>>>> Therefore md cannot put anything useful on the list in 'struct blk_plug'.
>>>
>>> That is the heart of the problem.
>>
>> Hmm, I don't really see a way to avoid the list in that case. You really
>> do need some way to queue items, a single callback or flag or pointer
>> will not suffice.
>>
>> I've added the patch and removed the (now) useless ->unplugged_fn
>> callback. I suggest you base your md changes on top of my for-linus
>> branch and tell me when you are confident it looks good, then I'll pull
>> in your MD changes and submit them later today.
>>
>> OK with you?
>>
> 
> Yes, that's perfect.  Thanks.
> 
> All of my plugging-related patches are now in a 'for-jens' branch:
> 
> The following changes since commit 99e22598e9a8e0a996d69c8c0f6b7027cb57720a:
> 
>   block: drop queue lock before calling __blk_run_queue() for kblockd punt (2011-04-18 09:59:55 +0200)
> 
> are available in the git repository at:
>   git://neil.brown.name/md for-jens
> 
> NeilBrown (6):
>       md: use new plugging interface for RAID IO.
>       md/dm - remove remains of plug_fn callback.
>       md - remove old plugging code.
>       md: provide generic support for handling unplug callbacks.
>       md: incorporate new plugging into raid5.
>       md: fix up raid1/raid10 unplugging.
> 
>  drivers/md/dm-raid.c          |    8 ----
>  drivers/md/md.c               |   87 +++++++++++++++++++++-------------------
>  drivers/md/md.h               |   26 ++----------
>  drivers/md/raid1.c            |   29 +++++++-------
>  drivers/md/raid10.c           |   27 ++++++-------
>  drivers/md/raid5.c            |   61 ++++++++++++----------------
>  drivers/md/raid5.h            |    2 -
>  include/linux/device-mapper.h |    1 -
>  8 files changed, 103 insertions(+), 138 deletions(-)

Great, thanks a lot Neil! It's pulled in now, will send the request to
Linus today.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
@ 2011-04-18 21:23                                             ` hch
  2011-04-18 21:30                                             ` hch
  2 siblings, 0 replies; 61+ messages in thread
From: hch @ 2011-04-18 21:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

> NeilBrown (6):
>       md: use new plugging interface for RAID IO.
>       md/dm - remove remains of plug_fn callback.
>       md - remove old plugging code.
>       md: provide generic support for handling unplug callbacks.
>       md: incorporate new plugging into raid5.
>       md: fix up raid1/raid10 unplugging.

Looking over more of the unplugging left over, is there a reason to
keep the unplug_work bits in CFQ?  They seem to rather counter the
current scheme (and it is the last user of kblockd outside of
blk-core.c)

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:33                                           ` NeilBrown
  2011-04-18  8:42                                             ` Jens Axboe
  2011-04-18 21:23                                             ` hch
@ 2011-04-18 21:30                                             ` hch
  2011-04-18 22:38                                               ` NeilBrown
  2 siblings, 1 reply; 61+ messages in thread
From: hch @ 2011-04-18 21:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

>       md: provide generic support for handling unplug callbacks.

This looks like some horribly ugly code to me.  The real fix is to do
the plugging in the block layers for bios instead of requests.  The
effect should be about the same, except that merging will become a
little easier as all bios will be on the list now when calling into
__make_request or it's equivalent, and even better if we extent the
list sort callback to also sort by the start block it will actually
simplify the merge algorithm a lot as it only needs to do front merges
and no back merges for the on-stack merging.

In addition it should also allow for much more optimal queue_lock
roundtrips - we can keep it locked at the end of what's currently
__make_request to have it available for the next bio that's been
on the list.  If it either can be merged now that we have the lock
and/or we optimize get_request_wait not to sleep in the fast path
we could get down to a single queue_lock roundtrip for each unplug.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18 21:30                                             ` hch
@ 2011-04-18 22:38                                               ` NeilBrown
  2011-04-20 10:55                                                 ` hch
  0 siblings, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-18 22:38 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Jens Axboe, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On Mon, 18 Apr 2011 17:30:48 -0400 "hch@infradead.org" <hch@infradead.org>
wrote:

> >       md: provide generic support for handling unplug callbacks.
> 
> This looks like some horribly ugly code to me.  The real fix is to do
> the plugging in the block layers for bios instead of requests.  The
> effect should be about the same, except that merging will become a
> little easier as all bios will be on the list now when calling into
> __make_request or it's equivalent, and even better if we extent the
> list sort callback to also sort by the start block it will actually
> simplify the merge algorithm a lot as it only needs to do front merges
> and no back merges for the on-stack merging.
> 
> In addition it should also allow for much more optimal queue_lock
> roundtrips - we can keep it locked at the end of what's currently
> __make_request to have it available for the next bio that's been
> on the list.  If it either can be merged now that we have the lock
> and/or we optimize get_request_wait not to sleep in the fast path
> we could get down to a single queue_lock roundtrip for each unplug.

Does the following match with your thinking?  I'm trying to make for a more
concrete understanding...

 - We change the ->make_request_fn interface so that it takes a list of
   bios rather than a single bio - linked on ->bi_next.
   These bios must all have the same ->bi_bdev.  They *might* be sorted
   by bi_sector (that needs to be decided).


 - generic_make_request currently queues bios if there is already an active
   request (this limits recursion).  We enhance this to also queue requests
   when code calls blk_start_plug.
   In effect, generic_make_request becomes:
        if (current->plug)
		blk_add_to_plug(current->plug, bio);
	else {
		struct blk_plug plug;
		blk_start_plug(&plug);
		__generic_make_request(bio);
		blk_finish_plug(&plug);
	}

 - __generic_make_request would sort the list of bios by bi_bdev (and maybe 
   bi_sector) and pass them along to the different ->make_request_fn
   functions.

   As there are likely to be only a few different bi_bdev values (often 1) but
   hopefully lots and lots of bios it might be more efficient to do a linear
   bucket sort based on bi_bdev, and only sort those buckets on bi_sector if
   required.

Then make_request_fn handlers can expect to get lots of bios at once, can
optimise their handling as seems appropriate, and not require any further
plugging.


Is that at all close to what you are thinking?

NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18 22:38                                               ` NeilBrown
@ 2011-04-20 10:55                                                 ` hch
  0 siblings, 0 replies; 61+ messages in thread
From: hch @ 2011-04-20 10:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: hch@infradead.org, Jens Axboe, Mike Snitzer,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Tue, Apr 19, 2011 at 08:38:13AM +1000, NeilBrown wrote:
> Is that at all close to what you are thinking?

Yes, pretty much like that.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  8:10                                         ` Jens Axboe
  2011-04-18  8:33                                           ` NeilBrown
@ 2011-04-18  9:19                                           ` hch
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
  2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 2 replies; 61+ messages in thread
From: hch @ 2011-04-18  9:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

Btw, I really start to wonder if the request level is the right place
to do this on-stack plugging.  Wouldn't it be better to just plug
bios in the on-stack queue?  That way we could also stop doing the
special case merging when adding to the plug list, and leave all the
merging / I/O schedule logic in the __make_request path.  Probably
not .39 material, but worth a prototype?

Also what this dicussion brought up is that the block layer data
structures are highly confusing.  Using a small subset of the
request_queue also for make_request based driver just doesn't make
sense.  It seems like we should try to migrate the required state
to struct gendisk, and submit I/O through a block_device_ops.submit
method, leaving the request_queue as an internal abstraction for
the request based drivers.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [dm-devel] [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:19                                           ` hch
@ 2011-04-18  9:40                                             ` Hannes Reinecke
  2011-04-18  9:47                                               ` Jens Axboe
  2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: Hannes Reinecke @ 2011-04-18  9:40 UTC (permalink / raw)
  To: device-mapper development
  Cc: hch@infradead.org, Jens Axboe, linux-raid@vger.kernel.org,
	Mike Snitzer, linux-kernel@vger.kernel.org, Alasdair G Kergon

On 04/18/2011 11:19 AM, hch@infradead.org wrote:
> Btw, I really start to wonder if the request level is the right place
> to do this on-stack plugging.  Wouldn't it be better to just plug
> bios in the on-stack queue?  That way we could also stop doing the
> special case merging when adding to the plug list, and leave all the
> merging / I/O schedule logic in the __make_request path.  Probably
> not .39 material, but worth a prototype?
>
> Also what this dicussion brought up is that the block layer data
> structures are highly confusing.  Using a small subset of the
> request_queue also for make_request based driver just doesn't make
> sense.  It seems like we should try to migrate the required state
> to struct gendisk, and submit I/O through a block_device_ops.submit
> method, leaving the request_queue as an internal abstraction for
> the request based drivers.
>
Good point.
It would also help us we the device-mapper redesign agk and myself 
discussed at LSF. Having a block_device_ops.submit function would
allow us remap the actual request queue generically; and we would 
even be able to address more than one request queue, which sounds 
awfully similar to what Jens is trying to do ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [dm-devel] [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
@ 2011-04-18  9:47                                               ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-18  9:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: device-mapper development, hch@infradead.org,
	linux-raid@vger.kernel.org, Mike Snitzer,
	linux-kernel@vger.kernel.org, Alasdair G Kergon

On 2011-04-18 11:40, Hannes Reinecke wrote:
> On 04/18/2011 11:19 AM, hch@infradead.org wrote:
>> Btw, I really start to wonder if the request level is the right place
>> to do this on-stack plugging.  Wouldn't it be better to just plug
>> bios in the on-stack queue?  That way we could also stop doing the
>> special case merging when adding to the plug list, and leave all the
>> merging / I/O schedule logic in the __make_request path.  Probably
>> not .39 material, but worth a prototype?
>>
>> Also what this dicussion brought up is that the block layer data
>> structures are highly confusing.  Using a small subset of the
>> request_queue also for make_request based driver just doesn't make
>> sense.  It seems like we should try to migrate the required state
>> to struct gendisk, and submit I/O through a block_device_ops.submit
>> method, leaving the request_queue as an internal abstraction for
>> the request based drivers.
>>
> Good point.
> It would also help us we the device-mapper redesign agk and myself 
> discussed at LSF. Having a block_device_ops.submit function would
> allow us remap the actual request queue generically; and we would 
> even be able to address more than one request queue, which sounds 
> awfully similar to what Jens is trying to do ...

The multiqueue bits would still have one request_queue, but multiple
queueing structures (I called those blk_queue_ctx, iirc).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-18  9:19                                           ` hch
  2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
@ 2011-04-18  9:46                                             ` Jens Axboe
  1 sibling, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-18  9:46 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-18 11:19, hch@infradead.org wrote:
> Btw, I really start to wonder if the request level is the right place
> to do this on-stack plugging.  Wouldn't it be better to just plug
> bios in the on-stack queue?  That way we could also stop doing the
> special case merging when adding to the plug list, and leave all the
> merging / I/O schedule logic in the __make_request path.  Probably
> not .39 material, but worth a prototype?
> 
> Also what this dicussion brought up is that the block layer data
> structures are highly confusing.  Using a small subset of the
> request_queue also for make_request based driver just doesn't make
> sense.  It seems like we should try to migrate the required state
> to struct gendisk, and submit I/O through a block_device_ops.submit
> method, leaving the request_queue as an internal abstraction for
> the request based drivers.

Partially agree, I've never really liked the two methods we have where
the light light version was originally meant for stacked devices but
gets used elsewhere now too. It also causes IO scheduling problems, and
then you get things like request based dm to work around that.

But the idea is really to move more towards private queueing more
localized, the multiqueue setup will really apply well there too. I'm
trying to flesh out the design of that, ideally it would be nice to
unify the different bits we have now.

But agree on pulling the stacked bits into some lower part, like the
gendisk. It would clean that up nicely.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 10:59                       ` NeilBrown
  2011-04-11 11:04                         ` Jens Axboe
@ 2011-04-11 11:55                         ` NeilBrown
  2011-04-11 12:12                           ` Jens Axboe
  1 sibling, 1 reply; 61+ messages in thread
From: NeilBrown @ 2011-04-11 11:55 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jens Axboe, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote:

> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
> 
> > On 2011-04-11 06:50, NeilBrown wrote:
> 
> > > The only explanation I can come up with is that very occasionally schedule on
> > > 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
> > > the scheduler nearly well enough to know if or how that can happen.
> > > However with this patch in place I can write to a RAID1 constantly for half
> > > an hour, and without it, the write rarely lasts for 3 minutes.
> > 
> > Or perhaps if the request_fn blocks, that would be problematic. So the
> > patch is likely a good idea even for that case.
> > 
> > I'll merge it, changing it to list_splice_init() as I think that would
> > be more clear.
> 
> OK - though I'm not 100% the patch fixes the problem - just that it hides the
> symptom for me.
> I might try instrumenting the code a bit more and see if I can find exactly
> where it is re-entering flush_plug_list - as that seems to be what is
> happening.

OK, I found how it re-enters.

The request_fn doesn't exactly block, but when scsi_request_fn calls
spin_unlock_irq, this calls preempt_enable which can call schedule, which is
a recursive call.

The patch I provided will stop that from recursing again as the blk_plug.list
will be empty.

So it is almost what you suggested, however the request_fn doesn't block, it
just enabled preempt.

So the comment I would put at the top of that patch would be something like:

From: NeilBrown <neilb@suse.de>

As request_fn called by __blk_run_queue is allowed to 'schedule()' (after
dropping the queue lock of course), it is possible to get a recursive call:

 schedule -> blk_flush_plug -> __blk_finish_plug -> flush_plug_list
      -> __blk_run_queue -> request_fn -> schedule

We must make sure that the second schedule does not call into blk_flush_plug
again.  So instead of leaving the list of requests on blk_plug->list, move
them to a separate list leaving blk_plug->list empty.

Signed-off-by: NeilBrown <neilb@suse.de>

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 11:55                         ` NeilBrown
@ 2011-04-11 12:12                           ` Jens Axboe
  2011-04-11 22:58                             ` hch
  0 siblings, 1 reply; 61+ messages in thread
From: Jens Axboe @ 2011-04-11 12:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, linux-kernel@vger.kernel.org, hch@infradead.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 13:55, NeilBrown wrote:
> On Mon, 11 Apr 2011 20:59:28 +1000 NeilBrown <neilb@suse.de> wrote:
> 
>> On Mon, 11 Apr 2011 11:19:58 +0200 Jens Axboe <jaxboe@fusionio.com> wrote:
>>
>>> On 2011-04-11 06:50, NeilBrown wrote:
>>
>>>> The only explanation I can come up with is that very occasionally schedule on
>>>> 2 separate cpus calls blk_flush_plug for the same task.  I don't understand
>>>> the scheduler nearly well enough to know if or how that can happen.
>>>> However with this patch in place I can write to a RAID1 constantly for half
>>>> an hour, and without it, the write rarely lasts for 3 minutes.
>>>
>>> Or perhaps if the request_fn blocks, that would be problematic. So the
>>> patch is likely a good idea even for that case.
>>>
>>> I'll merge it, changing it to list_splice_init() as I think that would
>>> be more clear.
>>
>> OK - though I'm not 100% the patch fixes the problem - just that it hides the
>> symptom for me.
>> I might try instrumenting the code a bit more and see if I can find exactly
>> where it is re-entering flush_plug_list - as that seems to be what is
>> happening.
> 
> OK, I found how it re-enters.
> 
> The request_fn doesn't exactly block, but when scsi_request_fn calls
> spin_unlock_irq, this calls preempt_enable which can call schedule, which is
> a recursive call.
> 
> The patch I provided will stop that from recursing again as the blk_plug.list
> will be empty.
> 
> So it is almost what you suggested, however the request_fn doesn't block, it
> just enabled preempt.
> 
> 
> So the comment I would put at the top of that patch would be something like:

Ah, so it was pretty close. That does explain it. I've already queued up
the patch, I'll ammend the commit message.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 12:12                           ` Jens Axboe
@ 2011-04-11 22:58                             ` hch
  2011-04-12  6:20                               ` Jens Axboe
  0 siblings, 1 reply; 61+ messages in thread
From: hch @ 2011-04-11 22:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org

Looking at the patch
(http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f)

I'm not sure it's an optimal design.  The flush callback really
is a per-queue thing.  Why isn't it a function pointer in the request
queue when doing the blk_run_queue call once we're done with a given
queue before moving on to the next one?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 22:58                             ` hch
@ 2011-04-12  6:20                               ` Jens Axboe
  0 siblings, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12  6:20 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: NeilBrown, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-12 00:58, hch@infradead.org wrote:
> Looking at the patch
> (http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=761e433f3de6fb8e369af9e5c08beb86286d023f)
> 
> I'm not sure it's an optimal design.  The flush callback really
> is a per-queue thing.  Why isn't it a function pointer in the request
> queue when doing the blk_run_queue call once we're done with a given
> queue before moving on to the next one?

I was thinking about this yesterday as well, the design didn't quite
feel just right. Additionally the user now must track this state too,
and whether he's plugged on that task or not.

I'll rewrite this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11  4:50                   ` [PATCH 05/10] block: remove per-queue plugging NeilBrown
  2011-04-11  9:19                     ` Jens Axboe
@ 2011-04-11 16:59                     ` hch
  2011-04-11 21:14                       ` NeilBrown
  1 sibling, 1 reply; 61+ messages in thread
From: hch @ 2011-04-11 16:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Mike Snitzer, Jens Axboe, linux-kernel@vger.kernel.org,
	hch@infradead.org, dm-devel, linux-raid

On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 273d60b..903ce8d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>  	struct request_queue *q;
>  	unsigned long flags;
>  	struct request *rq;
> +	struct list_head head;
>  
>  	BUG_ON(plug->magic != PLUG_MAGIC);
>  
>  	if (list_empty(&plug->list))
>  		return;
> +	list_add(&head, &plug->list);
> +	list_del_init(&plug->list);
>  
>  	if (plug->should_sort)
> -		list_sort(NULL, &plug->list, plug_rq_cmp);
> +		list_sort(NULL, &head, plug_rq_cmp);
> +	plug->should_sort = 0;

As Jens mentioned this should be list_splice_init.  But looking over
flush_plug_list the code there seems strange to me.

What does the local_irq_save in flush_plug_list protect?  Why don't
we need it over the list_sort?  And do we still need it when first
splicing the list to a local one?

It's one of these cases where I'd really like to see more comments
explaining why the code is doing what it's doing.


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 16:59                     ` hch
@ 2011-04-11 21:14                       ` NeilBrown
  2011-04-11 22:59                         ` hch
  2011-04-12  6:18                         ` Jens Axboe
  0 siblings, 2 replies; 61+ messages in thread
From: NeilBrown @ 2011-04-11 21:14 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Mike Snitzer, Jens Axboe, linux-kernel@vger.kernel.org, dm-devel,
	linux-raid

On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org>
wrote:

> On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 273d60b..903ce8d 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
> >  	struct request_queue *q;
> >  	unsigned long flags;
> >  	struct request *rq;
> > +	struct list_head head;
> >  
> >  	BUG_ON(plug->magic != PLUG_MAGIC);
> >  
> >  	if (list_empty(&plug->list))
> >  		return;
> > +	list_add(&head, &plug->list);
> > +	list_del_init(&plug->list);
> >  
> >  	if (plug->should_sort)
> > -		list_sort(NULL, &plug->list, plug_rq_cmp);
> > +		list_sort(NULL, &head, plug_rq_cmp);
> > +	plug->should_sort = 0;
> 
> As Jens mentioned this should be list_splice_init.  But looking over
> flush_plug_list the code there seems strange to me.
> 
> What does the local_irq_save in flush_plug_list protect?  Why don't
> we need it over the list_sort?  And do we still need it when first
> splicing the list to a local one?
> 
> It's one of these cases where I'd really like to see more comments
> explaining why the code is doing what it's doing.

My understanding of that was that the calling requirement of
__elv_add_request is that the queue spinlock is held and that interrupts are
disabled.
So rather than possible enabling and disabling interrupts several times as
different queue are handled, the code just disabled interrupts once, and
then just take the spinlock once for each different queue.

The whole point of the change to plugging was to take locks less often.
Disabling interrupts less often is presumably an analogous goal.

Though I agree that a comment would help.

	q = NULL;
+	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
	 * variants
	 */
	local_irq_save(flags);


assuming my analysis is correct.

NeilBrown


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 21:14                       ` NeilBrown
@ 2011-04-11 22:59                         ` hch
  2011-04-12  6:18                         ` Jens Axboe
  1 sibling, 0 replies; 61+ messages in thread
From: hch @ 2011-04-11 22:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: hch@infradead.org, Mike Snitzer, Jens Axboe,
	linux-kernel@vger.kernel.org, dm-devel, linux-raid

On Tue, Apr 12, 2011 at 07:14:28AM +1000, NeilBrown wrote:
> 
> My understanding of that was that the calling requirement of
> __elv_add_request is that the queue spinlock is held and that interrupts are
> disabled.
> So rather than possible enabling and disabling interrupts several times as
> different queue are handled, the code just disabled interrupts once, and
> then just take the spinlock once for each different queue.
> 
> The whole point of the change to plugging was to take locks less often.
> Disabling interrupts less often is presumably an analogous goal.
> 
> Though I agree that a comment would help.
> 
> 	q = NULL;
> +	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
> 	 * variants
> 	 */
> 	local_irq_save(flags);
> 
> 
> assuming my analysis is correct.

Your explanation does make sense to be now that you explain it.  I
didn't even thing of that variant before.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 05/10] block: remove per-queue plugging
  2011-04-11 21:14                       ` NeilBrown
  2011-04-11 22:59                         ` hch
@ 2011-04-12  6:18                         ` Jens Axboe
  1 sibling, 0 replies; 61+ messages in thread
From: Jens Axboe @ 2011-04-12  6:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: hch@infradead.org, Mike Snitzer, linux-kernel@vger.kernel.org,
	dm-devel@redhat.com, linux-raid@vger.kernel.org

On 2011-04-11 23:14, NeilBrown wrote:
> On Mon, 11 Apr 2011 12:59:23 -0400 "hch@infradead.org" <hch@infradead.org>
> wrote:
> 
>> On Mon, Apr 11, 2011 at 02:50:22PM +1000, NeilBrown wrote:
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 273d60b..903ce8d 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2674,19 +2674,23 @@ static void flush_plug_list(struct blk_plug *plug)
>>>  	struct request_queue *q;
>>>  	unsigned long flags;
>>>  	struct request *rq;
>>> +	struct list_head head;
>>>  
>>>  	BUG_ON(plug->magic != PLUG_MAGIC);
>>>  
>>>  	if (list_empty(&plug->list))
>>>  		return;
>>> +	list_add(&head, &plug->list);
>>> +	list_del_init(&plug->list);
>>>  
>>>  	if (plug->should_sort)
>>> -		list_sort(NULL, &plug->list, plug_rq_cmp);
>>> +		list_sort(NULL, &head, plug_rq_cmp);
>>> +	plug->should_sort = 0;
>>
>> As Jens mentioned this should be list_splice_init.  But looking over
>> flush_plug_list the code there seems strange to me.
>>
>> What does the local_irq_save in flush_plug_list protect?  Why don't
>> we need it over the list_sort?  And do we still need it when first
>> splicing the list to a local one?
>>
>> It's one of these cases where I'd really like to see more comments
>> explaining why the code is doing what it's doing.
> 
> My understanding of that was that the calling requirement of
> __elv_add_request is that the queue spinlock is held and that interrupts are
> disabled.
> So rather than possible enabling and disabling interrupts several times as
> different queue are handled, the code just disabled interrupts once, and
> then just take the spinlock once for each different queue.
> 
> The whole point of the change to plugging was to take locks less often.
> Disabling interrupts less often is presumably an analogous goal.
> 
> Though I agree that a comment would help.
> 
> 	q = NULL;
> +	/* Disable interrupts just once rather than using spin_lock_irq/sin_unlock_irq
> 	 * variants
> 	 */
> 	local_irq_save(flags);
> 
> 
> assuming my analysis is correct.

Yep that is correct, it's to avoid juggling irq on and off for multiple
queues. I will put a comment there.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2011-04-20 10:55 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1295659049-2688-1-git-send-email-jaxboe@fusionio.com>
     [not found] ` <1295659049-2688-6-git-send-email-jaxboe@fusionio.com>
     [not found]   ` <AANLkTin8FoXX6oqUyW+scwhadyX-TfW16_oKjvngU9-m@mail.gmail.com>
     [not found]     ` <20110303221353.GA10366@redhat.com>
     [not found]       ` <4D761E0D.8050200@fusionio.com>
     [not found]         ` <20110308202100.GA31744@redhat.com>
     [not found]           ` <4D76912C.9040705@fusionio.com>
     [not found]             ` <20110308220526.GA393@redhat.com>
     [not found]               ` <20110310005810.GA17911@redhat.com>
     [not found]                 ` <20110405130541.6c2b5f86@notabene.brown>
2011-04-11  4:50                   ` [PATCH 05/10] block: remove per-queue plugging NeilBrown
2011-04-11  9:19                     ` Jens Axboe
2011-04-11 10:59                       ` NeilBrown
2011-04-11 11:04                         ` Jens Axboe
2011-04-11 11:26                           ` NeilBrown
2011-04-11 11:37                             ` Jens Axboe
2011-04-11 12:05                               ` NeilBrown
2011-04-11 12:11                                 ` Jens Axboe
2011-04-11 12:36                                   ` NeilBrown
2011-04-11 12:48                                     ` Jens Axboe
2011-04-12  1:12                                       ` hch
2011-04-12  8:36                                         ` Jens Axboe
2011-04-12 12:22                                           ` Dave Chinner
2011-04-12 12:28                                             ` Jens Axboe
2011-04-12 12:41                                               ` Dave Chinner
2011-04-12 12:58                                                 ` Jens Axboe
2011-04-12 13:31                                                   ` Dave Chinner
2011-04-12 13:45                                                     ` Jens Axboe
2011-04-12 14:34                                                       ` Dave Chinner
2011-04-12 21:08                                                         ` NeilBrown
2011-04-13  2:23                                                           ` Linus Torvalds
2011-04-13 11:12                                                             ` Peter Zijlstra
2011-04-13 11:23                                                               ` Jens Axboe
2011-04-13 11:41                                                                 ` Peter Zijlstra
2011-04-13 15:13                                                                 ` Linus Torvalds
2011-04-13 17:35                                                                   ` Jens Axboe
2011-04-12 16:58                                                     ` hch
2011-04-12 17:29                                                       ` Jens Axboe
2011-04-12 16:44                                                   ` hch
2011-04-12 16:49                                                     ` Jens Axboe
2011-04-12 16:54                                                       ` hch
2011-04-12 17:24                                                         ` Jens Axboe
2011-04-12 13:40                                               ` Dave Chinner
2011-04-12 13:48                                                 ` Jens Axboe
2011-04-12 23:35                                                   ` Dave Chinner
2011-04-12 16:50                                           ` hch
2011-04-15  4:26                                   ` hch
2011-04-15  6:34                                     ` Jens Axboe
2011-04-17 22:19                                   ` NeilBrown
2011-04-18  4:19                                     ` NeilBrown
2011-04-18  6:38                                     ` Jens Axboe
2011-04-18  7:25                                       ` NeilBrown
2011-04-18  8:10                                         ` Jens Axboe
2011-04-18  8:33                                           ` NeilBrown
2011-04-18  8:42                                             ` Jens Axboe
2011-04-18 21:23                                             ` hch
2011-04-18 21:30                                             ` hch
2011-04-18 22:38                                               ` NeilBrown
2011-04-20 10:55                                                 ` hch
2011-04-18  9:19                                           ` hch
2011-04-18  9:40                                             ` [dm-devel] " Hannes Reinecke
2011-04-18  9:47                                               ` Jens Axboe
2011-04-18  9:46                                             ` Jens Axboe
2011-04-11 11:55                         ` NeilBrown
2011-04-11 12:12                           ` Jens Axboe
2011-04-11 22:58                             ` hch
2011-04-12  6:20                               ` Jens Axboe
2011-04-11 16:59                     ` hch
2011-04-11 21:14                       ` NeilBrown
2011-04-11 22:59                         ` hch
2011-04-12  6:18                         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).