Re: IO scheduler, queue depth, nr

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: IO scheduler, queue depth, nr_requests
       [not found]   ` <20040217145716.GE30438@traveler.cistron.net>
@ 2004-02-18 23:52     ` Miquel van Smoorenburg
  2004-02-19  1:24       ` Nick Piggin
  2004-02-19  1:26       ` Andrew Morton
  0 siblings, 2 replies; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-18 23:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Miquel van Smoorenburg, linux-lvm, linux-kernel, Joe Thornber

On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> For some reason, when using LVM, write requests get queued out
> of order to the 3ware controller, which results in quite a bit
> of seeking and thus performance loss.
[..]
> Okay I repeated some earlier tests, and I added some debug code in
> several places.
> 
> I added logging to tw_scsi_queue() in the 3ware driver to log the
> start sector and length of each request. It logs something like:
> 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> 
> With a perl script, I can check if the requests are sent to the
> host in order. That outputs something like this:
> 
> Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> See, 31 requests in order, then one request "backwards", then 31 in order, etc.

I found out what causes this. It's get_request_wait().

When the request queue is full, and a new request needs to be created,
__make_request() blocks in get_request_wait().

Another process wakes up first (pdflush / process submitting I/O itself /
xfsdatad / etc) and sends the next bio's to __make_request().
In the mean time some free requests have become available, and the bios
are merged into a new request. Those requests are submitted to the device.

Then, get_request_wait() returns but the bio is not mergeable anymore -
and that results in a backwards seek, severely limiting the I/O rate.

Wouldn't it be better to allow the request allocation and queue the
request, and /then/ put the process to sleep ? The queue will grow larger
than nr_requests, but it does that anyway.

Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-18 23:52     ` IO scheduler, queue depth, nr_requests Miquel van Smoorenburg
@ 2004-02-19  1:24       ` Nick Piggin
  2004-02-19  1:52         ` Miquel van Smoorenburg
  2004-02-19  1:26       ` Andrew Morton
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2004-02-19  1:24 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber



Miquel van Smoorenburg wrote:

>On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
>
>>For some reason, when using LVM, write requests get queued out
>>of order to the 3ware controller, which results in quite a bit
>>of seeking and thus performance loss.
>>
>[..]
>
>>Okay I repeated some earlier tests, and I added some debug code in
>>several places.
>>
>>I added logging to tw_scsi_queue() in the 3ware driver to log the
>>start sector and length of each request. It logs something like:
>>3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
>>
>>With a perl script, I can check if the requests are sent to the
>>host in order. That outputs something like this:
>>
>>Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
>>Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
>>Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
>>
>>See, 31 requests in order, then one request "backwards", then 31 in order, etc.
>>
>
>I found out what causes this. It's get_request_wait().
>
>When the request queue is full, and a new request needs to be created,
>__make_request() blocks in get_request_wait().
>
>Another process wakes up first (pdflush / process submitting I/O itself /
>xfsdatad / etc) and sends the next bio's to __make_request().
>In the mean time some free requests have become available, and the bios
>are merged into a new request. Those requests are submitted to the device.
>
>Then, get_request_wait() returns but the bio is not mergeable anymore -
>and that results in a backwards seek, severely limiting the I/O rate.
>
>Wouldn't it be better to allow the request allocation and queue the
>request, and /then/ put the process to sleep ? The queue will grow larger
>than nr_requests, but it does that anyway.
>
>

The "batching" logic there should allow a process to submit
a number of requests even above the nr_requests limit to
prevent this interleave and context switching.

Are you using tagged command queueing? What depth?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-18 23:52     ` IO scheduler, queue depth, nr_requests Miquel van Smoorenburg
  2004-02-19  1:24       ` Nick Piggin
@ 2004-02-19  1:26       ` Andrew Morton
  2004-02-19  2:11         ` Miquel van Smoorenburg
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2004-02-19  1:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: axboe, miquels, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > For some reason, when using LVM, write requests get queued out
> > of order to the 3ware controller, which results in quite a bit
> > of seeking and thus performance loss.
> [..]
> > Okay I repeated some earlier tests, and I added some debug code in
> > several places.
> > 
> > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > start sector and length of each request. It logs something like:
> > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > 
> > With a perl script, I can check if the requests are sent to the
> > host in order. That outputs something like this:
> > 
> > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31

3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
actually referring to a single request which has 31 128k BIOs in it, yes?

> > 
> > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> 
> I found out what causes this. It's get_request_wait().
> 
> When the request queue is full, and a new request needs to be created,
> __make_request() blocks in get_request_wait().
> 
> Another process wakes up first (pdflush / process submitting I/O itself /
> xfsdatad / etc) and sends the next bio's to __make_request().
> In the mean time some free requests have become available, and the bios
> are merged into a new request. Those requests are submitted to the device.
> 
> Then, get_request_wait() returns but the bio is not mergeable anymore -
> and that results in a backwards seek, severely limiting the I/O rate.
> 
> Wouldn't it be better to allow the request allocation and queue the
> request, and /then/ put the process to sleep ? The queue will grow larger
> than nr_requests, but it does that anyway.

That would help, but is a bit kludgy.

What _should_ have happened was:

a) The queue gets plugged

b) The maximal-sized 31-bio request is queued

c) The 4k request gets inserted *before* the 3968k request

d) The queue gets unplugged, and both requests stream smoothly.

(And this assumes that the queue was initially empty.  ALmost certainly
that was not the case).

So the question is, why did the little, later 4k request not get itself
inserted in front of the earlier, large 3968k request?

Are you using md as well?  If so, try removing the blk_run_queues()
statements from md_thread() and md_do_sync().  And generally hunt around
and try nuking blk_run_queues() statements from the device driver/md/dm
layer - they may well be wrong.

Is this problem specific to md?  Can it be reproduced on a boring-old-disk?

Does elevator=deadline change anything?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  1:24       ` Nick Piggin
@ 2004-02-19  1:52         ` Miquel van Smoorenburg
  2004-02-19  2:01           ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  1:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, linux-lvm, linux-kernel,
	Joe Thornber

On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
> Miquel van Smoorenburg wrote:
>
> >I found out what causes this. It's get_request_wait().
> >
> >When the request queue is full, and a new request needs to be created,
> >__make_request() blocks in get_request_wait().
> >
> >Another process wakes up first (pdflush / process submitting I/O itself /
> >xfsdatad / etc) and sends the next bio's to __make_request().
> >In the mean time some free requests have become available, and the bios
> >are merged into a new request. Those requests are submitted to the device.
> >
> >Then, get_request_wait() returns but the bio is not mergeable anymore -
> >and that results in a backwards seek, severely limiting the I/O rate.
> >
> 
> The "batching" logic there should allow a process to submit
> a number of requests even above the nr_requests limit to
> prevent this interleave and context switching.
> 
> Are you using tagged command queueing? What depth?

No, I'm not using tagged command queueing. The 3ware controller is not a
real scsi controller, the driver just emulates one. It's a raid5 controller
that drives SATA disks. It has an internal request queue ("can_queu")
of 254 outstanding commands. Because that is way bigger than nr_requests
this happens - if I set nr_requests to 512, the problem goes away. But
that shouldn't happen ;)

I'm preparing a proof-of-concept patch now, if it works and I don't wedge
the remote machine I'm testing this on I'll post it in a few minutes.

Mike.
> 
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  1:52         ` Miquel van Smoorenburg
@ 2004-02-19  2:01           ` Nick Piggin
  0 siblings, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2004-02-19  2:01 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Jens Axboe, linux-lvm, linux-kernel, Joe Thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 02:24:31, Nick Piggin wrote:
>
>>Miquel van Smoorenburg wrote:
>>
>>
>>>I found out what causes this. It's get_request_wait().
>>>
>>>When the request queue is full, and a new request needs to be created,
>>>__make_request() blocks in get_request_wait().
>>>
>>>Another process wakes up first (pdflush / process submitting I/O itself /
>>>xfsdatad / etc) and sends the next bio's to __make_request().
>>>In the mean time some free requests have become available, and the bios
>>>are merged into a new request. Those requests are submitted to the device.
>>>
>>>Then, get_request_wait() returns but the bio is not mergeable anymore -
>>>and that results in a backwards seek, severely limiting the I/O rate.
>>>
>>>
>>The "batching" logic there should allow a process to submit
>>a number of requests even above the nr_requests limit to
>>prevent this interleave and context switching.
>>
>>Are you using tagged command queueing? What depth?
>>
>
>No, I'm not using tagged command queueing. The 3ware controller is not a
>real scsi controller, the driver just emulates one. It's a raid5 controller
>that drives SATA disks. It has an internal request queue ("can_queu")
>of 254 outstanding commands.
>

This is what I mean by tagged command queueing.

> Because that is way bigger than nr_requests
>this happens - if I set nr_requests to 512, the problem goes away. But
>that shouldn't happen ;)
>
>

What shouldn't happen?

>I'm preparing a proof-of-concept patch now, if it works and I don't wedge
>the remote machine I'm testing this on I'll post it in a few minutes.
>
>

I'm not very happy with forcing a process to sleep _after_ it
has submitted a request... but I'd be interested to see exactly
what your patch does.

By far the best option is to use appropriately sized queues.
The below patch is a start, but it unfortunately doesn't help
drivers which use private queueing implementations.

http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.3/2.6.3-mm1/broken-out/scale-nr_requests.patch


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  1:26       ` Andrew Morton
@ 2004-02-19  2:11         ` Miquel van Smoorenburg
  2004-02-19  2:26           ` Andrew Morton
  2004-02-19  2:51           ` IO scheduler, queue depth, nr_requests Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19  2:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miquel van Smoorenburg, axboe, miquels, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 02:26:22, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > On Tue, 17 Feb 2004 15:57:16, Miquel van Smoorenburg wrote:
> > > For some reason, when using LVM, write requests get queued out
> > > of order to the 3ware controller, which results in quite a bit
> > > of seeking and thus performance loss.
> > [..]
> > > Okay I repeated some earlier tests, and I added some debug code in
> > > several places.
> > > 
> > > I added logging to tw_scsi_queue() in the 3ware driver to log the
> > > start sector and length of each request. It logs something like:
> > > 3wdbg: id 119, lba = 0x2330bc33, num_sectors = 256
> > > 
> > > With a perl script, I can check if the requests are sent to the
> > > host in order. That outputs something like this:
> > > 
> > > Consecutive: start 1180906348, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180906340, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180914292, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180914284, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180922236, length 7936 sec (3968 KB), requests: 31
> > > Consecutive: start 1180922228, length 8 sec (4 KB), requests: 1
> > > Consecutive: start 1180930180, length 7936 sec (3968 KB), requests: 31
> 
> 3968 / 31 = 128 exactly.  I think when you say "requests: 31" you are
> actually referring to a single request which has 31 128k BIOs in it, yes?

No, I'm actually referring to a struct request. I'm logging this in the
SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
in fact logged all the bio's submitted to __make_request, and the output
of the elevator from elv_next_request(). The bio's are submitted sequentially,
the resulting requests aren't. But this is because nr_requests is 128, while
the 3ware device has a queue of 254 entries (no tagging though). Upping
nr_requests to 512 makes this go away ..

That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
with a lot of debugging (unless I set nr_requests lower again), which points
to a timing issue.


> > > See, 31 requests in order, then one request "backwards", then 31 in order, etc.
> > 
> > I found out what causes this. It's get_request_wait().
> > 
> > When the request queue is full, and a new request needs to be created,
> > __make_request() blocks in get_request_wait().
> > 
> > Another process wakes up first (pdflush / process submitting I/O itself /
> > xfsdatad / etc) and sends the next bio's to __make_request().
> > In the mean time some free requests have become available, and the bios
> > are merged into a new request. Those requests are submitted to the device.
> > 
> > Then, get_request_wait() returns but the bio is not mergeable anymore -
> > and that results in a backwards seek, severely limiting the I/O rate.
> > 
> > Wouldn't it be better to allow the request allocation and queue the
> > request, and /then/ put the process to sleep ? The queue will grow larger
> > than nr_requests, but it does that anyway.
> 
> That would help, but is a bit kludgy.

Patch that does exactly this (it's probably butt-ugly, but it proves the
point) below.

> What _should_ have happened was:
> 
> a) The queue gets plugged
> 
> b) The maximal-sized 31-bio request is queued
> 
> c) The 4k request gets inserted *before* the 3968k request
> 
> d) The queue gets unplugged, and both requests stream smoothly.
> 
> (And this assumes that the queue was initially empty.  ALmost certainly
> that was not the case).

The thing is, the bio's are submitted perfectly sequentially. It's just that
every so often a request (with just its initial bio) gets stuck for a while.
Because the bio's after it are merged and sent to the device, it's not
possible to merge that stuck request later on when it gets unstuck, because
the other bio's have already left the building so to speak.

I'm not using MD at all, and I tried all elevators. All the same results,
so I stayed with elevator=noop because it's so much easier to understand ;)

This proof-of-concept patch fixes things:

ll_rw_blk.c.patch

(Hmm, 03:11, time to go to bed. I'm happy I finally found this though!)

--- linux-2.6.3/drivers/block/ll_rw_blk.c.ORIG	2004-02-04 04:43:10.000000000 +0100
+++ linux-2.6.3/drivers/block/ll_rw_blk.c	2004-02-19 02:51:06.000000000 +0100
@@ -1545,7 +1545,8 @@
 /*
  * Get a free request, queue_lock must not be held
  */
-static struct request *get_request(request_queue_t *q, int rw, int gfp_mask)
+static struct request *get_request(request_queue_t *q, int rw,
+					int gfp_mask, int *qfull)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
@@ -1569,10 +1570,15 @@
 			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
 		/*
 		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * "batcher", and not exempted by the IO scheduler.
+		 * If "qfull" is a valid pointer, set it to 1 to return
+		 * this info to the caller so that it can sleep.
 		 */
-		spin_unlock_irq(q->queue_lock);
-		goto out;
+		if (qfull == NULL) {
+			spin_unlock_irq(q->queue_lock);
+			goto out;
+		} else
+			*qfull = 1;
 	}
 
 	rl->count[rw]++;
@@ -1627,7 +1633,7 @@
  * No available requests for this queue, unplug the device and wait for some
  * requests to become available.
  */
-static struct request *get_request_wait(request_queue_t *q, int rw)
+static struct request *get_request_wait(request_queue_t *q, int rw, int wantreq)
 {
 	DEFINE_WAIT(wait);
 	struct request *rq;
@@ -1639,7 +1645,7 @@
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
 				TASK_UNINTERRUPTIBLE);
 
-		rq = get_request(q, rw, GFP_NOIO);
+		rq = wantreq ? get_request(q, rw, GFP_NOIO, NULL) : NULL;
 
 		if (!rq) {
 			struct io_context *ioc;
@@ -1657,7 +1663,7 @@
 			put_io_context(ioc);
 		}
 		finish_wait(&rl->wait[rw], &wait);
-	} while (!rq);
+	} while (!rq && wantreq);
 
 	return rq;
 }
@@ -1669,9 +1675,9 @@
 	BUG_ON(rw != READ && rw != WRITE);
 
 	if (gfp_mask & __GFP_WAIT)
-		rq = get_request_wait(q, rw);
+		rq = get_request_wait(q, rw, 1);
 	else
-		rq = get_request(q, rw, gfp_mask);
+		rq = get_request(q, rw, gfp_mask, NULL);
 
 	return rq;
 }
@@ -2002,6 +2008,7 @@
 	struct request *req, *freereq = NULL;
 	int el_ret, rw, nr_sectors, cur_nr_sectors, barrier, ra;
 	sector_t sector;
+	int qfull;
 
 	sector = bio->bi_sector;
 	nr_sectors = bio_sectors(bio);
@@ -2023,6 +2030,7 @@
 	ra = bio->bi_rw & (1 << BIO_RW_AHEAD);
 
 again:
+	qfull = 0;
 	spin_lock_irq(q->queue_lock);
 
 	if (elv_queue_empty(q)) {
@@ -2096,14 +2104,15 @@
 		freereq = NULL;
 	} else {
 		spin_unlock_irq(q->queue_lock);
-		if ((freereq = get_request(q, rw, GFP_ATOMIC)) == NULL) {
+		freereq = get_request(q, rw, GFP_ATOMIC, ra ? NULL : &qfull);
+		if (freereq == NULL) {
 			/*
 			 * READA bit set
 			 */
 			if (ra)
 				goto end_io;
 	
-			freereq = get_request_wait(q, rw);
+			freereq = get_request_wait(q, rw, 1);
 		}
 		goto again;
 	}
@@ -2141,6 +2150,13 @@
 	req->start_time = jiffies;
 
 	add_request(q, req);
+
+	if (qfull) {
+		spin_unlock_irq(q->queue_lock);
+		get_request_wait(q, rw, 0);
+		spin_lock_irq(q->queue_lock);
+	}
+
 out:
 	if (freereq)
 		__blk_put_request(q, freereq);

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  2:11         ` Miquel van Smoorenburg
@ 2004-02-19  2:26           ` Andrew Morton
  2004-02-19 10:15             ` Miquel van Smoorenburg
  2004-02-19  2:51           ` IO scheduler, queue depth, nr_requests Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2004-02-19  2:26 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg <miquels@cistron.nl> wrote:
>
> The thing is, the bio's are submitted perfectly sequentially. It's just that
>  every so often a request (with just its initial bio) gets stuck for a while.
>  Because the bio's after it are merged and sent to the device, it's not
>  possible to merge that stuck request later on when it gets unstuck, because
>  the other bio's have already left the building so to speak.

Oh.  So the raid controller's queue depth is larger than the kernel's.  So
everything gets immediately shovelled into the device and the kernel is
left with nothing to merge the little request against.

Shouldn't the controller itself be performing the insertion?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  2:11         ` Miquel van Smoorenburg
  2004-02-19  2:26           ` Andrew Morton
@ 2004-02-19  2:51           ` Nick Piggin
  2004-02-19 10:21             ` Jens Axboe
  1 sibling, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2004-02-19  2:51 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, axboe, linux-lvm, linux-kernel, thornber

Miquel van Smoorenburg wrote:

>
>No, I'm actually referring to a struct request. I'm logging this in the
>SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
>in fact logged all the bio's submitted to __make_request, and the output
>of the elevator from elv_next_request(). The bio's are submitted sequentially,
>the resulting requests aren't. But this is because nr_requests is 128, while
>the 3ware device has a queue of 254 entries (no tagging though). Upping
>nr_requests to 512 makes this go away ..
>
>That shouldn't be necessary though. I only see this with LVM over 3ware-raid5,
>not on the 3ware-raid5 array directly (/dev/sda1). And it gets less troublesome
>with a lot of debugging (unless I set nr_requests lower again), which points
>to a timing issue.
>
>

So the problem you are seeing is due to "unlucky" timing between
two processes submitting IO. And the very efficient mechanisms
(merging, sorting) we have to improve situations exactly like this
is effectively disabled. And to make it worse, it appears that your
controller shits itself on this trivially simple pattern.

Your hack makes a baby step in the direction of per *process*
request limits, which I happen to be an advocate of. As it stands
though, I don't like it.

Jens has the final say when it comes to the block layer though.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  2:26           ` Andrew Morton
@ 2004-02-19 10:15             ` Miquel van Smoorenburg
  2004-02-19 10:19               ` Jens Axboe
  0 siblings, 1 reply; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 10:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Miquel van Smoorenburg, miquels, axboe, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> >
> > The thing is, the bio's are submitted perfectly sequentially. It's just that
> >  every so often a request (with just its initial bio) gets stuck for a while.
> >  Because the bio's after it are merged and sent to the device, it's not
> >  possible to merge that stuck request later on when it gets unstuck, because
> >  the other bio's have already left the building so to speak.
> 
> Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> everything gets immediately shovelled into the device and the kernel is
> left with nothing to merge the little request against.

Well, the request queue of the kernel is max'ed out too, otherwise
get_request_wait() wouldn't be called. It's just an unfortunate timing issue.

> Shouldn't the controller itself be performing the insertion?

Well, you would indeed expect the 3ware hardware to be smarter than that,
but in its defence, the driver doesn't set sdev->simple_tags or sdev->ordered_tags
at all. It just has a large queue on the host, in hardware.

Perhaps this info should be exported into the request queue of the device,
so that ll_rw_blk knows about this and can do something similar to the
hack I posted ?

Note that AFAICS nothing in drivers/scsi uses the tagging stuff in ll_rw_blk.c.
blk_queue_init_tags() is only called by scsi_activate_tcq(), and nothing
ever calls that (except the 53c700.c driver). So you can't just check for
QUEUE_FLAG_QUEUED. Hmm, nothing in drivers/block calls it either. It's
not being used at all yet ? Or am I being dense ?

Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 10:15             ` Miquel van Smoorenburg
@ 2004-02-19 10:19               ` Jens Axboe
  2004-02-19 20:59                 ` Miquel van Smoorenburg
  0 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2004-02-19 10:19 UTC (permalink / raw)
  To: Miquel van Smoorenburg; +Cc: Andrew Morton, linux-lvm, linux-kernel, thornber

On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> On Thu, 19 Feb 2004 03:26:28, Andrew Morton wrote:
> > Miquel van Smoorenburg <miquels@cistron.nl> wrote:
> > >
> > > The thing is, the bio's are submitted perfectly sequentially. It's just that
> > >  every so often a request (with just its initial bio) gets stuck for a while.
> > >  Because the bio's after it are merged and sent to the device, it's not
> > >  possible to merge that stuck request later on when it gets unstuck, because
> > >  the other bio's have already left the building so to speak.
> > 
> > Oh.  So the raid controller's queue depth is larger than the kernel's.  So
> > everything gets immediately shovelled into the device and the kernel is
> > left with nothing to merge the little request against.
> 
> Well, the request queue of the kernel is max'ed out too, otherwise
> get_request_wait() wouldn't be called. It's just an unfortunate timing
> issue.

Indeed

> > Shouldn't the controller itself be performing the insertion?
> 
> Well, you would indeed expect the 3ware hardware to be smarter than
> that, but in its defence, the driver doesn't set sdev->simple_tags or
> sdev->ordered_tags at all. It just has a large queue on the host, in
> hardware.

A too large queue. IMHO the simple and correct solution to your problem
is to diminish the host queue (sane solution), or bump the block layer
queue size (dumb solution).

> Perhaps this info should be exported into the request queue of the device,
> so that ll_rw_blk knows about this and can do something similar to the
> hack I posted ?
> 
> Note that AFAICS nothing in drivers/scsi uses the tagging stuff in
> ll_rw_blk.c.  blk_queue_init_tags() is only called by
> scsi_activate_tcq(), and nothing ever calls that (except the 53c700.c
> driver). So you can't just check for QUEUE_FLAG_QUEUED. Hmm, nothing
> in drivers/block calls it either. It's not being used at all yet ? Or
> am I being dense ?

No you are correct, I already outlined that to you explicitly in the
very first mail in this thread. Hopefully this will change with 2.7 so
we have some block layer control over tagging in general.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19  2:51           ` IO scheduler, queue depth, nr_requests Nick Piggin
@ 2004-02-19 10:21             ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2004-02-19 10:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, Feb 19 2004, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >
> >No, I'm actually referring to a struct request. I'm logging this in the
> >SCSI layer, in scsi_request_fn(), just after elv_next_request(). I have
> >in fact logged all the bio's submitted to __make_request, and the output
> >of the elevator from elv_next_request(). The bio's are submitted 
> >sequentially,
> >the resulting requests aren't. But this is because nr_requests is 128, 
> >while
> >the 3ware device has a queue of 254 entries (no tagging though). Upping
> >nr_requests to 512 makes this go away ..
> >
> >That shouldn't be necessary though. I only see this with LVM over 
> >3ware-raid5,
> >not on the 3ware-raid5 array directly (/dev/sda1). And it gets less 
> >troublesome
> >with a lot of debugging (unless I set nr_requests lower again), which 
> >points
> >to a timing issue.
> >
> >
> 
> So the problem you are seeing is due to "unlucky" timing between
> two processes submitting IO. And the very efficient mechanisms
> (merging, sorting) we have to improve situations exactly like this
> is effectively disabled. And to make it worse, it appears that your
> controller shits itself on this trivially simple pattern.
> 
> Your hack makes a baby step in the direction of per *process*
> request limits, which I happen to be an advocate of. As it stands
> though, I don't like it.

I'm very much an advocate for per process request limits as well.  Would
be trivial to add... Miquels patch is horrible, I appreciate it being
posted as a cry for help.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 10:19               ` Jens Axboe
@ 2004-02-19 20:59                 ` Miquel van Smoorenburg
  2004-02-19 22:52                   ` Nick Piggin
  0 siblings, 1 reply; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 20:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Miquel van Smoorenburg, Andrew Morton, linux-lvm, linux-kernel,
	thornber

On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>
> > > Shouldn't the controller itself be performing the insertion?
> > 
> > Well, you would indeed expect the 3ware hardware to be smarter than
> > that, but in its defence, the driver doesn't set sdev->simple_tags or
> > sdev->ordered_tags at all. It just has a large queue on the host, in
> > hardware.
> 
> A too large queue. IMHO the simple and correct solution to your problem
> is to diminish the host queue (sane solution), or bump the block layer
> queue size (dumb solution).

Well, I did that. Lowering the queue size of the 3ware controller to 64
does help a bit, but performance is still not optimal - leaving it at 254
and increasing the nr_requests of the queue to 512 helps the most.

But the patch I posted does just as well, without any tuning. I changed
it a little though - it only has the "new" behaviour (instead of blocking
on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
That results in the best performance I've seen, by far.

Now the style of my patch might be ugly, but what is conceptually wrong
with allocating the request and queueing it, then block if the queue is
full, versus blocking on allocating the request and keeping a bio
"stuck" for quite some time, resulting in out-of-order requests to the
hardware ?

Note that this is not an issue of '2 processes writing to 1 file', really.
It's one process and pdflush writing the same dirty pages of the same file.

Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 20:59                 ` Miquel van Smoorenburg
@ 2004-02-19 22:52                   ` Nick Piggin
  2004-02-19 23:53                     ` Miquel van Smoorenburg
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2004-02-19 22:52 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
>
>>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
>>
>>
>>>>Shouldn't the controller itself be performing the insertion?
>>>>
>>>Well, you would indeed expect the 3ware hardware to be smarter than
>>>that, but in its defence, the driver doesn't set sdev->simple_tags or
>>>sdev->ordered_tags at all. It just has a large queue on the host, in
>>>hardware.
>>>
>>A too large queue. IMHO the simple and correct solution to your problem
>>is to diminish the host queue (sane solution), or bump the block layer
>>queue size (dumb solution).
>>
>
>Well, I did that. Lowering the queue size of the 3ware controller to 64
>does help a bit, but performance is still not optimal - leaving it at 254
>and increasing the nr_requests of the queue to 512 helps the most.
>
>But the patch I posted does just as well, without any tuning. I changed
>it a little though - it only has the "new" behaviour (instead of blocking
>on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
>That results in the best performance I've seen, by far.
>
>

That's because you are half introducing per-process limits.

>Now the style of my patch might be ugly, but what is conceptually wrong
>with allocating the request and queueing it, then block if the queue is
>full, versus blocking on allocating the request and keeping a bio
>"stuck" for quite some time, resulting in out-of-order requests to the
>hardware ?
>
>

Conceptually? The concept that you have everything you need to
continue and yet you block anyway is wrong.

>Note that this is not an issue of '2 processes writing to 1 file', really.
>It's one process and pdflush writing the same dirty pages of the same file.
>
>

pdflush is a process though, that's all that matters.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 22:52                   ` Nick Piggin
@ 2004-02-19 23:53                     ` Miquel van Smoorenburg
  2004-02-20  0:15                       ` Nick Piggin
  2004-02-20  1:12                       ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-19 23:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber

On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
> 
> 
> Miquel van Smoorenburg wrote:
> 
> >On Thu, 19 Feb 2004 11:19:15, Jens Axboe wrote:
> >
> >>On Thu, Feb 19 2004, Miquel van Smoorenburg wrote:
> >>
> >>
> >>>>Shouldn't the controller itself be performing the insertion?
> >>>>
> >>>Well, you would indeed expect the 3ware hardware to be smarter than
> >>>that, but in its defence, the driver doesn't set sdev->simple_tags or
> >>>sdev->ordered_tags at all. It just has a large queue on the host, in
> >>>hardware.
> >>>
> >>A too large queue. IMHO the simple and correct solution to your problem
> >>is to diminish the host queue (sane solution), or bump the block layer
> >>queue size (dumb solution).
> >>
> >
> >Well, I did that. Lowering the queue size of the 3ware controller to 64
> >does help a bit, but performance is still not optimal - leaving it at 254
> >and increasing the nr_requests of the queue to 512 helps the most.
> >
> >But the patch I posted does just as well, without any tuning. I changed
> >it a little though - it only has the "new" behaviour (instead of blocking
> >on allocating a request, allocate it, queue it, _then_ block) for WRITEs.
> >That results in the best performance I've seen, by far.
> >
> >
> 
> That's because you are half introducing per-process limits.
> 
> >Now the style of my patch might be ugly, but what is conceptually wrong
> >with allocating the request and queueing it, then block if the queue is
> >full, versus blocking on allocating the request and keeping a bio
> >"stuck" for quite some time, resulting in out-of-order requests to the
> >hardware ?
> >
> >
> 
> Conceptually? The concept that you have everything you need to
> continue and yet you block anyway is wrong.

For reading, I agree. For writing .. ah well, English is not my first
language, let's not argue about language semantics.

> >Note that this is not an issue of '2 processes writing to 1 file', really.
> >It's one process and pdflush writing the same dirty pages of the same file.
> 
> pdflush is a process though, that's all that matters.

I understand that when the two processes are unrelated, the patch as I
sent it will do the wrong thing.

But the thing is, you get this:

- "dd" process writes requests
- pdflush triggers to write dirty pages
- too many pages are dirty so "dd" blocks as well to write synchronously
- "dd" process triggers "queue full" but gets marked as "batching" so
  can continue (get_request)
- pdflush tries to submit one bio and gets blocked (get_request_wait)
- "dd" continues, but that one bio from pdflush remains stuck for a while

That's stupid, that one bio from pdflush should really be allowed on
the queue, since "dd" is adding requests from the same source to it
anyway.

Perhaps writes from pdflush should be handled differently to prevent
this specific case ?

Say, if pdflush adds request #128, don't mark it as batching, but
let it block. The next process will be the one marked as batching
and can continue. If pdflush tries to add a request > 128, allow it,
but _then_ block it.

Would something like that work ? Would it be a good idea to never mark
a pdflush process as batching, or would that have a negative impact
for some things ?

Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-19 23:53                     ` Miquel van Smoorenburg
@ 2004-02-20  0:15                       ` Nick Piggin
  2004-02-20  1:12                       ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2004-02-20  0:15 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber



Miquel van Smoorenburg wrote:

>On Thu, 19 Feb 2004 23:52:32, Nick Piggin wrote:
>
>>
>>Miquel van Smoorenburg wrote:
>>
>
...

>>>Note that this is not an issue of '2 processes writing to 1 file', really.
>>>It's one process and pdflush writing the same dirty pages of the same file.
>>>
>>pdflush is a process though, that's all that matters.
>>
>
>I understand that when the two processes are unrelated, the patch as I
>sent it will do the wrong thing.
>
>But the thing is, you get this:
>
>- "dd" process writes requests
>- pdflush triggers to write dirty pages
>- too many pages are dirty so "dd" blocks as well to write synchronously
>- "dd" process triggers "queue full" but gets marked as "batching" so
>  can continue (get_request)
>- pdflush tries to submit one bio and gets blocked (get_request_wait)
>- "dd" continues, but that one bio from pdflush remains stuck for a while
>
>

The batching logic can probably all be ripped out with per
process limits. It's too complex anyway really.

>That's stupid, that one bio from pdflush should really be allowed on
>the queue, since "dd" is adding requests from the same source to it
>anyway.
>
>

But the whole reason it is getting blocked in the first place
is because your controller is sucking up all your requests.
The whole problem is not a problem if you use properly sized
queues.

I'm a bit surprised that it wasn't working well with a controller
queue depth of 64 and 128 nr_requests. I'll give you a per process
request limit patch to try in a minute.

>Perhaps writes from pdflush should be handled differently to prevent
>this specific case ?
>
>Say, if pdflush adds request #128, don't mark it as batching, but
>let it block. The next process will be the one marked as batching
>and can continue. If pdflush tries to add a request > 128, allow it,
>but _then_ block it.
>
>Would something like that work ? Would it be a good idea to never mark
>a pdflush process as batching, or would that have a negative impact
>for some things ?
>
>

It's hard to know. Maybe a better solution would be to allow pdflush
to be exempt from the limits entirely as long as it tries not to write
to congested queues (which is what it does)...


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-19 23:53                     ` Miquel van Smoorenburg
  2004-02-20  0:15                       ` Nick Piggin
@ 2004-02-20  1:12                       ` Nick Piggin
  2004-02-20  1:26                         ` Andrew Morton
  2004-02-20  1:45                         ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
  1 sibling, 2 replies; 29+ messages in thread
From: Nick Piggin @ 2004-02-20  1:12 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Jens Axboe, Andrew Morton, linux-lvm, linux-kernel, thornber

[-- Attachment #1: Type: text/plain, Size: 152 bytes --]

Hi Miquel,
Can you see if this patch helps you?

Even with this patch, it might still be a good idea to allow
pdflush to disregard the limits...

Nick


[-- Attachment #2: blk-per-process-rqlim.patch --]
[-- Type: text/plain, Size: 7564 bytes --]

 linux-2.6-npiggin/drivers/block/ll_rw_blk.c |  113 ++++++++--------------------
 linux-2.6-npiggin/include/linux/blkdev.h    |    8 -
 2 files changed, 36 insertions(+), 85 deletions(-)

diff -puN include/linux/blkdev.h~blk-per-process-rqlim include/linux/blkdev.h
--- linux-2.6/include/linux/blkdev.h~blk-per-process-rqlim	2004-02-20 09:53:22.000000000 +1100
+++ linux-2.6-npiggin/include/linux/blkdev.h	2004-02-20 10:12:04.000000000 +1100
@@ -25,6 +25,7 @@ struct request_pm_state;
 
 #define BLKDEV_MIN_RQ	4
 #define BLKDEV_MAX_RQ	128	/* Default maximum */
+#define BLKDEV_MIN_PROC_RQ	64
 
 /*
  * This is the per-process anticipatory I/O scheduler state.
@@ -61,11 +62,7 @@ struct io_context {
 	atomic_t refcount;
 	pid_t pid;
 
-	/*
-	 * For request batching
-	 */
-	unsigned long last_waited; /* Time last woken after wait for request */
-	int nr_batch_requests;     /* Number of requests left in the batch */
+	unsigned int nr_requests;	/* Outstanding requests */
 
 	struct as_io_context *aic;
 };
@@ -141,6 +138,7 @@ struct request {
 	int ref_count;
 	request_queue_t *q;
 	struct request_list *rl;
+	struct io_context *ioc;
 
 	struct completion *waiting;
 	void *special;
diff -puN drivers/block/ll_rw_blk.c~blk-per-process-rqlim drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c~blk-per-process-rqlim	2004-02-20 09:53:25.000000000 +1100
+++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c	2004-02-20 10:40:43.000000000 +1100
@@ -55,12 +55,6 @@ unsigned long blk_max_low_pfn, blk_max_p
 EXPORT_SYMBOL(blk_max_low_pfn);
 EXPORT_SYMBOL(blk_max_pfn);
 
-/* Amount of time in which a process may batch requests */
-#define BLK_BATCH_TIME	(HZ/50UL)
-
-/* Number of requests a "batching" process may submit */
-#define BLK_BATCH_REQ	32
-
 /*
  * Return the threshold (number of used requests) at which the queue is
  * considered to be congested.  It include a little hysteresis to keep the
@@ -1495,57 +1489,27 @@ static inline struct request *blk_alloc_
 }
 
 /*
- * ioc_batching returns true if the ioc is a valid batching request and
- * should be given priority access to a request.
- */
-static inline int ioc_batching(struct io_context *ioc)
-{
-	if (!ioc)
-		return 0;
-
-	/*
-	 * Make sure the process is able to allocate at least 1 request
-	 * even if the batch times out, otherwise we could theoretically
-	 * lose wakeups.
-	 */
-	return ioc->nr_batch_requests == BLK_BATCH_REQ ||
-		(ioc->nr_batch_requests > 0
-		&& time_before(jiffies, ioc->last_waited + BLK_BATCH_TIME));
-}
-
-/*
- * ioc_set_batching sets ioc to be a new "batcher" if it is not one. This
- * will cause the process to be a "batcher" on all queues in the system. This
- * is the behaviour we want though - once it gets a wakeup it should be given
- * a nice run.
- */
-void ioc_set_batching(struct io_context *ioc)
-{
-	if (!ioc || ioc_batching(ioc))
-		return;
-
-	ioc->nr_batch_requests = BLK_BATCH_REQ;
-	ioc->last_waited = jiffies;
-}
-
-/*
  * A request has just been released.  Account for it, update the full and
  * congestion status, wake up any waiters.   Called under q->queue_lock.
  */
-static void freed_request(request_queue_t *q, int rw)
+static void freed_request(request_queue_t *q, struct io_context *ioc, int rw)
 {
 	struct request_list *rl = &q->rq;
 
+	ioc->nr_requests--;
 	rl->count[rw]--;
 	if (rl->count[rw] < queue_congestion_off_threshold(q))
 		clear_queue_congested(q, rw);
 	if (rl->count[rw]+1 <= q->nr_requests) {
-		smp_mb();
-		if (waitqueue_active(&rl->wait[rw]))
-			wake_up(&rl->wait[rw]);
 		if (!waitqueue_active(&rl->wait[rw]))
 			blk_clear_queue_full(q, rw);
 	}
+
+	if (rl->count[rw]-31 <= q->nr_requests
+			|| ioc->nr_requests <= BLKDEV_MIN_PROC_RQ - 32) {
+		if (waitqueue_active(&rl->wait[rw]))
+			wake_up(&rl->wait[rw]);
+	}
 }
 
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queuelist)
@@ -1556,32 +1520,34 @@ static struct request *get_request(reque
 {
 	struct request *rq = NULL;
 	struct request_list *rl = &q->rq;
-	struct io_context *ioc = get_io_context(gfp_mask);
+	struct io_context *ioc;
 
 	spin_lock_irq(q->queue_lock);
 	if (rl->count[rw]+1 >= q->nr_requests) {
 		/*
 		 * The queue will fill after this allocation, so set it as
-		 * full, and mark this process as "batching". This process
-		 * will be allowed to complete a batch of requests, others
-		 * will be blocked.
+		 * full.
 		 */
-		if (!blk_queue_full(q, rw)) {
-			ioc_set_batching(ioc);
+		if (!blk_queue_full(q, rw))
 			blk_set_queue_full(q, rw);
-		}
 	}
 
+	ioc = get_io_context(gfp_mask);
+	if (unlikely(ioc == NULL))
+		goto out;
+
 	if (blk_queue_full(q, rw)
-			&& !ioc_batching(ioc) && !elv_may_queue(q, rw)) {
+			&& ioc->nr_requests >= BLKDEV_MIN_PROC_RQ
+			&& !elv_may_queue(q, rw)) {
 		/*
-		 * The queue is full and the allocating process is not a
-		 * "batcher", and not exempted by the IO scheduler
+		 * The queue is full and the allocating process is over
+		 * it's limit, and not exempted by the IO scheduler
 		 */
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
+	ioc->nr_requests++;
 	rl->count[rw]++;
 	if (rl->count[rw] >= queue_congestion_on_threshold(q))
 		set_queue_congested(q, rw);
@@ -1597,14 +1563,11 @@ static struct request *get_request(reque
 		 * wait queue, but this is pretty rare.
 		 */
 		spin_lock_irq(q->queue_lock);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
 		spin_unlock_irq(q->queue_lock);
-		goto out;
+		goto out_ioc;
 	}
 
-	if (ioc_batching(ioc))
-		ioc->nr_batch_requests--;
-	
 	INIT_LIST_HEAD(&rq->queuelist);
 
 	/*
@@ -1620,13 +1583,15 @@ static struct request *get_request(reque
 	rq->ref_count = 1;
 	rq->q = q;
 	rq->rl = rl;
+	rq->ioc = ioc;
 	rq->waiting = NULL;
 	rq->special = NULL;
 	rq->data = NULL;
 	rq->sense = NULL;
 
-out:
+out_ioc:
 	put_io_context(ioc);
+out:
 	return rq;
 }
 
@@ -1644,25 +1609,12 @@ static struct request *get_request_wait(
 		struct request_list *rl = &q->rq;
 
 		prepare_to_wait_exclusive(&rl->wait[rw], &wait,
-				TASK_UNINTERRUPTIBLE);
+						TASK_UNINTERRUPTIBLE);
 
 		rq = get_request(q, rw, GFP_NOIO);
-
-		if (!rq) {
-			struct io_context *ioc;
-
+		if (!rq)
 			io_schedule();
 
-			/*
-			 * After sleeping, we become a "batching" process and
-			 * will be able to allocate at least one request, and
-			 * up to a big batch of them for a small period time.
-			 * See ioc_batching, ioc_set_batching
-			 */
-			ioc = get_io_context(GFP_NOIO);
-			ioc_set_batching(ioc);
-			put_io_context(ioc);
-		}
 		finish_wait(&rl->wait[rw], &wait);
 	} while (!rq);
 
@@ -1854,6 +1806,7 @@ void __blk_put_request(request_queue_t *
 	 * it didn't come out of our reserved rq pools
 	 */
 	if (rl) {
+		struct io_context *ioc = req->ioc;
 		int rw = rq_data_dir(req);
 
 		elv_completed_request(q, req);
@@ -1861,7 +1814,8 @@ void __blk_put_request(request_queue_t *
 		BUG_ON(!list_empty(&req->queuelist));
 
 		blk_free_request(q, req);
-		freed_request(q, rw);
+		freed_request(q, ioc, rw);
+		put_io_context(ioc);
 	}
 }
 
@@ -2721,7 +2675,7 @@ int __init blk_dev_init(void)
  */
 void put_io_context(struct io_context *ioc)
 {
-	if (ioc == NULL)
+	if (unlikely(ioc == NULL))
 		return;
 
 	BUG_ON(atomic_read(&ioc->refcount) == 0);
@@ -2772,8 +2726,7 @@ struct io_context *get_io_context(int gf
 		if (ret) {
 			atomic_set(&ret->refcount, 1);
 			ret->pid = tsk->pid;
-			ret->last_waited = jiffies; /* doesn't matter... */
-			ret->nr_batch_requests = 0; /* because this is 0 */
+			ret->nr_requests = 0;
 			ret->aic = NULL;
 			tsk->io_context = ret;
 		}

_

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:12                       ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
@ 2004-02-20  1:26                         ` Andrew Morton
  2004-02-20  1:40                           ` Nick Piggin
  2004-02-20  1:45                         ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2004-02-20  1:26 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> Even with this patch, it might still be a good idea to allow
> pdflush to disregard the limits...

Has it been confirmed that pdflush is blocking in get_request_wait()?  I
guess that can happen very occasionally because we don't bother with any
locking around there but if it's happening a lot then something is bust.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:26                         ` Andrew Morton
@ 2004-02-20  1:40                           ` Nick Piggin
  2004-02-20  2:32                             ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Nick Piggin @ 2004-02-20  1:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber



Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Even with this patch, it might still be a good idea to allow
>>pdflush to disregard the limits...
>>
>
>Has it been confirmed that pdflush is blocking in get_request_wait()?  I
>guess that can happen very occasionally because we don't bother with any
>locking around there but if it's happening a lot then something is bust.
>
>

Miquel's analysis is pretty plausible, but I'm not sure if
he's confirmed it or not, Miquel? Even if it isn't happening
a lot, and something isn't bust it might be a good idea to
do this.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:12                       ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
  2004-02-20  1:26                         ` Andrew Morton
@ 2004-02-20  1:45                         ` Nick Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2004-02-20  1:45 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Miquel van Smoorenburg, Jens Axboe, Andrew Morton, linux-lvm,
	linux-kernel, thornber



Nick Piggin wrote:

> Hi Miquel,
> Can you see if this patch helps you?
>

This patch isn't actually so good because it doesn't wake
a specific process when it gets below it's limit...


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)
  2004-02-20  1:40                           ` Nick Piggin
@ 2004-02-20  2:32                             ` Andrew Morton
  2004-02-20 14:40                               ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2004-02-20  2:32 UTC (permalink / raw)
  To: Nick Piggin; +Cc: miquels, axboe, linux-lvm, linux-kernel, thornber

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> 
> 
> Andrew Morton wrote:
> 
> >Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> >>Even with this patch, it might still be a good idea to allow
> >>pdflush to disregard the limits...
> >>
> >
> >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> >guess that can happen very occasionally because we don't bother with any
> >locking around there but if it's happening a lot then something is bust.
> >
> >
> 
> Miquel's analysis is pretty plausible, but I'm not sure if
> he's confirmed it or not, Miquel? Even if it isn't happening
> a lot, and something isn't bust it might be a good idea to
> do this.

Seems OK from a quick check.  pdflush will block in get_request_wait()
occasionally, but not at all often.  Perhaps we could move the
write_congested test into the mpage_writepages() inner loop but it hardly
seems worth the risk.

Maybe things are different on Miquel's clockwork controller.



 drivers/block/ll_rw_blk.c |    2 ++
 fs/fs-writeback.c         |    2 ++
 2 files changed, 4 insertions(+)

diff -puN drivers/block/ll_rw_blk.c~pdflush-blockage-check drivers/block/ll_rw_blk.c
--- 25/drivers/block/ll_rw_blk.c~pdflush-blockage-check	2004-02-19 18:16:33.000000000 -0800
+++ 25-akpm/drivers/block/ll_rw_blk.c	2004-02-19 18:16:33.000000000 -0800
@@ -1651,6 +1651,8 @@ static struct request *get_request_wait(
 		if (!rq) {
 			struct io_context *ioc;
 
+			WARN_ON(current_is_pdflush());
+
 			io_schedule();
 
 			/*
diff -puN fs/fs-writeback.c~pdflush-blockage-check fs/fs-writeback.c
--- 25/fs/fs-writeback.c~pdflush-blockage-check	2004-02-19 18:22:25.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c	2004-02-19 18:22:43.000000000 -0800
@@ -279,6 +279,8 @@ sync_sb_inodes(struct super_block *sb, s
 		}
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+			if (current_is_pdflush())
+				printk("saved pdflush\n");
 			wbc->encountered_congestion = 1;
 			if (sb != blockdev_superblock)
 				break;		/* Skip a congested fs */

_


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20  2:32                             ` Andrew Morton
@ 2004-02-20 14:40                               ` Miquel van Smoorenburg
  2004-02-20 14:57                                 ` Jens Axboe
  2004-02-20 14:59                                 ` Joe Thornber
  0 siblings, 2 replies; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-20 14:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Nick Piggin, miquels, axboe, linux-lvm, linux-kernel, thornber

On 2004.02.20 03:32, Andrew Morton wrote:
> Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> > 
> > 
> > Andrew Morton wrote:
> > 
> > >Nick Piggin <piggin@cyberone.com.au> wrote:
> > >
> > >>Even with this patch, it might still be a good idea to allow
> > >>pdflush to disregard the limits...
> > >>
> > >
> > >Has it been confirmed that pdflush is blocking in get_request_wait()?  I
> > >guess that can happen very occasionally because we don't bother with any
> > >locking around there but if it's happening a lot then something is bust.
> > >
> > >
> > 
> > Miquel's analysis is pretty plausible, but I'm not sure if
> > he's confirmed it or not, Miquel?

Yes, I've added current->comm to my debug printk's, and it's dd/pdflush
that both block in get_request_wait (alternating between the two).

> > Even if it isn't happening
> > a lot, and something isn't bust it might be a good idea to
> > do this.
> 
> Seems OK from a quick check.  pdflush will block in get_request_wait()
> occasionally, but not at all often.  Perhaps we could move the
> write_congested test into the mpage_writepages() inner loop but it hardly
> seems worth the risk.
> 
> Maybe things are different on Miquel's clockwork controller.

I haven't tested it yet because of the "This patch isn't actually so good"
comment, but I found another explanation.

>  drivers/block/ll_rw_blk.c |    2 ++
>  fs/fs-writeback.c         |    2 ++
>  2 files changed, 4 insertions(+)

*Lightbulb on* .. I just read fs-writeback.c. As I said, this happens with an
LVM device. Could it be that because LVM and the actual device have different struct
request_queue's things go awry ?

In fs-writeback.c, your're looking at the LVM device (and its request_queue, and
its backing_dev_info). In__make_request, you're looking at the SCSI device.

So that's why I saw "dd" and pdflush fighting over get_request. pdflush isn't
supposed to run when the blockdev is congested, but ofcourse the LVM device
is never marked as congested.

The same can happen with MD devices, I think. Here's another proof-of-concept
patch that fixes things for LVM. With this patch applied, I never see pdflush
running when the queue is congested. You'll probably mark it as "horrible",
but hey, it shows the problem clear enough..

This patch adds a function pointer to the backing_dev_info struct that can
be called to check on the congestion if it's anything other than a simple device.
Dm.c now sets this pointer to a function defined in dm-table.c that checks
the congestion bits on all devices part of the logical volume.

I'm not sure if it should be the other way around (ll_rw_blk setting the
congested bit in the LVM device instead). Also if any queue of any device
under LVM is congested, pdflush won't run on the whole device - but I think
that harms less than running on a device with a congested queue.

bdi-congestion-funp.patch

--- linux-2.6.3/include/linux/backing-dev.h.ORIG	2004-02-04 04:43:38.000000000 +0100
+++ linux-2.6.3/include/linux/backing-dev.h	2004-02-20 15:17:52.000000000 +0100
@@ -24,6 +24,8 @@
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	int memory_backed;	/* Cannot clean pages with writepage */
+	int (*congested)(int, void *); /* Function pointer if device is md/dm */
+	void *aux;		/* Pointer to aux data for congested func */
 };
 
 extern struct backing_dev_info default_backing_dev_info;
@@ -34,11 +36,15 @@
 
 static inline int bdi_read_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_read_congested, bdi->aux);
 	return test_bit(BDI_read_congested, &bdi->state);
 }
 
 static inline int bdi_write_congested(struct backing_dev_info *bdi)
 {
+	if (bdi->congested)
+		return bdi->congested(BDI_write_congested, bdi->aux);
 	return test_bit(BDI_write_congested, &bdi->state);
 }
 
--- linux-2.6.3/drivers/md/dm.h.ORIG	2004-02-20 15:15:16.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.h	2004-02-20 15:12:30.000000000 +0100
@@ -115,6 +115,7 @@
 int dm_table_get_mode(struct dm_table *t);
 void dm_table_suspend_targets(struct dm_table *t);
 void dm_table_resume_targets(struct dm_table *t);
+int dm_table_any_congested(int bdi_state, void *aux);
 
 /*-----------------------------------------------------------------
  * A registry of target types.
--- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
+++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100
@@ -857,6 +857,30 @@
 	}
 }
 
+/*
+ *	See if any device in the logical volume has a queue
+ *	that is congested - in that case, pdflush skips it.
+ */
+int dm_table_any_congested(int bdi_state, void *aux)
+{
+	struct mapped_device *md = aux;
+	struct dm_table *t;
+	struct list_head *d, *devices;
+	int r = 0;
+										
+	if ((t = dm_get_table(md)) == NULL)
+		return 0;
+										
+	devices = dm_table_get_devices(t);
+	for (d = devices->next; d != devices; d = d->next) {
+		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
+		request_queue_t *q = bdev_get_queue(dd->bdev);
+		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
+	}
+	dm_table_put(t);
+										
+	return r;
+}
 
 EXPORT_SYMBOL(dm_get_device);
 EXPORT_SYMBOL(dm_put_device);
--- linux-2.6.3/drivers/md/dm.c.ORIG	2004-02-20 15:15:02.000000000 +0100
+++ linux-2.6.3/drivers/md/dm.c	2004-02-20 15:14:51.000000000 +0100
@@ -608,6 +608,8 @@
 	}
 
 	md->queue->queuedata = md;
+	md->queue->backing_dev_info.congested = dm_table_any_congested;
+	md->queue->backing_dev_info.aux = md;
 	blk_queue_make_request(md->queue, dm_request);
 
 	md->io_pool = mempool_create(MIN_IOS, mempool_alloc_slab,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 14:40                               ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg
@ 2004-02-20 14:57                                 ` Jens Axboe
  2004-02-20 14:59                                 ` Joe Thornber
  1 sibling, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2004-02-20 14:57 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, linux-lvm, linux-kernel,
	thornber

On Fri, Feb 20 2004, Miquel van Smoorenburg wrote:
> > > Even if it isn't happening
> > > a lot, and something isn't bust it might be a good idea to
> > > do this.
> > 
> > Seems OK from a quick check.  pdflush will block in get_request_wait()
> > occasionally, but not at all often.  Perhaps we could move the
> > write_congested test into the mpage_writepages() inner loop but it hardly
> > seems worth the risk.
> > 
> > Maybe things are different on Miquel's clockwork controller.
> 
> I haven't tested it yet because of the "This patch isn't actually so good"
> comment, but I found another explanation.
> 
> >  drivers/block/ll_rw_blk.c |    2 ++
> >  fs/fs-writeback.c         |    2 ++
> >  2 files changed, 4 insertions(+)
> 
> *Lightbulb on* .. I just read fs-writeback.c. As I said, this happens
> with an LVM device. Could it be that because LVM and the actual device
> have different struct request_queue's things go awry ?
> 
> In fs-writeback.c, your're looking at the LVM device (and its
> request_queue, and its backing_dev_info). In__make_request, you're
> looking at the SCSI device.

In principle, the lvm/md queues themselves will never be congested. But
the underlying queues can be, of course.

Now this approach is _much_ better, imo. I don't particularly care very
much for how you solved it, though, I'd much rather just see both
setting and testing passed down (and kill the ->aux as well).

Regardless of the initial hw depth vs block depth (which is also a
generic device problem, not just dm related), this would be a good
addition to the congestion logic.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 14:40                               ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg
  2004-02-20 14:57                                 ` Jens Axboe
@ 2004-02-20 14:59                                 ` Joe Thornber
  2004-02-20 15:00                                   ` Jens Axboe
  1 sibling, 1 reply; 29+ messages in thread
From: Joe Thornber @ 2004-02-20 14:59 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: Andrew Morton, Nick Piggin, miquels, axboe, linux-lvm,
	linux-kernel, thornber

On Fri, Feb 20, 2004 at 03:40:42PM +0100, Miquel van Smoorenburg wrote:
> --- linux-2.6.3/drivers/md/dm-table.c.ORIG	2004-02-04 04:44:59.000000000 +0100
> +++ linux-2.6.3/drivers/md/dm-table.c	2004-02-20 15:14:35.000000000 +0100

<snip>

> +	if ((t = dm_get_table(md)) == NULL)
> +		return 0;

struct mapped_device has no business in this file.  You should move
this function to dm.c, and provide accessor fns in dm-table.c.

> +	devices = dm_table_get_devices(t);
> +	for (d = devices->next; d != devices; d = d->next) {
> +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> +		request_queue_t *q = bdev_get_queue(dd->bdev);
> +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));

Shouldn't this be calling your bdi_*_congested function rather than
assuming it is a real device under dm ? (often not true).

I'm also very slightly worried that or'ing together the congestion
results for all the seperate devices isn't always the right thing.
These devices include anything that the targets are using, exception
stores for snapshots, logs for mirror, all paths for multipath (or'ing
is most likely to be wrong for multipath).

- Joe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 14:59                                 ` Joe Thornber
@ 2004-02-20 15:00                                   ` Jens Axboe
  2004-02-22 14:02                                     ` Miquel van Smoorenburg
  0 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2004-02-20 15:00 UTC (permalink / raw)
  To: Joe Thornber
  Cc: Miquel van Smoorenburg, Andrew Morton, Nick Piggin, miquels,
	linux-lvm, linux-kernel

On Fri, Feb 20 2004, Joe Thornber wrote:
> > +	devices = dm_table_get_devices(t);
> > +	for (d = devices->next; d != devices; d = d->next) {
> > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> 
> Shouldn't this be calling your bdi_*_congested function rather than
> assuming it is a real device under dm ? (often not true).
> 
> I'm also very slightly worried that or'ing together the congestion
> results for all the seperate devices isn't always the right thing.
> These devices include anything that the targets are using, exception
> stores for snapshots, logs for mirror, all paths for multipath (or'ing
> is most likely to be wrong for multipath).

Yeah the patch is pretty much crap in that area, I don't think Miquel
was aiming for inclusion :)

I'd suggest making queue functions for congestion state as well so it
stacks properly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-20 15:00                                   ` Jens Axboe
@ 2004-02-22 14:02                                     ` Miquel van Smoorenburg
  2004-02-22 19:55                                       ` Andrew Morton
  0 siblings, 1 reply; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-22 14:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Joe Thornber, Miquel van Smoorenburg, Andrew Morton, Nick Piggin,
	miquels, linux-lvm, linux-kernel

On Fri, 20 Feb 2004 16:00:13, Jens Axboe wrote:
> On Fri, Feb 20 2004, Joe Thornber wrote:
> > > +	devices = dm_table_get_devices(t);
> > > +	for (d = devices->next; d != devices; d = d->next) {
> > > +		struct dm_dev *dd = list_entry(d, struct dm_dev, list);
> > > +		request_queue_t *q = bdev_get_queue(dd->bdev);
> > > +		r |= test_bit(bdi_state, &(q->backing_dev_info.state));
> > 
> > Shouldn't this be calling your bdi_*_congested function rather than
> > assuming it is a real device under dm ? (often not true).
> > 
> > I'm also very slightly worried that or'ing together the congestion
> > results for all the seperate devices isn't always the right thing.
> > These devices include anything that the targets are using, exception
> > stores for snapshots, logs for mirror, all paths for multipath (or'ing
> > is most likely to be wrong for multipath).
> 
> Yeah the patch is pretty much crap in that area, I don't think Miquel
> was aiming for inclusion :)
> 
> I'd suggest making queue functions for congestion state as well so it
> stacks properly.

I've been looking at this. If you want to keep the struct backing_dev_info
reasonably stand-alone (no function pointer and aux data like I did) you
probably need to add a list of parent_queues to every request_list. If
the queue get congested, you mark the parent queue(s) congested as well.
Congested state would need to be changed into a counter instead of a
bitfield, I think.

The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
mostly meant to prevent 2 pdflush daemons from running the same queue.
I don't see much harm in pdflush #1 running on /dev/md0 which consists
of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Would that be the way to go? If so, I'll take a stab at it.

Mike.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests))
  2004-02-22 14:02                                     ` Miquel van Smoorenburg
@ 2004-02-22 19:55                                       ` Andrew Morton
  0 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2004-02-22 19:55 UTC (permalink / raw)
  To: Miquel van Smoorenburg
  Cc: axboe, thornber, miquels, piggin, miquels, linux-lvm,
	linux-kernel

Miquel van Smoorenburg <miquels@cistron.net> wrote:
>
> The pdflush-is-running-on-this-queue bit can probably remain as-is. It's
>  mostly meant to prevent 2 pdflush daemons from running the same queue.
>  I don't see much harm in pdflush #1 running on /dev/md0 which consists
>  of /dev/sda1 and /dev/sdb1 and pdflush #2 running on /dev/sdb2, right ?

Yes, having two pdflushes working the same spindle is a bit pointless but
is presumably fairly harmless.

The most important thing here is to prevent pdflush from blocking on
request exhaustion.  Because while pdflush sleeps on a particular disk, all
the others remain unserviced.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
       [not found]                     ` <1r6S4-6cv-1@gated-at.bofh.it>
@ 2004-02-25 20:17                       ` Bill Davidsen
  2004-02-25 21:39                         ` Miquel van Smoorenburg
  2004-02-26  0:39                         ` Nick Piggin
  0 siblings, 2 replies; 29+ messages in thread
From: Bill Davidsen @ 2004-02-25 20:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List

linux.kernelNick Piggin wrote:

> But the whole reason it is getting blocked in the first place
> is because your controller is sucking up all your requests.
> The whole problem is not a problem if you use properly sized
> queues.
> 
> I'm a bit surprised that it wasn't working well with a controller
> queue depth of 64 and 128 nr_requests. I'll give you a per process
> request limit patch to try in a minute.

And there's the rub... he did try what you are calling correctly sized 
queues, and his patch works better. I'm all in favor of having the 
theory and then writing the code, but when something works I would 
rather understand why and modify the theory.

In other words, given a patch which does help performance in this case, 
it would be good to understand why, instead of favoring a solution which 
is better in theory, but which has been tried and found inferior in 
performance.

I am NOT saying we should just block, effective as Miquel's patch seems, 
just that we should understand why it works well instead of saying it is 
in theory bad. I agree, but it works! Hopefully per-process limits solve 
this, but they "in theory" could result in blocking a process in an 
otherwise idle system. Unless I midread what you mean of course. 
Processes which calculate for a while and write results are not uncomon, 
and letting such a process write a whole bunch of data and then go 
calculate while it is written is certainly the way it should work. I'm 
unconvinced that per-process limits are the whole answer without 
considering the entire io load on the system.

Feel free to tell me I'm misreading your intent (and why).
-- 
bill on the road <davidsen@tmr.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-25 20:17                       ` Bill Davidsen
@ 2004-02-25 21:39                         ` Miquel van Smoorenburg
  2004-02-26  0:39                         ` Nick Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Miquel van Smoorenburg @ 2004-02-25 21:39 UTC (permalink / raw)
  To: linux-kernel

In article <403D02E3.4070208@tmr.com>,
Bill Davidsen  <davidsen@tmr.com> wrote:
>linux.kernelNick Piggin wrote:
>
>> But the whole reason it is getting blocked in the first place
>> is because your controller is sucking up all your requests.
>> The whole problem is not a problem if you use properly sized
>> queues.
>> 
>> I'm a bit surprised that it wasn't working well with a controller
>> queue depth of 64 and 128 nr_requests. I'll give you a per process
>> request limit patch to try in a minute.
>
>And there's the rub... he did try what you are calling correctly sized 
>queues, and his patch works better. I'm all in favor of having the 
>theory and then writing the code, but when something works I would 
>rather understand why and modify the theory.

Ah, but we now do understand why - it was pdflush and a process
fighting over get_request on basically the same device, before and
after LVM remapping.

See:

http://www.ussg.iu.edu/hypermail/linux/kernel/0402.2/1522.html

More complete solutions:

https://www.redhat.com/archives/linux-lvm/2004-February/msg00203.html
https://www.redhat.com/archives/linux-lvm/2004-February/msg00215.html

Mike.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IO scheduler, queue depth, nr_requests
  2004-02-25 20:17                       ` Bill Davidsen
  2004-02-25 21:39                         ` Miquel van Smoorenburg
@ 2004-02-26  0:39                         ` Nick Piggin
  1 sibling, 0 replies; 29+ messages in thread
From: Nick Piggin @ 2004-02-26  0:39 UTC (permalink / raw)
  To: bill davidsen; +Cc: Linux Kernel Mailing List



Bill Davidsen wrote:

> linux.kernelNick Piggin wrote:
>
>> But the whole reason it is getting blocked in the first place
>> is because your controller is sucking up all your requests.
>> The whole problem is not a problem if you use properly sized
>> queues.
>>
>> I'm a bit surprised that it wasn't working well with a controller
>> queue depth of 64 and 128 nr_requests. I'll give you a per process
>> request limit patch to try in a minute.
>
>
> And there's the rub... he did try what you are calling correctly sized 
> queues, and his patch works better. I'm all in favor of having the 
> theory and then writing the code, but when something works I would 
> rather understand why and modify the theory.
>
> In other words, given a patch which does help performance in this 
> case, it would be good to understand why, instead of favoring a 
> solution which is better in theory, but which has been tried and found 
> inferior in performance.
>
> I am NOT saying we should just block, effective as Miquel's patch 
> seems, just that we should understand why it works well instead of 
> saying it is in theory bad. I agree, but it works! Hopefully 
> per-process limits solve this, but they "in theory" could result in 
> blocking a process in an otherwise idle system. Unless I midread what 
> you mean of course. Processes which calculate for a while and write 
> results are not uncomon, and letting such a process write a whole 
> bunch of data and then go calculate while it is written is certainly 
> the way it should work. I'm unconvinced that per-process limits are 
> the whole answer without considering the entire io load on the system.
>
> Feel free to tell me I'm misreading your intent (and why).


No, I know Miquel's patch works and his analysis of what is happening
is correct (he proved it). The patch is a bit of a hack to get around
the specific problem he is seeing which should never happen with an
appropriately sized queue, and it might actually hurt in some cases.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2004-02-26  0:39 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20040216131609.GA21974@cistron.nl>
     [not found] ` <20040216133047.GA9330@suse.de>
     [not found]   ` <20040217145716.GE30438@traveler.cistron.net>
2004-02-18 23:52     ` IO scheduler, queue depth, nr_requests Miquel van Smoorenburg
2004-02-19  1:24       ` Nick Piggin
2004-02-19  1:52         ` Miquel van Smoorenburg
2004-02-19  2:01           ` Nick Piggin
2004-02-19  1:26       ` Andrew Morton
2004-02-19  2:11         ` Miquel van Smoorenburg
2004-02-19  2:26           ` Andrew Morton
2004-02-19 10:15             ` Miquel van Smoorenburg
2004-02-19 10:19               ` Jens Axboe
2004-02-19 20:59                 ` Miquel van Smoorenburg
2004-02-19 22:52                   ` Nick Piggin
2004-02-19 23:53                     ` Miquel van Smoorenburg
2004-02-20  0:15                       ` Nick Piggin
2004-02-20  1:12                       ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
2004-02-20  1:26                         ` Andrew Morton
2004-02-20  1:40                           ` Nick Piggin
2004-02-20  2:32                             ` Andrew Morton
2004-02-20 14:40                               ` [PATCH] bdi_congestion_funp (was: Re: [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests)) Miquel van Smoorenburg
2004-02-20 14:57                                 ` Jens Axboe
2004-02-20 14:59                                 ` Joe Thornber
2004-02-20 15:00                                   ` Jens Axboe
2004-02-22 14:02                                     ` Miquel van Smoorenburg
2004-02-22 19:55                                       ` Andrew Morton
2004-02-20  1:45                         ` [PATCH] per process request limits (was Re: IO scheduler, queue depth, nr_requests) Nick Piggin
2004-02-19  2:51           ` IO scheduler, queue depth, nr_requests Nick Piggin
2004-02-19 10:21             ` Jens Axboe
     [not found] <1qJVx-75K-15@gated-at.bofh.it>
     [not found] ` <1qJVx-75K-17@gated-at.bofh.it>
     [not found]   ` <1qJVw-75K-11@gated-at.bofh.it>
     [not found]     ` <1qLb8-6m-27@gated-at.bofh.it>
     [not found]       ` <1qLXl-XV-17@gated-at.bofh.it>
     [not found]         ` <1qMgF-1dA-5@gated-at.bofh.it>
     [not found]           ` <1qTs3-7A2-51@gated-at.bofh.it>
     [not found]             ` <1qTBB-7Hh-7@gated-at.bofh.it>
     [not found]               ` <1r3AS-1hW-5@gated-at.bofh.it>
     [not found]                 ` <1r5jD-2RQ-31@gated-at.bofh.it>
     [not found]                   ` <1r6fH-3L8-11@gated-at.bofh.it>
     [not found]                     ` <1r6S4-6cv-1@gated-at.bofh.it>
2004-02-25 20:17                       ` Bill Davidsen
2004-02-25 21:39                         ` Miquel van Smoorenburg
2004-02-26  0:39                         ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox