public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [patch] get_request starvation fix
@ 2002-02-08  8:46 Andrew Morton
  2002-02-08  8:57 ` Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08  8:46 UTC (permalink / raw)
  To: Rik van Riel, William Lee Irwin III, lkml

Here's a patch which addresses the get_request starvation problem.

The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level.  But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.

Some instrumentation illustrates this clearly.  All testing was on
ext2.  In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair.  Look at this section of the trace:

Feb  7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb  7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb  7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb  7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb  7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb  7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb  7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb  7 21:59:06 mnm kernel: dbench[904] wakes after 278ms

With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:

Feb  7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb  7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb  7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb  7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb  7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb  7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb  7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb  7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb  7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb  7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb  7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb  7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb  7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb  7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb  7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb  7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb  7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb  7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb  7 23:55:42 mnm kernel: dbench[988] wakes after 344ms



The actual algorithm I chose is described in the patch.  It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles.  The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.

Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run.  Beats me.

Rik, this patch will conflict with read-latency2 in the init code. 
Easily fixable.  I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8.  I'll
do more testing tomorrow.

Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running.  This is in fact not due to request
starvation.  It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.


--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c	Thu Feb  7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Fri Feb  8 00:33:01 2002
@@ -118,10 +118,15 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * low- and high-water marks for the queues.
+ */
+static int batch_requests_low;
+static int batch_requests_high;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,18 +443,81 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with high- and low- water marks and
+ *    with the assumption that request taking is much, much faster than
+ *    request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests_high
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If someone else is waiting on requests and free_requests < low_water,
+ *       the requester sleeps in FIFO manner.
+ * 
+ *    c) If low_water < free_requests < high_water, the caller is immediately
+ *       granted a new request.
+ * 
+ *    d) If nobody is waiting on requests, the caller gets given a request,
+ *       if there are any available.  Otherwise the caller sleeps.
+ * 
+ *  When a request is released:
+ * 
+ *    e) If free_requests < low_water, do nothing.
+ * 
+ *    f) If free_requests > high_water, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the high-water mark,
+ *   it will be able to take approximately (high-water - low-water) requests
+ *   before blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+#undef REQTIMING
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+#ifdef REQTIMING
+	struct timeval tv1, tv2;
+	unsigned long long t1, t2;
+	unsigned long t;
+	do_gettimeofday(&tv1);
+#endif
+
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_request, &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count < batch_requests_low)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
@@ -457,19 +525,20 @@ static struct request *__get_request_wai
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_request, &wait);
 	current->state = TASK_RUNNING;
-	return rq;
-}
 
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
-
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
+#ifdef REQTIMING
+	do_gettimeofday(&tv2);
+	t1 = tv1.tv_sec;
+	t1 *= 1000000;
+	t1 += tv1.tv_usec;
+	t2 = tv2.tv_sec;
+	t2 *= 1000000;
+	t2 += tv2.tv_usec;
+	t = t2 - t1;
+	printk("%s[%d] wakes after %ldms\n", current->comm,
+			current->pid, t / 1000);
+#endif
+	return rq;
 }
 
 /* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,7 +629,8 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+		if (++q->rq[rw].count >= batch_requests_high &&
+				waitqueue_active(&q->wait_for_request))
 			wake_up(&q->wait_for_request);
 	}
 }
@@ -742,22 +812,41 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests_high) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			if (waitqueue_active(&q->wait_for_request)) {
+				if (q->rq[rw].count < batch_requests_low) {
+					spin_unlock_irq(&io_request_lock);
+					freereq = __get_request_wait(q, rw);
+					goto again;
+				}
+				req = get_request(q, rw);
+				if (req == NULL)
+					BUG();
+			} else {
+				req = get_request(q, rw);
+				if (req == NULL) {
+					spin_unlock_irq(&io_request_lock);
+					freereq = __get_request_wait(q, rw);
+					goto again;
+				}
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */
@@ -1124,8 +1213,10 @@ int __init blk_dev_init(void)
 	/*
 	 * Batch frees according to queue length
 	 */
-	batch_requests = queue_nr_requests/4;
-	printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
+	batch_requests_high = queue_nr_requests / 4;
+	batch_requests_low = batch_requests_high / 2;
+	printk("block: %d slots per queue, batch_low=%d, batch_high=%d\n",
+		queue_nr_requests, batch_requests_low, batch_requests_high);
 
 #ifdef CONFIG_AMIGA_Z2RAM
 	z2_init();

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08  8:46 Andrew Morton
@ 2002-02-08  8:57 ` Jens Axboe
  2002-02-08  9:57   ` Andrew Morton
  2002-02-08  9:10 ` Andrew Morton
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2002-02-08  8:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml

On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.

[snip]

Agrh, if only I knew you were working on this too :/. Oh well, from a
first-read the patch looks good.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08  8:46 Andrew Morton
  2002-02-08  8:57 ` Jens Axboe
@ 2002-02-08  9:10 ` Andrew Morton
  2002-02-08 11:37 ` Rik van Riel
  2002-02-11  9:41 ` Andrew Morton
  3 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08  9:10 UTC (permalink / raw)
  To: Rik van Riel, William Lee Irwin III, lkml

Andrew Morton wrote:
> 
> It is noteworthy that the wakeup strategy has changed from wake-all to
> wake-one, which should save a few cycles.

It is also noteworthy that read latency completely sucks.  Not sure if
I've made it worse.

Anyway, with exclusive wakeup we need separate waitqueues for read and
write requests.  Updated patch at

	http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08  8:57 ` Jens Axboe
@ 2002-02-08  9:57   ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08  9:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Rik van Riel, William Lee Irwin III, lkml

Jens Axboe wrote:
> 
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
> 
> [snip]
> 
> Agrh, if only I knew you were working on this too :/. Oh well, from a
> first-read the patch looks good.

Seems that with FIFO fairness, /bin/sync now also livelocks.  And
it's pretty easy to see why.  There's nothing to make
write_unlocked_buffers() terminate.

We'll worry about that tomorrow.  I may just make it give up
after writing (2 * nr_buffers_type[BUF_DIRTY]) buffers.

The patch works well with read-latency2 (and it didn't throw
rejects).  Smooth and fast.  It's going to take some time and
testing to settle everything in.


-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08  8:46 Andrew Morton
  2002-02-08  8:57 ` Jens Axboe
  2002-02-08  9:10 ` Andrew Morton
@ 2002-02-08 11:37 ` Rik van Riel
  2002-02-08 18:28   ` Andrew Morton
  2002-02-11  9:41 ` Andrew Morton
  3 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2002-02-08 11:37 UTC (permalink / raw)
  To: Andrew Morton; +Cc: William Lee Irwin III, lkml

On Fri, 8 Feb 2002, Andrew Morton wrote:

> + *   This all assumes that the rate of taking requests is much, much higher
> + *   than the rate of releasing them.  Which is very true.

This is not necessarily true for read requests.

If each read request is synchronous and the process will
generate the next read request after the current one
has finished, then it's quite possible to clog up the
queue with read requests which are generated at exactly
the same rate as they're processed.

Couldn't this still cause starvation, even with your patch?

regards,

Rik
-- 
"Linux holds advantages over the single-vendor commercial OS"
    -- Microsoft's "Competing with Linux" document

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08 11:37 ` Rik van Riel
@ 2002-02-08 18:28   ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 18:28 UTC (permalink / raw)
  To: Rik van Riel; +Cc: William Lee Irwin III, lkml

Rik van Riel wrote:
> 
> On Fri, 8 Feb 2002, Andrew Morton wrote:
> 
> > + *   This all assumes that the rate of taking requests is much, much higher
> > + *   than the rate of releasing them.  Which is very true.
> 
> This is not necessarily true for read requests.
> 
> If each read request is synchronous and the process will
> generate the next read request after the current one
> has finished, then it's quite possible to clog up the
> queue with read requests which are generated at exactly
> the same rate as they're processed.
> 
> Couldn't this still cause starvation, even with your patch?

No, that's fine.

The problem which the comment refers to is: how to provide
per-process request batching without running off and creating
per-process reservation pools or such.

What I'm relying on is that when a sleeper is woken (at low-water),
there are at least (high-water - low-water) requests available before
get_request will again sleep.  And that the woken process will be
able to grab a decent number of those non-blocking requests. I
suspect it's always true, as long as (high-water - low_water) is
"much greater than" the number of CPUs.

The synchronous reader is well-behaved, and should be nicely
FIFO if we're getting low on requests.

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
@ 2002-02-08 19:31 Dieter Nützel
  0 siblings, 0 replies; 23+ messages in thread
From: Dieter Nützel @ 2002-02-08 19:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List

On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.

[snip]

> Also, you noted the other day that a LILO run *never* terminated when
> there was a dbench running.  This is in fact not due to request
> starvation.  It's due to livelock in invalidate_bdev(), which is called
> via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
> as long as another task is generating locked buffers against the
> device.
[snip]

Could this below related?
I get system looks through lilo (bzlilo) from time to time with all latest 
kernels + O(1) + -aa + preempt

Thanks,
	Dieter

> Re: [patch] O(1) scheduler, -J9
> Datum: Wed, 30 Jan 2002 17:36:21 +0100
> Von: Dieter Nützel <Dieter.Nuetzel@hamburg.de>
> An: Ingo Molnar <mingo@elte.hu>
> Kopie: Robert Love <rml@tech9.net>, Oleg Drokin <green@namesys.com>
>
> On Wednesday, 30. January 2002 17:26, Dieter Nützel wrote:
> > On Wednesday, 30. January 2002 15:40, Ingo Molnar wrote:
> > > On Wed, 30 Jan 2002, Dieter [iso-8859-15] Nützel wrote:
> > > > As always running "nice +19 make -j40" in my HUGE C++ VTK tree in the
> > > > background ;-)
> > > >
> > > > The mouse feels a little bit sluggish after KMail start so I reniced X
> > > > to -10. [...]
> > >
> > > does this make X smooth? nice -10 is a good choice, it's not too low and
> > > not too high, i'm using that value myself.
> >
> > Yes, mostly.
> > In "normal operation mode" X at 0 is good.
> >
> > I only get some very little but noticeable unsmoothness with several KDE
> > apps running. It could be KDE it self.
> >
> > Latency degradation since -J4 stays. But that could be missing of some
> > lock-breaks.
> >
> > Robert's latest BKL stuff halved the numbers for the two
> > latencytest0.42-png graphic benches on 2.4.18-pre7 + patches.

> Addition:

> I am most worried about the occasional kernel hangs during make 
> bzlilo/bzImage. It appears during lilo (???). The new kernel (vmlinux) is 
> still be built in /usr/src/linux but together with "latest" *.o and
> .*.flags files broken due to ReiserFS transaction replay after reboot
> (normal behavior with a journaling FS but without full data journaling.
>
> Maybe a RACE between lilo vs sync?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
       [not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
@ 2002-02-08 19:44 ` Andrew Morton
  2002-02-08 19:53   ` Dieter Nützel
  2002-02-08 20:43   ` Rik van Riel
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 19:44 UTC (permalink / raw)
  To: Dieter Nützel
  Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List

Dieter Nützel wrote:
> 
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
> 
> [snip]
> 
> > Also, you noted the other day that a LILO run *never* terminated when
> > there was a dbench running.  This is in fact not due to request
> > starvation.  It's due to livelock in invalidate_bdev(), which is called
> > via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
> > as long as another task is generating locked buffers against the
> > device.
> [snip]
> 
> Could this below related?
> I get system looks through lilo (bzlilo) from time to time with all latest
> kernels + O(1) + -aa + preempt
> 

Depends what you mean by "system locks".  The invalidate_bdev() problem
won't lock the machine.  Its symptoms are merely that the ioctl will
not terminate until the process which is writing to disk stops.

In other words: if you run dbench, then lilo, lilo will not complete
until after dbench terminates.

If you're seeing actual have-to-hit-reset lockups then that'll
be due to something quite different.

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08 19:44 ` Andrew Morton
@ 2002-02-08 19:53   ` Dieter Nützel
  2002-02-08 20:43   ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Dieter Nützel @ 2002-02-08 19:53 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List

On Friday, 8. February 2002 20:44, Andrew Morton wrote:
> Dieter Nützel wrote:
> > On Fri, Feb 08 2002, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> >
> > [snip]
> >
> > > Also, you noted the other day that a LILO run *never* terminated when
> > > there was a dbench running.  This is in fact not due to request
> > > starvation.  It's due to livelock in invalidate_bdev(), which is called
> > > via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
> > > as long as another task is generating locked buffers against the
> > > device.
> >
> > [snip]
> >
> > Could this below related?
> > I get system looks through lilo (bzlilo) from time to time with all
> > latest kernels + O(1) + -aa + preempt
>
> Depends what you mean by "system locks".

Hard lock up :-(

Nothing in the log files.
No SYSRQ works. Only reset. --- But ReiserFS works. Some few corrupted 
*.o.flag files and most of the time a broken /usr/src/vmlinux or freshly 
/boot/vmlinuz file.

> The invalidate_bdev() problem won't lock the machine.

I see.

> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.

dbench wasn't run before
Only several "sync" commands (by hand) during kernel build helps.

> If you're seeing actual have-to-hit-reset lockups then that'll
> be due to something quite different.

Sadly, yes.

Thanks,
	Dieter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08 19:44 ` Andrew Morton
  2002-02-08 19:53   ` Dieter Nützel
@ 2002-02-08 20:43   ` Rik van Riel
  1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2002-02-08 20:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dieter Nützel, Jens Axboe, Ingo Molnar, Robert Love,
	Linux Kernel List

On Fri, 8 Feb 2002, Andrew Morton wrote:

> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.

Actually, lilo completed after 5 minutes, but still only
about halfway through the dbench.

Rik
-- 
Will hack the VM for food.

http://www.surriel.com/		http://distro.conectiva.com/


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
@ 2002-02-09  1:56 rwhron
  0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-09  1:56 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

On Fri, Feb 08, 2002 at 10:29:37AM -0800, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> > 
> > Looks like a tremendously important patch.  Thanks!
> 
> hmm.  Dunno.  I don't think it's very common to hit get_request
> starvation.  Only for dbenchy things, I suspect.  Still, it's
> good to have a design which addresses nasty corner cases.

It really works!  

tiobench sequential reads with 32 threads went from max latency of
193 seconds down to 7 seconds with your make_request, read-latency2
and low-latency patches.  (2.4.18-pre9-am1)  The % of high latency
requests also went down.

K6-2 475 mhz with 384 MB ram and reiserfs on IDE disks.
Average of 3 runs.
Total File size = 1024 MB.  (individual file size = 1024 / num-threads)
Read, write, and seek rates in MB/sec. 
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.

Sequential       Num                   Avg     Maximum  Lat%   Lat%  CPU
Reads            Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8   8.94  13.33%   10.402   1785.52  0.000  0.000  67
2.4.18-pre9        8   8.85  13.22%   10.537   1954.27  0.000  0.000  67
2.4.18-pre9-am1   16   8.98  13.70%   30.827   4023.09  0.000  0.000  66
2.4.18-pre9       16   8.91  13.60%   31.122   4072.67  0.000  0.000  65
2.4.18-pre9-am1   32   9.00  14.16%   70.834   7032.45  0.000  0.000  64
2.4.18-pre9       32   8.87  13.89%   66.135 193590.10  0.030  0.023  64
                             
                             
Random           Num                  Avg      Maximum  Lat%   Lat%  CPU
Reads            Thr Rate  ( CPU%)  Latency    Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8  0.69 1 .828%  132.242     473.73  0.000  0.000  38
2.4.18-pre9        8  0.71 2 .156%  128.869     510.38  0.000  0.000  33
2.4.18-pre9-am1   16  0.72 2 .154%  371.288    1410.13  0.000  0.000  33
2.4.18-pre9       16  0.74 2 .025%  364.871    1397.82  0.000  0.000  36
2.4.18-pre9-am1   32  0.74 2 .229%  816.605    2996.07  0.000  0.000  33
2.4.18-pre9       32  0.75 2 .163%  734.035    2841.97  0.000  0.000  35
                             
Sequential       Num                   Avg     Maximum  Lat%   Lat%  CPU 
Writes           Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff 
                 --- --------------------------------------------------- 
2.4.18-pre9-am1    8  11.45  52.33%    7.651  15095.52  0.058  0.000  22 
2.4.18-pre9        8   9.23  40.84%    9.021  12351.12  0.040  0.000  23
2.4.18-pre9-am1   16  11.48  54.57%   22.253  41922.53  0.214  0.000  21 
2.4.18-pre9       16   9.46  41.71%   26.374  28280.18  0.212  0.000  23
2.4.18-pre9-am1   32  11.37  55.91%   52.917  75331.67  0.679  0.004  20 
2.4.18-pre9       32   9.50  42.55%   61.944  60970.74  0.770  0.005  22
                             
                             
Random           Num                   Avg     Maximum  Lat%   Lat%  CPU
Writes           Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8   0.62  2.393%    1.786    330.56  0.000  0.000  26
2.4.18-pre9        8   0.50  1.216%    1.700    333.32  0.000  0.000  41
2.4.18-pre9-am1   16   0.65  2.527%    4.533    996.54  0.000  0.000  26
2.4.18-pre9       16   0.52  1.267%    4.228   1236.08  0.000  0.000  41
2.4.18-pre9-am1   32   0.65  2.530%    7.980   1907.91  0.000  0.000  26
2.4.18-pre9       32   0.53  1.343%    7.818   2509.73  0.000  0.000  39


> > Do you have a patch for dbench too? :)
> 
> /bin/rm?

Good one.  It would be helpful though, if you've already done the work.
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-08  8:46 Andrew Morton
                   ` (2 preceding siblings ...)
  2002-02-08 11:37 ` Rik van Riel
@ 2002-02-11  9:41 ` Andrew Morton
  2002-02-11 17:35   ` Suparna Bhattacharya
  2002-02-13  0:33   ` Jesse Barnes
  3 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-11  9:41 UTC (permalink / raw)
  To: Rik van Riel, William Lee Irwin III, lkml, Suparna Bhattacharya

Andrew Morton wrote:
> 
> Here's a patch which addresses the get_request starvation problem.
> 

Sharp-eyed Suparna noticed that the algorithm still works if the
low-water mark is set to zero (ie: rip it out) so I did that and
the code is a little simpler.   Updated patch at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch

Also, here's a patch which fixes the /bin/sync livelock in
write_unlocked_buffers().  It simply bales out after writing
all the buffers which were dirty at the time the function
was called, rather than keeping on trying to write buffers
until the list is empty.

Given that /bin/sync calls write_unlocked_buffers() three times,
that's good enough.  sync still takes aaaaaages, but it terminates.

The similar livelock in invalidate_bdev() is a bit trickier.  I'm
not really sure what the semantics of invalidate_bdev() against a
device which has an r/w filesystem mounted on it are supposed to be.
I suspect we should just not call the function if the device is in
use, or call fsync_dev() or something.





--- linux-2.4.18-pre9/fs/buffer.c	Thu Feb  7 13:04:21 2002
+++ linux-akpm/fs/buffer.c	Sun Feb 10 21:50:48 2002
@@ -188,12 +188,13 @@ static void write_locked_buffers(struct 
  * return without it!
  */
 #define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, signed long *nr_to_write)
 {
 	struct buffer_head *next;
 	struct buffer_head *array[NRSYNC];
 	unsigned int count;
 	int nr;
+	int ret;
 
 	next = lru_list[BUF_DIRTY];
 	nr = nr_buffers_type[BUF_DIRTY];
@@ -212,29 +213,38 @@ static int write_some_buffers(kdev_t dev
 			array[count++] = bh;
 			if (count < NRSYNC)
 				continue;
-
-			spin_unlock(&lru_list_lock);
-			write_locked_buffers(array, count);
-			return -EAGAIN;
+			ret = -EAGAIN;
+			goto writeout;
 		}
 		unlock_buffer(bh);
 		__refile_buffer(bh);
 	}
+	ret = 0;
+writeout:
 	spin_unlock(&lru_list_lock);
-
-	if (count)
+	if (count) {
 		write_locked_buffers(array, count);
-	return 0;
+		if (nr_to_write)
+			*nr_to_write -= count;
+	}
+	return ret;
 }
 
 /*
- * Write out all buffers on the dirty list.
+ * Because we drop the locking during I/O it's not possible
+ * to write out all the buffers.  So the only guarantee that
+ * we can make here is that we write out all the buffers which
+ * were dirty at the time write_unlocked_buffers() was called.
+ * fsync_dev() calls in here three times, so we end up writing
+ * many more buffers than ever appear on BUF_DIRTY.
  */
 static void write_unlocked_buffers(kdev_t dev)
 {
+	signed long nr_to_write = nr_buffers_type[BUF_DIRTY] * 2;
+
 	do {
 		spin_lock(&lru_list_lock);
-	} while (write_some_buffers(dev));
+	} while (write_some_buffers(dev, &nr_to_write) && (nr_to_write > 0));
 	run_task_queue(&tq_disk);
 }
 
@@ -1079,7 +1089,7 @@ void balance_dirty(void)
 
 	/* If we're getting into imbalance, start write-out */
 	spin_lock(&lru_list_lock);
-	write_some_buffers(NODEV);
+	write_some_buffers(NODEV, NULL);
 
 	/*
 	 * And if we're _really_ out of balance, wait for
@@ -2852,7 +2862,7 @@ static int sync_old_buffers(void)
 		bh = lru_list[BUF_DIRTY];
 		if (!bh || time_before(jiffies, bh->b_flushtime))
 			break;
-		if (write_some_buffers(NODEV))
+		if (write_some_buffers(NODEV, NULL))
 			continue;
 		return 0;
 	}
@@ -2951,7 +2961,7 @@ int bdflush(void *startup)
 		CHECK_EMERGENCY_SYNC
 
 		spin_lock(&lru_list_lock);
-		if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
+		if (!write_some_buffers(NODEV, NULL) || balance_dirty_state() < 0) {
 			wait_for_some_buffers(NODEV);
 			interruptible_sleep_on(&bdflush_wait);
 		}


-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-11  9:41 ` Andrew Morton
@ 2002-02-11 17:35   ` Suparna Bhattacharya
  2002-02-11 19:26     ` Andrew Morton
  2002-02-13  0:33   ` Jesse Barnes
  1 sibling, 1 reply; 23+ messages in thread
From: Suparna Bhattacharya @ 2002-02-11 17:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml

There's something that's still bothering me - I wonder if I'm missing
something very obvious here. 

> * So here's what we do:
> * 
> *    a) A READA requester fails if free_requests < batch_requests
> * 
> *       We don't want READA requests to prevent sleepers from ever
> *       waking.
> * 
> *  When a process wants a new request:
> * 
> *    b) If free_requests == 0, the requester sleeps in FIFO manner.
> * 
> *    b) If 0 <  free_requests < batch_requests and there are waiters,
> *       we still take a request non-blockingly.  This provides batching.

For a caller who just got an exclusive wakeup on the request queue, 
this enables the woken up task to do batching and not sleep again on the 
next request it needs. However since we use the same logic for new callers, 
don't we still have those starvation issues ?
(i.e new callers end up stealing requests while there are sleepers, 
given that wakeups will happen only if the queue builds up beyond the high
water mark)

> *
> *    c) If free_requests >= batch_requests, the caller is immediately
> *       granted a new request.
> * 
> *  When a request is released:
> * 
> *    d) If free_requests < batch_requests, do nothing.

On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Andrew Morton wrote:
> > 
> > Here's a patch which addresses the get_request starvation problem.
> > 
> 
> Sharp-eyed Suparna noticed that the algorithm still works if the
> low-water mark is set to zero (ie: rip it out) so I did that and
> the code is a little simpler.   Updated patch at
> http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-11 17:35   ` Suparna Bhattacharya
@ 2002-02-11 19:26     ` Andrew Morton
  2002-02-14  6:00       ` Suparna Bhattacharya
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2002-02-11 19:26 UTC (permalink / raw)
  To: suparna; +Cc: Rik van Riel, William Lee Irwin III, lkml

Suparna Bhattacharya wrote:
> 
> ...
> > *  When a process wants a new request:
> > *
> > *    b) If free_requests == 0, the requester sleeps in FIFO manner.
> > *
> > *    b) If 0 <  free_requests < batch_requests and there are waiters,
> > *       we still take a request non-blockingly.  This provides batching.
> 
> For a caller who just got an exclusive wakeup on the request queue,
> this enables the woken up task to do batching and not sleep again on the
> next request it needs. However since we use the same logic for new callers,
> don't we still have those starvation issues ?

Let's think of `incoming' tasks and `outgoing' ones.  The incoming
tasks are running tasks which are making new requests, and the
outgoing ones are tasks which were on the waitqueue, and which are
in the process of being woken and granted a request.

Your concern is that incoming tasks can still steal all the requests
away from the sleepers.   And sure, they can, until all requests are
gone.  At which time the incoming tasks get to wait their turn on
the waitqueue.  And generally, the problem is with writes, where we tend
to have a few threads performing a large number of requests.

Another scenario is where incoming tasks just take a handful of requests,
and their rate of taking equals the rate of request freeing, and the
free_request threshold remains greater than zero, less than batch_nr_requests.
This steady state could conceivably happen with the correct number of
threads performing O_SYNC/fsync writes, for example.  I'll see if I can
find a way to make this happen.

The final scenario is where there are many outgoing tasks, so they
compete, and each gets an insufficiently large number of requests.
I suspect this is indeed happening with the current kernel's wake-all
code.  But the wake-one change makes this unlikely.

When an outgoing task is woken, we know there are 32 free requests on the
queue.  There's an assumption (or design) here that the outgoing task
will be able to get a decent number of those requests.  This works.  It
may fail if there are a massive number of CPUs.  But we need to increase
the overall request queue size anyway - 128 is too small.

Under heavy load, the general operating mode for this code is that 
damn near every task is asleep.  So most work is done by outgoing threads.

BTW, I suspect the request batching isn't super-important.  It'll
certainly decrease the context switch rate very much.  But from a
request-merging point of view it's unlikely to make much difference.

-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [patch] get_request starvation fix
@ 2002-02-12 23:13 Andrew Morton
  2002-02-13  1:28 ` William Lee Irwin III
  2002-02-15 17:23 ` Marcelo Tosatti
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-12 23:13 UTC (permalink / raw)
  To: Alan Cox, lkml

Second version of this patch, incorporating Suparna's
suggested simplification (the low-water mark was
unnecessary).

This patch is working well here.  Hopefully it'll pop up
in an rmap kernel soon.

Bill Irwin has been doing some fairly extensive tuning
and testing of this.  Hopefully he'll come out with some
numbers soon.

I include the original description...


==== make_request description ====

Here's a patch which addresses the get_request starvation problem.

The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level.  But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.

Some instrumentation illustrates this clearly.  All testing was on
ext2.  In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair.  Look at this section of the trace:

Feb  7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb  7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb  7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb  7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb  7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb  7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb  7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb  7 21:59:06 mnm kernel: dbench[904] wakes after 278ms

With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:

Feb  7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb  7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb  7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb  7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb  7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb  7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb  7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb  7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb  7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb  7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb  7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb  7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb  7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb  7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb  7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb  7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb  7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb  7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb  7 23:55:42 mnm kernel: dbench[988] wakes after 344ms



The actual algorithm I chose is described in the patch.  It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles.  The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.

Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run.  Beats me.

Rik, this patch will conflict with read-latency2 in the init code. 
Easily fixable.  I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8.  I'll
do more testing tomorrow.

Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running.  This is in fact not due to request
starvation.  It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.


=====================================

--- linux-2.4.18-pre9/include/linux/blkdev.h	Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h	Mon Feb 11 00:59:50 2002
@@ -119,9 +119,9 @@ struct request_queue
 	spinlock_t		queue_lock;
 
 	/*
-	 * Tasks wait here for free request
+	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_request;
+	wait_queue_head_t	wait_for_requests[2];
 };
 
 struct blk_dev_struct {
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c	Thu Feb  7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Mon Feb 11 00:59:50 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
 		q->rq[i&1].count++;
 	}
 
-	init_waitqueue_head(&q->wait_for_request);
+	init_waitqueue_head(&q->wait_for_requests[0]);
+	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,38 +443,102 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with wakeup hysteresis around
+ *    0 .. batch_requests and with the assumption that request taking is much,
+ *    much faster than request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner.
+ * 
+ *    b) If 0 <  free_requests < batch_requests and there are waiters,
+ *       we still take a request non-blockingly.  This provides batching.
+ *
+ *    c) If free_requests >= batch_requests, the caller is immediately
+ *       granted a new request.
+ * 
+ *  When a request is released:
+ * 
+ *    d) If free_requests < batch_requests, do nothing.
+ * 
+ *    f) If free_requests >= batch_requests, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the batch_requests level,
+ *   it will be able to take approximately (batch_requests) requests before
+ *   blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+#undef REQTIMING
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+#ifdef REQTIMING
+	struct timeval tv1, tv2;
+	unsigned long long t1, t2;
+	unsigned long t;
+	do_gettimeofday(&tv1);
+#endif
+
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_request, &wait);
+	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
-	return rq;
-}
-
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
 
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
+#ifdef REQTIMING
+	do_gettimeofday(&tv2);
+	t1 = tv1.tv_sec;
+	t1 *= 1000000;
+	t1 += tv1.tv_usec;
+	t2 = tv2.tv_sec;
+	t2 *= 1000000;
+	t2 += tv2.tv_usec;
+	t = t2 - t1;
+	printk("%s[%d] wakes after %ldms for %s\n", current->comm,
+			current->pid, t / 1000,
+			(rw == WRITE) ? "WRITE" : "READ");
+#endif
+	return rq;
 }
 
 /* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,8 +629,9 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
-			wake_up(&q->wait_for_request);
+		if (++q->rq[rw].count >= batch_requests &&
+				waitqueue_active(&q->wait_for_requests[rw]))
+			wake_up(&q->wait_for_requests[rw]);
 	}
 }
 
@@ -742,22 +812,30 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			req = get_request(q, rw);
+			if (req == NULL) {
+				spin_unlock_irq(&io_request_lock);
+				freereq = __get_request_wait(q, rw);
+				goto again;
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */


-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-11  9:41 ` Andrew Morton
  2002-02-11 17:35   ` Suparna Bhattacharya
@ 2002-02-13  0:33   ` Jesse Barnes
  1 sibling, 0 replies; 23+ messages in thread
From: Jesse Barnes @ 2002-02-13  0:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: lkml

On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Also, here's a patch which fixes the /bin/sync livelock in
> write_unlocked_buffers().  It simply bales out after writing
> all the buffers which were dirty at the time the function
> was called, rather than keeping on trying to write buffers
> until the list is empty.

I've also seen this issue when I start multiple mkfs commands (10 in
parallel basically make the system useless), but the patch helped.
Any chance it will get into 2.4?

Thanks,
Jesse

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-12 23:13 Andrew Morton
@ 2002-02-13  1:28 ` William Lee Irwin III
  2002-02-15 17:23 ` Marcelo Tosatti
  1 sibling, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2002-02-13  1:28 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alan Cox, lkml

On Tue, Feb 12, 2002 at 03:13:26PM -0800, Andrew Morton wrote:
> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
> 
> This patch is working well here.  Hopefully it'll pop up
> in an rmap kernel soon.
> 
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this.  Hopefully he'll come out with some
> numbers soon.

Much of my testing has involved varying elevator and bdflush parameters
to find what works best with this. From preliminary results it is clear
that the resolution of the fairness issues in this patch and the read
latency patch does not impair performance.

I'll do a bunch more runs tonight with the latest version of this.
My prior runs were with an earlier version.


Cheers,
Bill

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
@ 2002-02-13 13:55 rwhron
  0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-13 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm


tiobench measurements using 2048 MB worth of files on ext2 fs.
K6-2 475 mhz with 384 MB RAM and IDE disk.
Sorted by test, number of threads, then kernel.
Read, write, and seek rates in MB/sec. 
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.

2.4.18-pre9-am1 = make_request, read_latency2, and low-latency patches.
2.4.18-pre9-am2 = make_request, read_latency2, and sync_livelock patches.

On sequential reads, kernels with make_request and read_latency2
reduce max latency from around 500 seconds down 2-8 seconds.
Fairness to requests is vastly improved.  Throughput looks better too.

Sequential Reads Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   9.18 12.19%    34.739  545388.95  0.00304  0.00304
2.4.18-pre9-ac1   32  11.48 30.08%    25.736  289601.71  0.00839  0.00839
2.4.18-pre9-am1   32   9.25 11.97%    40.103    2251.30  0.00000  0.00000
2.4.18-pre9-am2   32   9.32 12.11%    39.917    2206.14  0.00000  0.00000

2.4.18-pre9       64   8.83 11.81%    59.045  484270.65  0.02689  0.02651
2.4.18-pre9-ac1   64  10.37 27.12%    49.460  293425.91  0.04273  0.04196
2.4.18-pre9-am1   64   9.53 13.42%    77.737    3840.97  0.00000  0.00000
2.4.18-pre9-am2   64   9.52 13.15%    77.873    3951.99  0.00000  0.00000

2.4.18-pre9      128   8.44 10.92%   109.317  572551.72  0.08507  0.07534
2.4.18-pre9-ac1  128   9.28 24.08%   104.159  333550.33  0.13599  0.13523
2.4.18-pre9-am1  128   9.55 15.38%   153.480    7939.48  0.66070  0.00000
2.4.18-pre9-am2  128   9.59 15.72%   153.244    7648.12  0.80833  0.00000

On random reads, the make_request and read_latency2 patches help a lot at 
128 threads.  Max latency and % of high latency requests is greatly reduced.

Random Reads     Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   0.57 1.580%   549.367    1523.39  0.00000  0.00000
2.4.18-pre9-ac1   32   0.53 2.226%   660.193    2966.26  0.00000  0.00000
2.4.18-pre9-am1   32   0.58 1.479%   623.361    1693.90  0.00000  0.00000
2.4.18-pre9-am2   32   0.58 1.501%   619.679    1727.47  0.00000  0.00000

2.4.18-pre9       64   0.61 2.031%  1016.205    2618.34  0.00000  0.00000
2.4.18-pre9-ac1   64   0.56 2.579%  1160.377   72621.35  0.35283  0.32763
2.4.18-pre9-am1   64   0.61 1.711%  1145.865    2865.21  0.00000  0.00000
2.4.18-pre9-am2   64   0.62 1.978%  1135.097    2792.76  0.00000  0.00000

2.4.18-pre9      128   0.61 1.699%  1569.812   53372.49  4.48589  4.41029
2.4.18-pre9-ac1  128   0.55 2.701%  1862.139   75417.90  5.54435  5.21673
2.4.18-pre9-am1  128   0.63 2.511%  2152.480    4710.51  0.00000  0.00000
2.4.18-pre9-am2  128   0.63 2.525%  2145.578    4709.85  0.00000  0.00000

On the write tests, 2.4.18-pre9-ac1 has the best latency numbers.

Sequential Write Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32  13.18 35.47%    26.155   33677.84  0.48867  0.00324
2.4.18-pre9-ac1   32  13.66 36.82%    24.738    9008.30  0.03567  0.00000
2.4.18-pre9-am1   32  15.27 53.94%    21.236   39604.59  0.37098  0.00400
2.4.18-pre9-am2   32  15.17 53.66%    21.492   41358.77  0.37155  0.00496

2.4.18-pre9       64  13.47 44.74%    46.296   46673.56  0.76332  0.03968
2.4.18-pre9-ac1   64  11.91 32.45%    55.355   12051.19  0.22926  0.00000
2.4.18-pre9-am1   64  14.57 52.16%    43.430   51274.83  0.75874  0.05665
2.4.18-pre9-am2   64  14.78 53.76%    42.685   54281.45  0.63266  0.07457

2.4.18-pre9      128  12.78 42.55%    91.421   60494.19  1.20201  0.26531
2.4.18-pre9-ac1  128  10.15 27.63%   122.921   15371.80  1.29719  0.00000
2.4.18-pre9-am1  128  13.72 47.73%    83.373   69739.68  1.16043  0.22126
2.4.18-pre9-am2  128  14.17 52.81%    80.642   67952.87  1.10722  0.23041


Random Writes    Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   0.60 1.323%     1.915     576.55  0.00000  0.00000
2.4.18-pre9-ac1   32   0.61 1.296%     0.311     481.93  0.00000  0.00000
2.4.18-pre9-am1   32   0.71 2.411%     1.696     624.95  0.00000  0.00000
2.4.18-pre9-am2   32   0.73 2.804%     1.567     711.59  0.00000  0.00000

2.4.18-pre9       64   0.65 1.547%     1.552     558.61  0.00000  0.00000
2.4.18-pre9-ac1   64   0.64 1.457%     0.214      88.78  0.00000  0.00000
2.4.18-pre9-am1   64   0.74 2.649%     1.014     522.44  0.00000  0.00000
2.4.18-pre9-am2   64   0.75 2.647%     1.035     557.22  0.00000  0.00000

2.4.18-pre9      128   0.70 1.823%     1.664     785.13  0.00000  0.00000
2.4.18-pre9-ac1  128   0.67 1.926%     0.339     398.63  0.00000  0.00000
2.4.18-pre9-am1  128   0.76 2.716%     1.527     845.77  0.00000  0.00000
2.4.18-pre9-am2  128   0.75 2.828%     1.371     886.19  0.00000  0.00000

More tests on more kernels at
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-11 19:26     ` Andrew Morton
@ 2002-02-14  6:00       ` Suparna Bhattacharya
  0 siblings, 0 replies; 23+ messages in thread
From: Suparna Bhattacharya @ 2002-02-14  6:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml

On Mon, Feb 11, 2002 at 11:26:44AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya wrote:
> > 
> > ...
> > > *  When a process wants a new request:
> > > *
> > > *    b) If free_requests == 0, the requester sleeps in FIFO manner.
> > > *
> > > *    b) If 0 <  free_requests < batch_requests and there are waiters,
> > > *       we still take a request non-blockingly.  This provides batching.
> > 
> > For a caller who just got an exclusive wakeup on the request queue,
> > this enables the woken up task to do batching and not sleep again on the
> > next request it needs. However since we use the same logic for new callers,
> > don't we still have those starvation issues ?
> 
> Let's think of `incoming' tasks and `outgoing' ones.  The incoming
> tasks are running tasks which are making new requests, and the
> outgoing ones are tasks which were on the waitqueue, and which are
> in the process of being woken and granted a request.
> 
> Your concern is that incoming tasks can still steal all the requests
> away from the sleepers.   And sure, they can, until all requests are
> gone.  At which time the incoming tasks get to wait their turn on
> the waitqueue.  And generally, the problem is with writes, where we tend
> to have a few threads performing a large number of requests.
> 
> Another scenario is where incoming tasks just take a handful of requests,
> and their rate of taking equals the rate of request freeing, and the
> free_request threshold remains greater than zero, less than batch_nr_requests.
> This steady state could conceivably happen with the correct number of
> threads performing O_SYNC/fsync writes, for example.  I'll see if I can
> find a way to make this happen.

Thanks for the explanation. (I was off-line for a couple of days so
didn't get to respond earlier) This is closest to the case that I
was concerned about. Perhaps with merging/write-clustering at higher 
levels the number of requests might be less, but that is secondary.  
What was worrying me is that even when the rate of request taking 
is greater than the rate of request freeing, couldn't we have every 
1 out of N incoming tasks picking up a just freed up request, thus 
starving the waiters ? (The rest N-1 incoming tasks might get to
wait their turn, where N:1 is the ratio of request taking to request
freeing)

> 
> The final scenario is where there are many outgoing tasks, so they
> compete, and each gets an insufficiently large number of requests.
> I suspect this is indeed happening with the current kernel's wake-all
> code.  But the wake-one change makes this unlikely.
> 
> When an outgoing task is woken, we know there are 32 free requests on the
> queue.  There's an assumption (or design) here that the outgoing task
> will be able to get a decent number of those requests.  This works.  It
> may fail if there are a massive number of CPUs.  But we need to increase
> the overall request queue size anyway - 128 is too small.

Yes, especially when the system has sufficient resources to increase
the request queue size, one might as well queue up on the request 
queue, rather than the wait queue, and minimize those context switches.

> 
> Under heavy load, the general operating mode for this code is that 
> damn near every task is asleep.  So most work is done by outgoing threads.
> 
> BTW, I suspect the request batching isn't super-important.  It'll
> certainly decrease the context switch rate very much.  But from a
> request-merging point of view it's unlikely to make much difference.
> 
> -

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-12 23:13 Andrew Morton
  2002-02-13  1:28 ` William Lee Irwin III
@ 2002-02-15 17:23 ` Marcelo Tosatti
  2002-02-16  7:32   ` Andrew Morton
  1 sibling, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2002-02-15 17:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Alan Cox, lkml



On Tue, 12 Feb 2002, Andrew Morton wrote:

> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
> 
> This patch is working well here.  Hopefully it'll pop up
> in an rmap kernel soon.
> 
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this.  Hopefully he'll come out with some
> numbers soon.
> 
> I include the original description...

It seems the real gain (in latency) is caused by the FIFO behaviour.

That is, removing this hunk (against __get_request_wait())

-               if (q->rq[rw].count < batch_requests)
+               if (q->rq[rw].count == 0)
                        schedule();

Would not make _much_ difference latency-wise. I'm I right or missing
something ?

Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
the instrumentation stuff _and_ make it clear on the documentation that
READA requests are not being used in practice).

Thanks a lot for that, Andrew.



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-15 17:23 ` Marcelo Tosatti
@ 2002-02-16  7:32   ` Andrew Morton
  2002-02-16 10:13     ` Daniel Phillips
  0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2002-02-16  7:32 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Alan Cox, lkml

Marcelo Tosatti wrote:
> 
> ...
> It seems the real gain (in latency) is caused by the FIFO behaviour.

Well there is no net gain in latency.  What we've gained is *fairness*,
so the worst-case latency is improved.  And yes, it's the FIFO behaviour
which provides that.   And it is the exclusive wait which reduces the context
switch rate by 50%.

> That is, removing this hunk (against __get_request_wait())
> 
> -               if (q->rq[rw].count < batch_requests)
> +               if (q->rq[rw].count == 0)
>                         schedule();
> 
> Would not make _much_ difference latency-wise. I'm I right or missing
> something ?

Well the code which is presently in -rc1 is quite pointless.  We
perform the above test a couple of microseconds after determining
that there are zero free requests.  So the test you've illustrated
will *always* be true.  The test for zero is the right thing to
do - it's just normal waitqueue handling.  It's just a little
obfuscated by the implicit test-for-zero in get_request().

However, contrary to my earlier guess, the request batching does
make a measurable difference.  Changing the code so that we wake up
a sleeper as soon as any request is freed costs maybe 30%
on `dbench 64'.

> Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
> the instrumentation stuff _and_ make it clear on the documentation that
> READA requests are not being used in practice).

READA is used in a few filesystems for directory readahead.  (And since
dir-in-pagecache, ext2 is no longer performing directory readhead.  Bad.)

I came >this< close to killing READA altogether.  I believe it's a
misfeature.  If we're under intense seek pressure, the very, very _last_
thing we want to do is to create more seeks by not performing readahead.
But given that the misfeature only applies to creation of new requests,
and that block-contiguous readahead will still work OK, it's not too
serious.  Plus it's a stable kernel :)





--- linux-2.4.18-rc1/include/linux/blkdev.h	Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h	Fri Feb 15 23:06:04 2002
@@ -119,9 +119,9 @@ struct request_queue
 	spinlock_t		queue_lock;
 
 	/*
-	 * Tasks wait here for free request
+	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_request;
+	wait_queue_head_t	wait_for_requests[2];
 };
 
 struct blk_dev_struct {
--- linux-2.4.18-rc1/drivers/block/ll_rw_blk.c	Wed Feb 13 12:59:09 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Fri Feb 15 23:09:17 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
 		q->rq[i&1].count++;
 	}
 
-	init_waitqueue_head(&q->wait_for_request);
+	init_waitqueue_head(&q->wait_for_requests[0]);
+	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,40 +443,84 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with wakeup hysteresis around
+ *    0 .. batch_requests and with the assumption that request taking is much,
+ *    much faster than request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.  Note that READA is used extremely rarely - a few
+ *       filesystems use it for directory readahead.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner.
+ * 
+ *    b) If 0 <  free_requests < batch_requests and there are waiters,
+ *       we still take a request non-blockingly.  This provides batching.
+ *
+ *    c) If free_requests >= batch_requests, the caller is immediately
+ *       granted a new request.
+ * 
+ *  When a request is released:
+ * 
+ *    d) If free_requests < batch_requests, do nothing.
+ * 
+ *    f) If free_requests >= batch_requests, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the batch_requests level,
+ *   it will be able to take approximately (batch_requests) requests before
+ *   blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_request, &wait);
+	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
 	return rq;
 }
 
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
-
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
-}
-
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -546,7 +595,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,8 +609,9 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
-			wake_up(&q->wait_for_request);
+		if (++q->rq[rw].count >= batch_requests &&
+				waitqueue_active(&q->wait_for_requests[rw]))
+			wake_up(&q->wait_for_requests[rw]);
 	}
 }
 
@@ -742,22 +792,30 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			req = get_request(q, rw);
+			if (req == NULL) {
+				spin_unlock_irq(&io_request_lock);
+				freereq = __get_request_wait(q, rw);
+				goto again;
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */


-

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-16  7:32   ` Andrew Morton
@ 2002-02-16 10:13     ` Daniel Phillips
  2002-02-16 10:25       ` Andrew Morton
  0 siblings, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2002-02-16 10:13 UTC (permalink / raw)
  To: Andrew Morton, Marcelo Tosatti; +Cc: Alan Cox, lkml

On February 16, 2002 08:32 am, Andrew Morton wrote:
> However, contrary to my earlier guess, the request batching does
> make a measurable difference.  Changing the code so that we wake up
> a sleeper as soon as any request is freed costs maybe 30%
> on `dbench 64'.

Is this consistent with results on other IO benchmarks?

--
Daniel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [patch] get_request starvation fix
  2002-02-16 10:13     ` Daniel Phillips
@ 2002-02-16 10:25       ` Andrew Morton
  0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-16 10:25 UTC (permalink / raw)
  To: Daniel Phillips; +Cc: lkml

Daniel Phillips wrote:
> 
> On February 16, 2002 08:32 am, Andrew Morton wrote:
> > However, contrary to my earlier guess, the request batching does
> > make a measurable difference.  Changing the code so that we wake up
> > a sleeper as soon as any request is freed costs maybe 30%
> > on `dbench 64'.
> 
> Is this consistent with results on other IO benchmarks?
> 

I dunno.  I doubt it - few of the other benchmarks are very
seek-intensive.   dbench is somewhat repeatable if you load
it up with enough clients.  And average the results across
enough runs.  And stand on one leg and squint.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2002-02-16 10:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-02-13 13:55 [patch] get_request starvation fix rwhron
  -- strict thread matches above, loose matches on Subject: below --
2002-02-12 23:13 Andrew Morton
2002-02-13  1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
2002-02-16  7:32   ` Andrew Morton
2002-02-16 10:13     ` Daniel Phillips
2002-02-16 10:25       ` Andrew Morton
2002-02-09  1:56 rwhron
     [not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
2002-02-08 19:44 ` Andrew Morton
2002-02-08 19:53   ` Dieter Nützel
2002-02-08 20:43   ` Rik van Riel
2002-02-08 19:31 Dieter Nützel
2002-02-08  8:46 Andrew Morton
2002-02-08  8:57 ` Jens Axboe
2002-02-08  9:57   ` Andrew Morton
2002-02-08  9:10 ` Andrew Morton
2002-02-08 11:37 ` Rik van Riel
2002-02-08 18:28   ` Andrew Morton
2002-02-11  9:41 ` Andrew Morton
2002-02-11 17:35   ` Suparna Bhattacharya
2002-02-11 19:26     ` Andrew Morton
2002-02-14  6:00       ` Suparna Bhattacharya
2002-02-13  0:33   ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox