All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@zip.com.au>
To: Marcelo Tosatti <marcelo@conectiva.com.br>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>, lkml <linux-kernel@vger.kernel.org>
Subject: Re: [patch] get_request starvation fix
Date: Fri, 15 Feb 2002 23:32:25 -0800	[thread overview]
Message-ID: <3C6E0B09.30983B1A@zip.com.au> (raw)
In-Reply-To: <3C69A196.B7325DC2@zip.com.au> <Pine.LNX.4.21.0202151515020.23069-100000@freak.distro.conectiva>

Marcelo Tosatti wrote:
> 
> ...
> It seems the real gain (in latency) is caused by the FIFO behaviour.

Well there is no net gain in latency.  What we've gained is *fairness*,
so the worst-case latency is improved.  And yes, it's the FIFO behaviour
which provides that.   And it is the exclusive wait which reduces the context
switch rate by 50%.

> That is, removing this hunk (against __get_request_wait())
> 
> -               if (q->rq[rw].count < batch_requests)
> +               if (q->rq[rw].count == 0)
>                         schedule();
> 
> Would not make _much_ difference latency-wise. I'm I right or missing
> something ?

Well the code which is presently in -rc1 is quite pointless.  We
perform the above test a couple of microseconds after determining
that there are zero free requests.  So the test you've illustrated
will *always* be true.  The test for zero is the right thing to
do - it's just normal waitqueue handling.  It's just a little
obfuscated by the implicit test-for-zero in get_request().

However, contrary to my earlier guess, the request batching does
make a measurable difference.  Changing the code so that we wake up
a sleeper as soon as any request is freed costs maybe 30%
on `dbench 64'.

> Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
> the instrumentation stuff _and_ make it clear on the documentation that
> READA requests are not being used in practice).

READA is used in a few filesystems for directory readahead.  (And since
dir-in-pagecache, ext2 is no longer performing directory readhead.  Bad.)

I came >this< close to killing READA altogether.  I believe it's a
misfeature.  If we're under intense seek pressure, the very, very _last_
thing we want to do is to create more seeks by not performing readahead.
But given that the misfeature only applies to creation of new requests,
and that block-contiguous readahead will still work OK, it's not too
serious.  Plus it's a stable kernel :)





--- linux-2.4.18-rc1/include/linux/blkdev.h	Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h	Fri Feb 15 23:06:04 2002
@@ -119,9 +119,9 @@ struct request_queue
 	spinlock_t		queue_lock;
 
 	/*
-	 * Tasks wait here for free request
+	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_request;
+	wait_queue_head_t	wait_for_requests[2];
 };
 
 struct blk_dev_struct {
--- linux-2.4.18-rc1/drivers/block/ll_rw_blk.c	Wed Feb 13 12:59:09 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Fri Feb 15 23:09:17 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
 		q->rq[i&1].count++;
 	}
 
-	init_waitqueue_head(&q->wait_for_request);
+	init_waitqueue_head(&q->wait_for_requests[0]);
+	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,40 +443,84 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with wakeup hysteresis around
+ *    0 .. batch_requests and with the assumption that request taking is much,
+ *    much faster than request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.  Note that READA is used extremely rarely - a few
+ *       filesystems use it for directory readahead.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner.
+ * 
+ *    b) If 0 <  free_requests < batch_requests and there are waiters,
+ *       we still take a request non-blockingly.  This provides batching.
+ *
+ *    c) If free_requests >= batch_requests, the caller is immediately
+ *       granted a new request.
+ * 
+ *  When a request is released:
+ * 
+ *    d) If free_requests < batch_requests, do nothing.
+ * 
+ *    f) If free_requests >= batch_requests, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the batch_requests level,
+ *   it will be able to take approximately (batch_requests) requests before
+ *   blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_request, &wait);
+	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
 	return rq;
 }
 
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
-
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
-}
-
 /* RO fail safe mechanism */
 
 static long ro_bits[MAX_BLKDEV][8];
@@ -546,7 +595,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,8 +609,9 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
-			wake_up(&q->wait_for_request);
+		if (++q->rq[rw].count >= batch_requests &&
+				waitqueue_active(&q->wait_for_requests[rw]))
+			wake_up(&q->wait_for_requests[rw]);
 	}
 }
 
@@ -742,22 +792,30 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			req = get_request(q, rw);
+			if (req == NULL) {
+				spin_unlock_irq(&io_request_lock);
+				freereq = __get_request_wait(q, rw);
+				goto again;
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */


-

  reply	other threads:[~2002-02-16  7:33 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2002-02-12 23:13 [patch] get_request starvation fix Andrew Morton
2002-02-13  1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
2002-02-16  7:32   ` Andrew Morton [this message]
2002-02-16 10:13     ` Daniel Phillips
2002-02-16 10:25       ` Andrew Morton
  -- strict thread matches above, loose matches on Subject: below --
2002-02-13 13:55 rwhron
2002-02-09  1:56 rwhron
     [not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
2002-02-08 19:44 ` Andrew Morton
2002-02-08 19:53   ` Dieter Nützel
2002-02-08 20:43   ` Rik van Riel
2002-02-08 19:31 Dieter Nützel
2002-02-08  8:46 Andrew Morton
2002-02-08  8:57 ` Jens Axboe
2002-02-08  9:57   ` Andrew Morton
2002-02-08  9:10 ` Andrew Morton
2002-02-08 11:37 ` Rik van Riel
2002-02-08 18:28   ` Andrew Morton
2002-02-11  9:41 ` Andrew Morton
2002-02-11 17:35   ` Suparna Bhattacharya
2002-02-11 19:26     ` Andrew Morton
2002-02-14  6:00       ` Suparna Bhattacharya
2002-02-13  0:33   ` Jesse Barnes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3C6E0B09.30983B1A@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=linux-kernel@vger.kernel.org \
    --cc=marcelo@conectiva.com.br \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.