public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [patch] get_request starvation fix
@ 2002-02-13 13:55 rwhron
  0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-13 13:55 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm


tiobench measurements using 2048 MB worth of files on ext2 fs.
K6-2 475 mhz with 384 MB RAM and IDE disk.
Sorted by test, number of threads, then kernel.
Read, write, and seek rates in MB/sec. 
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.

2.4.18-pre9-am1 = make_request, read_latency2, and low-latency patches.
2.4.18-pre9-am2 = make_request, read_latency2, and sync_livelock patches.

On sequential reads, kernels with make_request and read_latency2
reduce max latency from around 500 seconds down 2-8 seconds.
Fairness to requests is vastly improved.  Throughput looks better too.

Sequential Reads Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   9.18 12.19%    34.739  545388.95  0.00304  0.00304
2.4.18-pre9-ac1   32  11.48 30.08%    25.736  289601.71  0.00839  0.00839
2.4.18-pre9-am1   32   9.25 11.97%    40.103    2251.30  0.00000  0.00000
2.4.18-pre9-am2   32   9.32 12.11%    39.917    2206.14  0.00000  0.00000

2.4.18-pre9       64   8.83 11.81%    59.045  484270.65  0.02689  0.02651
2.4.18-pre9-ac1   64  10.37 27.12%    49.460  293425.91  0.04273  0.04196
2.4.18-pre9-am1   64   9.53 13.42%    77.737    3840.97  0.00000  0.00000
2.4.18-pre9-am2   64   9.52 13.15%    77.873    3951.99  0.00000  0.00000

2.4.18-pre9      128   8.44 10.92%   109.317  572551.72  0.08507  0.07534
2.4.18-pre9-ac1  128   9.28 24.08%   104.159  333550.33  0.13599  0.13523
2.4.18-pre9-am1  128   9.55 15.38%   153.480    7939.48  0.66070  0.00000
2.4.18-pre9-am2  128   9.59 15.72%   153.244    7648.12  0.80833  0.00000

On random reads, the make_request and read_latency2 patches help a lot at 
128 threads.  Max latency and % of high latency requests is greatly reduced.

Random Reads     Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   0.57 1.580%   549.367    1523.39  0.00000  0.00000
2.4.18-pre9-ac1   32   0.53 2.226%   660.193    2966.26  0.00000  0.00000
2.4.18-pre9-am1   32   0.58 1.479%   623.361    1693.90  0.00000  0.00000
2.4.18-pre9-am2   32   0.58 1.501%   619.679    1727.47  0.00000  0.00000

2.4.18-pre9       64   0.61 2.031%  1016.205    2618.34  0.00000  0.00000
2.4.18-pre9-ac1   64   0.56 2.579%  1160.377   72621.35  0.35283  0.32763
2.4.18-pre9-am1   64   0.61 1.711%  1145.865    2865.21  0.00000  0.00000
2.4.18-pre9-am2   64   0.62 1.978%  1135.097    2792.76  0.00000  0.00000

2.4.18-pre9      128   0.61 1.699%  1569.812   53372.49  4.48589  4.41029
2.4.18-pre9-ac1  128   0.55 2.701%  1862.139   75417.90  5.54435  5.21673
2.4.18-pre9-am1  128   0.63 2.511%  2152.480    4710.51  0.00000  0.00000
2.4.18-pre9-am2  128   0.63 2.525%  2145.578    4709.85  0.00000  0.00000

On the write tests, 2.4.18-pre9-ac1 has the best latency numbers.

Sequential Write Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32  13.18 35.47%    26.155   33677.84  0.48867  0.00324
2.4.18-pre9-ac1   32  13.66 36.82%    24.738    9008.30  0.03567  0.00000
2.4.18-pre9-am1   32  15.27 53.94%    21.236   39604.59  0.37098  0.00400
2.4.18-pre9-am2   32  15.17 53.66%    21.492   41358.77  0.37155  0.00496

2.4.18-pre9       64  13.47 44.74%    46.296   46673.56  0.76332  0.03968
2.4.18-pre9-ac1   64  11.91 32.45%    55.355   12051.19  0.22926  0.00000
2.4.18-pre9-am1   64  14.57 52.16%    43.430   51274.83  0.75874  0.05665
2.4.18-pre9-am2   64  14.78 53.76%    42.685   54281.45  0.63266  0.07457

2.4.18-pre9      128  12.78 42.55%    91.421   60494.19  1.20201  0.26531
2.4.18-pre9-ac1  128  10.15 27.63%   122.921   15371.80  1.29719  0.00000
2.4.18-pre9-am1  128  13.72 47.73%    83.373   69739.68  1.16043  0.22126
2.4.18-pre9-am2  128  14.17 52.81%    80.642   67952.87  1.10722  0.23041


Random Writes    Num                   Avg      Maximum    Lat%     Lat% 
Kernel           Thr  Rate  (CPU%)   Latency    Latency     >2s     >10s 
---------------- ---  ---------------------------------------------------
2.4.18-pre9       32   0.60 1.323%     1.915     576.55  0.00000  0.00000
2.4.18-pre9-ac1   32   0.61 1.296%     0.311     481.93  0.00000  0.00000
2.4.18-pre9-am1   32   0.71 2.411%     1.696     624.95  0.00000  0.00000
2.4.18-pre9-am2   32   0.73 2.804%     1.567     711.59  0.00000  0.00000

2.4.18-pre9       64   0.65 1.547%     1.552     558.61  0.00000  0.00000
2.4.18-pre9-ac1   64   0.64 1.457%     0.214      88.78  0.00000  0.00000
2.4.18-pre9-am1   64   0.74 2.649%     1.014     522.44  0.00000  0.00000
2.4.18-pre9-am2   64   0.75 2.647%     1.035     557.22  0.00000  0.00000

2.4.18-pre9      128   0.70 1.823%     1.664     785.13  0.00000  0.00000
2.4.18-pre9-ac1  128   0.67 1.926%     0.339     398.63  0.00000  0.00000
2.4.18-pre9-am1  128   0.76 2.716%     1.527     845.77  0.00000  0.00000
2.4.18-pre9-am2  128   0.75 2.828%     1.371     886.19  0.00000  0.00000

More tests on more kernels at
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 23+ messages in thread
* [patch] get_request starvation fix
@ 2002-02-12 23:13 Andrew Morton
  2002-02-13  1:28 ` William Lee Irwin III
  2002-02-15 17:23 ` Marcelo Tosatti
  0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-12 23:13 UTC (permalink / raw)
  To: Alan Cox, lkml

Second version of this patch, incorporating Suparna's
suggested simplification (the low-water mark was
unnecessary).

This patch is working well here.  Hopefully it'll pop up
in an rmap kernel soon.

Bill Irwin has been doing some fairly extensive tuning
and testing of this.  Hopefully he'll come out with some
numbers soon.

I include the original description...


==== make_request description ====

Here's a patch which addresses the get_request starvation problem.

The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level.  But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.

Some instrumentation illustrates this clearly.  All testing was on
ext2.  In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair.  Look at this section of the trace:

Feb  7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb  7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb  7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb  7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb  7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb  7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb  7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb  7 21:59:06 mnm kernel: dbench[904] wakes after 278ms

With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:

Feb  7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb  7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb  7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb  7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb  7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb  7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb  7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb  7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb  7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb  7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb  7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb  7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb  7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb  7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb  7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb  7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb  7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb  7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb  7 23:55:42 mnm kernel: dbench[988] wakes after 344ms



The actual algorithm I chose is described in the patch.  It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles.  The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.

Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run.  Beats me.

Rik, this patch will conflict with read-latency2 in the init code. 
Easily fixable.  I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8.  I'll
do more testing tomorrow.

Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running.  This is in fact not due to request
starvation.  It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.


=====================================

--- linux-2.4.18-pre9/include/linux/blkdev.h	Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h	Mon Feb 11 00:59:50 2002
@@ -119,9 +119,9 @@ struct request_queue
 	spinlock_t		queue_lock;
 
 	/*
-	 * Tasks wait here for free request
+	 * Tasks wait here for free read and write requests
 	 */
-	wait_queue_head_t	wait_for_request;
+	wait_queue_head_t	wait_for_requests[2];
 };
 
 struct blk_dev_struct {
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c	Thu Feb  7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Mon Feb 11 00:59:50 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
 		q->rq[i&1].count++;
 	}
 
-	init_waitqueue_head(&q->wait_for_request);
+	init_waitqueue_head(&q->wait_for_requests[0]);
+	init_waitqueue_head(&q->wait_for_requests[1]);
 	spin_lock_init(&q->queue_lock);
 }
 
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,38 +443,102 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with wakeup hysteresis around
+ *    0 .. batch_requests and with the assumption that request taking is much,
+ *    much faster than request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If free_requests == 0, the requester sleeps in FIFO manner.
+ * 
+ *    b) If 0 <  free_requests < batch_requests and there are waiters,
+ *       we still take a request non-blockingly.  This provides batching.
+ *
+ *    c) If free_requests >= batch_requests, the caller is immediately
+ *       granted a new request.
+ * 
+ *  When a request is released:
+ * 
+ *    d) If free_requests < batch_requests, do nothing.
+ * 
+ *    f) If free_requests >= batch_requests, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the batch_requests level,
+ *   it will be able to take approximately (batch_requests) requests before
+ *   blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+#undef REQTIMING
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+#ifdef REQTIMING
+	struct timeval tv1, tv2;
+	unsigned long long t1, t2;
+	unsigned long t;
+	do_gettimeofday(&tv1);
+#endif
+
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count == 0)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
 		spin_unlock_irq(&io_request_lock);
 	} while (rq == NULL);
-	remove_wait_queue(&q->wait_for_request, &wait);
+	remove_wait_queue(&q->wait_for_requests[rw], &wait);
 	current->state = TASK_RUNNING;
-	return rq;
-}
-
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
 
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
+#ifdef REQTIMING
+	do_gettimeofday(&tv2);
+	t1 = tv1.tv_sec;
+	t1 *= 1000000;
+	t1 += tv1.tv_usec;
+	t2 = tv2.tv_sec;
+	t2 *= 1000000;
+	t2 += tv2.tv_usec;
+	t = t2 - t1;
+	printk("%s[%d] wakes after %ldms for %s\n", current->comm,
+			current->pid, t / 1000,
+			(rw == WRITE) ? "WRITE" : "READ");
+#endif
+	return rq;
 }
 
 /* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,8 +629,9 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
-			wake_up(&q->wait_for_request);
+		if (++q->rq[rw].count >= batch_requests &&
+				waitqueue_active(&q->wait_for_requests[rw]))
+			wake_up(&q->wait_for_requests[rw]);
 	}
 }
 
@@ -742,22 +812,30 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			req = get_request(q, rw);
+			if (req == NULL) {
+				spin_unlock_irq(&io_request_lock);
+				freereq = __get_request_wait(q, rw);
+				goto again;
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */


-

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
@ 2002-02-09  1:56 rwhron
  0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-09  1:56 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel

On Fri, Feb 08, 2002 at 10:29:37AM -0800, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> > 
> > Looks like a tremendously important patch.  Thanks!
> 
> hmm.  Dunno.  I don't think it's very common to hit get_request
> starvation.  Only for dbenchy things, I suspect.  Still, it's
> good to have a design which addresses nasty corner cases.

It really works!  

tiobench sequential reads with 32 threads went from max latency of
193 seconds down to 7 seconds with your make_request, read-latency2
and low-latency patches.  (2.4.18-pre9-am1)  The % of high latency
requests also went down.

K6-2 475 mhz with 384 MB ram and reiserfs on IDE disks.
Average of 3 runs.
Total File size = 1024 MB.  (individual file size = 1024 / num-threads)
Read, write, and seek rates in MB/sec. 
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.

Sequential       Num                   Avg     Maximum  Lat%   Lat%  CPU
Reads            Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8   8.94  13.33%   10.402   1785.52  0.000  0.000  67
2.4.18-pre9        8   8.85  13.22%   10.537   1954.27  0.000  0.000  67
2.4.18-pre9-am1   16   8.98  13.70%   30.827   4023.09  0.000  0.000  66
2.4.18-pre9       16   8.91  13.60%   31.122   4072.67  0.000  0.000  65
2.4.18-pre9-am1   32   9.00  14.16%   70.834   7032.45  0.000  0.000  64
2.4.18-pre9       32   8.87  13.89%   66.135 193590.10  0.030  0.023  64
                             
                             
Random           Num                  Avg      Maximum  Lat%   Lat%  CPU
Reads            Thr Rate  ( CPU%)  Latency    Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8  0.69 1 .828%  132.242     473.73  0.000  0.000  38
2.4.18-pre9        8  0.71 2 .156%  128.869     510.38  0.000  0.000  33
2.4.18-pre9-am1   16  0.72 2 .154%  371.288    1410.13  0.000  0.000  33
2.4.18-pre9       16  0.74 2 .025%  364.871    1397.82  0.000  0.000  36
2.4.18-pre9-am1   32  0.74 2 .229%  816.605    2996.07  0.000  0.000  33
2.4.18-pre9       32  0.75 2 .163%  734.035    2841.97  0.000  0.000  35
                             
Sequential       Num                   Avg     Maximum  Lat%   Lat%  CPU 
Writes           Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff 
                 --- --------------------------------------------------- 
2.4.18-pre9-am1    8  11.45  52.33%    7.651  15095.52  0.058  0.000  22 
2.4.18-pre9        8   9.23  40.84%    9.021  12351.12  0.040  0.000  23
2.4.18-pre9-am1   16  11.48  54.57%   22.253  41922.53  0.214  0.000  21 
2.4.18-pre9       16   9.46  41.71%   26.374  28280.18  0.212  0.000  23
2.4.18-pre9-am1   32  11.37  55.91%   52.917  75331.67  0.679  0.004  20 
2.4.18-pre9       32   9.50  42.55%   61.944  60970.74  0.770  0.005  22
                             
                             
Random           Num                   Avg     Maximum  Lat%   Lat%  CPU
Writes           Thr  Rate   (CPU%)  Latency   Latency   >2s   >10s  Eff
                 --- ---------------------------------------------------
2.4.18-pre9-am1    8   0.62  2.393%    1.786    330.56  0.000  0.000  26
2.4.18-pre9        8   0.50  1.216%    1.700    333.32  0.000  0.000  41
2.4.18-pre9-am1   16   0.65  2.527%    4.533    996.54  0.000  0.000  26
2.4.18-pre9       16   0.52  1.267%    4.228   1236.08  0.000  0.000  41
2.4.18-pre9-am1   32   0.65  2.530%    7.980   1907.91  0.000  0.000  26
2.4.18-pre9       32   0.53  1.343%    7.818   2509.73  0.000  0.000  39


> > Do you have a patch for dbench too? :)
> 
> /bin/rm?

Good one.  It would be helpful though, if you've already done the work.
-- 
Randy Hron


^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
@ 2002-02-08 19:31 Dieter Nützel
  0 siblings, 0 replies; 23+ messages in thread
From: Dieter Nützel @ 2002-02-08 19:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List

On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.

[snip]

> Also, you noted the other day that a LILO run *never* terminated when
> there was a dbench running.  This is in fact not due to request
> starvation.  It's due to livelock in invalidate_bdev(), which is called
> via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
> as long as another task is generating locked buffers against the
> device.
[snip]

Could this below related?
I get system looks through lilo (bzlilo) from time to time with all latest 
kernels + O(1) + -aa + preempt

Thanks,
	Dieter

> Re: [patch] O(1) scheduler, -J9
> Datum: Wed, 30 Jan 2002 17:36:21 +0100
> Von: Dieter Nützel <Dieter.Nuetzel@hamburg.de>
> An: Ingo Molnar <mingo@elte.hu>
> Kopie: Robert Love <rml@tech9.net>, Oleg Drokin <green@namesys.com>
>
> On Wednesday, 30. January 2002 17:26, Dieter Nützel wrote:
> > On Wednesday, 30. January 2002 15:40, Ingo Molnar wrote:
> > > On Wed, 30 Jan 2002, Dieter [iso-8859-15] Nützel wrote:
> > > > As always running "nice +19 make -j40" in my HUGE C++ VTK tree in the
> > > > background ;-)
> > > >
> > > > The mouse feels a little bit sluggish after KMail start so I reniced X
> > > > to -10. [...]
> > >
> > > does this make X smooth? nice -10 is a good choice, it's not too low and
> > > not too high, i'm using that value myself.
> >
> > Yes, mostly.
> > In "normal operation mode" X at 0 is good.
> >
> > I only get some very little but noticeable unsmoothness with several KDE
> > apps running. It could be KDE it self.
> >
> > Latency degradation since -J4 stays. But that could be missing of some
> > lock-breaks.
> >
> > Robert's latest BKL stuff halved the numbers for the two
> > latencytest0.42-png graphic benches on 2.4.18-pre7 + patches.

> Addition:

> I am most worried about the occasional kernel hangs during make 
> bzlilo/bzImage. It appears during lilo (???). The new kernel (vmlinux) is 
> still be built in /usr/src/linux but together with "latest" *.o and
> .*.flags files broken due to ReiserFS transaction replay after reboot
> (normal behavior with a journaling FS but without full data journaling.
>
> Maybe a RACE between lilo vs sync?

^ permalink raw reply	[flat|nested] 23+ messages in thread
* [patch] get_request starvation fix
@ 2002-02-08  8:46 Andrew Morton
  2002-02-08  8:57 ` Jens Axboe
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08  8:46 UTC (permalink / raw)
  To: Rik van Riel, William Lee Irwin III, lkml

Here's a patch which addresses the get_request starvation problem.

The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level.  But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.

Some instrumentation illustrates this clearly.  All testing was on
ext2.  In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair.  Look at this section of the trace:

Feb  7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb  7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb  7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb  7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb  7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb  7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb  7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb  7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb  7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb  7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb  7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb  7 21:59:06 mnm kernel: dbench[904] wakes after 278ms

With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:

Feb  7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb  7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb  7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb  7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb  7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb  7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb  7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb  7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb  7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb  7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb  7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb  7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb  7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb  7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb  7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb  7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb  7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb  7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb  7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb  7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb  7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb  7 23:55:42 mnm kernel: dbench[988] wakes after 344ms



The actual algorithm I chose is described in the patch.  It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles.  The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.

Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run.  Beats me.

Rik, this patch will conflict with read-latency2 in the init code. 
Easily fixable.  I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8.  I'll
do more testing tomorrow.

Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running.  This is in fact not due to request
starvation.  It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO.  invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.


--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c	Thu Feb  7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c	Fri Feb  8 00:33:01 2002
@@ -118,10 +118,15 @@ int * max_readahead[MAX_BLKDEV];
 int * max_sectors[MAX_BLKDEV];
 
 /*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
  */
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * low- and high-water marks for the queues.
+ */
+static int batch_requests_low;
+static int batch_requests_high;
 
 static inline int get_max_sectors(kdev_t dev)
 {
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
 #define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
 /*
  * Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in.  Returns NULL if there are no free requests.
  */
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
 {
 	struct request *rq = NULL;
 	struct request_list *rl = q->rq + rw;
@@ -438,18 +443,81 @@ static inline struct request *get_reques
 }
 
 /*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ * 
+ * 2: We want to be `fair' to all requesters.  We must avoid starvation, and
+ *    attempt to ensure that all requesters sleep for a similar duration.  Hence
+ *    no stealing requests when there are other processes waiting.
+ * 
+ * 3: We also wish to support `batching' of requests.  So when a process is
+ *    woken, we want to allow it to allocate a decent number of requests
+ *    before it blocks again, so they can be nicely merged (this only really
+ *    matters if the process happens to be adding requests near the head of
+ *    the queue).
+ * 
+ * 4: We want to avoid scheduling storms.  This isn't really important, because
+ *    the system will be I/O bound anyway.  But it's easy.
+ * 
+ *    There is tension between requirements 2 and 3.  Once a task has woken,
+ *    we don't want to allow it to sleep as soon as it takes its second request.
+ *    But we don't want currently-running tasks to steal all the requests
+ *    from the sleepers.  We handle this with high- and low- water marks and
+ *    with the assumption that request taking is much, much faster than
+ *    request freeing.
+ * 
+ * So here's what we do:
+ * 
+ *    a) A READA requester fails if free_requests < batch_requests_high
+ * 
+ *       We don't want READA requests to prevent sleepers from ever
+ *       waking.
+ * 
+ *  When a process wants a new request:
+ * 
+ *    b) If someone else is waiting on requests and free_requests < low_water,
+ *       the requester sleeps in FIFO manner.
+ * 
+ *    c) If low_water < free_requests < high_water, the caller is immediately
+ *       granted a new request.
+ * 
+ *    d) If nobody is waiting on requests, the caller gets given a request,
+ *       if there are any available.  Otherwise the caller sleeps.
+ * 
+ *  When a request is released:
+ * 
+ *    e) If free_requests < low_water, do nothing.
+ * 
+ *    f) If free_requests > high_water, wake up a single waiter.
+ * 
+ *   The net effect is that when a process is woken at the high-water mark,
+ *   it will be able to take approximately (high-water - low-water) requests
+ *   before blocking again (at the tail of the queue).
+ * 
+ *   This all assumes that the rate of taking requests is much, much higher
+ *   than the rate of releasing them.  Which is very true.
+ *
+ * -akpm, Feb 2002.
  */
+#undef REQTIMING
 static struct request *__get_request_wait(request_queue_t *q, int rw)
 {
 	register struct request *rq;
 	DECLARE_WAITQUEUE(wait, current);
 
+#ifdef REQTIMING
+	struct timeval tv1, tv2;
+	unsigned long long t1, t2;
+	unsigned long t;
+	do_gettimeofday(&tv1);
+#endif
+
 	generic_unplug_device(q);
-	add_wait_queue(&q->wait_for_request, &wait);
+	add_wait_queue_exclusive(&q->wait_for_request, &wait);
 	do {
 		set_current_state(TASK_UNINTERRUPTIBLE);
-		if (q->rq[rw].count < batch_requests)
+		if (q->rq[rw].count < batch_requests_low)
 			schedule();
 		spin_lock_irq(&io_request_lock);
 		rq = get_request(q,rw);
@@ -457,19 +525,20 @@ static struct request *__get_request_wai
 	} while (rq == NULL);
 	remove_wait_queue(&q->wait_for_request, &wait);
 	current->state = TASK_RUNNING;
-	return rq;
-}
 
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
-	register struct request *rq;
-
-	spin_lock_irq(&io_request_lock);
-	rq = get_request(q, rw);
-	spin_unlock_irq(&io_request_lock);
-	if (rq)
-		return rq;
-	return __get_request_wait(q, rw);
+#ifdef REQTIMING
+	do_gettimeofday(&tv2);
+	t1 = tv1.tv_sec;
+	t1 *= 1000000;
+	t1 += tv1.tv_usec;
+	t2 = tv2.tv_sec;
+	t2 *= 1000000;
+	t2 += tv2.tv_usec;
+	t = t2 - t1;
+	printk("%s[%d] wakes after %ldms\n", current->comm,
+			current->pid, t / 1000);
+#endif
+	return rq;
 }
 
 /* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
 /*
  * Must be called with io_request_lock held and interrupts disabled
  */
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
 {
 	request_queue_t *q = req->q;
 	int rw = req->cmd;
@@ -560,7 +629,8 @@ inline void blkdev_release_request(struc
 	 */
 	if (q) {
 		list_add(&req->queue, &q->rq[rw].free);
-		if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+		if (++q->rq[rw].count >= batch_requests_high &&
+				waitqueue_active(&q->wait_for_request))
 			wake_up(&q->wait_for_request);
 	}
 }
@@ -742,22 +812,41 @@ again:
 			BUG();
 	}
 		
-	/*
-	 * Grab a free request from the freelist - if that is empty, check
-	 * if we are doing read ahead and abort instead of blocking for
-	 * a free slot.
-	 */
 get_rq:
 	if (freereq) {
 		req = freereq;
 		freereq = NULL;
-	} else if ((req = get_request(q, rw)) == NULL) {
-		spin_unlock_irq(&io_request_lock);
-		if (rw_ahead)
-			goto end_io;
-
-		freereq = __get_request_wait(q, rw);
-		goto again;
+	} else {
+		/*
+		 * See description above __get_request_wait()
+		 */
+		if (rw_ahead) {
+			if (q->rq[rw].count < batch_requests_high) {
+				spin_unlock_irq(&io_request_lock);
+				goto end_io;
+			}
+			req = get_request(q, rw);
+			if (req == NULL)
+				BUG();
+		} else {
+			if (waitqueue_active(&q->wait_for_request)) {
+				if (q->rq[rw].count < batch_requests_low) {
+					spin_unlock_irq(&io_request_lock);
+					freereq = __get_request_wait(q, rw);
+					goto again;
+				}
+				req = get_request(q, rw);
+				if (req == NULL)
+					BUG();
+			} else {
+				req = get_request(q, rw);
+				if (req == NULL) {
+					spin_unlock_irq(&io_request_lock);
+					freereq = __get_request_wait(q, rw);
+					goto again;
+				}
+			}
+		}
 	}
 
 /* fill up the request-info, and add it to the queue */
@@ -1124,8 +1213,10 @@ int __init blk_dev_init(void)
 	/*
 	 * Batch frees according to queue length
 	 */
-	batch_requests = queue_nr_requests/4;
-	printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
+	batch_requests_high = queue_nr_requests / 4;
+	batch_requests_low = batch_requests_high / 2;
+	printk("block: %d slots per queue, batch_low=%d, batch_high=%d\n",
+		queue_nr_requests, batch_requests_low, batch_requests_high);
 
 #ifdef CONFIG_AMIGA_Z2RAM
 	z2_init();

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2002-02-16 10:27 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
2002-02-08 19:44 ` [patch] get_request starvation fix Andrew Morton
2002-02-08 19:53   ` Dieter Nützel
2002-02-08 20:43   ` Rik van Riel
2002-02-13 13:55 rwhron
  -- strict thread matches above, loose matches on Subject: below --
2002-02-12 23:13 Andrew Morton
2002-02-13  1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
2002-02-16  7:32   ` Andrew Morton
2002-02-16 10:13     ` Daniel Phillips
2002-02-16 10:25       ` Andrew Morton
2002-02-09  1:56 rwhron
2002-02-08 19:31 Dieter Nützel
2002-02-08  8:46 Andrew Morton
2002-02-08  8:57 ` Jens Axboe
2002-02-08  9:57   ` Andrew Morton
2002-02-08  9:10 ` Andrew Morton
2002-02-08 11:37 ` Rik van Riel
2002-02-08 18:28   ` Andrew Morton
2002-02-11  9:41 ` Andrew Morton
2002-02-11 17:35   ` Suparna Bhattacharya
2002-02-11 19:26     ` Andrew Morton
2002-02-14  6:00       ` Suparna Bhattacharya
2002-02-13  0:33   ` Jesse Barnes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox