* Re: [patch] get_request starvation fix
[not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
@ 2002-02-08 19:44 ` Andrew Morton
2002-02-08 19:53 ` Dieter Nützel
2002-02-08 20:43 ` Rik van Riel
0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 19:44 UTC (permalink / raw)
To: Dieter Nützel
Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List
Dieter Nützel wrote:
>
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
>
> [snip]
>
> > Also, you noted the other day that a LILO run *never* terminated when
> > there was a dbench running. This is in fact not due to request
> > starvation. It's due to livelock in invalidate_bdev(), which is called
> > via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> > as long as another task is generating locked buffers against the
> > device.
> [snip]
>
> Could this below related?
> I get system looks through lilo (bzlilo) from time to time with all latest
> kernels + O(1) + -aa + preempt
>
Depends what you mean by "system locks". The invalidate_bdev() problem
won't lock the machine. Its symptoms are merely that the ioctl will
not terminate until the process which is writing to disk stops.
In other words: if you run dbench, then lilo, lilo will not complete
until after dbench terminates.
If you're seeing actual have-to-hit-reset lockups then that'll
be due to something quite different.
-
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-08 19:44 ` [patch] get_request starvation fix Andrew Morton
@ 2002-02-08 19:53 ` Dieter Nützel
2002-02-08 20:43 ` Rik van Riel
1 sibling, 0 replies; 23+ messages in thread
From: Dieter Nützel @ 2002-02-08 19:53 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List
On Friday, 8. February 2002 20:44, Andrew Morton wrote:
> Dieter Nützel wrote:
> > On Fri, Feb 08 2002, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> >
> > [snip]
> >
> > > Also, you noted the other day that a LILO run *never* terminated when
> > > there was a dbench running. This is in fact not due to request
> > > starvation. It's due to livelock in invalidate_bdev(), which is called
> > > via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> > > as long as another task is generating locked buffers against the
> > > device.
> >
> > [snip]
> >
> > Could this below related?
> > I get system looks through lilo (bzlilo) from time to time with all
> > latest kernels + O(1) + -aa + preempt
>
> Depends what you mean by "system locks".
Hard lock up :-(
Nothing in the log files.
No SYSRQ works. Only reset. --- But ReiserFS works. Some few corrupted
*.o.flag files and most of the time a broken /usr/src/vmlinux or freshly
/boot/vmlinuz file.
> The invalidate_bdev() problem won't lock the machine.
I see.
> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.
dbench wasn't run before
Only several "sync" commands (by hand) during kernel build helps.
> If you're seeing actual have-to-hit-reset lockups then that'll
> be due to something quite different.
Sadly, yes.
Thanks,
Dieter
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-08 19:44 ` [patch] get_request starvation fix Andrew Morton
2002-02-08 19:53 ` Dieter Nützel
@ 2002-02-08 20:43 ` Rik van Riel
1 sibling, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2002-02-08 20:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Dieter Nützel, Jens Axboe, Ingo Molnar, Robert Love,
Linux Kernel List
On Fri, 8 Feb 2002, Andrew Morton wrote:
> In other words: if you run dbench, then lilo, lilo will not complete
> until after dbench terminates.
Actually, lilo completed after 5 minutes, but still only
about halfway through the dbench.
Rik
--
Will hack the VM for food.
http://www.surriel.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
@ 2002-02-13 13:55 rwhron
0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-13 13:55 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm
tiobench measurements using 2048 MB worth of files on ext2 fs.
K6-2 475 mhz with 384 MB RAM and IDE disk.
Sorted by test, number of threads, then kernel.
Read, write, and seek rates in MB/sec.
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.
2.4.18-pre9-am1 = make_request, read_latency2, and low-latency patches.
2.4.18-pre9-am2 = make_request, read_latency2, and sync_livelock patches.
On sequential reads, kernels with make_request and read_latency2
reduce max latency from around 500 seconds down 2-8 seconds.
Fairness to requests is vastly improved. Throughput looks better too.
Sequential Reads Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 9.18 12.19% 34.739 545388.95 0.00304 0.00304
2.4.18-pre9-ac1 32 11.48 30.08% 25.736 289601.71 0.00839 0.00839
2.4.18-pre9-am1 32 9.25 11.97% 40.103 2251.30 0.00000 0.00000
2.4.18-pre9-am2 32 9.32 12.11% 39.917 2206.14 0.00000 0.00000
2.4.18-pre9 64 8.83 11.81% 59.045 484270.65 0.02689 0.02651
2.4.18-pre9-ac1 64 10.37 27.12% 49.460 293425.91 0.04273 0.04196
2.4.18-pre9-am1 64 9.53 13.42% 77.737 3840.97 0.00000 0.00000
2.4.18-pre9-am2 64 9.52 13.15% 77.873 3951.99 0.00000 0.00000
2.4.18-pre9 128 8.44 10.92% 109.317 572551.72 0.08507 0.07534
2.4.18-pre9-ac1 128 9.28 24.08% 104.159 333550.33 0.13599 0.13523
2.4.18-pre9-am1 128 9.55 15.38% 153.480 7939.48 0.66070 0.00000
2.4.18-pre9-am2 128 9.59 15.72% 153.244 7648.12 0.80833 0.00000
On random reads, the make_request and read_latency2 patches help a lot at
128 threads. Max latency and % of high latency requests is greatly reduced.
Random Reads Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 0.57 1.580% 549.367 1523.39 0.00000 0.00000
2.4.18-pre9-ac1 32 0.53 2.226% 660.193 2966.26 0.00000 0.00000
2.4.18-pre9-am1 32 0.58 1.479% 623.361 1693.90 0.00000 0.00000
2.4.18-pre9-am2 32 0.58 1.501% 619.679 1727.47 0.00000 0.00000
2.4.18-pre9 64 0.61 2.031% 1016.205 2618.34 0.00000 0.00000
2.4.18-pre9-ac1 64 0.56 2.579% 1160.377 72621.35 0.35283 0.32763
2.4.18-pre9-am1 64 0.61 1.711% 1145.865 2865.21 0.00000 0.00000
2.4.18-pre9-am2 64 0.62 1.978% 1135.097 2792.76 0.00000 0.00000
2.4.18-pre9 128 0.61 1.699% 1569.812 53372.49 4.48589 4.41029
2.4.18-pre9-ac1 128 0.55 2.701% 1862.139 75417.90 5.54435 5.21673
2.4.18-pre9-am1 128 0.63 2.511% 2152.480 4710.51 0.00000 0.00000
2.4.18-pre9-am2 128 0.63 2.525% 2145.578 4709.85 0.00000 0.00000
On the write tests, 2.4.18-pre9-ac1 has the best latency numbers.
Sequential Write Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 13.18 35.47% 26.155 33677.84 0.48867 0.00324
2.4.18-pre9-ac1 32 13.66 36.82% 24.738 9008.30 0.03567 0.00000
2.4.18-pre9-am1 32 15.27 53.94% 21.236 39604.59 0.37098 0.00400
2.4.18-pre9-am2 32 15.17 53.66% 21.492 41358.77 0.37155 0.00496
2.4.18-pre9 64 13.47 44.74% 46.296 46673.56 0.76332 0.03968
2.4.18-pre9-ac1 64 11.91 32.45% 55.355 12051.19 0.22926 0.00000
2.4.18-pre9-am1 64 14.57 52.16% 43.430 51274.83 0.75874 0.05665
2.4.18-pre9-am2 64 14.78 53.76% 42.685 54281.45 0.63266 0.07457
2.4.18-pre9 128 12.78 42.55% 91.421 60494.19 1.20201 0.26531
2.4.18-pre9-ac1 128 10.15 27.63% 122.921 15371.80 1.29719 0.00000
2.4.18-pre9-am1 128 13.72 47.73% 83.373 69739.68 1.16043 0.22126
2.4.18-pre9-am2 128 14.17 52.81% 80.642 67952.87 1.10722 0.23041
Random Writes Num Avg Maximum Lat% Lat%
Kernel Thr Rate (CPU%) Latency Latency >2s >10s
---------------- --- ---------------------------------------------------
2.4.18-pre9 32 0.60 1.323% 1.915 576.55 0.00000 0.00000
2.4.18-pre9-ac1 32 0.61 1.296% 0.311 481.93 0.00000 0.00000
2.4.18-pre9-am1 32 0.71 2.411% 1.696 624.95 0.00000 0.00000
2.4.18-pre9-am2 32 0.73 2.804% 1.567 711.59 0.00000 0.00000
2.4.18-pre9 64 0.65 1.547% 1.552 558.61 0.00000 0.00000
2.4.18-pre9-ac1 64 0.64 1.457% 0.214 88.78 0.00000 0.00000
2.4.18-pre9-am1 64 0.74 2.649% 1.014 522.44 0.00000 0.00000
2.4.18-pre9-am2 64 0.75 2.647% 1.035 557.22 0.00000 0.00000
2.4.18-pre9 128 0.70 1.823% 1.664 785.13 0.00000 0.00000
2.4.18-pre9-ac1 128 0.67 1.926% 0.339 398.63 0.00000 0.00000
2.4.18-pre9-am1 128 0.76 2.716% 1.527 845.77 0.00000 0.00000
2.4.18-pre9-am2 128 0.75 2.828% 1.371 886.19 0.00000 0.00000
More tests on more kernels at
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
--
Randy Hron
^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch] get_request starvation fix
@ 2002-02-12 23:13 Andrew Morton
2002-02-13 1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
0 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-12 23:13 UTC (permalink / raw)
To: Alan Cox, lkml
Second version of this patch, incorporating Suparna's
suggested simplification (the low-water mark was
unnecessary).
This patch is working well here. Hopefully it'll pop up
in an rmap kernel soon.
Bill Irwin has been doing some fairly extensive tuning
and testing of this. Hopefully he'll come out with some
numbers soon.
I include the original description...
==== make_request description ====
Here's a patch which addresses the get_request starvation problem.
The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level. But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.
Some instrumentation illustrates this clearly. All testing was on
ext2. In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair. Look at this section of the trace:
Feb 7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb 7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb 7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb 7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb 7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb 7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb 7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb 7 21:59:06 mnm kernel: dbench[904] wakes after 278ms
With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:
Feb 7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb 7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb 7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb 7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb 7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb 7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb 7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb 7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb 7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb 7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb 7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb 7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb 7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb 7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb 7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb 7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb 7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb 7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb 7 23:55:42 mnm kernel: dbench[988] wakes after 344ms
The actual algorithm I chose is described in the patch. It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles. The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.
Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run. Beats me.
Rik, this patch will conflict with read-latency2 in the init code.
Easily fixable. I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8. I'll
do more testing tomorrow.
Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running. This is in fact not due to request
starvation. It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.
=====================================
--- linux-2.4.18-pre9/include/linux/blkdev.h Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h Mon Feb 11 00:59:50 2002
@@ -119,9 +119,9 @@ struct request_queue
spinlock_t queue_lock;
/*
- * Tasks wait here for free request
+ * Tasks wait here for free read and write requests
*/
- wait_queue_head_t wait_for_request;
+ wait_queue_head_t wait_for_requests[2];
};
struct blk_dev_struct {
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c Thu Feb 7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Mon Feb 11 00:59:50 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];
/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
static inline int get_max_sectors(kdev_t dev)
{
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
q->rq[i&1].count++;
}
- init_waitqueue_head(&q->wait_for_request);
+ init_waitqueue_head(&q->wait_for_requests[0]);
+ init_waitqueue_head(&q->wait_for_requests[1]);
spin_lock_init(&q->queue_lock);
}
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,38 +443,102 @@ static inline struct request *get_reques
}
/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with wakeup hysteresis around
+ * 0 .. batch_requests and with the assumption that request taking is much,
+ * much faster than request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking.
+ *
+ * When a process wants a new request:
+ *
+ * b) If free_requests == 0, the requester sleeps in FIFO manner.
+ *
+ * b) If 0 < free_requests < batch_requests and there are waiters,
+ * we still take a request non-blockingly. This provides batching.
+ *
+ * c) If free_requests >= batch_requests, the caller is immediately
+ * granted a new request.
+ *
+ * When a request is released:
+ *
+ * d) If free_requests < batch_requests, do nothing.
+ *
+ * f) If free_requests >= batch_requests, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the batch_requests level,
+ * it will be able to take approximately (batch_requests) requests before
+ * blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+#undef REQTIMING
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);
+#ifdef REQTIMING
+ struct timeval tv1, tv2;
+ unsigned long long t1, t2;
+ unsigned long t;
+ do_gettimeofday(&tv1);
+#endif
+
generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
- remove_wait_queue(&q->wait_for_request, &wait);
+ remove_wait_queue(&q->wait_for_requests[rw], &wait);
current->state = TASK_RUNNING;
- return rq;
-}
-
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
+#ifdef REQTIMING
+ do_gettimeofday(&tv2);
+ t1 = tv1.tv_sec;
+ t1 *= 1000000;
+ t1 += tv1.tv_usec;
+ t2 = tv2.tv_sec;
+ t2 *= 1000000;
+ t2 += tv2.tv_usec;
+ t = t2 - t1;
+ printk("%s[%d] wakes after %ldms for %s\n", current->comm,
+ current->pid, t / 1000,
+ (rw == WRITE) ? "WRITE" : "READ");
+#endif
+ return rq;
}
/* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,8 +629,9 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
- wake_up(&q->wait_for_request);
+ if (++q->rq[rw].count >= batch_requests &&
+ waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
}
}
@@ -742,22 +812,30 @@ again:
BUG();
}
- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
}
/* fill up the request-info, and add it to the queue */
-
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-12 23:13 Andrew Morton
@ 2002-02-13 1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
1 sibling, 0 replies; 23+ messages in thread
From: William Lee Irwin III @ 2002-02-13 1:28 UTC (permalink / raw)
To: Andrew Morton; +Cc: Alan Cox, lkml
On Tue, Feb 12, 2002 at 03:13:26PM -0800, Andrew Morton wrote:
> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
>
> This patch is working well here. Hopefully it'll pop up
> in an rmap kernel soon.
>
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this. Hopefully he'll come out with some
> numbers soon.
Much of my testing has involved varying elevator and bdflush parameters
to find what works best with this. From preliminary results it is clear
that the resolution of the fairness issues in this patch and the read
latency patch does not impair performance.
I'll do a bunch more runs tonight with the latest version of this.
My prior runs were with an earlier version.
Cheers,
Bill
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-12 23:13 Andrew Morton
2002-02-13 1:28 ` William Lee Irwin III
@ 2002-02-15 17:23 ` Marcelo Tosatti
2002-02-16 7:32 ` Andrew Morton
1 sibling, 1 reply; 23+ messages in thread
From: Marcelo Tosatti @ 2002-02-15 17:23 UTC (permalink / raw)
To: Andrew Morton; +Cc: Alan Cox, lkml
On Tue, 12 Feb 2002, Andrew Morton wrote:
> Second version of this patch, incorporating Suparna's
> suggested simplification (the low-water mark was
> unnecessary).
>
> This patch is working well here. Hopefully it'll pop up
> in an rmap kernel soon.
>
> Bill Irwin has been doing some fairly extensive tuning
> and testing of this. Hopefully he'll come out with some
> numbers soon.
>
> I include the original description...
It seems the real gain (in latency) is caused by the FIFO behaviour.
That is, removing this hunk (against __get_request_wait())
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();
Would not make _much_ difference latency-wise. I'm I right or missing
something ?
Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
the instrumentation stuff _and_ make it clear on the documentation that
READA requests are not being used in practice).
Thanks a lot for that, Andrew.
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-15 17:23 ` Marcelo Tosatti
@ 2002-02-16 7:32 ` Andrew Morton
2002-02-16 10:13 ` Daniel Phillips
0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2002-02-16 7:32 UTC (permalink / raw)
To: Marcelo Tosatti; +Cc: Alan Cox, lkml
Marcelo Tosatti wrote:
>
> ...
> It seems the real gain (in latency) is caused by the FIFO behaviour.
Well there is no net gain in latency. What we've gained is *fairness*,
so the worst-case latency is improved. And yes, it's the FIFO behaviour
which provides that. And it is the exclusive wait which reduces the context
switch rate by 50%.
> That is, removing this hunk (against __get_request_wait())
>
> - if (q->rq[rw].count < batch_requests)
> + if (q->rq[rw].count == 0)
> schedule();
>
> Would not make _much_ difference latency-wise. I'm I right or missing
> something ?
Well the code which is presently in -rc1 is quite pointless. We
perform the above test a couple of microseconds after determining
that there are zero free requests. So the test you've illustrated
will *always* be true. The test for zero is the right thing to
do - it's just normal waitqueue handling. It's just a little
obfuscated by the implicit test-for-zero in get_request().
However, contrary to my earlier guess, the request batching does
make a measurable difference. Changing the code so that we wake up
a sleeper as soon as any request is freed costs maybe 30%
on `dbench 64'.
> Anyway, I would like to have the patch cleaned up for 2.4.19-pre (remove
> the instrumentation stuff _and_ make it clear on the documentation that
> READA requests are not being used in practice).
READA is used in a few filesystems for directory readahead. (And since
dir-in-pagecache, ext2 is no longer performing directory readhead. Bad.)
I came >this< close to killing READA altogether. I believe it's a
misfeature. If we're under intense seek pressure, the very, very _last_
thing we want to do is to create more seeks by not performing readahead.
But given that the misfeature only applies to creation of new requests,
and that block-contiguous readahead will still work OK, it's not too
serious. Plus it's a stable kernel :)
--- linux-2.4.18-rc1/include/linux/blkdev.h Mon Nov 26 11:52:07 2001
+++ linux-akpm/include/linux/blkdev.h Fri Feb 15 23:06:04 2002
@@ -119,9 +119,9 @@ struct request_queue
spinlock_t queue_lock;
/*
- * Tasks wait here for free request
+ * Tasks wait here for free read and write requests
*/
- wait_queue_head_t wait_for_request;
+ wait_queue_head_t wait_for_requests[2];
};
struct blk_dev_struct {
--- linux-2.4.18-rc1/drivers/block/ll_rw_blk.c Wed Feb 13 12:59:09 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Fri Feb 15 23:09:17 2002
@@ -118,10 +118,14 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];
/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * The threshold around which we make wakeup decisions
+ */
+static int batch_requests;
static inline int get_max_sectors(kdev_t dev)
{
@@ -352,7 +356,8 @@ static void blk_init_free_list(request_q
q->rq[i&1].count++;
}
- init_waitqueue_head(&q->wait_for_request);
+ init_waitqueue_head(&q->wait_for_requests[0]);
+ init_waitqueue_head(&q->wait_for_requests[1]);
spin_lock_init(&q->queue_lock);
}
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,40 +443,84 @@ static inline struct request *get_reques
}
/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with wakeup hysteresis around
+ * 0 .. batch_requests and with the assumption that request taking is much,
+ * much faster than request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking. Note that READA is used extremely rarely - a few
+ * filesystems use it for directory readahead.
+ *
+ * When a process wants a new request:
+ *
+ * b) If free_requests == 0, the requester sleeps in FIFO manner.
+ *
+ * b) If 0 < free_requests < batch_requests and there are waiters,
+ * we still take a request non-blockingly. This provides batching.
+ *
+ * c) If free_requests >= batch_requests, the caller is immediately
+ * granted a new request.
+ *
+ * When a request is released:
+ *
+ * d) If free_requests < batch_requests, do nothing.
+ *
+ * f) If free_requests >= batch_requests, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the batch_requests level,
+ * it will be able to take approximately (batch_requests) requests before
+ * blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);
generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_requests[rw], &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count == 0)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
spin_unlock_irq(&io_request_lock);
} while (rq == NULL);
- remove_wait_queue(&q->wait_for_request, &wait);
+ remove_wait_queue(&q->wait_for_requests[rw], &wait);
current->state = TASK_RUNNING;
return rq;
}
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
-
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
-}
-
/* RO fail safe mechanism */
static long ro_bits[MAX_BLKDEV][8];
@@ -546,7 +595,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,8 +609,9 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
- wake_up(&q->wait_for_request);
+ if (++q->rq[rw].count >= batch_requests &&
+ waitqueue_active(&q->wait_for_requests[rw]))
+ wake_up(&q->wait_for_requests[rw]);
}
}
@@ -742,22 +792,30 @@ again:
BUG();
}
- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
}
/* fill up the request-info, and add it to the queue */
-
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-16 7:32 ` Andrew Morton
@ 2002-02-16 10:13 ` Daniel Phillips
2002-02-16 10:25 ` Andrew Morton
0 siblings, 1 reply; 23+ messages in thread
From: Daniel Phillips @ 2002-02-16 10:13 UTC (permalink / raw)
To: Andrew Morton, Marcelo Tosatti; +Cc: Alan Cox, lkml
On February 16, 2002 08:32 am, Andrew Morton wrote:
> However, contrary to my earlier guess, the request batching does
> make a measurable difference. Changing the code so that we wake up
> a sleeper as soon as any request is freed costs maybe 30%
> on `dbench 64'.
Is this consistent with results on other IO benchmarks?
--
Daniel
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-16 10:13 ` Daniel Phillips
@ 2002-02-16 10:25 ` Andrew Morton
0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-16 10:25 UTC (permalink / raw)
To: Daniel Phillips; +Cc: lkml
Daniel Phillips wrote:
>
> On February 16, 2002 08:32 am, Andrew Morton wrote:
> > However, contrary to my earlier guess, the request batching does
> > make a measurable difference. Changing the code so that we wake up
> > a sleeper as soon as any request is freed costs maybe 30%
> > on `dbench 64'.
>
> Is this consistent with results on other IO benchmarks?
>
I dunno. I doubt it - few of the other benchmarks are very
seek-intensive. dbench is somewhat repeatable if you load
it up with enough clients. And average the results across
enough runs. And stand on one leg and squint.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
@ 2002-02-09 1:56 rwhron
0 siblings, 0 replies; 23+ messages in thread
From: rwhron @ 2002-02-09 1:56 UTC (permalink / raw)
To: akpm; +Cc: linux-kernel
On Fri, Feb 08, 2002 at 10:29:37AM -0800, Andrew Morton wrote:
> > > Here's a patch which addresses the get_request starvation problem.
> >
> > Looks like a tremendously important patch. Thanks!
>
> hmm. Dunno. I don't think it's very common to hit get_request
> starvation. Only for dbenchy things, I suspect. Still, it's
> good to have a design which addresses nasty corner cases.
It really works!
tiobench sequential reads with 32 threads went from max latency of
193 seconds down to 7 seconds with your make_request, read-latency2
and low-latency patches. (2.4.18-pre9-am1) The % of high latency
requests also went down.
K6-2 475 mhz with 384 MB ram and reiserfs on IDE disks.
Average of 3 runs.
Total File size = 1024 MB. (individual file size = 1024 / num-threads)
Read, write, and seek rates in MB/sec.
Latency in milliseconds.
Percent of requests that took longer than 2 and 10 seconds.
Sequential Num Avg Maximum Lat% Lat% CPU
Reads Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 8.94 13.33% 10.402 1785.52 0.000 0.000 67
2.4.18-pre9 8 8.85 13.22% 10.537 1954.27 0.000 0.000 67
2.4.18-pre9-am1 16 8.98 13.70% 30.827 4023.09 0.000 0.000 66
2.4.18-pre9 16 8.91 13.60% 31.122 4072.67 0.000 0.000 65
2.4.18-pre9-am1 32 9.00 14.16% 70.834 7032.45 0.000 0.000 64
2.4.18-pre9 32 8.87 13.89% 66.135 193590.10 0.030 0.023 64
Random Num Avg Maximum Lat% Lat% CPU
Reads Thr Rate ( CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 0.69 1 .828% 132.242 473.73 0.000 0.000 38
2.4.18-pre9 8 0.71 2 .156% 128.869 510.38 0.000 0.000 33
2.4.18-pre9-am1 16 0.72 2 .154% 371.288 1410.13 0.000 0.000 33
2.4.18-pre9 16 0.74 2 .025% 364.871 1397.82 0.000 0.000 36
2.4.18-pre9-am1 32 0.74 2 .229% 816.605 2996.07 0.000 0.000 33
2.4.18-pre9 32 0.75 2 .163% 734.035 2841.97 0.000 0.000 35
Sequential Num Avg Maximum Lat% Lat% CPU
Writes Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 11.45 52.33% 7.651 15095.52 0.058 0.000 22
2.4.18-pre9 8 9.23 40.84% 9.021 12351.12 0.040 0.000 23
2.4.18-pre9-am1 16 11.48 54.57% 22.253 41922.53 0.214 0.000 21
2.4.18-pre9 16 9.46 41.71% 26.374 28280.18 0.212 0.000 23
2.4.18-pre9-am1 32 11.37 55.91% 52.917 75331.67 0.679 0.004 20
2.4.18-pre9 32 9.50 42.55% 61.944 60970.74 0.770 0.005 22
Random Num Avg Maximum Lat% Lat% CPU
Writes Thr Rate (CPU%) Latency Latency >2s >10s Eff
--- ---------------------------------------------------
2.4.18-pre9-am1 8 0.62 2.393% 1.786 330.56 0.000 0.000 26
2.4.18-pre9 8 0.50 1.216% 1.700 333.32 0.000 0.000 41
2.4.18-pre9-am1 16 0.65 2.527% 4.533 996.54 0.000 0.000 26
2.4.18-pre9 16 0.52 1.267% 4.228 1236.08 0.000 0.000 41
2.4.18-pre9-am1 32 0.65 2.530% 7.980 1907.91 0.000 0.000 26
2.4.18-pre9 32 0.53 1.343% 7.818 2509.73 0.000 0.000 39
> > Do you have a patch for dbench too? :)
>
> /bin/rm?
Good one. It would be helpful though, if you've already done the work.
--
Randy Hron
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
@ 2002-02-08 19:31 Dieter Nützel
0 siblings, 0 replies; 23+ messages in thread
From: Dieter Nützel @ 2002-02-08 19:31 UTC (permalink / raw)
To: Andrew Morton; +Cc: Jens Axboe, Ingo Molnar, Robert Love, Linux Kernel List
On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.
[snip]
> Also, you noted the other day that a LILO run *never* terminated when
> there was a dbench running. This is in fact not due to request
> starvation. It's due to livelock in invalidate_bdev(), which is called
> via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
> as long as another task is generating locked buffers against the
> device.
[snip]
Could this below related?
I get system looks through lilo (bzlilo) from time to time with all latest
kernels + O(1) + -aa + preempt
Thanks,
Dieter
> Re: [patch] O(1) scheduler, -J9
> Datum: Wed, 30 Jan 2002 17:36:21 +0100
> Von: Dieter Nützel <Dieter.Nuetzel@hamburg.de>
> An: Ingo Molnar <mingo@elte.hu>
> Kopie: Robert Love <rml@tech9.net>, Oleg Drokin <green@namesys.com>
>
> On Wednesday, 30. January 2002 17:26, Dieter Nützel wrote:
> > On Wednesday, 30. January 2002 15:40, Ingo Molnar wrote:
> > > On Wed, 30 Jan 2002, Dieter [iso-8859-15] Nützel wrote:
> > > > As always running "nice +19 make -j40" in my HUGE C++ VTK tree in the
> > > > background ;-)
> > > >
> > > > The mouse feels a little bit sluggish after KMail start so I reniced X
> > > > to -10. [...]
> > >
> > > does this make X smooth? nice -10 is a good choice, it's not too low and
> > > not too high, i'm using that value myself.
> >
> > Yes, mostly.
> > In "normal operation mode" X at 0 is good.
> >
> > I only get some very little but noticeable unsmoothness with several KDE
> > apps running. It could be KDE it self.
> >
> > Latency degradation since -J4 stays. But that could be missing of some
> > lock-breaks.
> >
> > Robert's latest BKL stuff halved the numbers for the two
> > latencytest0.42-png graphic benches on 2.4.18-pre7 + patches.
> Addition:
> I am most worried about the occasional kernel hangs during make
> bzlilo/bzImage. It appears during lilo (???). The new kernel (vmlinux) is
> still be built in /usr/src/linux but together with "latest" *.o and
> .*.flags files broken due to ReiserFS transaction replay after reboot
> (normal behavior with a journaling FS but without full data journaling.
>
> Maybe a RACE between lilo vs sync?
^ permalink raw reply [flat|nested] 23+ messages in thread
* [patch] get_request starvation fix
@ 2002-02-08 8:46 Andrew Morton
2002-02-08 8:57 ` Jens Axboe
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 8:46 UTC (permalink / raw)
To: Rik van Riel, William Lee Irwin III, lkml
Here's a patch which addresses the get_request starvation problem.
The problem we're experiencing is that when the request queue is
exhausted, processes go to sleep until the request queue has refilled
to the batch_nr_requests level. But already-running tasks can simply
whizz in and steal free requests, thus preventing the sleeping tasks
from getting woken.
Some instrumentation illustrates this clearly. All testing was on
ext2. In a `dbench 64' run on stock 2.4.18-pre9, the maximum sleep
duration in __get_request_wait() was 37 seconds, and it's being very
unfair. Look at this section of the trace:
Feb 7 21:59:06 mnm kernel: dbench[883] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[877] wakes after 442ms
Feb 7 21:59:06 mnm kernel: dbench[868] wakes after 441ms
Feb 7 21:59:06 mnm kernel: kupdated[7] wakes after 11587ms
Feb 7 21:59:06 mnm kernel: dbench[914] wakes after 574ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 2618ms
Feb 7 21:59:06 mnm kernel: dbench[927] wakes after 755ms
Feb 7 21:59:06 mnm kernel: dbench[873] wakes after 263ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 838ms
Feb 7 21:59:06 mnm kernel: dbench[924] wakes after 134ms
Feb 7 21:59:06 mnm kernel: dbench[871] wakes after 18688ms
Feb 7 21:59:06 mnm kernel: dbench[908] wakes after 3245ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 974ms
Feb 7 21:59:06 mnm kernel: dbench[878] wakes after 27401ms
Feb 7 21:59:06 mnm kernel: dbench[895] wakes after 278ms
Feb 7 21:59:06 mnm kernel: dbench[904] wakes after 278ms
With the patch applied, the maximum sleep duration was eight seconds,
but when this was happening, *all* tasks were sleeping for similar
durations:
Feb 7 23:55:40 mnm kernel: dbench[948] wakes after 3978ms
Feb 7 23:55:40 mnm kernel: dbench[935] wakes after 3439ms
Feb 7 23:55:41 mnm kernel: dbench[955] wakes after 2998ms
Feb 7 23:55:41 mnm kernel: dbench[965] wakes after 2847ms
Feb 7 23:55:41 mnm kernel: dbench[994] wakes after 3032ms
Feb 7 23:55:41 mnm kernel: dbench[969] wakes after 2652ms
Feb 7 23:55:41 mnm kernel: dbench[956] wakes after 2197ms
Feb 7 23:55:41 mnm kernel: dbench[993] wakes after 1589ms
Feb 7 23:55:41 mnm kernel: dbench[990] wakes after 1590ms
Feb 7 23:55:41 mnm kernel: dbench[939] wakes after 1585ms
Feb 7 23:55:41 mnm kernel: kupdated[7] wakes after 1555ms
Feb 7 23:55:41 mnm kernel: dbench[976] wakes after 1787ms
Feb 7 23:55:41 mnm kernel: dbench[997] wakes after 1653ms
Feb 7 23:55:41 mnm kernel: dbench[961] wakes after 1891ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 1967ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 1339ms
Feb 7 23:55:42 mnm kernel: dbench[965] wakes after 1216ms
Feb 7 23:55:42 mnm kernel: kjournald[9] wakes after 1253ms
Feb 7 23:55:42 mnm kernel: dbench[990] wakes after 879ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 849ms
Feb 7 23:55:42 mnm kernel: dbench[959] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[936] wakes after 343ms
Feb 7 23:55:42 mnm kernel: dbench[964] wakes after 492ms
Feb 7 23:55:42 mnm kernel: dbench[966] wakes after 648ms
Feb 7 23:55:42 mnm kernel: kupdated[7] wakes after 413ms
Feb 7 23:55:42 mnm kernel: dbench[988] wakes after 344ms
The actual algorithm I chose is described in the patch. It is
noteworthy that the wakeup strategy has changed from wake-all to
wake-one, which should save a few cycles. The number of context
switches during the entire dbench 64 run fell from 57,000 to
29,000.
Interestingly, dbench runtimes are still very inconsistent, and
many clients still exit quite early in the run. Beats me.
Rik, this patch will conflict with read-latency2 in the init code.
Easily fixable. I haven't tested with the larger request queue, but I
suggest that we want to keep the 1024 requests, a high-water mark of
queue_nr_requests/4 and a low-water mark of queue_nr_requests/8. I'll
do more testing tomorrow.
Also, you noted the other day that a LILO run *never* terminated when
there was a dbench running. This is in fact not due to request
starvation. It's due to livelock in invalidate_bdev(), which is called
via ioctl(BLKFLSBUF) by LILO. invalidate_bdev() will never terminate
as long as another task is generating locked buffers against the
device.
--- linux-2.4.18-pre9/drivers/block/ll_rw_blk.c Thu Feb 7 13:04:11 2002
+++ linux-akpm/drivers/block/ll_rw_blk.c Fri Feb 8 00:33:01 2002
@@ -118,10 +118,15 @@ int * max_readahead[MAX_BLKDEV];
int * max_sectors[MAX_BLKDEV];
/*
- * How many reqeusts do we allocate per queue,
- * and how many do we "batch" on freeing them?
+ * The total number of requests in each queue.
*/
-static int queue_nr_requests, batch_requests;
+static int queue_nr_requests;
+
+/*
+ * low- and high-water marks for the queues.
+ */
+static int batch_requests_low;
+static int batch_requests_high;
static inline int get_max_sectors(kdev_t dev)
{
@@ -418,9 +423,9 @@ void blk_init_queue(request_queue_t * q,
#define blkdev_free_rq(list) list_entry((list)->next, struct request, queue);
/*
* Get a free request. io_request_lock must be held and interrupts
- * disabled on the way in.
+ * disabled on the way in. Returns NULL if there are no free requests.
*/
-static inline struct request *get_request(request_queue_t *q, int rw)
+static struct request *get_request(request_queue_t *q, int rw)
{
struct request *rq = NULL;
struct request_list *rl = q->rq + rw;
@@ -438,18 +443,81 @@ static inline struct request *get_reques
}
/*
- * No available requests for this queue, unplug the device.
+ * Here's the request allocation design:
+ *
+ * 1: Blocking on request exhaustion is a key part of I/O throttling.
+ *
+ * 2: We want to be `fair' to all requesters. We must avoid starvation, and
+ * attempt to ensure that all requesters sleep for a similar duration. Hence
+ * no stealing requests when there are other processes waiting.
+ *
+ * 3: We also wish to support `batching' of requests. So when a process is
+ * woken, we want to allow it to allocate a decent number of requests
+ * before it blocks again, so they can be nicely merged (this only really
+ * matters if the process happens to be adding requests near the head of
+ * the queue).
+ *
+ * 4: We want to avoid scheduling storms. This isn't really important, because
+ * the system will be I/O bound anyway. But it's easy.
+ *
+ * There is tension between requirements 2 and 3. Once a task has woken,
+ * we don't want to allow it to sleep as soon as it takes its second request.
+ * But we don't want currently-running tasks to steal all the requests
+ * from the sleepers. We handle this with high- and low- water marks and
+ * with the assumption that request taking is much, much faster than
+ * request freeing.
+ *
+ * So here's what we do:
+ *
+ * a) A READA requester fails if free_requests < batch_requests_high
+ *
+ * We don't want READA requests to prevent sleepers from ever
+ * waking.
+ *
+ * When a process wants a new request:
+ *
+ * b) If someone else is waiting on requests and free_requests < low_water,
+ * the requester sleeps in FIFO manner.
+ *
+ * c) If low_water < free_requests < high_water, the caller is immediately
+ * granted a new request.
+ *
+ * d) If nobody is waiting on requests, the caller gets given a request,
+ * if there are any available. Otherwise the caller sleeps.
+ *
+ * When a request is released:
+ *
+ * e) If free_requests < low_water, do nothing.
+ *
+ * f) If free_requests > high_water, wake up a single waiter.
+ *
+ * The net effect is that when a process is woken at the high-water mark,
+ * it will be able to take approximately (high-water - low-water) requests
+ * before blocking again (at the tail of the queue).
+ *
+ * This all assumes that the rate of taking requests is much, much higher
+ * than the rate of releasing them. Which is very true.
+ *
+ * -akpm, Feb 2002.
*/
+#undef REQTIMING
static struct request *__get_request_wait(request_queue_t *q, int rw)
{
register struct request *rq;
DECLARE_WAITQUEUE(wait, current);
+#ifdef REQTIMING
+ struct timeval tv1, tv2;
+ unsigned long long t1, t2;
+ unsigned long t;
+ do_gettimeofday(&tv1);
+#endif
+
generic_unplug_device(q);
- add_wait_queue(&q->wait_for_request, &wait);
+ add_wait_queue_exclusive(&q->wait_for_request, &wait);
do {
set_current_state(TASK_UNINTERRUPTIBLE);
- if (q->rq[rw].count < batch_requests)
+ if (q->rq[rw].count < batch_requests_low)
schedule();
spin_lock_irq(&io_request_lock);
rq = get_request(q,rw);
@@ -457,19 +525,20 @@ static struct request *__get_request_wai
} while (rq == NULL);
remove_wait_queue(&q->wait_for_request, &wait);
current->state = TASK_RUNNING;
- return rq;
-}
-static inline struct request *get_request_wait(request_queue_t *q, int rw)
-{
- register struct request *rq;
-
- spin_lock_irq(&io_request_lock);
- rq = get_request(q, rw);
- spin_unlock_irq(&io_request_lock);
- if (rq)
- return rq;
- return __get_request_wait(q, rw);
+#ifdef REQTIMING
+ do_gettimeofday(&tv2);
+ t1 = tv1.tv_sec;
+ t1 *= 1000000;
+ t1 += tv1.tv_usec;
+ t2 = tv2.tv_sec;
+ t2 *= 1000000;
+ t2 += tv2.tv_usec;
+ t = t2 - t1;
+ printk("%s[%d] wakes after %ldms\n", current->comm,
+ current->pid, t / 1000);
+#endif
+ return rq;
}
/* RO fail safe mechanism */
@@ -546,7 +615,7 @@ static inline void add_request(request_q
/*
* Must be called with io_request_lock held and interrupts disabled
*/
-inline void blkdev_release_request(struct request *req)
+void blkdev_release_request(struct request *req)
{
request_queue_t *q = req->q;
int rw = req->cmd;
@@ -560,7 +629,8 @@ inline void blkdev_release_request(struc
*/
if (q) {
list_add(&req->queue, &q->rq[rw].free);
- if (++q->rq[rw].count >= batch_requests && waitqueue_active(&q->wait_for_request))
+ if (++q->rq[rw].count >= batch_requests_high &&
+ waitqueue_active(&q->wait_for_request))
wake_up(&q->wait_for_request);
}
}
@@ -742,22 +812,41 @@ again:
BUG();
}
- /*
- * Grab a free request from the freelist - if that is empty, check
- * if we are doing read ahead and abort instead of blocking for
- * a free slot.
- */
get_rq:
if (freereq) {
req = freereq;
freereq = NULL;
- } else if ((req = get_request(q, rw)) == NULL) {
- spin_unlock_irq(&io_request_lock);
- if (rw_ahead)
- goto end_io;
-
- freereq = __get_request_wait(q, rw);
- goto again;
+ } else {
+ /*
+ * See description above __get_request_wait()
+ */
+ if (rw_ahead) {
+ if (q->rq[rw].count < batch_requests_high) {
+ spin_unlock_irq(&io_request_lock);
+ goto end_io;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ if (waitqueue_active(&q->wait_for_request)) {
+ if (q->rq[rw].count < batch_requests_low) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ req = get_request(q, rw);
+ if (req == NULL)
+ BUG();
+ } else {
+ req = get_request(q, rw);
+ if (req == NULL) {
+ spin_unlock_irq(&io_request_lock);
+ freereq = __get_request_wait(q, rw);
+ goto again;
+ }
+ }
+ }
}
/* fill up the request-info, and add it to the queue */
@@ -1124,8 +1213,10 @@ int __init blk_dev_init(void)
/*
* Batch frees according to queue length
*/
- batch_requests = queue_nr_requests/4;
- printk("block: %d slots per queue, batch=%d\n", queue_nr_requests, batch_requests);
+ batch_requests_high = queue_nr_requests / 4;
+ batch_requests_low = batch_requests_high / 2;
+ printk("block: %d slots per queue, batch_low=%d, batch_high=%d\n",
+ queue_nr_requests, batch_requests_low, batch_requests_high);
#ifdef CONFIG_AMIGA_Z2RAM
z2_init();
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-08 8:46 Andrew Morton
@ 2002-02-08 8:57 ` Jens Axboe
2002-02-08 9:57 ` Andrew Morton
2002-02-08 9:10 ` Andrew Morton
` (2 subsequent siblings)
3 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2002-02-08 8:57 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml
On Fri, Feb 08 2002, Andrew Morton wrote:
> Here's a patch which addresses the get_request starvation problem.
[snip]
Agrh, if only I knew you were working on this too :/. Oh well, from a
first-read the patch looks good.
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-08 8:57 ` Jens Axboe
@ 2002-02-08 9:57 ` Andrew Morton
0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 9:57 UTC (permalink / raw)
To: Jens Axboe; +Cc: Rik van Riel, William Lee Irwin III, lkml
Jens Axboe wrote:
>
> On Fri, Feb 08 2002, Andrew Morton wrote:
> > Here's a patch which addresses the get_request starvation problem.
>
> [snip]
>
> Agrh, if only I knew you were working on this too :/. Oh well, from a
> first-read the patch looks good.
Seems that with FIFO fairness, /bin/sync now also livelocks. And
it's pretty easy to see why. There's nothing to make
write_unlocked_buffers() terminate.
We'll worry about that tomorrow. I may just make it give up
after writing (2 * nr_buffers_type[BUF_DIRTY]) buffers.
The patch works well with read-latency2 (and it didn't throw
rejects). Smooth and fast. It's going to take some time and
testing to settle everything in.
-
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-08 8:46 Andrew Morton
2002-02-08 8:57 ` Jens Axboe
@ 2002-02-08 9:10 ` Andrew Morton
2002-02-08 11:37 ` Rik van Riel
2002-02-11 9:41 ` Andrew Morton
3 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 9:10 UTC (permalink / raw)
To: Rik van Riel, William Lee Irwin III, lkml
Andrew Morton wrote:
>
> It is noteworthy that the wakeup strategy has changed from wake-all to
> wake-one, which should save a few cycles.
It is also noteworthy that read latency completely sucks. Not sure if
I've made it worse.
Anyway, with exclusive wakeup we need separate waitqueues for read and
write requests. Updated patch at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
-
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-08 8:46 Andrew Morton
2002-02-08 8:57 ` Jens Axboe
2002-02-08 9:10 ` Andrew Morton
@ 2002-02-08 11:37 ` Rik van Riel
2002-02-08 18:28 ` Andrew Morton
2002-02-11 9:41 ` Andrew Morton
3 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2002-02-08 11:37 UTC (permalink / raw)
To: Andrew Morton; +Cc: William Lee Irwin III, lkml
On Fri, 8 Feb 2002, Andrew Morton wrote:
> + * This all assumes that the rate of taking requests is much, much higher
> + * than the rate of releasing them. Which is very true.
This is not necessarily true for read requests.
If each read request is synchronous and the process will
generate the next read request after the current one
has finished, then it's quite possible to clog up the
queue with read requests which are generated at exactly
the same rate as they're processed.
Couldn't this still cause starvation, even with your patch?
regards,
Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document
http://www.surriel.com/ http://distro.conectiva.com/
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-08 11:37 ` Rik van Riel
@ 2002-02-08 18:28 ` Andrew Morton
0 siblings, 0 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-08 18:28 UTC (permalink / raw)
To: Rik van Riel; +Cc: William Lee Irwin III, lkml
Rik van Riel wrote:
>
> On Fri, 8 Feb 2002, Andrew Morton wrote:
>
> > + * This all assumes that the rate of taking requests is much, much higher
> > + * than the rate of releasing them. Which is very true.
>
> This is not necessarily true for read requests.
>
> If each read request is synchronous and the process will
> generate the next read request after the current one
> has finished, then it's quite possible to clog up the
> queue with read requests which are generated at exactly
> the same rate as they're processed.
>
> Couldn't this still cause starvation, even with your patch?
No, that's fine.
The problem which the comment refers to is: how to provide
per-process request batching without running off and creating
per-process reservation pools or such.
What I'm relying on is that when a sleeper is woken (at low-water),
there are at least (high-water - low-water) requests available before
get_request will again sleep. And that the woken process will be
able to grab a decent number of those non-blocking requests. I
suspect it's always true, as long as (high-water - low_water) is
"much greater than" the number of CPUs.
The synchronous reader is well-behaved, and should be nicely
FIFO if we're getting low on requests.
-
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-08 8:46 Andrew Morton
` (2 preceding siblings ...)
2002-02-08 11:37 ` Rik van Riel
@ 2002-02-11 9:41 ` Andrew Morton
2002-02-11 17:35 ` Suparna Bhattacharya
2002-02-13 0:33 ` Jesse Barnes
3 siblings, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2002-02-11 9:41 UTC (permalink / raw)
To: Rik van Riel, William Lee Irwin III, lkml, Suparna Bhattacharya
Andrew Morton wrote:
>
> Here's a patch which addresses the get_request starvation problem.
>
Sharp-eyed Suparna noticed that the algorithm still works if the
low-water mark is set to zero (ie: rip it out) so I did that and
the code is a little simpler. Updated patch at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
Also, here's a patch which fixes the /bin/sync livelock in
write_unlocked_buffers(). It simply bales out after writing
all the buffers which were dirty at the time the function
was called, rather than keeping on trying to write buffers
until the list is empty.
Given that /bin/sync calls write_unlocked_buffers() three times,
that's good enough. sync still takes aaaaaages, but it terminates.
The similar livelock in invalidate_bdev() is a bit trickier. I'm
not really sure what the semantics of invalidate_bdev() against a
device which has an r/w filesystem mounted on it are supposed to be.
I suspect we should just not call the function if the device is in
use, or call fsync_dev() or something.
--- linux-2.4.18-pre9/fs/buffer.c Thu Feb 7 13:04:21 2002
+++ linux-akpm/fs/buffer.c Sun Feb 10 21:50:48 2002
@@ -188,12 +188,13 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, signed long *nr_to_write)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
unsigned int count;
int nr;
+ int ret;
next = lru_list[BUF_DIRTY];
nr = nr_buffers_type[BUF_DIRTY];
@@ -212,29 +213,38 @@ static int write_some_buffers(kdev_t dev
array[count++] = bh;
if (count < NRSYNC)
continue;
-
- spin_unlock(&lru_list_lock);
- write_locked_buffers(array, count);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto writeout;
}
unlock_buffer(bh);
__refile_buffer(bh);
}
+ ret = 0;
+writeout:
spin_unlock(&lru_list_lock);
-
- if (count)
+ if (count) {
write_locked_buffers(array, count);
- return 0;
+ if (nr_to_write)
+ *nr_to_write -= count;
+ }
+ return ret;
}
/*
- * Write out all buffers on the dirty list.
+ * Because we drop the locking during I/O it's not possible
+ * to write out all the buffers. So the only guarantee that
+ * we can make here is that we write out all the buffers which
+ * were dirty at the time write_unlocked_buffers() was called.
+ * fsync_dev() calls in here three times, so we end up writing
+ * many more buffers than ever appear on BUF_DIRTY.
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ signed long nr_to_write = nr_buffers_type[BUF_DIRTY] * 2;
+
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &nr_to_write) && (nr_to_write > 0));
run_task_queue(&tq_disk);
}
@@ -1079,7 +1089,7 @@ void balance_dirty(void)
/* If we're getting into imbalance, start write-out */
spin_lock(&lru_list_lock);
- write_some_buffers(NODEV);
+ write_some_buffers(NODEV, NULL);
/*
* And if we're _really_ out of balance, wait for
@@ -2852,7 +2862,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2951,7 +2961,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC
spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state() < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}
-
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [patch] get_request starvation fix
2002-02-11 9:41 ` Andrew Morton
@ 2002-02-11 17:35 ` Suparna Bhattacharya
2002-02-11 19:26 ` Andrew Morton
2002-02-13 0:33 ` Jesse Barnes
1 sibling, 1 reply; 23+ messages in thread
From: Suparna Bhattacharya @ 2002-02-11 17:35 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml
There's something that's still bothering me - I wonder if I'm missing
something very obvious here.
> * So here's what we do:
> *
> * a) A READA requester fails if free_requests < batch_requests
> *
> * We don't want READA requests to prevent sleepers from ever
> * waking.
> *
> * When a process wants a new request:
> *
> * b) If free_requests == 0, the requester sleeps in FIFO manner.
> *
> * b) If 0 < free_requests < batch_requests and there are waiters,
> * we still take a request non-blockingly. This provides batching.
For a caller who just got an exclusive wakeup on the request queue,
this enables the woken up task to do batching and not sleep again on the
next request it needs. However since we use the same logic for new callers,
don't we still have those starvation issues ?
(i.e new callers end up stealing requests while there are sleepers,
given that wakeups will happen only if the queue builds up beyond the high
water mark)
> *
> * c) If free_requests >= batch_requests, the caller is immediately
> * granted a new request.
> *
> * When a request is released:
> *
> * d) If free_requests < batch_requests, do nothing.
On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Andrew Morton wrote:
> >
> > Here's a patch which addresses the get_request starvation problem.
> >
>
> Sharp-eyed Suparna noticed that the algorithm still works if the
> low-water mark is set to zero (ie: rip it out) so I did that and
> the code is a little simpler. Updated patch at
> http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/make_request.patch
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-11 17:35 ` Suparna Bhattacharya
@ 2002-02-11 19:26 ` Andrew Morton
2002-02-14 6:00 ` Suparna Bhattacharya
0 siblings, 1 reply; 23+ messages in thread
From: Andrew Morton @ 2002-02-11 19:26 UTC (permalink / raw)
To: suparna; +Cc: Rik van Riel, William Lee Irwin III, lkml
Suparna Bhattacharya wrote:
>
> ...
> > * When a process wants a new request:
> > *
> > * b) If free_requests == 0, the requester sleeps in FIFO manner.
> > *
> > * b) If 0 < free_requests < batch_requests and there are waiters,
> > * we still take a request non-blockingly. This provides batching.
>
> For a caller who just got an exclusive wakeup on the request queue,
> this enables the woken up task to do batching and not sleep again on the
> next request it needs. However since we use the same logic for new callers,
> don't we still have those starvation issues ?
Let's think of `incoming' tasks and `outgoing' ones. The incoming
tasks are running tasks which are making new requests, and the
outgoing ones are tasks which were on the waitqueue, and which are
in the process of being woken and granted a request.
Your concern is that incoming tasks can still steal all the requests
away from the sleepers. And sure, they can, until all requests are
gone. At which time the incoming tasks get to wait their turn on
the waitqueue. And generally, the problem is with writes, where we tend
to have a few threads performing a large number of requests.
Another scenario is where incoming tasks just take a handful of requests,
and their rate of taking equals the rate of request freeing, and the
free_request threshold remains greater than zero, less than batch_nr_requests.
This steady state could conceivably happen with the correct number of
threads performing O_SYNC/fsync writes, for example. I'll see if I can
find a way to make this happen.
The final scenario is where there are many outgoing tasks, so they
compete, and each gets an insufficiently large number of requests.
I suspect this is indeed happening with the current kernel's wake-all
code. But the wake-one change makes this unlikely.
When an outgoing task is woken, we know there are 32 free requests on the
queue. There's an assumption (or design) here that the outgoing task
will be able to get a decent number of those requests. This works. It
may fail if there are a massive number of CPUs. But we need to increase
the overall request queue size anyway - 128 is too small.
Under heavy load, the general operating mode for this code is that
damn near every task is asleep. So most work is done by outgoing threads.
BTW, I suspect the request batching isn't super-important. It'll
certainly decrease the context switch rate very much. But from a
request-merging point of view it's unlikely to make much difference.
-
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-11 19:26 ` Andrew Morton
@ 2002-02-14 6:00 ` Suparna Bhattacharya
0 siblings, 0 replies; 23+ messages in thread
From: Suparna Bhattacharya @ 2002-02-14 6:00 UTC (permalink / raw)
To: Andrew Morton; +Cc: Rik van Riel, William Lee Irwin III, lkml
On Mon, Feb 11, 2002 at 11:26:44AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya wrote:
> >
> > ...
> > > * When a process wants a new request:
> > > *
> > > * b) If free_requests == 0, the requester sleeps in FIFO manner.
> > > *
> > > * b) If 0 < free_requests < batch_requests and there are waiters,
> > > * we still take a request non-blockingly. This provides batching.
> >
> > For a caller who just got an exclusive wakeup on the request queue,
> > this enables the woken up task to do batching and not sleep again on the
> > next request it needs. However since we use the same logic for new callers,
> > don't we still have those starvation issues ?
>
> Let's think of `incoming' tasks and `outgoing' ones. The incoming
> tasks are running tasks which are making new requests, and the
> outgoing ones are tasks which were on the waitqueue, and which are
> in the process of being woken and granted a request.
>
> Your concern is that incoming tasks can still steal all the requests
> away from the sleepers. And sure, they can, until all requests are
> gone. At which time the incoming tasks get to wait their turn on
> the waitqueue. And generally, the problem is with writes, where we tend
> to have a few threads performing a large number of requests.
>
> Another scenario is where incoming tasks just take a handful of requests,
> and their rate of taking equals the rate of request freeing, and the
> free_request threshold remains greater than zero, less than batch_nr_requests.
> This steady state could conceivably happen with the correct number of
> threads performing O_SYNC/fsync writes, for example. I'll see if I can
> find a way to make this happen.
Thanks for the explanation. (I was off-line for a couple of days so
didn't get to respond earlier) This is closest to the case that I
was concerned about. Perhaps with merging/write-clustering at higher
levels the number of requests might be less, but that is secondary.
What was worrying me is that even when the rate of request taking
is greater than the rate of request freeing, couldn't we have every
1 out of N incoming tasks picking up a just freed up request, thus
starving the waiters ? (The rest N-1 incoming tasks might get to
wait their turn, where N:1 is the ratio of request taking to request
freeing)
>
> The final scenario is where there are many outgoing tasks, so they
> compete, and each gets an insufficiently large number of requests.
> I suspect this is indeed happening with the current kernel's wake-all
> code. But the wake-one change makes this unlikely.
>
> When an outgoing task is woken, we know there are 32 free requests on the
> queue. There's an assumption (or design) here that the outgoing task
> will be able to get a decent number of those requests. This works. It
> may fail if there are a massive number of CPUs. But we need to increase
> the overall request queue size anyway - 128 is too small.
Yes, especially when the system has sufficient resources to increase
the request queue size, one might as well queue up on the request
queue, rather than the wait queue, and minimize those context switches.
>
> Under heavy load, the general operating mode for this code is that
> damn near every task is asleep. So most work is done by outgoing threads.
>
> BTW, I suspect the request batching isn't super-important. It'll
> certainly decrease the context switch rate very much. But from a
> request-merging point of view it's unlikely to make much difference.
>
> -
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [patch] get_request starvation fix
2002-02-11 9:41 ` Andrew Morton
2002-02-11 17:35 ` Suparna Bhattacharya
@ 2002-02-13 0:33 ` Jesse Barnes
1 sibling, 0 replies; 23+ messages in thread
From: Jesse Barnes @ 2002-02-13 0:33 UTC (permalink / raw)
To: Andrew Morton; +Cc: lkml
On Mon, Feb 11, 2002 at 01:41:20AM -0800, Andrew Morton wrote:
> Also, here's a patch which fixes the /bin/sync livelock in
> write_unlocked_buffers(). It simply bales out after writing
> all the buffers which were dirty at the time the function
> was called, rather than keeping on trying to write buffers
> until the list is empty.
I've also seen this issue when I start multiple mkfs commands (10 in
parallel basically make the system useless), but the patch helped.
Any chance it will get into 2.4?
Thanks,
Jesse
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2002-02-16 10:27 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <200202081932.GAA05943@mangalore.zipworld.com.au>
2002-02-08 19:44 ` [patch] get_request starvation fix Andrew Morton
2002-02-08 19:53 ` Dieter Nützel
2002-02-08 20:43 ` Rik van Riel
2002-02-13 13:55 rwhron
-- strict thread matches above, loose matches on Subject: below --
2002-02-12 23:13 Andrew Morton
2002-02-13 1:28 ` William Lee Irwin III
2002-02-15 17:23 ` Marcelo Tosatti
2002-02-16 7:32 ` Andrew Morton
2002-02-16 10:13 ` Daniel Phillips
2002-02-16 10:25 ` Andrew Morton
2002-02-09 1:56 rwhron
2002-02-08 19:31 Dieter Nützel
2002-02-08 8:46 Andrew Morton
2002-02-08 8:57 ` Jens Axboe
2002-02-08 9:57 ` Andrew Morton
2002-02-08 9:10 ` Andrew Morton
2002-02-08 11:37 ` Rik van Riel
2002-02-08 18:28 ` Andrew Morton
2002-02-11 9:41 ` Andrew Morton
2002-02-11 17:35 ` Suparna Bhattacharya
2002-02-11 19:26 ` Andrew Morton
2002-02-14 6:00 ` Suparna Bhattacharya
2002-02-13 0:33 ` Jesse Barnes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox