* RFC - how to balance Dirty+Writeback in the face of slow writeback.
@ 2006-08-14 23:40 Neil Brown
2006-08-15 8:06 ` Andrew Morton
0 siblings, 1 reply; 37+ messages in thread
From: Neil Brown @ 2006-08-14 23:40 UTC (permalink / raw)
To: linux-kernel
I have a question about the write throttling in
balance_dirty_pages in the face of slow writeback.
Suppose we have a filesystem where writeback is relatively slow -
e.g. NFS or EXTx over nbd over a slow link.
Suppose for the sake of simplicity that writeback is very slow and
doesn't progress at all for the first part of our experiment.
We write to a large file.
Balance_dirty_pages gets called periodically. Until the number of
Dirty pages reached 40% of memory it does nothing.
Once we hit 40%, balance_dirty_pages starts calling writeback_inodes
and these Dirty pages get converted to Writeback pages. This happens
at 1.5 times the speed that dirty pages are created (due to
sync_writeback_pages()). So for every 100K that we dirty, 150K gets
converted to writeback. But balance_dirty_pages doesn't wait for anything.
This will result in the number of dirty pages going down steadily, and
the number of writeback pages increasing quickly (3 times the speed of
the drop in Dirty). The total of Dirty+Writeback will keep growing.
When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
balance_dirty_pages will no longer be able to flush the full
'write_chunk' (1.5 times number of recent dirtied pages) and so will
spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
a busy loop, but it won't progress.
Now our very slow writeback gets it's act together and starts making
some progress and the Writeback number steadily drops down to 40%.
At this point balance_dirty_pages will exit, more pages will get
dirtied, and balance_dirty_pages will quickly flush them out again.
The steady state will be with Dirty at or close to 0, and Writeback at
or close to 40%.
Now obviously this is somewhat idealised, and even slow writeback will
make some progress early on, but you can still expect to get a very
large Writeback with a very small Dirty before stabilising.
I don't think we want this, but I'm not sure what we do want, so I'm
asking for opinions.
I don't think that pushing Dirty down to zero is the best thing to
do. If writeback is slow, we should be simply waiting for writeback
to progress rather than putting more work into the writeback queue.
This also allows pages to stay 'dirty' for longer which is generally
considered to be a good thing.
I think we need to have 2 numbers. One that is the limit of dirty
pages, and one that is the limit of the combined dirty+writeback.
Alternately it could simply be a limit on writeback.
Probably the later because having a very large writeback number makes
the 'inactive_list' of pages very large and so it takes a long time
to scan.
So suppose dirty were capped at vm_dirty_ratio, and writeback were
capped at that too, though independently.
Then in our experiment, Dirty would grow up to 40%, then
balance_dirty_pages would start flushing and Writeback would grow to
40% while Dirty stayed at 40%. Then balance_dirty_pages would not
flush anything but would just wait for Writeback to drop below 40%.
You would get a very obvious steady stage of 40% dirty and
40% Writeback.
Is this too much memory? 80% tied up in what are essentially dirty
blocks is more than you would expect when setting vm.dirty_ratio to
40.
Maybe 40% should limit Dirty+Writeback and when we cross the
threshold:
if Dirty > Writeback - flush and wait
if Dirty < Writeback - just wait
bdflush should get some writeback underway before we hit the 40%, so
balance_dirty_pages shouldn't find itself waiting for the pages it
just flushed.
Suggestions? Opinions?
The following patch demonstrates the last suggestion.
Thanks,
NeilBrown
Signed-off-by: Neil Brown <neilb@suse.de>
### Diffstat output
./mm/page-writeback.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff .prev/mm/page-writeback.c ./mm/page-writeback.c
--- .prev/mm/page-writeback.c 2006-08-15 09:36:23.000000000 +1000
+++ ./mm/page-writeback.c 2006-08-15 09:39:17.000000000 +1000
@@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
- if (nr_reclaimable) {
+ if (nr_reclaimable > global_page_state(NR_WRITEBACK)) {
writeback_inodes(&wbc);
get_dirty_limits(&background_thresh,
&dirty_thresh, mapping);
@@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a
<= dirty_thresh)
break;
pages_written += write_chunk - wbc.nr_to_write;
- if (pages_written >= write_chunk)
- break; /* We've done our duty */
}
blk_congestion_wait(WRITE, HZ/10);
}
^ permalink raw reply [flat|nested] 37+ messages in thread* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-14 23:40 RFC - how to balance Dirty+Writeback in the face of slow writeback Neil Brown @ 2006-08-15 8:06 ` Andrew Morton 2006-08-15 23:00 ` David Chinner 2006-08-17 3:59 ` Neil Brown 0 siblings, 2 replies; 37+ messages in thread From: Andrew Morton @ 2006-08-15 8:06 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel On Tue, 15 Aug 2006 09:40:12 +1000 Neil Brown <neilb@suse.de> wrote: > > I have a question about the write throttling in > balance_dirty_pages in the face of slow writeback. btw, we have problem in there at present when you're using a combination of slow devices and fast devices. That worked OK in 2.5.x, iirc, but seems to have gotten broken since. > Suppose we have a filesystem where writeback is relatively slow - > e.g. NFS or EXTx over nbd over a slow link. > > Suppose for the sake of simplicity that writeback is very slow and > doesn't progress at all for the first part of our experiment. > > We write to a large file. > Balance_dirty_pages gets called periodically. Until the number of > Dirty pages reached 40% of memory it does nothing. > > Once we hit 40%, balance_dirty_pages starts calling writeback_inodes > and these Dirty pages get converted to Writeback pages. This happens > at 1.5 times the speed that dirty pages are created (due to > sync_writeback_pages()). So for every 100K that we dirty, 150K gets > converted to writeback. But balance_dirty_pages doesn't wait for anything. > > This will result in the number of dirty pages going down steadily, and > the number of writeback pages increasing quickly (3 times the speed of > the drop in Dirty). The total of Dirty+Writeback will keep growing. > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM) > balance_dirty_pages will no longer be able to flush the full > 'write_chunk' (1.5 times number of recent dirtied pages) and so will > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't > a busy loop, but it won't progress. This assumes that the queues are unbounded. They're not - they're limited to 128 requests, which is 60MB or so. Per queue. The scenario you identify can happen if it's spread across multiple disks simultaneously. CFQ used to have 1024 requests and we did have problems with excessive numbers of writeback pages. I fixed that in 2.6.early, but that seems to have got lost as well. > Now our very slow writeback gets it's act together and starts making > some progress and the Writeback number steadily drops down to 40%. > At this point balance_dirty_pages will exit, more pages will get > dirtied, and balance_dirty_pages will quickly flush them out again. > > The steady state will be with Dirty at or close to 0, and Writeback at > or close to 40%. > > Now obviously this is somewhat idealised, and even slow writeback will > make some progress early on, but you can still expect to get a very > large Writeback with a very small Dirty before stabilising. > > I don't think we want this, but I'm not sure what we do want, so I'm > asking for opinions. > > I don't think that pushing Dirty down to zero is the best thing to > do. If writeback is slow, we should be simply waiting for writeback > to progress rather than putting more work into the writeback queue. > This also allows pages to stay 'dirty' for longer which is generally > considered to be a good thing. > > I think we need to have 2 numbers. One that is the limit of dirty > pages, and one that is the limit of the combined dirty+writeback. > Alternately it could simply be a limit on writeback. > Probably the later because having a very large writeback number makes > the 'inactive_list' of pages very large and so it takes a long time > to scan. > So suppose dirty were capped at vm_dirty_ratio, and writeback were > capped at that too, though independently. > > Then in our experiment, Dirty would grow up to 40%, then > balance_dirty_pages would start flushing and Writeback would grow to > 40% while Dirty stayed at 40%. Then balance_dirty_pages would not > flush anything but would just wait for Writeback to drop below 40%. > You would get a very obvious steady stage of 40% dirty and > 40% Writeback. > > Is this too much memory? 80% tied up in what are essentially dirty > blocks is more than you would expect when setting vm.dirty_ratio to > 40. > > Maybe 40% should limit Dirty+Writeback and when we cross the > threshold: > if Dirty > Writeback - flush and wait > if Dirty < Writeback - just wait > > bdflush should get some writeback underway before we hit the 40%, so > balance_dirty_pages shouldn't find itself waiting for the pages it > just flushed. > > Suggestions? Opinions? > > The following patch demonstrates the last suggestion. > > Thanks, > NeilBrown > > Signed-off-by: Neil Brown <neilb@suse.de> > > ### Diffstat output > ./mm/page-writeback.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff .prev/mm/page-writeback.c ./mm/page-writeback.c > --- .prev/mm/page-writeback.c 2006-08-15 09:36:23.000000000 +1000 > +++ ./mm/page-writeback.c 2006-08-15 09:39:17.000000000 +1000 > @@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a > * written to the server's write cache, but has not yet > * been flushed to permanent storage. > */ > - if (nr_reclaimable) { > + if (nr_reclaimable > global_page_state(NR_WRITEBACK)) { > writeback_inodes(&wbc); > get_dirty_limits(&background_thresh, > &dirty_thresh, mapping); > @@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a > <= dirty_thresh) > break; > pages_written += write_chunk - wbc.nr_to_write; > - if (pages_written >= write_chunk) > - break; /* We've done our duty */ > } > blk_congestion_wait(WRITE, HZ/10); > } Something like that - it'll be relatively simple. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-15 8:06 ` Andrew Morton @ 2006-08-15 23:00 ` David Chinner 2006-08-17 4:08 ` Neil Brown 2006-08-17 3:59 ` Neil Brown 1 sibling, 1 reply; 37+ messages in thread From: David Chinner @ 2006-08-15 23:00 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, linux-kernel On Tue, Aug 15, 2006 at 01:06:11AM -0700, Andrew Morton wrote: > On Tue, 15 Aug 2006 09:40:12 +1000 > Neil Brown <neilb@suse.de> wrote: > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM) > > balance_dirty_pages will no longer be able to flush the full > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't > > a busy loop, but it won't progress. > > This assumes that the queues are unbounded. They're not - they're limited > to 128 requests, which is 60MB or so. > > Per queue. The scenario you identify can happen if it's spread across > multiple disks simultaneously. Though in this situation, you don't usually have slow writeback problems. I haven't seen any recent problems with insufficient throttling on this sort of configuration. > CFQ used to have 1024 requests and we did have problems with excessive > numbers of writeback pages. I fixed that in 2.6.early, but that seems to > have got lost as well. CFQ still has a queue depth of 128 requests.... > > bdflush should get some writeback underway before we hit the 40%, so > > balance_dirty_pages shouldn't find itself waiting for the pages it > > just flushed. balance_dirty_pages() already kicks the background writeback done by pdflush when dirty > dirty_background_ratio (10%). IMO, if you've got slow writeback, you should be reducing the amount of dirty memory you allow in the machine so that you don't tie up large amounts of memory that takes a long time to clean. Throttle earlier and you avoid this problem entirely. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-15 23:00 ` David Chinner @ 2006-08-17 4:08 ` Neil Brown 2006-08-17 6:14 ` Andrew Morton 2006-08-17 22:17 ` David Chinner 0 siblings, 2 replies; 37+ messages in thread From: Neil Brown @ 2006-08-17 4:08 UTC (permalink / raw) To: David Chinner; +Cc: Andrew Morton, linux-kernel On Wednesday August 16, dgc@sgi.com wrote: > > IMO, if you've got slow writeback, you should be reducing the amount > of dirty memory you allow in the machine so that you don't tie up > large amounts of memory that takes a long time to clean. Throttle earlier > and you avoid this problem entirely. I completely agree that 'throttle earlier' is important. I just not completely sure what should be throttled when. I think I could argue that pages in 'Writeback' are really still dirty. The difference is really just an implementation issue. So when the dirty_ratio is set to 40%, that should apply to all 'dirty' pages, which means both that flagged as 'Dirty' and those flagged as 'Writeback'. So I think you need to throttle when Dirty+Writeback hits dirty_ratio (which we don't quite get right at the moment). But the trick is to throttle gently and fairly, rather than having a hard wall so that any one who hits it just stops. Thanks, NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 4:08 ` Neil Brown @ 2006-08-17 6:14 ` Andrew Morton 2006-08-17 12:36 ` Trond Myklebust 2006-08-18 0:11 ` David Chinner 2006-08-17 22:17 ` David Chinner 1 sibling, 2 replies; 37+ messages in thread From: Andrew Morton @ 2006-08-17 6:14 UTC (permalink / raw) To: Neil Brown; +Cc: David Chinner, linux-kernel On Thu, 17 Aug 2006 14:08:58 +1000 Neil Brown <neilb@suse.de> wrote: > So I think you need to throttle when Dirty+Writeback hits dirty_ratio yup. > (which we don't quite get right at the moment). But the trick is to > throttle gently and fairly, rather than having a hard wall so that any > one who hits it just stops. I swear, I had all this working in 2001. Perhaps I dreamed it. But I specifically remember testing that processes which were performing small, occasional writes were not getting blocked behind the activity of other processes which were doing massive write()s. Ho hum, not to worry. I guess a robust approach would be to track, on a per-process, per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate. If it is "low" then accept the dirtying. If it is "high" then this process is a heavy writer and needs throttling earlier. Up to a point - at some level we'll need to throttle everyone as a safety net if nothing else. Something like that covers the global dirty+writeback problem. The other major problem space is the multiple-backing-device problem: a) One device is being written to heavily, another lightly b) One device is fast, another is slow. Thus far, the limited size of the request queues has saved us from really, really serious problems. But that doesn't work when lots of disks are being used. To solve this properly we'd need to account for dirty+writeback(+unstable?) pages on a per-backing-dev basis. But as a first step, yes, using dirty+writeback for the throttling threshold and continuing to rely upon limited request queue size to save us from disaster would be a good step. btw, one thing which afaik NFS _still_ doesn't do is to wake up processes which are stuck in blk_congestion_wait() when NFS has retired a bunch of writes. It should do so, otherwise NFS write-intensive workloads might end up sleeping for too long. I guess the amount of buffering and hysteresis we have in there has thus far prevented any problems from being observed. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 6:14 ` Andrew Morton @ 2006-08-17 12:36 ` Trond Myklebust 2006-08-17 15:14 ` Andrew Morton 2006-08-18 0:11 ` David Chinner 1 sibling, 1 reply; 37+ messages in thread From: Trond Myklebust @ 2006-08-17 12:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote: > btw, one thing which afaik NFS _still_ doesn't do is to wake up processes > which are stuck in blk_congestion_wait() when NFS has retired a bunch of > writes. It should do so, otherwise NFS write-intensive workloads might end > up sleeping for too long. I guess the amount of buffering and hysteresis > we have in there has thus far prevented any problems from being observed. Are we to understand it that you consider blk_congestion_wait() to be an official API, and not just another block layer hack inside the VM? 'cos currently the only tools for waking up processes in blk_congestion_wait() are the two routines: static void clear_queue_congested(request_queue_t *q, int rw) and static void set_queue_congested(request_queue_t *q, int rw) in block/ll_rw_blk.c. Hardly a model of well thought out code... Trond ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 12:36 ` Trond Myklebust @ 2006-08-17 15:14 ` Andrew Morton 2006-08-17 16:22 ` Trond Myklebust 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2006-08-17 15:14 UTC (permalink / raw) To: Trond Myklebust; +Cc: Neil Brown, David Chinner, linux-kernel On Thu, 17 Aug 2006 08:36:19 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote: > > btw, one thing which afaik NFS _still_ doesn't do is to wake up processes > > which are stuck in blk_congestion_wait() when NFS has retired a bunch of > > writes. It should do so, otherwise NFS write-intensive workloads might end > > up sleeping for too long. I guess the amount of buffering and hysteresis > > we have in there has thus far prevented any problems from being observed. > > Are we to understand it that you consider blk_congestion_wait() to be an > official API, and not just another block layer hack inside the VM? > > 'cos currently the only tools for waking up processes in > blk_congestion_wait() are the two routines: > > static void clear_queue_congested(request_queue_t *q, int rw) > and > static void set_queue_congested(request_queue_t *q, int rw) > > in block/ll_rw_blk.c. Hardly a model of well thought out code... > We've been over this before... Take a look at blk_congestion_wait(). It doesn't know about request queues. We'd need a new void writeback_congestion_end(int rw) { wake_up(congestion_wqh[rw]); } or similar. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 15:14 ` Andrew Morton @ 2006-08-17 16:22 ` Trond Myklebust 2006-08-18 5:49 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Trond Myklebust @ 2006-08-17 16:22 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote: > Take a look at blk_congestion_wait(). It doesn't know about request > queues. We'd need a new > > void writeback_congestion_end(int rw) > { > wake_up(congestion_wqh[rw]); > } > > or similar. ...and how often do you want us to call this? NFS doesn't know much about request queues either: it writes out pages on a per-RPC call basis. In the worst case that could mean waking up the VM every time we write out a single page. Cheers, Trond ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 16:22 ` Trond Myklebust @ 2006-08-18 5:49 ` Andrew Morton 2006-08-18 10:43 ` Nikita Danilov 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2006-08-18 5:49 UTC (permalink / raw) To: Trond Myklebust; +Cc: Neil Brown, David Chinner, linux-kernel On Thu, 17 Aug 2006 12:22:59 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote: > > Take a look at blk_congestion_wait(). It doesn't know about request > > queues. We'd need a new > > > > void writeback_congestion_end(int rw) > > { > > wake_up(congestion_wqh[rw]); > > } > > > > or similar. > > ...and how often do you want us to call this? NFS doesn't know much > about request queues either: it writes out pages on a per-RPC call > basis. In the worst case that could mean waking up the VM every time we > write out a single page. > Once per page would work OK, but we'd save some CPU by making it less frequent. This stuff isn't very precise. We could make it precise, but it would require a really large amount of extra locking, extra locks, etc. The way this code all works is pretty crude and simple: a process comes in to to some writeback and it enters a polling loop: while (we need to do writeback) { for (each superblock) { if (the superblock's backing_dev isn't congested) { stuff some more IO down it() } } take_a_nap(); } so the process remains captured in that polling loop until the dirty-memory-exceed condition subsides. The reason why we avoid congsted queues is so that one thread can keep multiple queues busy: we don't want to allow writing threads to get stuck on a single queue and we don't want to have to provision one pdflush per spindle (or, more precisely, per backing_dev_info). So the question is: how do we "take a nap"? That's blk_congestion_wait(). The process goes to sleep in there and gets woken up when someone thinks that a queue might be able to take some more writeout. A caller into blk_congestion_wait() is _supposed_ to be woken by writeback completion. If the timeout actually expires, something isn't right. If we had all the new locking in place and correct, the timeout wouldn't actually be needed. In theory, the timeout is only there as a fallback to handle certain races for which we don't want to implement all that new locking to fix. It would be good if NFS were to implement a fixed-size "request queue", so we can't fill all memory with NFS requests. Then, NFS can implement a congestion threshold at "75% full" (via its backing_dev_info) and everything is in place. As a halfway step it might provide benefit for NFS to poke the congestion_wq[] every quarter megabyte or so, to kick any processes out of their sleep so they go back to poll all the superblocks again, earlier than they otherwise would have. It might not make any difference - one would need to get in there and understand the dynamic behaviour. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 5:49 ` Andrew Morton @ 2006-08-18 10:43 ` Nikita Danilov 0 siblings, 0 replies; 37+ messages in thread From: Nikita Danilov @ 2006-08-18 10:43 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel Andrew Morton writes: [...] > > The way this code all works is pretty crude and simple: a process comes > in to to some writeback and it enters a polling loop: > > while (we need to do writeback) { > for (each superblock) { > if (the superblock's backing_dev isn't congested) { > stuff some more IO down it() > } > } > take_a_nap(); > } > > so the process remains captured in that polling loop until the > dirty-memory-exceed condition subsides. The reason why we avoid Hm... wbc->nr_to_write is checked all the way down (balance_dirty_pages(), writeback_inodes(), sync_sb_inodes(), mpage_writepages()), so "occasional writer" cannot be stuck for more than 32 + 16 pages, it seems. Nikita. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 6:14 ` Andrew Morton 2006-08-17 12:36 ` Trond Myklebust @ 2006-08-18 0:11 ` David Chinner 2006-08-18 6:29 ` Andrew Morton 1 sibling, 1 reply; 37+ messages in thread From: David Chinner @ 2006-08-18 0:11 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, linux-kernel On Wed, Aug 16, 2006 at 11:14:48PM -0700, Andrew Morton wrote: > > I guess a robust approach would be to track, on a per-process, > per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate. > If it is "low" then accept the dirtying. If it is "high" then this process > is a heavy writer and needs throttling earlier. Up to a point - at some > level we'll need to throttle everyone as a safety net if nothing else. The problem with that approach is that throttling a large writer forces data to disk earlier and that may be undesirable - the large file might be a temp file that will soon be unlinked, and in this case you don't want it throttled. Right now, you set dirty*ratio high enough that this doesn't happen, and the file remains memory resident until unlink. > Something like that covers the global dirty+writeback problem. The other > major problem space is the multiple-backing-device problem: > > a) One device is being written to heavily, another lightly > > b) One device is fast, another is slow. Once we are past the throttling threshold, the only thing that matters is whether we can write more data to the backing device(s). We should not realy be allowing the input rate to exceed the output rate one we are passed the throttle threshold. > Thus far, the limited size of the request queues has saved us from really, > really serious problems. But that doesn't work when lots of disks are > being used. Mainly because it increases the number of pages under writeback that currently aren't accounted as dirty and the throttle doesn't kick in when it should. > To solve this properly we'd need to account for > dirty+writeback(+unstable?) pages on a per-backing-dev basis. We'd still need to account for them globally because we still need to be able to globally limit the amount of dirty data in the machine. FYI, I implemented a complex two-stage throttle on Irix a couple of years ago - it uses a per-device soft throttle threshold that is not enforced until the global dirty state passes a configurable limit. At that point, the per-device limits are enforced. This meant that devices with no dirty state attached to them could continue to dirty pages up to their soft-threshold, whereas heavy writers would be stopped until their backing devices fell back below the soft thresholds. Because the amount of dirty pages could continue to grow past safe limits if you had enough devices, there is also a global hard limit that cannot be exceeded and this throttles all incoming write requests regardless of the state of the device it was being written to. The problem with this approach is that the code was complex and difficult to test properly. Also, working out the default config values was an exercise in trial, error, workload measurement and guesswork that took some time to get right. The current linux code works as well as that two-stage throttle (better in some cases!) because of one main thing - bound request queue depth with feedback into the throttling control loop. Irix has neither of these so the throttle had to provide this accounting and limiting (soft throttle threshold). Hence I'm not sure that per-backing-device accounting and making decisions based on that accounting is really going to buy us much apart from additional complexity.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 0:11 ` David Chinner @ 2006-08-18 6:29 ` Andrew Morton 2006-08-18 7:03 ` Jens Axboe 2006-08-18 7:07 ` Neil Brown 0 siblings, 2 replies; 37+ messages in thread From: Andrew Morton @ 2006-08-18 6:29 UTC (permalink / raw) To: David Chinner; +Cc: Neil Brown, linux-kernel On Fri, 18 Aug 2006 10:11:02 +1000 David Chinner <dgc@sgi.com> wrote: > > > Something like that covers the global dirty+writeback problem. The other > > major problem space is the multiple-backing-device problem: > > > > a) One device is being written to heavily, another lightly > > > > b) One device is fast, another is slow. > > Once we are past the throttling threshold, the only thing that > matters is whether we can write more data to the backing device(s). > We should not realy be allowing the input rate to exceed the output > rate one we are passed the throttle threshold. True. But it seems really sad to block some process which is doing a really small dirtying (say, some dopey atime update) just because some other process is doing a huge write. Now, things _usually_ work out all right, if only because of balance_dirty_pages_ratelimited()'s logic. But it's more by happenstance than by intent, and these sorts of interferences can happen. > > To solve this properly we'd need to account for > > dirty+writeback(+unstable?) pages on a per-backing-dev basis. > > We'd still need to account for them globally because we still need > to be able to globally limit the amount of dirty data in the > machine. > > FYI, I implemented a complex two-stage throttle on Irix a couple of > years ago - it uses a per-device soft throttle threshold that is not > enforced until the global dirty state passes a configurable limit. > At that point, the per-device limits are enforced. > > This meant that devices with no dirty state attached to them could > continue to dirty pages up to their soft-threshold, whereas heavy > writers would be stopped until their backing devices fell back below > the soft thresholds. > > Because the amount of dirty pages could continue to grow past safe > limits if you had enough devices, there is also a global hard limit > that cannot be exceeded and this throttles all incoming write > requests regardless of the state of the device it was being written > to. > > The problem with this approach is that the code was complex and > difficult to test properly. Also, working out the default config > values was an exercise in trial, error, workload measurement and > guesswork that took some time to get right. > > The current linux code works as well as that two-stage throttle > (better in some cases!) because of one main thing - bound request > queue depth with feedback into the throttling control loop. Irix > has neither of these so the throttle had to provide this accounting > and limiting (soft throttle threshold). > > Hence I'm not sure that per-backing-device accounting and making > decisions based on that accounting is really going to buy us much > apart from additional complexity.... > hm, interesting. It seems that the many-writers-to-different-disks workloads don't happen very often. We know this because a) The 2.4 performance is utterly awful, and I never saw anybody complain and b) 2.6 has the risk of filling all memory with under-writeback pages, and nobdy has complained about that either (iirc). Relying on that observation and the request-queue limits has got us this far but yeah, we should plug that PageWriteback windup scenario. btw, Neil, has the Pagewriteback windup actually been demonstrated? If so, how? ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 6:29 ` Andrew Morton @ 2006-08-18 7:03 ` Jens Axboe 2006-08-18 7:11 ` Andrew Morton 2006-08-18 18:57 ` Andi Kleen 2006-08-18 7:07 ` Neil Brown 1 sibling, 2 replies; 37+ messages in thread From: Jens Axboe @ 2006-08-18 7:03 UTC (permalink / raw) To: Andrew Morton; +Cc: David Chinner, Neil Brown, linux-kernel On Thu, Aug 17 2006, Andrew Morton wrote: > It seems that the many-writers-to-different-disks workloads don't happen > very often. We know this because > > a) The 2.4 performance is utterly awful, and I never saw anybody > complain and Talk to some of the people that used DVD-RAM devices (or other excruciatingly slow writers) on their system, and they would disagree violently :-) It's been discussed here on lkml many times in the past, but that's years behind us now. Thankfully your pdflush work got rid of that embarassment. But it definitely does matter, to real ordinary users. -- Jens Axboe ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 7:03 ` Jens Axboe @ 2006-08-18 7:11 ` Andrew Morton 2006-08-18 18:57 ` Andi Kleen 1 sibling, 0 replies; 37+ messages in thread From: Andrew Morton @ 2006-08-18 7:11 UTC (permalink / raw) To: Jens Axboe; +Cc: David Chinner, Neil Brown, linux-kernel On Fri, 18 Aug 2006 09:03:15 +0200 Jens Axboe <axboe@suse.de> wrote: > On Thu, Aug 17 2006, Andrew Morton wrote: > > It seems that the many-writers-to-different-disks workloads don't happen > > very often. We know this because > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > complain and > > Talk to some of the people that used DVD-RAM devices (or other > excruciatingly slow writers) on their system, and they would disagree > violently :-) umm, OK, I guess that has the same cause: buffer_heads from different devices all on the same single queue. In this case the problem is that one device is slow. In the same-speed-devices case the problem is that all writeback threads get stuck on the same device, allowing others to go idle. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 7:03 ` Jens Axboe 2006-08-18 7:11 ` Andrew Morton @ 2006-08-18 18:57 ` Andi Kleen 2006-08-21 0:35 ` Neil Brown 1 sibling, 1 reply; 37+ messages in thread From: Andi Kleen @ 2006-08-18 18:57 UTC (permalink / raw) To: Jens Axboe; +Cc: David Chinner, Neil Brown, linux-kernel, akpm Jens Axboe <axboe@suse.de> writes: > On Thu, Aug 17 2006, Andrew Morton wrote: > > It seems that the many-writers-to-different-disks workloads don't happen > > very often. We know this because > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > complain and > > Talk to some of the people that used DVD-RAM devices (or other > excruciatingly slow writers) on their system, and they would disagree > violently :-) I hit this recently while doing backups to a slow external USB disk. The system was quite unusable (some commands blocked for over a minute) -Andi ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 18:57 ` Andi Kleen @ 2006-08-21 0:35 ` Neil Brown 2006-08-21 3:15 ` David Chinner 0 siblings, 1 reply; 37+ messages in thread From: Neil Brown @ 2006-08-21 0:35 UTC (permalink / raw) To: Andi Kleen; +Cc: Jens Axboe, David Chinner, linux-kernel, akpm On August 18, ak@suse.de wrote: > Jens Axboe <axboe@suse.de> writes: > > > On Thu, Aug 17 2006, Andrew Morton wrote: > > > It seems that the many-writers-to-different-disks workloads don't happen > > > very often. We know this because > > > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > > complain and > > > > Talk to some of the people that used DVD-RAM devices (or other > > excruciatingly slow writers) on their system, and they would disagree > > violently :-) > > I hit this recently while doing backups to a slow external USB disk. > The system was quite unusable (some commands blocked for over a minute) Ouch. I suspect we are going to see more of this, as USB drive for backups is probably a very attractive option for many. The 'obvious' solution would be to count dirty pages per backing_dev and rate limit writes based on this. But counting pages can be expensive. I wonder if there might be some way to throttle the required writes without doing too much counting. Could we watch when the backing_dev is congested and use that? e.g. When Dirty+Writeback is between max_dirty/2 and max_dirty, balance_dirty_pages waits until mapping->backing_dev_info is not congested. That might slow things down, but it is hard to know if it would slow things down the right amount... Given that large machines are likely to have lots of different backing_devs, maybe counting all the dirty pages per backing_dev wouldn't be too expensive? NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 0:35 ` Neil Brown @ 2006-08-21 3:15 ` David Chinner 2006-08-21 7:24 ` Neil Brown 2006-08-21 7:47 ` Andi Kleen 0 siblings, 2 replies; 37+ messages in thread From: David Chinner @ 2006-08-21 3:15 UTC (permalink / raw) To: Neil Brown; +Cc: Andi Kleen, Jens Axboe, David Chinner, linux-kernel, akpm On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote: > On August 18, ak@suse.de wrote: > > Jens Axboe <axboe@suse.de> writes: > > > > > On Thu, Aug 17 2006, Andrew Morton wrote: > > > > It seems that the many-writers-to-different-disks workloads don't happen > > > > very often. We know this because > > > > > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > > > complain and > > > > > > Talk to some of the people that used DVD-RAM devices (or other > > > excruciatingly slow writers) on their system, and they would disagree > > > violently :-) > > > > I hit this recently while doing backups to a slow external USB disk. > > The system was quite unusable (some commands blocked for over a minute) > > Ouch. > I suspect we are going to see more of this, as USB drive for backups > is probably a very attractive option for many. I can't see how this would occur on a 2.6 kernel unless the problem is that all the reclaimable memory in the machine is dirty page cache pages every allocation is blocking waiting for writeback to the slow device to occur. That is, we filled memory with dirty pages before we got to the throttle threshold. > The 'obvious' solution would be to count dirty pages per backing_dev > and rate limit writes based on this. > But counting pages can be expensive. I wonder if there might be some > way to throttle the required writes without doing too much counting. I don't think we want to count pages here. My "obvious" solution is a per-backing-dev throttle threshold, just like we have per-backing-dev readahead parameters.... That is, we allow a per-block-dev value to be set that overrides the global setting for that blockdev only. Hence for slower devices we can set the point at which we throttle at a much lower dirty memory threshold when that block device is congested. > Could we watch when the backing_dev is congested and use that? > e.g. > When Dirty+Writeback is between max_dirty/2 and max_dirty, > balance_dirty_pages waits until mapping->backing_dev_info > is not congested. The problem with that approach is that writeback_inodes() operates on "random" block devices, not necessarily the one we are trying to write to We don't care what bdi we start write back on - we just want some dirty pages to come clean. If we can't write the number of pages we wanted to, that means all bdi's are congested, and we then wait for one to become uncongested so we can push more data into it. Hence waiting on a specific bdi to become uncongested is the wrong thing to do because we could be cleaning pages on a different, uncongested bdi instead of waiting. A per-bdi throttle threshold will have the effect of pushing out pages on faster block devs earlier than they would otherwise be pushed out, but that will only occur if we are writing to a slower block device. Also, only the slower bdi will be subject to this throttling, so it won't get as much memory dirty as the faster devices.... > That might slow things down, but it is hard to know if it would slow > things down the right amount... > > Given that large machines are likely to have lots of different > backing_devs, maybe counting all the dirty pages per backing_dev > wouldn't be too expensive? Consider 1024p machines writing in parallel at >10GB/s write speeds to a single filesystem (i.e. single bdi). Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 3:15 ` David Chinner @ 2006-08-21 7:24 ` Neil Brown 2006-08-21 13:51 ` Jens Axboe 2006-08-21 14:28 ` David Chinner 2006-08-21 7:47 ` Andi Kleen 1 sibling, 2 replies; 37+ messages in thread From: Neil Brown @ 2006-08-21 7:24 UTC (permalink / raw) To: David Chinner; +Cc: Andi Kleen, Jens Axboe, linux-kernel, akpm On Monday August 21, dgc@sgi.com wrote: > On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote: > > On August 18, ak@suse.de wrote: > > > Jens Axboe <axboe@suse.de> writes: > > > > > > > On Thu, Aug 17 2006, Andrew Morton wrote: > > > > > It seems that the many-writers-to-different-disks workloads don't happen > > > > > very often. We know this because > > > > > > > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > > > > complain and > > > > > > > > Talk to some of the people that used DVD-RAM devices (or other > > > > excruciatingly slow writers) on their system, and they would disagree > > > > violently :-) > > > > > > I hit this recently while doing backups to a slow external USB disk. > > > The system was quite unusable (some commands blocked for over a minute) > > > > Ouch. > > I suspect we are going to see more of this, as USB drive for backups > > is probably a very attractive option for many. > > I can't see how this would occur on a 2.6 kernel unless the problem is > that all the reclaimable memory in the machine is dirty page cache pages > every allocation is blocking waiting for writeback to the slow device to > occur. That is, we filled memory with dirty pages before we got to the > throttle threshold. I started writing a longish reply to this explaining how maybe that could happen, and then realised I had been missing important aspects of the code. writeback_inodes doesn't just work on any random device as both you and I thought. The 'writeback_control' structure identifies the bdi to be flushed and it will only call __writeback_sync_inode on inodes with the same bdi. This means that any process writing to a particular bdi should throttle against the queue limits in the bdi once we pass the dirty threshold. This means that it shouldn't be able to fill memory with dirty pages for that device (unless the bdi doesn't have a queue limit like nfs...). But now I see another way that Jens' problem could occur... maybe. Suppose the total Dirty+Writeback exceeds the threshold due entirely to the slow device, and it is slowly working its way through the writeback pages. We write to some other device, make 'ratelimit_pages' pages dirty and then hit balance_dirty_pages. We now need to either get the total dirty pages below the threshold or start writeback on 1.5 * ratelimit_pages pages. As we only have 'ratelimit_pages' dirty pages we cannot start writeback on enough, and so must wait until Dirty+Writeback drops below the threshold. And as we are waiting on the slow device, that could take a while (especially as it is possible that no-one is calling balance_dirty_pages against that bdi). I think this was the reason for the interesting extra patch that I mentioned we have is the SuSE kernel. The effect of that patch is to break out of balance_dirty_pages as soon as Dirty hits zero. This should stop a slow device from blocking other traffic but has unfortunate side effects when combined with nfs which doesn't limit its writeback queue. Jens: Was it s SuSE kernel or a mainline kernel on which you experienced this slowdown with an external USB drive? > > > The 'obvious' solution would be to count dirty pages per backing_dev > > and rate limit writes based on this. > > But counting pages can be expensive. I wonder if there might be some > > way to throttle the required writes without doing too much counting. > > I don't think we want to count pages here. > > My "obvious" solution is a per-backing-dev throttle threshold, just > like we have per-backing-dev readahead parameters.... > > That is, we allow a per-block-dev value to be set that overrides the > global setting for that blockdev only. Hence for slower devices > we can set the point at which we throttle at a much lower dirty > memory threshold when that block device is congested. > I don't think this would help. The bdi with the higher threshold could exclude bdis with lower thresholds from making any forward progress. Here is a question: Seeing that wbc.nonblocking == 0 wbc.older_than_this == NULL wbc.range_cyclic == 0 in balance_dirty_pages when it calls writeback_inodes, under what circumstances will writeback_inodes return with wbc.nr_to_write > 0 ?? If a write error occurs it could abort early, but otherwise I think it will only exit early if it runs out of pages to write, because there aren't any dirty pages. If that is true, then after calling writeback_inodes once, balance_dirty_pages should just exit. It isn't going to find any more work to do next time it is called anyway. Either the queue was never congested, in which case we don't need to throttle writes, or it blocked for a while waiting for the queue to clean (in ->writepage) and so has successfully throttled writes. So my feeling (at the moment) is that balance_dirty_pages should look like: if below threshold return writeback_inodes({.bdi = mapping->backing_dev_info)} ) while (above threshold + 10%) writeback_inodes(.bdi = NULL) blk_congestion_wait and all bdis should impose a queue limit. This would limit the extent to which different bdi can interfere with each other, and make the role of writeback_inodes clear (especially with a nice big comment). Then we just need to deal with the case where the some of the queue limits of all devices exceeds the dirty threshold.... Maybe writeout queues need to auto-adjust their queue length when some system-wide situation is detected.... sounds messy. NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 7:24 ` Neil Brown @ 2006-08-21 13:51 ` Jens Axboe 2006-08-25 4:36 ` Neil Brown 2006-08-21 14:28 ` David Chinner 1 sibling, 1 reply; 37+ messages in thread From: Jens Axboe @ 2006-08-21 13:51 UTC (permalink / raw) To: Neil Brown; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm On Mon, Aug 21 2006, Neil Brown wrote: > Jens: Was it s SuSE kernel or a mainline kernel on which you > experienced this slowdown with an external USB drive? Note that this was on the old days, on 2.4 kernels. It was (and still is) a generic 2.4 problem, which is quite apparent when you have larger slow devices. Larger, because then you can have a lot of dirty memory in flight for that device. The case I most often saw reported was on DVD-RAM atapi or scsi devices, which write at 3-400kb/sec. An external usb hard drive over usb 1.x would be almost as bad, I suppose. I haven't heard any complaints for 2.6 in this area for a long time. > Then we just need to deal with the case where the some of the queue > limits of all devices exceeds the dirty threshold.... > Maybe writeout queues need to auto-adjust their queue length when some > system-wide situation is detected.... sounds messy. Queue length is a little tricky, so it's basically controlled by two parameters - nr_requests and max_sectors_kb. Most SATA drives can do 32MiB requests, so in theory a system that sets max_sectors_kb to max_hw_sectors_kb and retains a default nr_requests of 128, can see up to 32 * (128 * 3) / 2 == 6144MiB per disk in flight. Auch. By default we only allow 512KiB per requests, which brings us to a more reasonable 96MiB per disk. But these numbers are in no way tied to the hardware. It may be totally reasonable to have 3GiB of dirty data on one system, and it may be totally unreasonable to have 96MiB of dirty data on another. I've always thought that assuming any kind of reliable throttling at the queue level is broken and that the vm should handle this completely. -- Jens Axboe ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 13:51 ` Jens Axboe @ 2006-08-25 4:36 ` Neil Brown 2006-08-25 6:37 ` Jens Axboe 2006-08-25 13:16 ` Trond Myklebust 0 siblings, 2 replies; 37+ messages in thread From: Neil Brown @ 2006-08-25 4:36 UTC (permalink / raw) To: Jens Axboe; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm On Monday August 21, axboe@suse.de wrote: > > But these numbers are in no way tied to the hardware. It may be totally > reasonable to have 3GiB of dirty data on one system, and it may be > totally unreasonable to have 96MiB of dirty data on another. I've always > thought that assuming any kind of reliable throttling at the queue level > is broken and that the vm should handle this completely. I keep changing my mind about this. Sometimes I see it that way, sometimes it seems very sensible for throttling to happen at the device queue. Can I ask a question: Why do we have a 'nr_requests' maximum? Why not just allocate request structures whenever a request is made? If there some reason relating to making the block layer work more efficiently? or is it just because the VM requires it. I'm beginning to think that the current scheme really works very well - except for a few 'bugs'(*). The one change that might make sense would be for the VM to be able to tune the queue size of each backing dev. Exactly how that would work I'm not sure, but the goal would be to get the sum of the active queue sizes to about 1 half of dirty_threshold. The 'bugs' I am currently aware of are: - nfs doesn't put a limit on the request queue - the ext3 journal often writes out dirty data without clearing the Dirty flag on the page - so the nr_dirty count ends up wrong. ext3 writes the buffers out and marks them clean. So when the VM tried to flush a page, it finds all the buffers are clean and so marks the page clean, so the nr_dirty count eventually gets correct again, but I think this can cause write throttling to be very unfair at times. I think we need a queue limit on NFS requests..... NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-25 4:36 ` Neil Brown @ 2006-08-25 6:37 ` Jens Axboe 2006-08-28 1:28 ` David Chinner 2006-08-25 13:16 ` Trond Myklebust 1 sibling, 1 reply; 37+ messages in thread From: Jens Axboe @ 2006-08-25 6:37 UTC (permalink / raw) To: Neil Brown; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm On Fri, Aug 25 2006, Neil Brown wrote: > On Monday August 21, axboe@suse.de wrote: > > > > But these numbers are in no way tied to the hardware. It may be totally > > reasonable to have 3GiB of dirty data on one system, and it may be > > totally unreasonable to have 96MiB of dirty data on another. I've always > > thought that assuming any kind of reliable throttling at the queue level > > is broken and that the vm should handle this completely. > > I keep changing my mind about this. Sometimes I see it that way, > sometimes it seems very sensible for throttling to happen at the > device queue. > > Can I ask a question: Why do we have a 'nr_requests' maximum? Why > not just allocate request structures whenever a request is made? > If there some reason relating to making the block layer work more > efficiently? or is it just because the VM requires it. It's by and large because the vm requires it. Historically the limit was there because the requests were statically allocated. Later the limit help bound runtimes for the io scheduler, since the merge and sort operations where O(N) each. Right now any of the io schedulers can handle larger number of requests without breaking a sweat, but the vm goes pretty nasty if you set (eg) 8192 requests as your limit. The limit is also handy for avoiding filling memory with requests structures. At some point here's little benefit to doing larger queues, depending on the workload and hardware. 128 is usually a pretty fair number, so... > I'm beginning to think that the current scheme really works very well > - except for a few 'bugs'(*). It works ok, but it makes it hard to experiment with larger queue depths when the vm falls apart :-). It's not a big deal, though, even if the design isn't very nice - nr_requests is not a well defined entity. It can be anywhere from 512b to megabyte(s) in size. So throttling on X number of requests tends to be pretty vague and depends hugely on the workload (random vs sequential IO). -- Jens Axboe ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-25 6:37 ` Jens Axboe @ 2006-08-28 1:28 ` David Chinner 0 siblings, 0 replies; 37+ messages in thread From: David Chinner @ 2006-08-28 1:28 UTC (permalink / raw) To: Jens Axboe; +Cc: Neil Brown, David Chinner, Andi Kleen, linux-kernel, akpm On Fri, Aug 25, 2006 at 08:37:24AM +0200, Jens Axboe wrote: > On Fri, Aug 25 2006, Neil Brown wrote: > > > I'm beginning to think that the current scheme really works very well > > - except for a few 'bugs'(*). > > It works ok, but it makes it hard to experiment with larger queue depths > when the vm falls apart :-). It's not a big deal, though, even if the > design isn't very nice - nr_requests is not a well defined entity. It > can be anywhere from 512b to megabyte(s) in size. So throttling on X > number of requests tends to be pretty vague and depends hugely on the > workload (random vs sequential IO). So maybe we need a different control parameter - the amount of memory we allow to be backed up in a queue rather than the number of requests the queue can take... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-25 4:36 ` Neil Brown 2006-08-25 6:37 ` Jens Axboe @ 2006-08-25 13:16 ` Trond Myklebust 2006-08-27 8:21 ` Neil Brown 1 sibling, 1 reply; 37+ messages in thread From: Trond Myklebust @ 2006-08-25 13:16 UTC (permalink / raw) To: Neil Brown; +Cc: Jens Axboe, David Chinner, Andi Kleen, linux-kernel, akpm On Fri, 2006-08-25 at 14:36 +1000, Neil Brown wrote: > The 'bugs' I am currently aware of are: > - nfs doesn't put a limit on the request queue > - the ext3 journal often writes out dirty data without clearing > the Dirty flag on the page - so the nr_dirty count ends up wrong. > ext3 writes the buffers out and marks them clean. So when > the VM tried to flush a page, it finds all the buffers are clean > and so marks the page clean, so the nr_dirty count eventually > gets correct again, but I think this can cause write throttling to > be very unfair at times. > > I think we need a queue limit on NFS requests..... That is simply not happening until someone can give a cogent argument for _why_ it is necessary. Such a cogent argument must, among other things, allow us to determine what would be a sensible queue limit. It should also point out _why_ the filesystem should be doing this instead of the VM. Furthermore, I'd like to point out that NFS has a "third" state for pages: following an UNSTABLE write the data on them is marked as 'uncommitted'. Such pages are tracked using the NR_UNSTABLE_NFS counter. The question is: if we want to set limits on the write queue, what does that imply for the uncommitted writes? If you go back and look at the 2.4 NFS client, we actually had an arbitrary queue limit. That limit covered the sum of writes+uncommitted pages. Performance sucked, 'cos we were not able to use server side caching efficiently. The number of COMMIT requests (causes the server to fsync() the client's data to disk) on the wire kept going through the roof as we tried to free up pages in order to satisfy the hard limit. For those reasons and others, the filesystem queue limit was removed for 2.6 in favour of allowing the VM to control the limits based on its extra knowledge of the state of global resources. Trond ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-25 13:16 ` Trond Myklebust @ 2006-08-27 8:21 ` Neil Brown 0 siblings, 0 replies; 37+ messages in thread From: Neil Brown @ 2006-08-27 8:21 UTC (permalink / raw) To: Trond Myklebust; +Cc: Jens Axboe, David Chinner, Andi Kleen, linux-kernel, akpm On Friday August 25, trond.myklebust@fys.uio.no wrote: > On Fri, 2006-08-25 at 14:36 +1000, Neil Brown wrote: > > The 'bugs' I am currently aware of are: > > - nfs doesn't put a limit on the request queue > > - the ext3 journal often writes out dirty data without clearing > > the Dirty flag on the page - so the nr_dirty count ends up wrong. > > ext3 writes the buffers out and marks them clean. So when > > the VM tried to flush a page, it finds all the buffers are clean > > and so marks the page clean, so the nr_dirty count eventually > > gets correct again, but I think this can cause write throttling to > > be very unfair at times. > > > > I think we need a queue limit on NFS requests..... > > That is simply not happening until someone can give a cogent argument > for _why_ it is necessary. Such a cogent argument must, among other > things, allow us to determine what would be a sensible queue limit. It > should also point out _why_ the filesystem should be doing this instead > of the VM. Well, I'm game.... let's see how I do ... (Hmmm. 290 lines. Maybe I did too well :-). Firstly - what would be a sensible queue limit? To a large extent the size isn't very important. It needs to be big enough that you can have a reasonable number of concurrent in-flight requests - this is something that only the NFS client can determine. It needs to be small enough to not be able to use up too much memory. This is something that the VM should impose, but doesn't yet. Something like a few percent of memory would probably be the right ball park. I agree this is something the VM should be responsible for, but isn't yet. However I don't think it is a big part of the picture. So: why is it necessary? How about "Because that's how the Linux VM works". I can see that is not very satisfactory, and it isn't the whole answer, but I think it is worth saying. There are probably several ways to do effective write throttling. The Linux VM (currently) uses push-back from the writeout queue to achieve write throttling, so we really need all writeout mechanisms to provide some pushback. Maybe the VM could be changed, but I think it would be a big change and not something to do lightly - partly because I think the current system works (for the most part) very well. So: how does it work, why is it so clever, and how does it depend on push-back? It is actually quite subtle... Write throttling needs to slow down the generation of dirty pages. (It also could speed up the cleaning of dirty pages, but there is a limit to how fast cleaning can happen, so the majority of the control imposed is the slowing of generation of dirt). It needs to do this in a way that is 'fair' in some (fairly coarse) sense. Possibly the best understanding of fairness is that - for any process - the delay imposed is proportional to the cost generated. i.e. generating more dirty pages imposes more delay. Generating dirty pages on slower devices imposes more delay. When the fraction of dirty pages is below some threshold (40% by default) no throttling is imposed. When we cross that threshold we switch on write throttling. The simplest approach might be to block all writers and push out pages until we are below the threshold again. This isn't fair in the above sense as all dirtying processes are stopped for the same period of time, independent of how much dirty that have created. An 'obvious' response to that problem is to do lots of 'bean counting'. i.e. count how many pages are generated by each process per unit time, and count how many pages are dirty for each writeout queue, and impose delays accordingly. However bean counting can hurt performance, even when they aren't needed (not the effort that is taken to minimise the performance impact of counting 'Dirty' and 'Writeback' pages etc. We don't really want lots of fine grained counting). Another 'obvious' response might be 'if you have dirtied N pages, then wait until 'N' pages have become clean. However that isn't easy to do because if multiple processes are waiting for N pages to be cleaned, you would need to credit each page as it becomes clean to some process, and that is just more bean counting. So what we actually do is notice that there are two sorts of dirty pages. Those marked 'Dirty' and those marked 'Writeback'. Further we know there is some upper limit to the number if Writeback pages (this is where the queue limit comes in). So the penalty imposed on a process that has dirtied N pages is "You must transition 1.5*N pages from 'Dirty' to 'Writeback'. It is only required to do that on the writeout queue that it is using so a process writing to a fast device shouldn't be slowed down by the presence of a slow device. The first few processes might not find this to be much penalty as they will just be filling up the writeout queue. But soon the queues will be full and write throttling will have its intended effect. So in the steady state, for every N pages that are dirtied, a process needs to wait for 1.5*N pages to become clean on the same device, which sound reasonably fair. Naturally this will push the number of dirty pages steadily down until we are below the threshold, and then it will be free-reign again. Thus under heavy write load, the system can expect to oscillate between just above and just below the threshold. So you can imagine - if some writeout queue does not put a bound on its size (or has a size that is a substantial fraction of memory) then the above will simply not work. Transitioning pages from Dirty to Writeback will not impose a delay so the number of Writeback pages will continue to grow. Could the VM impose this limit itself? Probably yes. But the writeout queue driver is in a perfect position to keep track of the number of outstanding requests, so it may as well impose the limit. The current (unstated) internal API for Linux says that the writeout queue must impose the limit, so NFS should do so too. Is that cogent enough? I hope so. I haven't forgotten about 'unstable'. I'll get to that a little further down, but while that description is fresh in your mind.... So, if this is such a glaring problem with NFS, why aren't more people having problems? This is a very good question and one I only just found an answer to - quite an interesting answer I think. We have had quite a few customers report problems. They all seem to be using very big machines (8Gig plus). However I couldn't come close to duplicating the problem on the biggest machine I have easy access to which has 6Gig. I now think the memory size is only part of the issue. The other side of the issue is the NFS server. I had always been writing to a Linux NFS server. And, I'm sorry to admit, the Linux NFS server isn't ideal. In particular: In NFSv3 the response to a WRITE request (and other requests) contain post-op attributes and (optionally) pre-op attributes. The pre-op attributes are only allowed if the server can guarantee that the WRITE operation was the only operation to be performed on the file between the moment when 'pre-op' were valid, and the moment when 'post-op' were valid. The idea is that the Client sends a WRITE request, gets a reply and if the pre-op attributes are present and match the client's cache, then no-one else has touched the file, the post-op attributes are now valid, and the clients cache is otherwise certain to be up-to-date. The Linux NFS server doesn't send pre-op attributes (because it cannot hold a lock on the file while doing a write ... probably a fixable problem with leases or something). So when the client writes to the NFS server it always gets a response that says effectively "Someone else might have written to that file recently". In nfs_file_write there is (or was until very recently) a call to nfs_revalidate_mapping which will flush out all changed pages if there is some doubt about the contents of the cache. So on a Linux client (2.6.17 or prior) writing to a Linux server, you will find the writer process almost always in invalidate_inode_pages due to this call. I.e. it is flushing out data quite independent of write throttling. (Write throttling triggers the first write, that causes the caches to be doubtful, and then the next write will flush the cache). Writing to a different NFS server (e.g. a NetApp filer I expect - I don't have complete server details on all my bug reports), the pre-op attributes could be present, and you get a blow-out in the size of the Writeback list. Having noticed the recent change in nfs_file_write, I tried on 2.6.17-rc4-mm2 in a qemu instance and the Writeout count steadily grew while the 'Dirty' count went down to zero. Then writeout dropped back to around 40% while Dirty stayed at zero (as balance_dirty_pages has a fall back - if you cannot write out 1.5*N, then wait until the number of dirty pages is below the threshold). On a small memory machine this doesn't cause much of a problem. On a large machine which millions of Writeback pages all on the inactive list, things slow to a crawl. [I'm curious about why the invalidate_inode_pages call was removed from nfs_file_write... presumably the cache inconsistency is still potentially there, so clean pages should be purged. Whether dirty pages should be purged is a different question....] Despite having said that the current approach is quite clever, I think there is room for improvement. It is mostly tuning around the edges. - I think that when there is a high write load, the system should stay in 'over-threshold' mode rather than oscillating back and forth. e.g. once we go over the threshold we set 'dirty_exceeded' and leave it set until below (say) 90% of the threshold. While dirty_exceed is true: if over the threshold, the 'write_chunk' (number of pages to be transitioned from Dirty to Writeback) should be 1.5*N. if under, then just 1.0*N. This would mean a fairly predictable steady-state behaviour which would be more likely to be fair. - We shouldn't fall back to 'wait until below the threshold' when we fail to write out 1.5*N pages. Rather we should realise this means there are no dirty blocks on this device, and just let the process continue - obviously it has done it's work. However that fall-back is the only thing currently saving NFS from using up all available pages in its writeout queue, so that change cannot happen until NFS joins the party. - When 'dirty_exceed' is true, balance_dirty_pages gets called for every 8 pages that are written, but 'write_chunk' is still set to 3/64 of memory (or 6Meg, which ever is smaller) (that assumes a 1 CPU system). This seems terribly unfair, but should be trivial to fix. There is one more issue that I feel I should talk about in order to give the full picture on write-throttling. The above treatment only considers the fact that the throttled processes are causing transitions from Dirty to Writeback, and it tries to require an appropriate number of transitions from each. But in fact other processes might be causing that transition and so be using up spots on the writeout queue, thus imposing more throttling. bdflush also causes writeback on Dirty pages. This will only write out old pages. I suspect (no measuring attempted) this will either have no effect at all when write-throttling is happening, (as no pages will be old enough) or will mean that the slowest device will see more writeout than faster devices (as the oldest pages should belong to the slowest device). This means that writers to the slowest device will have to wait longer than they might expect (because the slots in the writeout queue that they could have taken, were taken by bdflush). This should generally push down the number of dirty pages used by the slowest device, which is possibly a good thing. Another process that causes writeback on Dirty pages is the ext3 journald. It actually writes out the buffers so that when the VM tries to write the page, it finds it is actually clean already. However the journald is still 'stealing' slots in the request queue that the write-throttled processes cannot get. I'm not sure what the net effect of this will be. In my experimenting I did have a scenario where it had a pronounced effect, but I have changed enough code that I don't think it was in any way representative on what the mainline kernel would do.. Anyway, to your other issue... > > Furthermore, I'd like to point out that NFS has a "third" state for > pages: following an UNSTABLE write the data on them is marked as > 'uncommitted'. Such pages are tracked using the NR_UNSTABLE_NFS counter. > The question is: if we want to set limits on the write queue, what does > that imply for the uncommitted writes? Excellent question. In some ways, Unstable pages are like Dirty pages. i.e. there is some operation that needs to be done on them. Either WRITE or COMMIT. This similarity is supported by the fact that the two numbers are added together into 'nr_reclaimable' in balance_dirty_pages. A first guess might be that the write throttling should transition 1.5*N "dirty or unstable" pages to Writeback (What state are Unstable pages when the commit is happening? Let's pretend for the moment they are Writeback again, it isn't important for this discussion). However that wouldn't work. For every page that is dirtied, one dirty page needs to be written and one unstable page needs to be committed. So maybe nfs_writepages (which does the final writing I think) should effectively double nr_to_write, as WRITEing is only half of the required work. But the won't work either. Because sometimes a WRITE can return DATA_STABLE and no commit is needed. So requiring twice as many WRITEs or COMMITs isn't right. I think that nfs_writepages needs to try to write nr_to_write, and try to commit nr_to_write pages, and should decrease nr_to_write by the maximum of the number of pages written and the number of pages committed. (I note that nfs currently doesn't commit individual pages but instead commits the whole file. Obviously if you end up committing more pages than nr_to_write, you wouldn't decrease nr_to_write so much that it goes negative). However this thought hasn't been thoroughly considered. It might have problems and there might be a better way. The important thing is to generate some requests and to decrease nr_to_write in proportion to the mount of useful work that has been done. > If you go back and look at the 2.4 NFS client, we actually had an > arbitrary queue limit. That limit covered the sum of writes+uncommitted > pages. Performance sucked, 'cos we were not able to use server side > caching efficiently. The number of COMMIT requests (causes the server to > fsync() the client's data to disk) on the wire kept going through the > roof as we tried to free up pages in order to satisfy the hard limit. Pages that are UNSTABLE shouldn't use slots in the request queue (which you seem to be implying was the case in 2.4). So while we are below the threshold for throttling, commits would only be forced by sync or bdflush (how does that work? The first bdflush passes triggers a WRITE, the next one triggers the COMMIT?). But once we get above the threshold, we really want to be calling COMMIT quite a lot to get the number of "dirty or writeback or unstable" pages down. So calling COMMIT on every nfs_write_pages that came from balance_dirty_pages would seem fairly appropriate. > For those reasons and others, the filesystem queue limit was removed for > 2.6 in favour of allowing the VM to control the limits based on its > extra knowledge of the state of global resources. Unfortunately the VM still needs a little help from the writeout queue. (I think I'm going to write a Documentation file on how writeback - but would anyone read it....) NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 7:24 ` Neil Brown 2006-08-21 13:51 ` Jens Axboe @ 2006-08-21 14:28 ` David Chinner 2006-08-25 5:24 ` Neil Brown 1 sibling, 1 reply; 37+ messages in thread From: David Chinner @ 2006-08-21 14:28 UTC (permalink / raw) To: Neil Brown; +Cc: David Chinner, Andi Kleen, Jens Axboe, linux-kernel, akpm On Mon, Aug 21, 2006 at 05:24:14PM +1000, Neil Brown wrote: > On Monday August 21, dgc@sgi.com wrote: > > On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote: > > > On August 18, ak@suse.de wrote: > > > > Jens Axboe <axboe@suse.de> writes: > > > > > > > > > On Thu, Aug 17 2006, Andrew Morton wrote: > > > > > > It seems that the many-writers-to-different-disks workloads don't happen > > > > > > very often. We know this because > > > > > > > > > > > > a) The 2.4 performance is utterly awful, and I never saw anybody > > > > > > complain and > > > > > > > > > > Talk to some of the people that used DVD-RAM devices (or other > > > > > excruciatingly slow writers) on their system, and they would disagree > > > > > violently :-) > > > > > > > > I hit this recently while doing backups to a slow external USB disk. > > > > The system was quite unusable (some commands blocked for over a minute) > > > > > > Ouch. > > > I suspect we are going to see more of this, as USB drive for backups > > > is probably a very attractive option for many. > > > > I can't see how this would occur on a 2.6 kernel unless the problem is > > that all the reclaimable memory in the machine is dirty page cache pages > > every allocation is blocking waiting for writeback to the slow device to > > occur. That is, we filled memory with dirty pages before we got to the > > throttle threshold. > > I started writing a longish reply to this explaining how maybe that > could happen, and then realised I had been missing important aspects > of the code. > > writeback_inodes doesn't just work on any random device as both you > and I thought. The 'writeback_control' structure identifies the bdi > to be flushed and it will only call __writeback_sync_inode on inodes > with the same bdi. Yes, now that you point it out it is obvious :/ > This means that any process writing to a particular bdi should throttle > against the queue limits in the bdi once we pass the dirty threshold. > This means that it shouldn't be able to fill memory with dirty pages > for that device (unless the bdi doesn't have a queue limit like > nfs...). *nod* > But now I see another way that Jens' problem could occur... maybe. > > Suppose the total Dirty+Writeback exceeds the threshold due entirely > to the slow device, and it is slowly working its way through the > writeback pages. > > We write to some other device, make 'ratelimit_pages' pages dirty and > then hit balance_dirty_pages. We now need to either get the total > dirty pages below the threshold or start writeback on > 1.5 * ratelimit_pages > pages. As we only have 'ratelimit_pages' dirty pages we cannot start > writeback on enough, and so must wait until Dirty+Writeback drops > below the threshold. And as we are waiting on the slow device, that > could take a while (especially as it is possible that no-one is > calling balance_dirty_pages against that bdi). Hmm - one thing I just noticed - when we loop after sleeping, we reset wbc->nr_to_write = write_chunk. Hence once we get into this throttle loop, we can't break out until we drop under the threshold even if we manage to write a few more pages. > > That is, we allow a per-block-dev value to be set that overrides the > > global setting for that blockdev only. Hence for slower devices > > we can set the point at which we throttle at a much lower dirty > > memory threshold when that block device is congested. > > > > I don't think this would help. The bdi with the higher threshold > could exclude bdis with lower thresholds from making any forward > progress. True - knowing the writeback is not on all superblocks changes the picture a little :/ > Here is a question: > Seeing that > wbc.nonblocking == 0 > wbc.older_than_this == NULL > wbc.range_cyclic == 0 > in balance_dirty_pages when it calls writeback_inodes, under what > circumstances will writeback_inodes return with wbc.nr_to_write > 0 > ?? The page couldn't be written out for some reason and redirty_page_for_writepage() was called. A few filesystems call this in different situations, generally error conditions. In that case, we end up with wbc->pages_skipped increasing rather than wbc->nr_to_write decreasing.... > If a write error occurs it could abort early, but otherwise I think > it will only exit early if it runs out of pages to write, because > there aren't any dirty pages. Yes, I think you're right, Neil. > If that is true, then after calling writeback_inodes once, > balance_dirty_pages should just exit. It isn't going to find any more > work to do next time it is called anyway. > Either the queue was never congested, in which case we don't need to > throttle writes, or it blocked for a while waiting for the queue to > clean (in ->writepage) and so has successfully throttled writes. *nod* > So my feeling (at the moment) is that balance_dirty_pages should look > like: > > if below threshold > return > writeback_inodes({.bdi = mapping->backing_dev_info)} ) > > while (above threshold + 10%) > writeback_inodes(.bdi = NULL) > blk_congestion_wait > > and all bdis should impose a queue limit. I don't really like the "+ 10%" in there - it's too rubbery given the range of memory sizes Linux supports (think of an Altix with several TBs of RAM in it ;). With bdis imposing a queue limit, the number of writeback pages should be bound and so we shouldn't need headroom like this. Hmmm - the above could put the writer to sleep on the request queue of the slow device that holds all dirty+writeback. This could effectively slow all writers down to the rate of the slowest device in the system as they all attempt to do blocking writeback on the only dirty bdi (the really slow one). > Then we just need to deal with the case where the some of the queue > limits of all devices exceeds the dirty threshold.... > Maybe writeout queues need to auto-adjust their queue length when some > system-wide situation is detected.... sounds messy. Pretty uncommon case, I think. If someone has a system like that then tuning the system is not unreasonable.... AFAICT, all we need to do is prevent interactions between bdis and the current problem is that we loop on clean bdis waiting for slow dirty ones to drain. My thoughts are along the lines of a decay in nr_to_write between loop iterations when we don't write out enough pages (i.e. clean bdi) so we break out of the loop sooner rather than later. Something like: --- mm/page-writeback.c | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) Index: 2.6.x-xfs-new/mm/page-writeback.c =================================================================== --- 2.6.x-xfs-new.orig/mm/page-writeback.c 2006-05-29 15:14:10.000000000 +1000 +++ 2.6.x-xfs-new/mm/page-writeback.c 2006-08-21 23:53:40.849788387 +1000 @@ -195,16 +195,17 @@ static void balance_dirty_pages(struct a long dirty_thresh; unsigned long pages_written = 0; unsigned long write_chunk = sync_writeback_pages(); + unsigned long write_decay = write_chunk >> 2; struct backing_dev_info *bdi = mapping->backing_dev_info; + struct writeback_control wbc = { + .bdi = bdi, + .sync_mode = WB_SYNC_NONE, + .older_than_this = NULL, + .nr_to_write = write_chunk, + }; for (;;) { - struct writeback_control wbc = { - .bdi = bdi, - .sync_mode = WB_SYNC_NONE, - .older_than_this = NULL, - .nr_to_write = write_chunk, - }; get_dirty_limits(&wbs, &background_thresh, &dirty_thresh, mapping); @@ -231,8 +232,14 @@ static void balance_dirty_pages(struct a pages_written += write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ + + /* Decay the remainder so we don't get stuck here + * waiting for some other slow bdi to flush */ + wbc.nr_to_write -= write_decay; } blk_congestion_wait(WRITE, HZ/10); + if (wbc.nr_to_write <= 0) + break; } if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded) --- Thoughts? Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 14:28 ` David Chinner @ 2006-08-25 5:24 ` Neil Brown 2006-08-28 1:55 ` David Chinner 0 siblings, 1 reply; 37+ messages in thread From: Neil Brown @ 2006-08-25 5:24 UTC (permalink / raw) To: David Chinner; +Cc: Andi Kleen, Jens Axboe, linux-kernel, akpm On Tuesday August 22, dgc@sgi.com wrote: > On Mon, Aug 21, 2006 at 05:24:14PM +1000, Neil Brown wrote: > > So my feeling (at the moment) is that balance_dirty_pages should look > > like: > > > > if below threshold > > return > > writeback_inodes({.bdi = mapping->backing_dev_info)} ) > > > > while (above threshold + 10%) > > writeback_inodes(.bdi = NULL) > > blk_congestion_wait > > > > and all bdis should impose a queue limit. > > I don't really like the "+ 10%" in there - it's too rubbery given > the range of memory sizes Linux supports (think of an Altix with > several TBs of RAM in it ;). With bdis imposing a queue limit, the > number of writeback pages should be bound and so we shouldn't need > headroom like this. I had that there precisely because some BDIs are not bounded - nfs in particular (which is what started this whole thread). I think I'm now convinced that nfs really need to limit its writeout queue. > > Hmmm - the above could put the writer to sleep on the request queue > of the slow device that holds all dirty+writeback. This could > effectively slow all writers down to the rate of the slowest device > in the system as they all attempt to do blocking writeback on the > only dirty bdi (the really slow one). > > AFAICT, all we need to do is prevent interactions between bdis and > the current problem is that we loop on clean bdis waiting for slow > dirty ones to drain. > > My thoughts are along the lines of a decay in nr_to_write between > loop iterations when we don't write out enough pages (i.e. clean > bdi) so we break out of the loop sooner rather than later. I don't understand the purpose of the decay. Once you are sure the bdi is clean, why not break out of the loop straight away? Also, your code is a little confusing. The pages_written += write_chunk - wbc.nr_to_write in the loop assumes that wbc.nr_to_write equalled write_chunk just before the call to writeback_inodes, however as you have moved the initialisation of wbc out of the loop, this is no longer true. So I would like us to break out of the loop as soon as there is good reason to believe the bdi is clean. So maybe something like this.. Note that we *must* have bounded queue on all bdis or this patch can cause substantial badness. NeilBrown Signed-off-by: Neil Brown <neilb@suse.de> ### Diffstat output ./mm/page-writeback.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff .prev/mm/page-writeback.c ./mm/page-writeback.c --- .prev/mm/page-writeback.c 2006-08-25 15:18:37.000000000 +1000 +++ ./mm/page-writeback.c 2006-08-25 15:22:39.000000000 +1000 @@ -187,7 +187,7 @@ static void balance_dirty_pages(struct a .bdi = bdi, .sync_mode = WB_SYNC_NONE, .older_than_this = NULL, - .nr_to_write = write_chunk, + .nr_to_write = write_chunk - pages_written, .range_cyclic = 1, }; @@ -217,10 +217,13 @@ static void balance_dirty_pages(struct a global_page_state(NR_WRITEBACK) <= dirty_thresh) break; - pages_written += write_chunk - wbc.nr_to_write; + if (pages_written == write_chunk - wbc.nr_to_write) + break; /* couldn't write - must be clean */ + pages_written = write_chunk - wbc.nr_to_write; if (pages_written >= write_chunk) break; /* We've done our duty */ - } + } else + break; blk_congestion_wait(WRITE, HZ/10); } ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-25 5:24 ` Neil Brown @ 2006-08-28 1:55 ` David Chinner 0 siblings, 0 replies; 37+ messages in thread From: David Chinner @ 2006-08-28 1:55 UTC (permalink / raw) To: Neil Brown; +Cc: David Chinner, Andi Kleen, Jens Axboe, linux-kernel, akpm On Fri, Aug 25, 2006 at 03:24:47PM +1000, Neil Brown wrote: > On Tuesday August 22, dgc@sgi.com wrote: > > AFAICT, all we need to do is prevent interactions between bdis and > > the current problem is that we loop on clean bdis waiting for slow > > dirty ones to drain. > > > > My thoughts are along the lines of a decay in nr_to_write between > > loop iterations when we don't write out enough pages (i.e. clean > > bdi) so we break out of the loop sooner rather than later. > > I don't understand the purpose of the decay. Once you are sure the > bdi is clean, why not break out of the loop straight away? Simply to slow down the rate at which any process is dirtying memory. The decay only becomes active when you're writing to a clean device when there are lots of dirty pages on a slow device, otherwise it's a no-op. To illustrate the problem of breaking straight out of the throttle loop, even though we hit the dirty rate limit we may have dirtied pages on multiple bdis but we are only flushing on one of them. Hence we could potentially trigger increasing numbers of dirty pages if we don't back off in some way when throttling here even though the device we throttled on was clean. e.g. Think of writing data to a slow device, then a log entry to a fast device, and every time the write to the fast device triggers the throttling which gets cleaned and we go and dirty more pages on the slow device immediately without throttling.... > Also, your code is a little confusing. The Sorry, it was a quick hack to illustrate my thinking..... > So I would like us to break out of the loop as soon as there is good > reason to believe the bdi is clean. Which was exactly my line of thinking, but tempered by the fact that just breaking out of the loop could introduce a nasty problem.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-21 3:15 ` David Chinner 2006-08-21 7:24 ` Neil Brown @ 2006-08-21 7:47 ` Andi Kleen 1 sibling, 0 replies; 37+ messages in thread From: Andi Kleen @ 2006-08-21 7:47 UTC (permalink / raw) To: David Chinner; +Cc: Neil Brown, Jens Axboe, linux-kernel, akpm > > Ouch. > > I suspect we are going to see more of this, as USB drive for backups > > is probably a very attractive option for many. > > I can't see how this would occur on a 2.6 kernel I still got the traces to prove it: http://www.firstfloor.org/~andi/usb-loop-copy-stall-1 e.g. notice the lynx which is stuck in a m/atime update. It was stalling for a quite long time. > -Andi ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-18 6:29 ` Andrew Morton 2006-08-18 7:03 ` Jens Axboe @ 2006-08-18 7:07 ` Neil Brown 1 sibling, 0 replies; 37+ messages in thread From: Neil Brown @ 2006-08-18 7:07 UTC (permalink / raw) To: Andrew Morton; +Cc: David Chinner, linux-kernel On Thursday August 17, akpm@osdl.org wrote: > > btw, Neil, has the Pagewriteback windup actually been demonstrated? If so, > how? Yes. On large machines (e.g. 16G) just writing to large files (I think. I don't have precise details of the application, but I think in one case it was just iozone). By "large files" I mean larger than memory. This has happened on both SLES9 (2.6.5 based) and SLES10 (2.6.16 based). We do have an extra patch in balance_dirty_pages which I haven't tracked down the reason for yet. It has the effect of breaking out of the loop once nr_dirty hits 0, which makes the problem hard to recover from. It may even be making it occur more quickly - I'm not sure. What we see is Pagewriteback at about 10G out of 16G, and Dirty at 0. The whole machine pretty much slows to a halt. There is little free memory so lots of processes end up in 'reclaim' walking the inactive list looking for pages to free up. Most of what they find are in Writeback and so they just skip over them. skipping 2.6 million pages seems to take a little while. And there is a kmalloc call in the NFS writeout path (it is actually a mempool_alloc so it will succeed, but (partly) as mempool uses the reserve last instead of first it always looks for free memory first. So Pagewriteback is at 60%, memory is tight, nfs write is progressing very slowly and (because of our SuSE specific patch) balance_dirty_pages isn't throttling anymore so as soon as nfs does manage to write out a page another appears to replace it. I suspect it is making forward progress, but not very much. We have a fairly hackish patch in place limit the NFS writeback on a per-file basis (sysctl tunable) but I want trying to understand the real problem so that a real solution could be found. NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 4:08 ` Neil Brown 2006-08-17 6:14 ` Andrew Morton @ 2006-08-17 22:17 ` David Chinner 1 sibling, 0 replies; 37+ messages in thread From: David Chinner @ 2006-08-17 22:17 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel On Thu, Aug 17, 2006 at 02:08:58PM +1000, Neil Brown wrote: > On Wednesday August 16, dgc@sgi.com wrote: > > > > IMO, if you've got slow writeback, you should be reducing the amount > > of dirty memory you allow in the machine so that you don't tie up > > large amounts of memory that takes a long time to clean. Throttle earlier > > and you avoid this problem entirely. > > I completely agree that 'throttle earlier' is important. I just not > completely sure what should be throttled when. > > I think I could argue that pages in 'Writeback' are really still > dirty. The difference is really just an implementation issue. No argument here - I think you're right, Neil. > So when the dirty_ratio is set to 40%, that should apply to all > 'dirty' pages, which means both that flagged as 'Dirty' and those > flagged as 'Writeback'. Don't forget NFS client unstable pages. FWIW, with writeback not being accounted as dirty, there is a window in the NFS client where a page during writeback is not dirty or unstable and hence not visible to the throttle. Hence if we have lots of outstanding async writes to NFS servers, or their I/O completion is held off, the throttle won't activate where is should and potentially let too many pages get dirtied. This may not be a major problem with the traditional small write sizes, but with 1MB I/Os this could be a fairly large number of pages that are unaccounted for a short period of time. > So I think you need to throttle when Dirty+Writeback hits dirty_ratio > (which we don't quite get right at the moment). But the trick is to > throttle gently and fairly, rather than having a hard wall so that any > one who hits it just stops. I disagree with the "throttle gently" bit there. If a process is writing faster than the underlying storage can write, then you have to stop the process in it's tracks while the storage catches up. Especially if other processes are writing tothe same device. You may as well just hit it with a big hammer becauses it's simple and pretty effective. Besides, it is difficult to be gentle when you can dirty memory at least an order of magnitude faster than you can clean it. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-15 8:06 ` Andrew Morton 2006-08-15 23:00 ` David Chinner @ 2006-08-17 3:59 ` Neil Brown 2006-08-17 6:22 ` Andrew Morton 2006-08-17 13:21 ` Trond Myklebust 1 sibling, 2 replies; 37+ messages in thread From: Neil Brown @ 2006-08-17 3:59 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-kernel On Tuesday August 15, akpm@osdl.org wrote: > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM) > > balance_dirty_pages will no longer be able to flush the full > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't > > a busy loop, but it won't progress. > > This assumes that the queues are unbounded. They're not - they're limited > to 128 requests, which is 60MB or so. Ahhh... so the limit on the requests-per-queue is an important part of write-throttling behaviour. I didn't know that, thanks. fs/nfs doesn't seem to impose a limit. It will just allocate as many as you ask for until you start running out of memory. I've seen 60% of memory (10 out of 16Gig) in writeback for NFS. Maybe I should look there to address my current issue, though imposing a system-wide writeback limit seems safer. > > Per queue. The scenario you identify can happen if it's spread across > multiple disks simultaneously. > > CFQ used to have 1024 requests and we did have problems with excessive > numbers of writeback pages. I fixed that in 2.6.early, but that seems to > have got lost as well. > What would you say constitutes "excessive"? Is there any sense in which some absolute number is excessive (as it takes too long to scan some list) or is it just a percent-of-memory thing? > > Something like that - it'll be relatively simple. Unfortunately I think it is also relatively simple to get it badly wrong:-) Make one workload fast, and another slower. But thanks, you've been very helpful (as usual). I'll ponder it a bit longer and see what turns up. NeilBrown ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 3:59 ` Neil Brown @ 2006-08-17 6:22 ` Andrew Morton 2006-08-17 8:36 ` Jens Axboe 2006-08-17 13:21 ` Trond Myklebust 1 sibling, 1 reply; 37+ messages in thread From: Andrew Morton @ 2006-08-17 6:22 UTC (permalink / raw) To: Neil Brown; +Cc: linux-kernel On Thu, 17 Aug 2006 13:59:41 +1000 Neil Brown <neilb@suse.de> wrote: > > CFQ used to have 1024 requests and we did have problems with excessive > > numbers of writeback pages. I fixed that in 2.6.early, but that seems to > > have got lost as well. > > > > What would you say constitutes "excessive"? Is there any sense in > which some absolute number is excessive (as it takes too long to scan > some list) or is it just a percent-of-memory thing? Excessive = 100% of memory dirty or under writeback against a single disk on a 512MB machine. Perhaps that problem just got forgotten about when CFQ went from 1024 requests down to 128. (That 128 was actually 64-available-for-read+64-available-for-write, so it's really 64 requests). > > > > Something like that - it'll be relatively simple. > > Unfortunately I think it is also relatively simple to get it badly > wrong:-) Make one workload fast, and another slower. > I think it's unlikely in this case. As long as we keep the queues reasonably full, the disks will be running flat-out and merging will be as good as we're going to get. One thing one does have to watch out for is the many-disks scenario: do concurrent dd's onto 12 disks and make sure that none of their LEDs go out. This is actually surprisingly hard to do, but it would be very hard to do worse than 2.4.x ;) ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 6:22 ` Andrew Morton @ 2006-08-17 8:36 ` Jens Axboe 0 siblings, 0 replies; 37+ messages in thread From: Jens Axboe @ 2006-08-17 8:36 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, linux-kernel On Wed, Aug 16 2006, Andrew Morton wrote: > On Thu, 17 Aug 2006 13:59:41 +1000 > Neil Brown <neilb@suse.de> wrote: > > > > CFQ used to have 1024 requests and we did have problems with excessive > > > numbers of writeback pages. I fixed that in 2.6.early, but that seems to > > > have got lost as well. > > > > > > > What would you say constitutes "excessive"? Is there any sense in > > which some absolute number is excessive (as it takes too long to scan > > some list) or is it just a percent-of-memory thing? > > Excessive = 100% of memory dirty or under writeback against a single disk > on a 512MB machine. Perhaps that problem just got forgotten about when CFQ > went from 1024 requests down to 128. (That 128 was actually > 64-available-for-read+64-available-for-write, so it's really 64 requests). That's not quite true, if you set nr_requests to 128 that's 128 for reads and 128 for writes. With the batching you will actually typically see 128 * 3 / 2 == 192 requests allocated. Which translates to about 96MiB of dirty data on the queue, if everything works smoothly. The 3/2 limit is quite new, before I introduced that, if you had a lot of writes each of them would be allowed 16 requests over the limit. So you would sometimes see huge queues, as with just eg 16 writes, you could have 128 + 16*16 requests allocated. I've always been of the opinion that the vm should handle all of this, and things should not change or break if I set 10000 as the request limit. A rate-of-dirtying throttling per process sounds like a really good idea, we badly need to prevent the occasional write (like a process doing sync reads, and getting stuck in slooow reclaim) from being throttled in the presence of a heavy dirtier. -- Jens Axboe ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 3:59 ` Neil Brown 2006-08-17 6:22 ` Andrew Morton @ 2006-08-17 13:21 ` Trond Myklebust 2006-08-17 15:30 ` Andrew Morton 1 sibling, 1 reply; 37+ messages in thread From: Trond Myklebust @ 2006-08-17 13:21 UTC (permalink / raw) To: Neil Brown; +Cc: Andrew Morton, linux-kernel On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote: > On Tuesday August 15, akpm@osdl.org wrote: > > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM) > > > balance_dirty_pages will no longer be able to flush the full > > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will > > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't > > > a busy loop, but it won't progress. > > > > This assumes that the queues are unbounded. They're not - they're limited > > to 128 requests, which is 60MB or so. > > Ahhh... so the limit on the requests-per-queue is an important part of > write-throttling behaviour. I didn't know that, thanks. > > fs/nfs doesn't seem to impose a limit. It will just allocate as many > as you ask for until you start running out of memory. I've seen 60% > of memory (10 out of 16Gig) in writeback for NFS. > > Maybe I should look there to address my current issue, though imposing > a system-wide writeback limit seems safer. Exactly how would a request limit help? All that boils down to is having the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK). Cheers, Trond ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 13:21 ` Trond Myklebust @ 2006-08-17 15:30 ` Andrew Morton 2006-08-17 16:18 ` Trond Myklebust 0 siblings, 1 reply; 37+ messages in thread From: Andrew Morton @ 2006-08-17 15:30 UTC (permalink / raw) To: Trond Myklebust; +Cc: Neil Brown, linux-kernel On Thu, 17 Aug 2006 09:21:51 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote: > > On Tuesday August 15, akpm@osdl.org wrote: > > > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM) > > > > balance_dirty_pages will no longer be able to flush the full > > > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will > > > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't > > > > a busy loop, but it won't progress. > > > > > > This assumes that the queues are unbounded. They're not - they're limited > > > to 128 requests, which is 60MB or so. > > > > Ahhh... so the limit on the requests-per-queue is an important part of > > write-throttling behaviour. I didn't know that, thanks. > > > > fs/nfs doesn't seem to impose a limit. It will just allocate as many > > as you ask for until you start running out of memory. I've seen 60% > > of memory (10 out of 16Gig) in writeback for NFS. > > > > Maybe I should look there to address my current issue, though imposing > > a system-wide writeback limit seems safer. > > Exactly how would a request limit help? All that boils down to is having > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK). > I assume that if NFS is not limiting its NR_WRITEBACK consumption and block devices are doing so, we could get in a situation where NFS hogs all of the fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent block-device-based writeback. Perhaps. The top-level poll-the-superblocks writeback loop might tend to prevent that from happening. But if applications were doing a lot of superblock-specific writeback (fdatasync, sync_file_range(SYNC_FILE_RANGE_WRITE), etc) then unfairness might occur. ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 15:30 ` Andrew Morton @ 2006-08-17 16:18 ` Trond Myklebust 2006-08-18 5:34 ` Andrew Morton 0 siblings, 1 reply; 37+ messages in thread From: Trond Myklebust @ 2006-08-17 16:18 UTC (permalink / raw) To: Andrew Morton; +Cc: Neil Brown, linux-kernel On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote: > On Thu, 17 Aug 2006 09:21:51 -0400 > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > Exactly how would a request limit help? All that boils down to is having > > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring > > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK). > > > > I assume that if NFS is not limiting its NR_WRITEBACK consumption and block > devices are doing so, we could get in a situation where NFS hogs all of the > fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent > block-device-based writeback. Since NFS has no control over NR_DIRTY, how does controlling NR_WRITEBACK help? The only resource that NFS shares with the block device writeout queues is memory. IOW: The resource that needs to be controlled is the dirty pages, not the write-out queue. Unless you can throttle back on the creation of dirty NFS pages in the first place, then the potential for unfairness will exist. Trond ^ permalink raw reply [flat|nested] 37+ messages in thread
* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback. 2006-08-17 16:18 ` Trond Myklebust @ 2006-08-18 5:34 ` Andrew Morton 0 siblings, 0 replies; 37+ messages in thread From: Andrew Morton @ 2006-08-18 5:34 UTC (permalink / raw) To: Trond Myklebust; +Cc: Neil Brown, linux-kernel On Thu, 17 Aug 2006 12:18:52 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote: > > On Thu, 17 Aug 2006 09:21:51 -0400 > > Trond Myklebust <trond.myklebust@fys.uio.no> wrote: > > > Exactly how would a request limit help? All that boils down to is having > > > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring > > > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK). > > > > > > > I assume that if NFS is not limiting its NR_WRITEBACK consumption and block > > devices are doing so, we could get in a situation where NFS hogs all of the > > fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent > > block-device-based writeback. > > Since NFS has no control over NR_DIRTY, how does controlling > NR_WRITEBACK help? The only resource that NFS shares with the block > device writeout queues is memory. Block devices have a limit on the amount of IO which they will queue. NFS doesn't. > IOW: The resource that needs to be controlled is the dirty pages, not > the write-out queue. Unless you can throttle back on the creation of > dirty NFS pages in the first place, then the potential for unfairness > will exist. Please read the whole thread - we're violently agreeing. ^ permalink raw reply [flat|nested] 37+ messages in thread
end of thread, other threads:[~2006-08-28 1:56 UTC | newest] Thread overview: 37+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-14 23:40 RFC - how to balance Dirty+Writeback in the face of slow writeback Neil Brown 2006-08-15 8:06 ` Andrew Morton 2006-08-15 23:00 ` David Chinner 2006-08-17 4:08 ` Neil Brown 2006-08-17 6:14 ` Andrew Morton 2006-08-17 12:36 ` Trond Myklebust 2006-08-17 15:14 ` Andrew Morton 2006-08-17 16:22 ` Trond Myklebust 2006-08-18 5:49 ` Andrew Morton 2006-08-18 10:43 ` Nikita Danilov 2006-08-18 0:11 ` David Chinner 2006-08-18 6:29 ` Andrew Morton 2006-08-18 7:03 ` Jens Axboe 2006-08-18 7:11 ` Andrew Morton 2006-08-18 18:57 ` Andi Kleen 2006-08-21 0:35 ` Neil Brown 2006-08-21 3:15 ` David Chinner 2006-08-21 7:24 ` Neil Brown 2006-08-21 13:51 ` Jens Axboe 2006-08-25 4:36 ` Neil Brown 2006-08-25 6:37 ` Jens Axboe 2006-08-28 1:28 ` David Chinner 2006-08-25 13:16 ` Trond Myklebust 2006-08-27 8:21 ` Neil Brown 2006-08-21 14:28 ` David Chinner 2006-08-25 5:24 ` Neil Brown 2006-08-28 1:55 ` David Chinner 2006-08-21 7:47 ` Andi Kleen 2006-08-18 7:07 ` Neil Brown 2006-08-17 22:17 ` David Chinner 2006-08-17 3:59 ` Neil Brown 2006-08-17 6:22 ` Andrew Morton 2006-08-17 8:36 ` Jens Axboe 2006-08-17 13:21 ` Trond Myklebust 2006-08-17 15:30 ` Andrew Morton 2006-08-17 16:18 ` Trond Myklebust 2006-08-18 5:34 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox