RFC - how to balance Dirty+Writeback in the face of slow writeback.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* RFC - how to balance Dirty+Writeback in the face of slow writeback.
@ 2006-08-14 23:40 Neil Brown
  2006-08-15  8:06 ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Neil Brown @ 2006-08-14 23:40 UTC (permalink / raw)
  To: linux-kernel

I have a question about the write throttling in
balance_dirty_pages in the face of slow writeback.

Suppose we have a filesystem where writeback is relatively slow -
e.g. NFS or EXTx over nbd over a slow link.

Suppose for the sake of simplicity that writeback is very slow and
doesn't progress at all for the first part of our experiment.

We write to a large file. 
Balance_dirty_pages gets called periodically.  Until the number of
Dirty pages reached 40% of memory it does nothing.

Once we hit 40%, balance_dirty_pages starts calling writeback_inodes
and these Dirty pages get converted to Writeback pages.  This happens
at 1.5 times the speed that dirty pages are created (due to
sync_writeback_pages()).  So for every 100K that we dirty, 150K gets
converted to writeback.  But balance_dirty_pages doesn't wait for anything.

This will result in the number of dirty pages going down steadily, and
the number of writeback pages increasing quickly (3 times the speed of
the drop in Dirty).  The total of Dirty+Writeback will keep growing.

When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
balance_dirty_pages will no longer be able to flush the full
'write_chunk' (1.5 times number of recent dirtied pages) and so will
spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
a busy loop, but it won't progress.

Now our very slow writeback gets it's act together and starts making
some progress and the Writeback number steadily drops down to 40%.
At this point balance_dirty_pages will exit, more pages will get
dirtied, and balance_dirty_pages will quickly flush them out again.

The steady state will be with Dirty at or close to 0, and Writeback at
or close to 40%.

Now obviously this is somewhat idealised, and even slow writeback will
make some progress early on, but you can still expect to get a very
large Writeback with a very small Dirty before stabilising.

I don't think we want this, but I'm not sure what we do want, so I'm
asking for opinions.

I don't think that pushing Dirty down to zero is the best thing to
do.  If writeback is slow, we should be simply waiting for writeback
to progress rather than putting more work into the writeback queue.
This also allows pages to stay 'dirty' for longer which is generally
considered to be a good thing.

I think we need to have 2 numbers.  One that is the limit of dirty
pages, and one that is the limit of the combined dirty+writeback.
Alternately it could simply be a limit on writeback.
Probably the later because having a very large writeback number makes
the 'inactive_list' of pages very large and so it takes a long time
to scan.
So suppose dirty were capped at vm_dirty_ratio, and writeback were
capped at that too, though independently. 

Then in our experiment, Dirty would grow up to 40%, then
balance_dirty_pages would start flushing and Writeback would grow to
40% while Dirty stayed at 40%.  Then balance_dirty_pages would not 
flush anything but would just wait for Writeback to drop below 40%.
You would get a very obvious steady stage of 40% dirty and
40% Writeback.

Is this too much memory?  80% tied up in what are essentially dirty
blocks is more than you would expect when setting vm.dirty_ratio to
40.

Maybe 40% should limit Dirty+Writeback and when we cross the
threshold:
  if Dirty > Writeback - flush and wait
  if Dirty < Writeback - just wait

bdflush should get some writeback underway before we hit the 40%, so
balance_dirty_pages shouldn't find itself waiting for the pages it
just flushed.

Suggestions? Opinions?

The following patch demonstrates the last suggestion.

Thanks,
NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./mm/page-writeback.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff .prev/mm/page-writeback.c ./mm/page-writeback.c
--- .prev/mm/page-writeback.c	2006-08-15 09:36:23.000000000 +1000
+++ ./mm/page-writeback.c	2006-08-15 09:39:17.000000000 +1000
@@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (nr_reclaimable > global_page_state(NR_WRITEBACK)) {
 			writeback_inodes(&wbc);
 			get_dirty_limits(&background_thresh,
 					 	&dirty_thresh, mapping);
@@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a
 					<= dirty_thresh)
 						break;
 			pages_written += write_chunk - wbc.nr_to_write;
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
 		}
 		blk_congestion_wait(WRITE, HZ/10);
 	}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-14 23:40 RFC - how to balance Dirty+Writeback in the face of slow writeback Neil Brown
@ 2006-08-15  8:06 ` Andrew Morton
  2006-08-15 23:00   ` David Chinner
  2006-08-17  3:59   ` Neil Brown
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2006-08-15  8:06 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel

On Tue, 15 Aug 2006 09:40:12 +1000
Neil Brown <neilb@suse.de> wrote:

> 
> I have a question about the write throttling in
> balance_dirty_pages in the face of slow writeback.

btw, we have problem in there at present when you're using a combination of
slow devices and fast devices.  That worked OK in 2.5.x, iirc, but seems to
have gotten broken since.

> Suppose we have a filesystem where writeback is relatively slow -
> e.g. NFS or EXTx over nbd over a slow link.
> 
> Suppose for the sake of simplicity that writeback is very slow and
> doesn't progress at all for the first part of our experiment.
> 
> We write to a large file. 
> Balance_dirty_pages gets called periodically.  Until the number of
> Dirty pages reached 40% of memory it does nothing.
> 
> Once we hit 40%, balance_dirty_pages starts calling writeback_inodes
> and these Dirty pages get converted to Writeback pages.  This happens
> at 1.5 times the speed that dirty pages are created (due to
> sync_writeback_pages()).  So for every 100K that we dirty, 150K gets
> converted to writeback.  But balance_dirty_pages doesn't wait for anything.
> 
> This will result in the number of dirty pages going down steadily, and
> the number of writeback pages increasing quickly (3 times the speed of
> the drop in Dirty).  The total of Dirty+Writeback will keep growing.
> 
> When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> balance_dirty_pages will no longer be able to flush the full
> 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> a busy loop, but it won't progress.

This assumes that the queues are unbounded.  They're not - they're limited
to 128 requests, which is 60MB or so.

Per queue.  The scenario you identify can happen if it's spread across
multiple disks simultaneously.

CFQ used to have 1024 requests and we did have problems with excessive
numbers of writeback pages.  I fixed that in 2.6.early, but that seems to
have got lost as well.

> Now our very slow writeback gets it's act together and starts making
> some progress and the Writeback number steadily drops down to 40%.
> At this point balance_dirty_pages will exit, more pages will get
> dirtied, and balance_dirty_pages will quickly flush them out again.
> 
> The steady state will be with Dirty at or close to 0, and Writeback at
> or close to 40%.
> 
> Now obviously this is somewhat idealised, and even slow writeback will
> make some progress early on, but you can still expect to get a very
> large Writeback with a very small Dirty before stabilising.
> 
> I don't think we want this, but I'm not sure what we do want, so I'm
> asking for opinions.
> 
> I don't think that pushing Dirty down to zero is the best thing to
> do.  If writeback is slow, we should be simply waiting for writeback
> to progress rather than putting more work into the writeback queue.
> This also allows pages to stay 'dirty' for longer which is generally
> considered to be a good thing.
> 
> I think we need to have 2 numbers.  One that is the limit of dirty
> pages, and one that is the limit of the combined dirty+writeback.
> Alternately it could simply be a limit on writeback.
> Probably the later because having a very large writeback number makes
> the 'inactive_list' of pages very large and so it takes a long time
> to scan.
> So suppose dirty were capped at vm_dirty_ratio, and writeback were
> capped at that too, though independently. 
> 
> Then in our experiment, Dirty would grow up to 40%, then
> balance_dirty_pages would start flushing and Writeback would grow to
> 40% while Dirty stayed at 40%.  Then balance_dirty_pages would not 
> flush anything but would just wait for Writeback to drop below 40%.
> You would get a very obvious steady stage of 40% dirty and
> 40% Writeback.
> 
> Is this too much memory?  80% tied up in what are essentially dirty
> blocks is more than you would expect when setting vm.dirty_ratio to
> 40.
> 
> Maybe 40% should limit Dirty+Writeback and when we cross the
> threshold:
>   if Dirty > Writeback - flush and wait
>   if Dirty < Writeback - just wait
> 
> bdflush should get some writeback underway before we hit the 40%, so
> balance_dirty_pages shouldn't find itself waiting for the pages it
> just flushed.
> 
> Suggestions? Opinions?
> 
> The following patch demonstrates the last suggestion.
> 
> Thanks,
> NeilBrown
> 
> Signed-off-by: Neil Brown <neilb@suse.de>
> 
> ### Diffstat output
>  ./mm/page-writeback.c |    4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff .prev/mm/page-writeback.c ./mm/page-writeback.c
> --- .prev/mm/page-writeback.c	2006-08-15 09:36:23.000000000 +1000
> +++ ./mm/page-writeback.c	2006-08-15 09:39:17.000000000 +1000
> @@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a
>  		 * written to the server's write cache, but has not yet
>  		 * been flushed to permanent storage.
>  		 */
> -		if (nr_reclaimable) {
> +		if (nr_reclaimable > global_page_state(NR_WRITEBACK)) {
>  			writeback_inodes(&wbc);
>  			get_dirty_limits(&background_thresh,
>  					 	&dirty_thresh, mapping);
> @@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a
>  					<= dirty_thresh)
>  						break;
>  			pages_written += write_chunk - wbc.nr_to_write;
> -			if (pages_written >= write_chunk)
> -				break;		/* We've done our duty */
>  		}
>  		blk_congestion_wait(WRITE, HZ/10);
>  	}

Something like that - it'll be relatively simple.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-15  8:06 ` Andrew Morton
@ 2006-08-15 23:00   ` David Chinner
  2006-08-17  4:08     ` Neil Brown
  2006-08-17  3:59   ` Neil Brown
  1 sibling, 1 reply; 37+ messages in thread
From: David Chinner @ 2006-08-15 23:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, linux-kernel

On Tue, Aug 15, 2006 at 01:06:11AM -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2006 09:40:12 +1000
> Neil Brown <neilb@suse.de> wrote:
> > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > balance_dirty_pages will no longer be able to flush the full
> > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > a busy loop, but it won't progress.
> 
> This assumes that the queues are unbounded.  They're not - they're limited
> to 128 requests, which is 60MB or so.
> 
> Per queue.  The scenario you identify can happen if it's spread across
> multiple disks simultaneously.

Though in this situation, you don't usually have slow writeback problems.
I haven't seen any recent problems with insufficient throttling on this
sort of configuration.

> CFQ used to have 1024 requests and we did have problems with excessive
> numbers of writeback pages.  I fixed that in 2.6.early, but that seems to
> have got lost as well.

CFQ still has a queue depth of 128 requests....

> > bdflush should get some writeback underway before we hit the 40%, so
> > balance_dirty_pages shouldn't find itself waiting for the pages it
> > just flushed.

balance_dirty_pages() already kicks the background writeback done
by pdflush when dirty > dirty_background_ratio (10%).

IMO, if you've got slow writeback, you should be reducing the amount
of dirty memory you allow in the machine so that you don't tie up
large amounts of memory that takes a long time to clean. Throttle earlier
and you avoid this problem entirely.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-15 23:00   ` David Chinner
@ 2006-08-17  4:08     ` Neil Brown
  2006-08-17  6:14       ` Andrew Morton
  2006-08-17 22:17       ` David Chinner
  0 siblings, 2 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-17  4:08 UTC (permalink / raw)
  To: David Chinner; +Cc: Andrew Morton, linux-kernel

On Wednesday August 16, dgc@sgi.com wrote:
> 
> IMO, if you've got slow writeback, you should be reducing the amount
> of dirty memory you allow in the machine so that you don't tie up
> large amounts of memory that takes a long time to clean. Throttle earlier
> and you avoid this problem entirely.

I completely agree that 'throttle earlier' is important.  I just not
completely sure what should be throttled when.

I think I could argue that pages in 'Writeback' are really still
dirty.  The difference is really just an implementation issue.

So when the dirty_ratio is set to 40%, that should apply to all
'dirty' pages, which means both that flagged as 'Dirty' and those
flagged as 'Writeback'.

So I think you need to throttle when Dirty+Writeback hits dirty_ratio
(which we don't quite get right at the moment).  But the trick is to
throttle gently and fairly, rather than having a hard wall so that any
one who hits it just stops.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  4:08     ` Neil Brown
@ 2006-08-17  6:14       ` Andrew Morton
  2006-08-17 12:36         ` Trond Myklebust
  2006-08-18  0:11         ` David Chinner
  2006-08-17 22:17       ` David Chinner
  1 sibling, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2006-08-17  6:14 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, linux-kernel

On Thu, 17 Aug 2006 14:08:58 +1000
Neil Brown <neilb@suse.de> wrote:

> So I think you need to throttle when Dirty+Writeback hits dirty_ratio

yup.

> (which we don't quite get right at the moment).  But the trick is to
> throttle gently and fairly, rather than having a hard wall so that any
> one who hits it just stops.

I swear, I had all this working in 2001.  Perhaps I dreamed it.  But I
specifically remember testing that processes which were performing small,
occasional writes were not getting blocked behind the activity of other
processes which were doing massive write()s.  Ho hum, not to worry.

I guess a robust approach would be to track, on a per-process,
per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate. 
If it is "low" then accept the dirtying.  If it is "high" then this process
is a heavy writer and needs throttling earlier.  Up to a point - at some
level we'll need to throttle everyone as a safety net if nothing else.

Something like that covers the global dirty+writeback problem.  The other
major problem space is the multiple-backing-device problem:

a) One device is being written to heavily, another lightly

b) One device is fast, another is slow.

Thus far, the limited size of the request queues has saved us from really,
really serious problems.  But that doesn't work when lots of disks are
being used.  To solve this properly we'd need to account for
dirty+writeback(+unstable?) pages on a per-backing-dev basis.

But as a first step, yes, using dirty+writeback for the throttling
threshold and continuing to rely upon limited request queue size to save us
from disaster would be a good step.

btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
which are stuck in blk_congestion_wait() when NFS has retired a bunch of
writes.  It should do so, otherwise NFS write-intensive workloads might end
up sleeping for too long.  I guess the amount of buffering and hysteresis
we have in there has thus far prevented any problems from being observed.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  6:14       ` Andrew Morton
@ 2006-08-17 12:36         ` Trond Myklebust
  2006-08-17 15:14           ` Andrew Morton
  2006-08-18  0:11         ` David Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2006-08-17 12:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel

On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote:
> btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
> which are stuck in blk_congestion_wait() when NFS has retired a bunch of
> writes.  It should do so, otherwise NFS write-intensive workloads might end
> up sleeping for too long.  I guess the amount of buffering and hysteresis
> we have in there has thus far prevented any problems from being observed.

Are we to understand it that you consider blk_congestion_wait() to be an
official API, and not just another block layer hack inside the VM?

'cos currently the only tools for waking up processes in
blk_congestion_wait() are the two routines:

   static void clear_queue_congested(request_queue_t *q, int rw)
and
   static void set_queue_congested(request_queue_t *q, int rw)

in block/ll_rw_blk.c. Hardly a model of well thought out code...

  Trond


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 12:36         ` Trond Myklebust
@ 2006-08-17 15:14           ` Andrew Morton
  2006-08-17 16:22             ` Trond Myklebust
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2006-08-17 15:14 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, David Chinner, linux-kernel

On Thu, 17 Aug 2006 08:36:19 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote:
> > btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
> > which are stuck in blk_congestion_wait() when NFS has retired a bunch of
> > writes.  It should do so, otherwise NFS write-intensive workloads might end
> > up sleeping for too long.  I guess the amount of buffering and hysteresis
> > we have in there has thus far prevented any problems from being observed.
> 
> Are we to understand it that you consider blk_congestion_wait() to be an
> official API, and not just another block layer hack inside the VM?
> 
> 'cos currently the only tools for waking up processes in
> blk_congestion_wait() are the two routines:
> 
>    static void clear_queue_congested(request_queue_t *q, int rw)
> and
>    static void set_queue_congested(request_queue_t *q, int rw)
> 
> in block/ll_rw_blk.c. Hardly a model of well thought out code...
> 

We've been over this before...

Take a look at blk_congestion_wait().  It doesn't know about request
queues.  We'd need a new

void writeback_congestion_end(int rw)
{
	wake_up(congestion_wqh[rw]);
}

or similar.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 15:14           ` Andrew Morton
@ 2006-08-17 16:22             ` Trond Myklebust
  2006-08-18  5:49               ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2006-08-17 16:22 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel

On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote:
> Take a look at blk_congestion_wait().  It doesn't know about request
> queues.  We'd need a new
> 
> void writeback_congestion_end(int rw)
> {
> 	wake_up(congestion_wqh[rw]);
> }
> 
> or similar.

...and how often do you want us to call this? NFS doesn't know much
about request queues either: it writes out pages on a per-RPC call
basis. In the worst case that could mean waking up the VM every time we
write out a single page.

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 16:22             ` Trond Myklebust
@ 2006-08-18  5:49               ` Andrew Morton
  2006-08-18 10:43                 ` Nikita Danilov
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2006-08-18  5:49 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, David Chinner, linux-kernel

On Thu, 17 Aug 2006 12:22:59 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote:
> > Take a look at blk_congestion_wait().  It doesn't know about request
> > queues.  We'd need a new
> > 
> > void writeback_congestion_end(int rw)
> > {
> > 	wake_up(congestion_wqh[rw]);
> > }
> > 
> > or similar.
> 
> ...and how often do you want us to call this? NFS doesn't know much
> about request queues either: it writes out pages on a per-RPC call
> basis. In the worst case that could mean waking up the VM every time we
> write out a single page.
> 

Once per page would work OK, but we'd save some CPU by making it less
frequent.

This stuff isn't very precise.  We could make it precise, but it would
require a really large amount of extra locking, extra locks, etc.

The way this code all works is pretty crude and simple: a process comes
in to to some writeback and it enters a polling loop:

	while (we need to do writeback) {
		for (each superblock) {
			if (the superblock's backing_dev isn't congested) {
				stuff some more IO down it()
			}
		}
		take_a_nap();
	}

so the process remains captured in that polling loop until the
dirty-memory-exceed condition subsides.  The reason why we avoid
congsted queues is so that one thread can keep multiple queues busy: we
don't want to allow writing threads to get stuck on a single queue and
we don't want to have to provision one pdflush per spindle (or, more
precisely, per backing_dev_info).

So the question is: how do we "take a nap"?  That's blk_congestion_wait(). 
The process goes to sleep in there and gets woken up when someone thinks
that a queue might be able to take some more writeout.

A caller into blk_congestion_wait() is _supposed_ to be woken by writeback
completion.  If the timeout actually expires, something isn't right.  If we
had all the new locking in place and correct, the timeout wouldn't actually
be needed.  In theory, the timeout is only there as a fallback to handle
certain races for which we don't want to implement all that new locking to
fix.

It would be good if NFS were to implement a fixed-size "request queue",
so we can't fill all memory with NFS requests.  Then, NFS can implement
a congestion threshold at "75% full" (via its backing_dev_info) and
everything is in place.

As a halfway step it might provide benefit for NFS to poke the
congestion_wq[] every quarter megabyte or so, to kick any processes out
of their sleep so they go back to poll all the superblocks again,
earlier than they otherwise would have.  It might not make any
difference - one would need to get in there and understand the dynamic
behaviour.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-18  5:49               ` Andrew Morton
@ 2006-08-18 10:43                 ` Nikita Danilov
  0 siblings, 0 replies; 37+ messages in thread
From: Nikita Danilov @ 2006-08-18 10:43 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, David Chinner, linux-kernel

Andrew Morton writes:

[...]

 > 
 > The way this code all works is pretty crude and simple: a process comes
 > in to to some writeback and it enters a polling loop:
 > 
 > 	while (we need to do writeback) {
 > 		for (each superblock) {
 > 			if (the superblock's backing_dev isn't congested) {
 > 				stuff some more IO down it()
 > 			}
 > 		}
 > 		take_a_nap();
 > 	}
 > 
 > so the process remains captured in that polling loop until the
 > dirty-memory-exceed condition subsides.  The reason why we avoid

Hm... wbc->nr_to_write is checked all the way down
(balance_dirty_pages(), writeback_inodes(), sync_sb_inodes(),
mpage_writepages()), so "occasional writer" cannot be stuck for more
than 32 + 16 pages, it seems.

Nikita.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  6:14       ` Andrew Morton
  2006-08-17 12:36         ` Trond Myklebust
@ 2006-08-18  0:11         ` David Chinner
  2006-08-18  6:29           ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: David Chinner @ 2006-08-18  0:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, linux-kernel

On Wed, Aug 16, 2006 at 11:14:48PM -0700, Andrew Morton wrote:
> 
> I guess a robust approach would be to track, on a per-process,
> per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate. 
> If it is "low" then accept the dirtying.  If it is "high" then this process
> is a heavy writer and needs throttling earlier.  Up to a point - at some
> level we'll need to throttle everyone as a safety net if nothing else.

The problem with that approach is that throttling a large writer
forces data to disk earlier and that may be undesirable - the large
file might be a temp file that will soon be unlinked, and in this case
you don't want it throttled. Right now, you set dirty*ratio high enough
that this doesn't happen, and the file remains memory resident until
unlink.

> Something like that covers the global dirty+writeback problem.  The other
> major problem space is the multiple-backing-device problem:
> 
> a) One device is being written to heavily, another lightly
> 
> b) One device is fast, another is slow.

Once we are past the throttling threshold, the only thing that
matters is whether we can write more data to the backing device(s).
We should not realy be allowing the input rate to exceed the output
rate one we are passed the throttle threshold.

> Thus far, the limited size of the request queues has saved us from really,
> really serious problems.  But that doesn't work when lots of disks are
> being used.

Mainly because it increases the number of pages under writeback that
currently aren't accounted as dirty and the throttle doesn't
kick in when it should.

> To solve this properly we'd need to account for
> dirty+writeback(+unstable?) pages on a per-backing-dev basis.

We'd still need to account for them globally because we still need
to be able to globally limit the amount of dirty data in the
machine.

FYI, I implemented a complex two-stage throttle on Irix a couple of
years ago - it uses a per-device soft throttle threshold that is not
enforced until the global dirty state passes a configurable limit.
At that point, the per-device limits are enforced.

This meant that devices with no dirty state attached to them could
continue to dirty pages up to their soft-threshold, whereas heavy
writers would be stopped until their backing devices fell back below
the soft thresholds.

Because the amount of dirty pages could continue to grow past safe
limits if you had enough devices, there is also a global hard limit
that cannot be exceeded and this throttles all incoming write
requests regardless of the state of the device it was being written
to.

The problem with this approach is that the code was complex and
difficult to test properly. Also, working out the default config
values was an exercise in trial, error, workload measurement and
guesswork that took some time to get right.

The current linux code works as well as that two-stage throttle
(better in some cases!) because of one main thing - bound request
queue depth with feedback into the throttling control loop. Irix
has neither of these so the throttle had to provide this accounting
and limiting (soft throttle threshold).

Hence I'm not sure that per-backing-device accounting and making
decisions based on that accounting is really going to buy us much
apart from additional complexity....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-18  0:11         ` David Chinner
@ 2006-08-18  6:29           ` Andrew Morton
  2006-08-18  7:03             ` Jens Axboe
  2006-08-18  7:07             ` Neil Brown
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2006-08-18  6:29 UTC (permalink / raw)
  To: David Chinner; +Cc: Neil Brown, linux-kernel

On Fri, 18 Aug 2006 10:11:02 +1000
David Chinner <dgc@sgi.com> wrote:

> 
> > Something like that covers the global dirty+writeback problem.  The other
> > major problem space is the multiple-backing-device problem:
> > 
> > a) One device is being written to heavily, another lightly
> > 
> > b) One device is fast, another is slow.
> 
> Once we are past the throttling threshold, the only thing that
> matters is whether we can write more data to the backing device(s).
> We should not realy be allowing the input rate to exceed the output
> rate one we are passed the throttle threshold.

True.

But it seems really sad to block some process which is doing a really small
dirtying (say, some dopey atime update) just because some other process is
doing a huge write.

Now, things _usually_ work out all right, if only because of
balance_dirty_pages_ratelimited()'s logic.  But it's more by happenstance
than by intent, and these sorts of interferences can happen.

> > To solve this properly we'd need to account for
> > dirty+writeback(+unstable?) pages on a per-backing-dev basis.
> 
> We'd still need to account for them globally because we still need
> to be able to globally limit the amount of dirty data in the
> machine.
> 
> FYI, I implemented a complex two-stage throttle on Irix a couple of
> years ago - it uses a per-device soft throttle threshold that is not
> enforced until the global dirty state passes a configurable limit.
> At that point, the per-device limits are enforced.
> 
> This meant that devices with no dirty state attached to them could
> continue to dirty pages up to their soft-threshold, whereas heavy
> writers would be stopped until their backing devices fell back below
> the soft thresholds.
> 
> Because the amount of dirty pages could continue to grow past safe
> limits if you had enough devices, there is also a global hard limit
> that cannot be exceeded and this throttles all incoming write
> requests regardless of the state of the device it was being written
> to.
> 
> The problem with this approach is that the code was complex and
> difficult to test properly. Also, working out the default config
> values was an exercise in trial, error, workload measurement and
> guesswork that took some time to get right.
> 
> The current linux code works as well as that two-stage throttle
> (better in some cases!) because of one main thing - bound request
> queue depth with feedback into the throttling control loop. Irix
> has neither of these so the throttle had to provide this accounting
> and limiting (soft throttle threshold).
> 
> Hence I'm not sure that per-backing-device accounting and making
> decisions based on that accounting is really going to buy us much
> apart from additional complexity....
> 

hm, interesting.

It seems that the many-writers-to-different-disks workloads don't happen
very often.  We know this because

a) The 2.4 performance is utterly awful, and I never saw anybody
   complain and

b) 2.6 has the risk of filling all memory with under-writeback pages,
   and nobdy has complained about that either (iirc).

Relying on that observation and the request-queue limits has got us this
far but yeah, we should plug that PageWriteback windup scenario.

btw, Neil, has the Pagewriteback windup actually been demonstrated?  If so,
how?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-18  6:29           ` Andrew Morton
@ 2006-08-18  7:03             ` Jens Axboe
  2006-08-18  7:11               ` Andrew Morton
  2006-08-18 18:57               ` Andi Kleen
  2006-08-18  7:07             ` Neil Brown
  1 sibling, 2 replies; 37+ messages in thread
From: Jens Axboe @ 2006-08-18  7:03 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Chinner, Neil Brown, linux-kernel

On Thu, Aug 17 2006, Andrew Morton wrote:
> It seems that the many-writers-to-different-disks workloads don't happen
> very often.  We know this because
> 
> a) The 2.4 performance is utterly awful, and I never saw anybody
>    complain and

Talk to some of the people that used DVD-RAM devices (or other
excruciatingly slow writers) on their system, and they would disagree
violently :-)

It's been discussed here on lkml many times in the past, but that's
years behind us now. Thankfully your pdflush work got rid of that
embarassment. But it definitely does matter, to real ordinary users.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-18  7:03             ` Jens Axboe
@ 2006-08-18  7:11               ` Andrew Morton
  2006-08-18 18:57               ` Andi Kleen
  1 sibling, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2006-08-18  7:11 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Chinner, Neil Brown, linux-kernel

On Fri, 18 Aug 2006 09:03:15 +0200
Jens Axboe <axboe@suse.de> wrote:

> On Thu, Aug 17 2006, Andrew Morton wrote:
> > It seems that the many-writers-to-different-disks workloads don't happen
> > very often.  We know this because
> > 
> > a) The 2.4 performance is utterly awful, and I never saw anybody
> >    complain and
> 
> Talk to some of the people that used DVD-RAM devices (or other
> excruciatingly slow writers) on their system, and they would disagree
> violently :-)

umm, OK, I guess that has the same cause: buffer_heads from different
devices all on the same single queue.  In this case the problem is that one
device is slow.  In the same-speed-devices case the problem is that all
writeback threads get stuck on the same device, allowing others to go idle.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-18  7:03             ` Jens Axboe
  2006-08-18  7:11               ` Andrew Morton
@ 2006-08-18 18:57               ` Andi Kleen
  2006-08-21  0:35                 ` Neil Brown
  1 sibling, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2006-08-18 18:57 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Chinner, Neil Brown, linux-kernel, akpm

Jens Axboe <axboe@suse.de> writes:

> On Thu, Aug 17 2006, Andrew Morton wrote:
> > It seems that the many-writers-to-different-disks workloads don't happen
> > very often.  We know this because
> > 
> > a) The 2.4 performance is utterly awful, and I never saw anybody
> >    complain and
> 
> Talk to some of the people that used DVD-RAM devices (or other
> excruciatingly slow writers) on their system, and they would disagree
> violently :-)

I hit this recently while doing backups to a slow external USB disk.
The system was quite unusable (some commands blocked for over a minute)

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-18 18:57               ` Andi Kleen
@ 2006-08-21  0:35                 ` Neil Brown
  2006-08-21  3:15                   ` David Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Neil Brown @ 2006-08-21  0:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Jens Axboe, David Chinner, linux-kernel, akpm

On  August 18, ak@suse.de wrote:
> Jens Axboe <axboe@suse.de> writes:
> 
> > On Thu, Aug 17 2006, Andrew Morton wrote:
> > > It seems that the many-writers-to-different-disks workloads don't happen
> > > very often.  We know this because
> > > 
> > > a) The 2.4 performance is utterly awful, and I never saw anybody
> > >    complain and
> > 
> > Talk to some of the people that used DVD-RAM devices (or other
> > excruciatingly slow writers) on their system, and they would disagree
> > violently :-)
> 
> I hit this recently while doing backups to a slow external USB disk.
> The system was quite unusable (some commands blocked for over a minute)

Ouch.  
I suspect we are going to see more of this, as USB drive for backups
is probably a very attractive option for many.

The 'obvious' solution would be to count dirty pages per backing_dev
and rate limit writes based on this.
But counting pages can be expensive.  I wonder if there might be some
way to throttle the required writes without doing too much counting.

Could we watch when the backing_dev is congested and use that?
e.g.
 When Dirty+Writeback is between max_dirty/2 and max_dirty,
  balance_dirty_pages waits until mapping->backing_dev_info
    is not congested.

That might slow things down, but it is hard to know if it would slow
things down the right amount...

Given that large machines are likely to have lots of different
backing_devs, maybe counting all the dirty pages per backing_dev
wouldn't be too expensive?

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21  0:35                 ` Neil Brown
@ 2006-08-21  3:15                   ` David Chinner
  2006-08-21  7:24                     ` Neil Brown
  2006-08-21  7:47                     ` Andi Kleen
  0 siblings, 2 replies; 37+ messages in thread
From: David Chinner @ 2006-08-21  3:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andi Kleen, Jens Axboe, David Chinner, linux-kernel, akpm

On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote:
> On  August 18, ak@suse.de wrote:
> > Jens Axboe <axboe@suse.de> writes:
> > 
> > > On Thu, Aug 17 2006, Andrew Morton wrote:
> > > > It seems that the many-writers-to-different-disks workloads don't happen
> > > > very often.  We know this because
> > > > 
> > > > a) The 2.4 performance is utterly awful, and I never saw anybody
> > > >    complain and
> > > 
> > > Talk to some of the people that used DVD-RAM devices (or other
> > > excruciatingly slow writers) on their system, and they would disagree
> > > violently :-)
> > 
> > I hit this recently while doing backups to a slow external USB disk.
> > The system was quite unusable (some commands blocked for over a minute)
> 
> Ouch.  
> I suspect we are going to see more of this, as USB drive for backups
> is probably a very attractive option for many.

I can't see how this would occur on a 2.6 kernel unless the problem is
that all the reclaimable memory in the machine is dirty page cache pages
every allocation is blocking waiting for writeback to the slow device to
occur. That is, we filled memory with dirty pages before we got to the
throttle threshold.

> The 'obvious' solution would be to count dirty pages per backing_dev
> and rate limit writes based on this.
> But counting pages can be expensive.  I wonder if there might be some
> way to throttle the required writes without doing too much counting.

I don't think we want to count pages here.

My "obvious" solution is a per-backing-dev throttle threshold, just
like we have per-backing-dev readahead parameters....

That is, we allow a per-block-dev value to be set that overrides the
global setting for that blockdev only. Hence for slower devices
we can set the point at which we throttle at a much lower dirty
memory threshold when that block device is congested.

> Could we watch when the backing_dev is congested and use that?
> e.g.
>  When Dirty+Writeback is between max_dirty/2 and max_dirty,
>   balance_dirty_pages waits until mapping->backing_dev_info
>     is not congested.

The problem with that approach is that writeback_inodes() operates
on "random" block devices, not necessarily the one we are
trying to write to

We don't care what bdi we start write back on - we just want
some dirty pages to come clean. If we can't write the number of
pages we wanted to, that means all bdi's are congested, and we then
wait for one to become uncongested so we can push more data into it.

Hence waiting on a specific bdi to become uncongested is the wrong
thing to do because we could be cleaning pages on a different,
uncongested bdi instead of waiting.

A per-bdi throttle threshold will have the effect of pushing out
pages on faster block devs earlier than they would otherwise be
pushed out, but that will only occur if we are writing to a
slower block device. Also, only the slower bdi will be subject
to this throttling, so it won't get as much memory dirty as
the faster devices....

> That might slow things down, but it is hard to know if it would slow
> things down the right amount...
> 
> Given that large machines are likely to have lots of different
> backing_devs, maybe counting all the dirty pages per backing_dev
> wouldn't be too expensive?

Consider 1024p machines writing in parallel at >10GB/s write speeds
to a single filesystem (i.e. single bdi).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21  3:15                   ` David Chinner
@ 2006-08-21  7:24                     ` Neil Brown
  2006-08-21 13:51                       ` Jens Axboe
  2006-08-21 14:28                       ` David Chinner
  2006-08-21  7:47                     ` Andi Kleen
  1 sibling, 2 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-21  7:24 UTC (permalink / raw)
  To: David Chinner; +Cc: Andi Kleen, Jens Axboe, linux-kernel, akpm

On Monday August 21, dgc@sgi.com wrote:
> On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote:
> > On  August 18, ak@suse.de wrote:
> > > Jens Axboe <axboe@suse.de> writes:
> > > 
> > > > On Thu, Aug 17 2006, Andrew Morton wrote:
> > > > > It seems that the many-writers-to-different-disks workloads don't happen
> > > > > very often.  We know this because
> > > > > 
> > > > > a) The 2.4 performance is utterly awful, and I never saw anybody
> > > > >    complain and
> > > > 
> > > > Talk to some of the people that used DVD-RAM devices (or other
> > > > excruciatingly slow writers) on their system, and they would disagree
> > > > violently :-)
> > > 
> > > I hit this recently while doing backups to a slow external USB disk.
> > > The system was quite unusable (some commands blocked for over a minute)
> > 
> > Ouch.  
> > I suspect we are going to see more of this, as USB drive for backups
> > is probably a very attractive option for many.
> 
> I can't see how this would occur on a 2.6 kernel unless the problem is
> that all the reclaimable memory in the machine is dirty page cache pages
> every allocation is blocking waiting for writeback to the slow device to
> occur. That is, we filled memory with dirty pages before we got to the
> throttle threshold.

I started writing a longish reply to this explaining how maybe that
could happen, and then realised I had been missing important aspects
of the code.

writeback_inodes doesn't just work on any random device as both you
and I thought.  The 'writeback_control' structure identifies the bdi
to be flushed and it will only call __writeback_sync_inode on inodes
with the same bdi.

This means that any process writing to a particular bdi should throttle
against the queue limits in the bdi once we pass the dirty threshold.
This means that it shouldn't be able to fill memory with dirty pages
for that device (unless the bdi doesn't have a queue limit like
nfs...).

But now I see another way that Jens' problem could occur... maybe.

Suppose the total Dirty+Writeback exceeds the threshold due entirely
to the slow device, and it is slowly working its way through the
writeback pages.

We write to some other device, make 'ratelimit_pages' pages dirty and
then hit balance_dirty_pages.  We now need to either get the total
dirty pages below the threshold or start writeback on 
   1.5 * ratelimit_pages 
pages.  As we only have 'ratelimit_pages' dirty pages we cannot start
writeback on enough, and so must wait until Dirty+Writeback drops
below the threshold.  And as we are waiting on the slow device, that
could take a while (especially as it is possible that no-one is
calling balance_dirty_pages against that bdi).

I think this was the reason for the interesting extra patch that I
mentioned we have is the SuSE kernel.  The effect of that patch is to
break out of balance_dirty_pages as soon as Dirty hits zero.

This should stop a slow device from blocking other traffic but has
unfortunate side effects when combined with nfs which doesn't limit
its writeback queue.

Jens:  Was it s SuSE kernel or a mainline kernel on which you
   experienced this slowdown with an external USB drive?

> 
> > The 'obvious' solution would be to count dirty pages per backing_dev
> > and rate limit writes based on this.
> > But counting pages can be expensive.  I wonder if there might be some
> > way to throttle the required writes without doing too much counting.
> 
> I don't think we want to count pages here.
> 
> My "obvious" solution is a per-backing-dev throttle threshold, just
> like we have per-backing-dev readahead parameters....
> 
> That is, we allow a per-block-dev value to be set that overrides the
> global setting for that blockdev only. Hence for slower devices
> we can set the point at which we throttle at a much lower dirty
> memory threshold when that block device is congested.
> 

I don't think this would help.  The bdi with the higher threshold
could exclude bdis with lower thresholds from making any forward
progress. 

Here is a question:
  Seeing that
      wbc.nonblocking == 0
      wbc.older_than_this == NULL
      wbc.range_cyclic == 0
  in balance_dirty_pages when it calls writeback_inodes, under what
  circumstances will writeback_inodes return with wbc.nr_to_write > 0
  ??

If a write error occurs it could abort early, but otherwise I think
it will only exit early if it runs out of pages to write, because
there aren't any dirty pages.

If that is true, then after calling writeback_inodes once,
balance_dirty_pages should just exit.  It isn't going to find any more
work to do next time it is called anyway.
Either the queue was never congested, in which case we don't need to
throttle writes, or it blocked for a while waiting for the queue to
clean (in ->writepage) and so has successfully throttled writes.

So my feeling (at the moment) is that balance_dirty_pages should look
like:

 if below threshold 
       return
 writeback_inodes({.bdi = mapping->backing_dev_info)} )

 while (above threshold + 10%)
        writeback_inodes(.bdi = NULL)
        blk_congestion_wait

and all bdis should impose a queue limit.

This would limit the extent to which different bdi can interfere with
each other, and make the role of writeback_inodes clear (especially
with a nice big comment).

Then we just need to deal with the case where the some of the queue
limits of all devices exceeds the dirty threshold....
Maybe writeout queues need to auto-adjust their queue length when some
system-wide situation is detected.... sounds messy.

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21  7:24                     ` Neil Brown
@ 2006-08-21 13:51                       ` Jens Axboe
  2006-08-25  4:36                         ` Neil Brown
  2006-08-21 14:28                       ` David Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2006-08-21 13:51 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm

On Mon, Aug 21 2006, Neil Brown wrote:
> Jens:  Was it s SuSE kernel or a mainline kernel on which you
>    experienced this slowdown with an external USB drive?

Note that this was on the old days, on 2.4 kernels. It was (and still
is) a generic 2.4 problem, which is quite apparent when you have larger
slow devices. Larger, because then you can have a lot of dirty memory in
flight for that device. The case I most often saw reported was on
DVD-RAM atapi or scsi devices, which write at 3-400kb/sec. An external
usb hard drive over usb 1.x would be almost as bad, I suppose.

I haven't heard any complaints for 2.6 in this area for a long time.

> Then we just need to deal with the case where the some of the queue
> limits of all devices exceeds the dirty threshold....
> Maybe writeout queues need to auto-adjust their queue length when some
> system-wide situation is detected.... sounds messy.

Queue length is a little tricky, so it's basically controlled by two
parameters - nr_requests and max_sectors_kb. Most SATA drives can do
32MiB requests, so in theory a system that sets max_sectors_kb to
max_hw_sectors_kb and retains a default nr_requests of 128, can see up
to 32 * (128 * 3) / 2 == 6144MiB per disk in flight. Auch. By default we
only allow 512KiB per requests, which brings us to a more reasonable
96MiB per disk.

But these numbers are in no way tied to the hardware. It may be totally
reasonable to have 3GiB of dirty data on one system, and it may be
totally unreasonable to have 96MiB of dirty data on another. I've always
thought that assuming any kind of reliable throttling at the queue level
is broken and that the vm should handle this completely.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21 13:51                       ` Jens Axboe
@ 2006-08-25  4:36                         ` Neil Brown
  2006-08-25  6:37                           ` Jens Axboe
  2006-08-25 13:16                           ` Trond Myklebust
  0 siblings, 2 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-25  4:36 UTC (permalink / raw)
  To: Jens Axboe; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm

On Monday August 21, axboe@suse.de wrote:
> 
> But these numbers are in no way tied to the hardware. It may be totally
> reasonable to have 3GiB of dirty data on one system, and it may be
> totally unreasonable to have 96MiB of dirty data on another. I've always
> thought that assuming any kind of reliable throttling at the queue level
> is broken and that the vm should handle this completely.

I keep changing my mind about this.  Sometimes I see it that way,
sometimes it seems very sensible for throttling to happen at the
device queue.

Can I ask a question:  Why do we have a 'nr_requests' maximum?  Why
not just allocate request structures whenever a request is made?
If there some reason relating to making the block layer work more
efficiently? or is it just because the VM requires it.


I'm beginning to think that the current scheme really works very well
- except for a few 'bugs'(*).
The one change that might make sense would be for the VM to be able to
tune the queue size of each backing dev.  Exactly how that would work
I'm not sure, but the goal would be to get the sum of the active queue
sizes to about 1 half of dirty_threshold.

The 'bugs' I am currently aware of are:
 - nfs doesn't put a limit on the request queue
 - the ext3 journal often writes out dirty data without clearing
   the Dirty flag on the page - so the nr_dirty count ends up wrong.
   ext3 writes the buffers out and marks them clean.  So when
   the VM tried to flush a page, it finds all the buffers are clean
   and so marks the page clean, so the nr_dirty count eventually
   gets correct again, but I think this can cause write throttling to
   be very unfair at times.

I think we need a queue limit on NFS requests.....

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-25  4:36                         ` Neil Brown
@ 2006-08-25  6:37                           ` Jens Axboe
  2006-08-28  1:28                             ` David Chinner
  2006-08-25 13:16                           ` Trond Myklebust
  1 sibling, 1 reply; 37+ messages in thread
From: Jens Axboe @ 2006-08-25  6:37 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, Andi Kleen, linux-kernel, akpm

On Fri, Aug 25 2006, Neil Brown wrote:
> On Monday August 21, axboe@suse.de wrote:
> > 
> > But these numbers are in no way tied to the hardware. It may be totally
> > reasonable to have 3GiB of dirty data on one system, and it may be
> > totally unreasonable to have 96MiB of dirty data on another. I've always
> > thought that assuming any kind of reliable throttling at the queue level
> > is broken and that the vm should handle this completely.
> 
> I keep changing my mind about this.  Sometimes I see it that way,
> sometimes it seems very sensible for throttling to happen at the
> device queue.
> 
> Can I ask a question:  Why do we have a 'nr_requests' maximum?  Why
> not just allocate request structures whenever a request is made?
> If there some reason relating to making the block layer work more
> efficiently? or is it just because the VM requires it.

It's by and large because the vm requires it. Historically the limit was
there because the requests were statically allocated. Later the limit
help bound runtimes for the io scheduler, since the merge and sort
operations where O(N) each. Right now any of the io schedulers can
handle larger number of requests without breaking a sweat, but the vm
goes pretty nasty if you set (eg) 8192 requests as your limit.

The limit is also handy for avoiding filling memory with requests
structures. At some point here's little benefit to doing larger queues,
depending on the workload and hardware. 128 is usually a pretty fair
number, so...

> I'm beginning to think that the current scheme really works very well
> - except for a few 'bugs'(*).

It works ok, but it makes it hard to experiment with larger queue depths
when the vm falls apart :-). It's not a big deal, though, even if the
design isn't very nice - nr_requests is not a well defined entity. It
can be anywhere from 512b to megabyte(s) in size. So throttling on X
number of requests tends to be pretty vague and depends hugely on the
workload (random vs sequential IO).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-25  6:37                           ` Jens Axboe
@ 2006-08-28  1:28                             ` David Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: David Chinner @ 2006-08-28  1:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Neil Brown, David Chinner, Andi Kleen, linux-kernel, akpm

On Fri, Aug 25, 2006 at 08:37:24AM +0200, Jens Axboe wrote:
> On Fri, Aug 25 2006, Neil Brown wrote:
> 
> > I'm beginning to think that the current scheme really works very well
> > - except for a few 'bugs'(*).
> 
> It works ok, but it makes it hard to experiment with larger queue depths
> when the vm falls apart :-). It's not a big deal, though, even if the
> design isn't very nice - nr_requests is not a well defined entity. It
> can be anywhere from 512b to megabyte(s) in size. So throttling on X
> number of requests tends to be pretty vague and depends hugely on the
> workload (random vs sequential IO).

So maybe we need a different control parameter - the amount of memory we
allow to be backed up in a queue rather than the number of requests the
queue can take...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-25  4:36                         ` Neil Brown
  2006-08-25  6:37                           ` Jens Axboe
@ 2006-08-25 13:16                           ` Trond Myklebust
  2006-08-27  8:21                             ` Neil Brown
  1 sibling, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2006-08-25 13:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: Jens Axboe, David Chinner, Andi Kleen, linux-kernel, akpm

On Fri, 2006-08-25 at 14:36 +1000, Neil Brown wrote:
> The 'bugs' I am currently aware of are:
>  - nfs doesn't put a limit on the request queue
>  - the ext3 journal often writes out dirty data without clearing
>    the Dirty flag on the page - so the nr_dirty count ends up wrong.
>    ext3 writes the buffers out and marks them clean.  So when
>    the VM tried to flush a page, it finds all the buffers are clean
>    and so marks the page clean, so the nr_dirty count eventually
>    gets correct again, but I think this can cause write throttling to
>    be very unfair at times.
> 
> I think we need a queue limit on NFS requests.....

That is simply not happening until someone can give a cogent argument
for _why_ it is necessary. Such a cogent argument must, among other
things, allow us to determine what would be a sensible queue limit. It
should also point out _why_ the filesystem should be doing this instead
of the VM.

Furthermore, I'd like to point out that NFS has a "third" state for
pages: following an UNSTABLE write the data on them is marked as
'uncommitted'. Such pages are tracked using the NR_UNSTABLE_NFS counter.
The question is: if we want to set limits on the write queue, what does
that imply for the uncommitted writes?
If you go back and look at the 2.4 NFS client, we actually had an
arbitrary queue limit. That limit covered the sum of writes+uncommitted
pages. Performance sucked, 'cos we were not able to use server side
caching efficiently. The number of COMMIT requests (causes the server to
fsync() the client's data to disk) on the wire kept going through the
roof as we tried to free up pages in order to satisfy the hard limit.
For those reasons and others, the filesystem queue limit was removed for
2.6 in favour of allowing the VM to control the limits based on its
extra knowledge of the state of global resources.

Trond

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-25 13:16                           ` Trond Myklebust
@ 2006-08-27  8:21                             ` Neil Brown
  0 siblings, 0 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-27  8:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Jens Axboe, David Chinner, Andi Kleen, linux-kernel, akpm

On Friday August 25, trond.myklebust@fys.uio.no wrote:
> On Fri, 2006-08-25 at 14:36 +1000, Neil Brown wrote:
> > The 'bugs' I am currently aware of are:
> >  - nfs doesn't put a limit on the request queue
> >  - the ext3 journal often writes out dirty data without clearing
> >    the Dirty flag on the page - so the nr_dirty count ends up wrong.
> >    ext3 writes the buffers out and marks them clean.  So when
> >    the VM tried to flush a page, it finds all the buffers are clean
> >    and so marks the page clean, so the nr_dirty count eventually
> >    gets correct again, but I think this can cause write throttling to
> >    be very unfair at times.
> > 
> > I think we need a queue limit on NFS requests.....
> 
> That is simply not happening until someone can give a cogent argument
> for _why_ it is necessary. Such a cogent argument must, among other
> things, allow us to determine what would be a sensible queue limit. It
> should also point out _why_ the filesystem should be doing this instead
> of the VM.

Well, I'm game.... let's see how I do ... (Hmmm. 290 lines.  Maybe I
did too well :-).

Firstly - what would be a sensible queue limit?
 To a large extent the size isn't very important.
 It needs to be big enough that you can have a reasonable number
 of concurrent in-flight requests - this is something that only
 the NFS client can determine.
 It needs to be small enough to not be able to use up too much
 memory.  This is something that the VM should impose, but doesn't
 yet.  Something like a few percent of memory would probably be
 the right ball park.  I agree this is something the VM should be
 responsible for, but isn't yet.  However I don't think it is a big
 part of the picture.

So: why is it necessary?
 How about "Because that's how the Linux VM works".  I can see that is
 not very satisfactory, and it isn't the whole answer, but I think it
 is worth saying.  There are probably several ways to do effective
 write throttling.  The Linux VM (currently) uses push-back from the
 writeout queue to achieve write throttling, so we really need all
 writeout mechanisms to provide some pushback.  Maybe the VM could be
 changed, but I think it would be a big change and not something to do
 lightly - partly because I think the current system works (for the
 most part) very well.

So: how does it work, why is it so clever, and how does it depend on
push-back?
It is actually quite subtle...

Write throttling needs to slow down the generation of dirty pages. (It
also could speed up the cleaning of dirty pages, but there is a limit
to how fast cleaning can happen, so the majority of the control
imposed is the slowing of generation of dirt).

It needs to do this in a way that is 'fair' in some (fairly coarse)
sense.  Possibly the best understanding of fairness is that - for any
process - the delay imposed is proportional to the cost generated.
i.e. generating more dirty pages imposes more delay.  Generating dirty
pages on slower devices imposes more delay.

When the fraction of dirty pages is below some threshold (40% by
default) no throttling is imposed.
When we cross that threshold we switch on write throttling.
The simplest approach might be to block all writers and push out pages
until we are below the threshold again.  This isn't fair in the above
sense as all dirtying processes are stopped for the same period of
time, independent of how much dirty that have created.

An 'obvious' response to that problem is to do lots of 'bean
counting'.  i.e. count how many pages are generated by each process
per unit time, and count how many pages are dirty for each writeout
queue, and impose delays accordingly.  However bean counting can hurt
performance, even when they aren't needed (not the effort that is
taken to minimise the performance impact of counting 'Dirty' and
'Writeback' pages etc.  We don't really want lots of fine grained
counting). 

Another 'obvious' response might be 'if you have dirtied N pages, then
wait until 'N' pages have become clean.  However that isn't easy to do
because if multiple processes are waiting for N pages to be cleaned,
you would need to credit each page as it becomes clean to some
process, and that is just more bean counting.

So what we actually do is notice that there are two sorts of dirty
pages.  Those marked 'Dirty' and those marked 'Writeback'.
Further we know there is some upper limit to the number if Writeback
pages (this is where the queue limit comes in).
So the penalty imposed on a process that has dirtied N pages is "You
must transition 1.5*N pages from 'Dirty' to 'Writeback'.  It is only
required to do that on the writeout queue that it is using so a
process writing to a fast device shouldn't be slowed down by the
presence of a slow device.

The first few processes might not find this to be much penalty as they
will just be filling up the writeout queue.  But soon the queues will
be full and write throttling will have its intended effect.

So in the steady state, for every N pages that are dirtied, a process
needs to wait for 1.5*N pages to become clean on the same device,
which sound reasonably fair.  Naturally this will push the number of
dirty pages steadily down until we are below the threshold, and then
it will be free-reign again.  Thus under heavy write load, the system
can expect to oscillate between just above and just below the
threshold.

So you can imagine - if some writeout queue does not put a bound on
its size (or has a size that is a substantial fraction of memory) then
the above will simply not work.  Transitioning pages from Dirty to
Writeback will not impose a delay so the number of Writeback pages
will continue to grow.

Could the VM impose this limit itself?  Probably yes.  But the
writeout queue driver is in a perfect position to keep track of the
number of outstanding requests, so it may as well impose the limit.
The current (unstated) internal API for Linux says that the writeout
queue must impose the limit, so NFS should do so too.

Is that cogent enough?  I hope so.
I haven't forgotten about 'unstable'.  I'll get to that a little
further down, but while that description is fresh in your mind....

So, if this is such a glaring problem with NFS, why aren't more people
having problems?  This is a very good question and one I only just
found an answer to - quite an interesting answer I think.

We have had quite a few customers report problems.  They all seem to
be using very big machines (8Gig plus).  However I couldn't come close
to duplicating the problem on the biggest machine I have easy access
to which has 6Gig.  I now think the memory size is only part of the
issue.  The other side of the issue is the NFS server.

I had always been writing to a Linux NFS server.  And, I'm sorry to
admit, the Linux NFS server isn't ideal. In particular:
 In NFSv3 the response to a WRITE request (and other requests) contain
 post-op attributes and (optionally) pre-op attributes.
 The pre-op attributes are only allowed if the server can guarantee
 that the WRITE operation was the only operation to be performed on
 the file between the moment when 'pre-op' were valid, and the moment
 when 'post-op' were valid.
 The idea is that the Client sends a WRITE request, gets a reply and
 if the pre-op attributes are present and match the client's cache,
 then no-one else has touched the file, the post-op attributes are now
 valid, and the clients cache is otherwise certain to be up-to-date.

 The Linux NFS server doesn't send pre-op attributes (because it
 cannot hold a lock on the file while doing a write ... probably a
 fixable problem with leases or something).  So when the client writes
 to the NFS server it always gets a response that says effectively
 "Someone else might have written to that file recently".
 In nfs_file_write there is (or was until very recently) a call to 
 nfs_revalidate_mapping which will flush out all changed pages if
 there is some doubt about the contents of the cache.

So on a Linux client (2.6.17 or prior) writing to a Linux server, you
will find the writer process almost always in invalidate_inode_pages
due to this call.  I.e. it is flushing out data quite independent of
write throttling.  (Write throttling triggers the first write, that
causes the caches to be doubtful, and then the next write will flush
the cache).

Writing to a different NFS server (e.g. a NetApp filer I expect - I
don't have complete server details on all my bug reports), the pre-op
attributes could be present, and you get a blow-out in the size of the
Writeback list.

Having noticed the recent change in nfs_file_write, I tried on
2.6.17-rc4-mm2 in a qemu instance and the Writeout count steadily grew
while the 'Dirty' count went down to zero.  Then writeout dropped back
to around 40% while Dirty stayed at zero (as balance_dirty_pages has a
fall back - if you cannot write out 1.5*N, then wait until the number
of dirty pages is below the threshold).  On a small memory machine
this doesn't cause much of a problem.  On a large machine which
millions of Writeback pages all on the inactive list, things slow to a
crawl.

[I'm curious about why the invalidate_inode_pages call was removed
from nfs_file_write... presumably the cache inconsistency is still
potentially there, so clean pages should be purged.  Whether dirty
pages should be purged is a different question....]

Despite having said that the current approach is quite clever, I think
there is room for improvement.  It is mostly tuning around the edges.

- I think that when there is a high write load, the system should stay
  in 'over-threshold' mode rather than oscillating back and forth.
  e.g. once we go over the threshold we set 'dirty_exceeded' and 
   leave it set until below (say) 90% of the threshold.
   While dirty_exceed is true:
    if over the threshold, the 'write_chunk' (number of pages to be
    transitioned from Dirty to Writeback) should be 1.5*N.
    if under, then just 1.0*N.
  This would mean a fairly predictable steady-state behaviour which
  would be more likely to be fair.

- We shouldn't fall back to 'wait until below the threshold' when 
  we fail to write out 1.5*N pages.  Rather we should realise this
  means there are no dirty blocks on this device, and just let the
  process continue - obviously it has done it's work.  However that
  fall-back is the only thing currently saving NFS from using up all
  available pages in its writeout queue, so that change cannot happen
  until NFS joins the party.

- When 'dirty_exceed' is true, balance_dirty_pages gets called for
  every 8 pages that are written, but 'write_chunk' is still set to
  3/64 of memory (or 6Meg, which ever is smaller) (that assumes a 1
  CPU system).  This seems terribly unfair, but should be trivial to
  fix.

There is one more issue that I feel I should talk about in order to
give the full picture on write-throttling.
The above treatment only considers the fact that the throttled
processes are causing transitions from Dirty to Writeback, and it
tries to require an appropriate number of transitions from each.
But in fact other processes might be causing that transition and so be
using up spots on the writeout queue, thus imposing more throttling.

bdflush also causes writeback on Dirty pages.  This will only write
out old pages.  I suspect (no measuring attempted) this will either
have no effect at all when write-throttling is happening, (as no pages
will be old enough) or will mean that the slowest device will see more
writeout than faster devices (as the oldest pages should belong to the
slowest device).  This means that writers to the slowest device will
have to wait longer than they might expect (because the slots in the
writeout queue that they could have taken, were taken by bdflush).
This should generally push down the number of dirty pages used by the
slowest device, which is possibly a good thing.

Another process that causes writeback on Dirty pages is the ext3
journald.  It actually writes out the buffers so that when the VM
tries to write the page, it finds it is actually clean already.
However the journald is still 'stealing' slots in the request queue
that the write-throttled processes cannot get.  I'm not sure what the
net effect of this will be.  In my experimenting I did have a scenario
where it had a pronounced effect, but I have changed enough code that
I don't think it was in any way representative on what the mainline
kernel would do..

Anyway, to your other issue...

> 
> Furthermore, I'd like to point out that NFS has a "third" state for
> pages: following an UNSTABLE write the data on them is marked as
> 'uncommitted'. Such pages are tracked using the NR_UNSTABLE_NFS counter.
> The question is: if we want to set limits on the write queue, what does
> that imply for the uncommitted writes?

Excellent question.
In some ways, Unstable pages are like Dirty pages. i.e. there is some
operation that needs to be done on them.  Either WRITE or COMMIT.
This similarity is supported by the fact that the two numbers are
added together into 'nr_reclaimable' in balance_dirty_pages.

A first guess might be that the write throttling should transition 
1.5*N "dirty or unstable" pages to Writeback  (What state are Unstable
pages when the commit is happening?  Let's pretend for the moment
they are Writeback again, it isn't important for this discussion).

However that wouldn't work.  For every page that is dirtied, one dirty
page needs to be written and one unstable page needs to be committed.
So maybe nfs_writepages (which does the final writing I think) should
effectively double nr_to_write, as WRITEing is only half of the
required work.

But the won't work either.  Because sometimes a WRITE can return
DATA_STABLE and no commit is needed.  So requiring twice as many
WRITEs or COMMITs isn't right.

I think that nfs_writepages needs to try to write nr_to_write, and try
to commit nr_to_write pages, and should decrease nr_to_write by the
maximum of the number of pages written and the number of pages
committed. (I note that nfs currently doesn't commit individual pages
but instead commits the whole file. Obviously if you end up committing
more pages than nr_to_write, you wouldn't decrease nr_to_write so much
that it goes negative).
However this thought hasn't been thoroughly considered.  It might have
problems and there might be a better way.  The important thing is to
generate some requests and to decrease nr_to_write in proportion to
the mount of useful work that has been done.

> If you go back and look at the 2.4 NFS client, we actually had an
> arbitrary queue limit. That limit covered the sum of writes+uncommitted
> pages. Performance sucked, 'cos we were not able to use server side
> caching efficiently. The number of COMMIT requests (causes the server to
> fsync() the client's data to disk) on the wire kept going through the
> roof as we tried to free up pages in order to satisfy the hard limit.

Pages that are UNSTABLE shouldn't use slots in the request queue
(which you seem to be implying was the case in 2.4).
So while we are below the threshold for throttling, commits would only
be forced by sync or bdflush (how does that work?  The first bdflush
passes triggers a WRITE, the next one triggers the COMMIT?).

But once we get above the threshold, we really want to be calling
COMMIT quite a lot to get the number of "dirty or writeback or
unstable" pages down.  So calling COMMIT on every nfs_write_pages that
came from balance_dirty_pages would seem fairly appropriate.

> For those reasons and others, the filesystem queue limit was removed for
> 2.6 in favour of allowing the VM to control the limits based on its
> extra knowledge of the state of global resources.

Unfortunately the VM still needs a little help from the writeout
queue.

(I think I'm going to write a Documentation file on how writeback -
but would anyone read it....)

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21  7:24                     ` Neil Brown
  2006-08-21 13:51                       ` Jens Axboe
@ 2006-08-21 14:28                       ` David Chinner
  2006-08-25  5:24                         ` Neil Brown
  1 sibling, 1 reply; 37+ messages in thread
From: David Chinner @ 2006-08-21 14:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, Andi Kleen, Jens Axboe, linux-kernel, akpm

On Mon, Aug 21, 2006 at 05:24:14PM +1000, Neil Brown wrote:
> On Monday August 21, dgc@sgi.com wrote:
> > On Mon, Aug 21, 2006 at 10:35:31AM +1000, Neil Brown wrote:
> > > On  August 18, ak@suse.de wrote:
> > > > Jens Axboe <axboe@suse.de> writes:
> > > > 
> > > > > On Thu, Aug 17 2006, Andrew Morton wrote:
> > > > > > It seems that the many-writers-to-different-disks workloads don't happen
> > > > > > very often.  We know this because
> > > > > > 
> > > > > > a) The 2.4 performance is utterly awful, and I never saw anybody
> > > > > >    complain and
> > > > > 
> > > > > Talk to some of the people that used DVD-RAM devices (or other
> > > > > excruciatingly slow writers) on their system, and they would disagree
> > > > > violently :-)
> > > > 
> > > > I hit this recently while doing backups to a slow external USB disk.
> > > > The system was quite unusable (some commands blocked for over a minute)
> > > 
> > > Ouch.  
> > > I suspect we are going to see more of this, as USB drive for backups
> > > is probably a very attractive option for many.
> > 
> > I can't see how this would occur on a 2.6 kernel unless the problem is
> > that all the reclaimable memory in the machine is dirty page cache pages
> > every allocation is blocking waiting for writeback to the slow device to
> > occur. That is, we filled memory with dirty pages before we got to the
> > throttle threshold.
> 
> I started writing a longish reply to this explaining how maybe that
> could happen, and then realised I had been missing important aspects
> of the code.
> 
> writeback_inodes doesn't just work on any random device as both you
> and I thought.  The 'writeback_control' structure identifies the bdi
> to be flushed and it will only call __writeback_sync_inode on inodes
> with the same bdi.

Yes, now that you point it out it is obvious :/

> This means that any process writing to a particular bdi should throttle
> against the queue limits in the bdi once we pass the dirty threshold.
> This means that it shouldn't be able to fill memory with dirty pages
> for that device (unless the bdi doesn't have a queue limit like
> nfs...).

*nod*

> But now I see another way that Jens' problem could occur... maybe.
> 
> Suppose the total Dirty+Writeback exceeds the threshold due entirely
> to the slow device, and it is slowly working its way through the
> writeback pages.
> 
> We write to some other device, make 'ratelimit_pages' pages dirty and
> then hit balance_dirty_pages.  We now need to either get the total
> dirty pages below the threshold or start writeback on 
>    1.5 * ratelimit_pages 
> pages.  As we only have 'ratelimit_pages' dirty pages we cannot start
> writeback on enough, and so must wait until Dirty+Writeback drops
> below the threshold.  And as we are waiting on the slow device, that
> could take a while (especially as it is possible that no-one is
> calling balance_dirty_pages against that bdi).

Hmm - one thing I just noticed - when we loop after sleeping, we
reset wbc->nr_to_write = write_chunk. Hence once we get into this
throttle loop, we can't break out until we drop under the threshold
even if we manage to write a few more pages.

> > That is, we allow a per-block-dev value to be set that overrides the
> > global setting for that blockdev only. Hence for slower devices
> > we can set the point at which we throttle at a much lower dirty
> > memory threshold when that block device is congested.
> > 
> 
> I don't think this would help.  The bdi with the higher threshold
> could exclude bdis with lower thresholds from making any forward
> progress. 

True - knowing the writeback is not on all superblocks changes the picture a
little :/

> Here is a question:
>   Seeing that
>       wbc.nonblocking == 0
>       wbc.older_than_this == NULL
>       wbc.range_cyclic == 0
>   in balance_dirty_pages when it calls writeback_inodes, under what
>   circumstances will writeback_inodes return with wbc.nr_to_write > 0
>   ??

The page couldn't be written out for some reason and
redirty_page_for_writepage() was called. A few filesystems call this in
different situations, generally error conditions. In that case, we end up with
wbc->pages_skipped increasing rather than wbc->nr_to_write decreasing....

> If a write error occurs it could abort early, but otherwise I think
> it will only exit early if it runs out of pages to write, because
> there aren't any dirty pages.

Yes, I think you're right, Neil.

> If that is true, then after calling writeback_inodes once,
> balance_dirty_pages should just exit.  It isn't going to find any more
> work to do next time it is called anyway.
> Either the queue was never congested, in which case we don't need to
> throttle writes, or it blocked for a while waiting for the queue to
> clean (in ->writepage) and so has successfully throttled writes.

*nod*

> So my feeling (at the moment) is that balance_dirty_pages should look
> like:
> 
>  if below threshold 
>        return
>  writeback_inodes({.bdi = mapping->backing_dev_info)} )
> 
>  while (above threshold + 10%)
>         writeback_inodes(.bdi = NULL)
>         blk_congestion_wait
> 
> and all bdis should impose a queue limit.

I don't really like the "+ 10%" in there - it's too rubbery given
the range of memory sizes Linux supports (think of an Altix with
several TBs of RAM in it ;). With bdis imposing a queue limit, the
number of writeback pages should be bound  and so we shouldn't need
headroom like this.

Hmmm - the above could put the writer to sleep on the request queue
of the slow device that holds all dirty+writeback. This could
effectively slow all writers down to the rate of the slowest device
in the system as they all attempt to do blocking writeback on the
only dirty bdi (the really slow one).

> Then we just need to deal with the case where the some of the queue
> limits of all devices exceeds the dirty threshold....
> Maybe writeout queues need to auto-adjust their queue length when some
> system-wide situation is detected.... sounds messy.

Pretty uncommon case, I think. If someone has a system like that
then tuning the system is not unreasonable....

AFAICT, all we need to do is prevent interactions between bdis and
the current problem is that we loop on clean bdis waiting for slow
dirty ones to drain.

My thoughts are along the lines of a decay in nr_to_write between
loop iterations when we don't write out enough pages (i.e. clean
bdi) so we break out of the loop sooner rather than later.
Something like:

---
 mm/page-writeback.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

Index: 2.6.x-xfs-new/mm/page-writeback.c
===================================================================
--- 2.6.x-xfs-new.orig/mm/page-writeback.c	2006-05-29 15:14:10.000000000 +1000
+++ 2.6.x-xfs-new/mm/page-writeback.c	2006-08-21 23:53:40.849788387 +1000
@@ -195,16 +195,17 @@ static void balance_dirty_pages(struct a
 	long dirty_thresh;
 	unsigned long pages_written = 0;
 	unsigned long write_chunk = sync_writeback_pages();
+	unsigned long write_decay = write_chunk >> 2;
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct writeback_control wbc = {
+		.bdi		= bdi,
+		.sync_mode	= WB_SYNC_NONE,
+		.older_than_this = NULL,
+		.nr_to_write	= write_chunk,
+	};
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-		};
 
 		get_dirty_limits(&wbs, &background_thresh,
 					&dirty_thresh, mapping);
@@ -231,8 +232,14 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
+
+			/* Decay the remainder so we don't get stuck here
+			 * waiting for some other slow bdi to flush */
+			wbc.nr_to_write -= write_decay;
 		}
 		blk_congestion_wait(WRITE, HZ/10);
+		if (wbc.nr_to_write <= 0)
+			break;
 	}
 
 	if (nr_reclaimable + wbs.nr_writeback <= dirty_thresh && dirty_exceeded)
---

Thoughts?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21 14:28                       ` David Chinner
@ 2006-08-25  5:24                         ` Neil Brown
  2006-08-28  1:55                           ` David Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Neil Brown @ 2006-08-25  5:24 UTC (permalink / raw)
  To: David Chinner; +Cc: Andi Kleen, Jens Axboe, linux-kernel, akpm

On Tuesday August 22, dgc@sgi.com wrote:
> On Mon, Aug 21, 2006 at 05:24:14PM +1000, Neil Brown wrote:
> > So my feeling (at the moment) is that balance_dirty_pages should look
> > like:
> > 
> >  if below threshold 
> >        return
> >  writeback_inodes({.bdi = mapping->backing_dev_info)} )
> > 
> >  while (above threshold + 10%)
> >         writeback_inodes(.bdi = NULL)
> >         blk_congestion_wait
> > 
> > and all bdis should impose a queue limit.
> 
> I don't really like the "+ 10%" in there - it's too rubbery given
> the range of memory sizes Linux supports (think of an Altix with
> several TBs of RAM in it ;). With bdis imposing a queue limit, the
> number of writeback pages should be bound  and so we shouldn't need
> headroom like this.

I had that there precisely because some BDIs are not bounded - nfs in
particular (which is what started this whole thread).
I think I'm now convinced that nfs really need to limit its writeout
queue. 

> 
> Hmmm - the above could put the writer to sleep on the request queue
> of the slow device that holds all dirty+writeback. This could
> effectively slow all writers down to the rate of the slowest device
> in the system as they all attempt to do blocking writeback on the
> only dirty bdi (the really slow one).


> 
> AFAICT, all we need to do is prevent interactions between bdis and
> the current problem is that we loop on clean bdis waiting for slow
> dirty ones to drain.
> 
> My thoughts are along the lines of a decay in nr_to_write between
> loop iterations when we don't write out enough pages (i.e. clean
> bdi) so we break out of the loop sooner rather than later.

I don't understand the purpose of the decay.  Once you are sure the
bdi is clean, why not break out of the loop straight away?

Also, your code is a little confusing.  The 
  pages_written += write_chunk - wbc.nr_to_write
in the loop assumes that wbc.nr_to_write equalled write_chunk just
before the call to writeback_inodes, however as you have moved the
initialisation of wbc out of the loop, this is no longer true.

So I would like us to break out of the loop as soon as there is good
reason to believe the bdi is clean.

So maybe something like this..
Note that we *must* have bounded queue on all bdis or this patch can
cause substantial badness.

NeilBrown

Signed-off-by: Neil Brown <neilb@suse.de>

### Diffstat output
 ./mm/page-writeback.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff .prev/mm/page-writeback.c ./mm/page-writeback.c
--- .prev/mm/page-writeback.c	2006-08-25 15:18:37.000000000 +1000
+++ ./mm/page-writeback.c	2006-08-25 15:22:39.000000000 +1000
@@ -187,7 +187,7 @@ static void balance_dirty_pages(struct a
 			.bdi		= bdi,
 			.sync_mode	= WB_SYNC_NONE,
 			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
+			.nr_to_write	= write_chunk - pages_written,
 			.range_cyclic	= 1,
 		};
 
@@ -217,10 +217,13 @@ static void balance_dirty_pages(struct a
 				global_page_state(NR_WRITEBACK)
 					<= dirty_thresh)
 						break;
-			pages_written += write_chunk - wbc.nr_to_write;
+			if (pages_written == write_chunk - wbc.nr_to_write)
+				break;		/* couldn't write - must be clean */
+			pages_written = write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
-		}
+		} else
+			break;
 		blk_congestion_wait(WRITE, HZ/10);
 	}
 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-25  5:24                         ` Neil Brown
@ 2006-08-28  1:55                           ` David Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: David Chinner @ 2006-08-28  1:55 UTC (permalink / raw)
  To: Neil Brown; +Cc: David Chinner, Andi Kleen, Jens Axboe, linux-kernel, akpm

On Fri, Aug 25, 2006 at 03:24:47PM +1000, Neil Brown wrote:
> On Tuesday August 22, dgc@sgi.com wrote:
> > AFAICT, all we need to do is prevent interactions between bdis and
> > the current problem is that we loop on clean bdis waiting for slow
> > dirty ones to drain.
> > 
> > My thoughts are along the lines of a decay in nr_to_write between
> > loop iterations when we don't write out enough pages (i.e. clean
> > bdi) so we break out of the loop sooner rather than later.
> 
> I don't understand the purpose of the decay.  Once you are sure the
> bdi is clean, why not break out of the loop straight away?

Simply to slow down the rate at which any process is dirtying
memory. The decay only becomes active when you're writing to a
clean device when there are lots of dirty pages on a slow device,
otherwise it's a no-op.

To illustrate the problem of breaking straight out of the throttle
loop, even though we hit the dirty rate limit we may have
dirtied pages on multiple bdis but we are only flushing on one of
them.  Hence we could potentially trigger increasing numbers of
dirty pages if we don't back off in some way when throttling here
even though the device we throttled on was clean.

e.g. Think of writing data to a slow device, then a log entry to a fast
device, and every time the write to the fast device triggers the
throttling which gets cleaned and we go and dirty more pages on
the slow device immediately without throttling....

> Also, your code is a little confusing.  The 

Sorry, it was a quick hack to illustrate my thinking.....

> So I would like us to break out of the loop as soon as there is good
> reason to believe the bdi is clean.

Which was exactly my line of thinking, but tempered by the fact that
just breaking out of the loop could introduce a nasty problem....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-21  3:15                   ` David Chinner
  2006-08-21  7:24                     ` Neil Brown
@ 2006-08-21  7:47                     ` Andi Kleen
  1 sibling, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2006-08-21  7:47 UTC (permalink / raw)
  To: David Chinner; +Cc: Neil Brown, Jens Axboe, linux-kernel, akpm


> > Ouch.  
> > I suspect we are going to see more of this, as USB drive for backups
> > is probably a very attractive option for many.
> 
> I can't see how this would occur on a 2.6 kernel 

I still got the traces to prove it:
http://www.firstfloor.org/~andi/usb-loop-copy-stall-1

e.g. notice the lynx which is stuck in a m/atime update. It was stalling
for a quite long time.

> 
-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-18  6:29           ` Andrew Morton
  2006-08-18  7:03             ` Jens Axboe
@ 2006-08-18  7:07             ` Neil Brown
  1 sibling, 0 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-18  7:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: David Chinner, linux-kernel

On Thursday August 17, akpm@osdl.org wrote:
> 
> btw, Neil, has the Pagewriteback windup actually been demonstrated?  If so,
> how?

Yes.
On large machines (e.g. 16G) just writing to large files (I think.  I
don't have precise details of the application, but I think in one case
it was just iozone).  By "large files" I mean larger than memory.

This has happened on both SLES9 (2.6.5 based) and SLES10 (2.6.16
based).  We do have an extra patch in balance_dirty_pages which I
haven't tracked down the reason for yet.  It has the effect of
breaking out of the loop once nr_dirty hits 0, which makes the problem
hard to recover from.  It may even be making it occur more quickly -
I'm not sure.

What we see is Pagewriteback at about 10G out of 16G, and Dirty at 0.
The whole machine pretty much slows to a halt.  There is little free
memory so lots of processes end up in 'reclaim' walking the inactive
list looking for pages to free up.  Most of what they find are in
Writeback and so they just skip over them.  skipping 2.6 million pages
seems to take a little while.

And there is a kmalloc call in the NFS writeout path (it is actually a
mempool_alloc so it will succeed, but (partly) as mempool uses the
reserve last instead of first it always looks for free memory first.

So Pagewriteback is at 60%, memory is tight, nfs write is progressing
very slowly and (because of our SuSE specific patch)
balance_dirty_pages isn't throttling anymore so as soon as nfs does
manage to write out a page another appears to replace it.  I suspect
it is making forward progress, but not very much.

We have a fairly hackish patch in place limit the NFS writeback on a
per-file basis (sysctl tunable) but I want trying to understand the
real problem so that a real solution could be found.

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  4:08     ` Neil Brown
  2006-08-17  6:14       ` Andrew Morton
@ 2006-08-17 22:17       ` David Chinner
  1 sibling, 0 replies; 37+ messages in thread
From: David Chinner @ 2006-08-17 22:17 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel

On Thu, Aug 17, 2006 at 02:08:58PM +1000, Neil Brown wrote:
> On Wednesday August 16, dgc@sgi.com wrote:
> > 
> > IMO, if you've got slow writeback, you should be reducing the amount
> > of dirty memory you allow in the machine so that you don't tie up
> > large amounts of memory that takes a long time to clean. Throttle earlier
> > and you avoid this problem entirely.
> 
> I completely agree that 'throttle earlier' is important.  I just not
> completely sure what should be throttled when.
> 
> I think I could argue that pages in 'Writeback' are really still
> dirty.  The difference is really just an implementation issue.

No argument here - I think you're right, Neil.

> So when the dirty_ratio is set to 40%, that should apply to all
> 'dirty' pages, which means both that flagged as 'Dirty' and those
> flagged as 'Writeback'.

Don't forget NFS client unstable pages.

FWIW, with writeback not being accounted as dirty, there is a window
in the NFS client where a page during writeback is not dirty or
unstable and hence not visible to the throttle. Hence if we have
lots of outstanding async writes to NFS servers, or their I/O
completion is held off, the throttle won't activate where is should
and potentially let too many pages get dirtied.

This may not be a major problem with the traditional small write
sizes, but with 1MB I/Os this could be a fairly large number of
pages that are unaccounted for a short period of time.

> So I think you need to throttle when Dirty+Writeback hits dirty_ratio
> (which we don't quite get right at the moment).  But the trick is to
> throttle gently and fairly, rather than having a hard wall so that any
> one who hits it just stops.

I disagree with the "throttle gently" bit there. If a process is
writing faster than the underlying storage can write, then you have
to stop the process in it's tracks while the storage catches up.
Especially if other processes are writing tothe same device. You
may as well just hit it with a big hammer becauses it's simple and
pretty effective.

Besides, it is difficult to be gentle when you can dirty memory at
least an order of magnitude faster than you can clean it.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-15  8:06 ` Andrew Morton
  2006-08-15 23:00   ` David Chinner
@ 2006-08-17  3:59   ` Neil Brown
  2006-08-17  6:22     ` Andrew Morton
  2006-08-17 13:21     ` Trond Myklebust
  1 sibling, 2 replies; 37+ messages in thread
From: Neil Brown @ 2006-08-17  3:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

On Tuesday August 15, akpm@osdl.org wrote:
> > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > balance_dirty_pages will no longer be able to flush the full
> > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > a busy loop, but it won't progress.
> 
> This assumes that the queues are unbounded.  They're not - they're limited
> to 128 requests, which is 60MB or so.

Ahhh... so the limit on the requests-per-queue is an important part of
write-throttling behaviour.  I didn't know that, thanks.

fs/nfs doesn't seem to impose a limit.  It will just allocate as many
as you ask for until you start running out of memory.  I've seen 60%
of memory (10 out of 16Gig) in writeback for NFS.

Maybe I should look there to address my current issue, though imposing
a system-wide writeback limit seems safer.

> 
> Per queue.  The scenario you identify can happen if it's spread across
> multiple disks simultaneously.
> 
> CFQ used to have 1024 requests and we did have problems with excessive
> numbers of writeback pages.  I fixed that in 2.6.early, but that seems to
> have got lost as well.
> 

What would you say constitutes "excessive"?  Is there any sense in
which some absolute number is excessive (as it takes too long to scan
some list) or is it just a percent-of-memory thing?

> 
> Something like that - it'll be relatively simple.

Unfortunately I think it is also relatively simple to get it badly
wrong:-)  Make one workload fast, and another slower.

But thanks, you've been very helpful (as usual).  I'll ponder it a bit
longer and see what turns up.

NeilBrown

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  3:59   ` Neil Brown
@ 2006-08-17  6:22     ` Andrew Morton
  2006-08-17  8:36       ` Jens Axboe
  2006-08-17 13:21     ` Trond Myklebust
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2006-08-17  6:22 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-kernel

On Thu, 17 Aug 2006 13:59:41 +1000
Neil Brown <neilb@suse.de> wrote:

> > CFQ used to have 1024 requests and we did have problems with excessive
> > numbers of writeback pages.  I fixed that in 2.6.early, but that seems to
> > have got lost as well.
> > 
> 
> What would you say constitutes "excessive"?  Is there any sense in
> which some absolute number is excessive (as it takes too long to scan
> some list) or is it just a percent-of-memory thing?

Excessive = 100% of memory dirty or under writeback against a single disk
on a 512MB machine.  Perhaps that problem just got forgotten about when CFQ
went from 1024 requests down to 128.  (That 128 was actually
64-available-for-read+64-available-for-write, so it's really 64 requests).

> > 
> > Something like that - it'll be relatively simple.
> 
> Unfortunately I think it is also relatively simple to get it badly
> wrong:-)  Make one workload fast, and another slower.
> 

I think it's unlikely in this case.  As long as we keep the queues
reasonably full, the disks will be running flat-out and merging will be as
good as we're going to get.

One thing one does have to watch out for is the many-disks scenario: do
concurrent dd's onto 12 disks and make sure that none of their LEDs go
out.  This is actually surprisingly hard to do, but it would be very hard
to do worse than 2.4.x ;)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow  writeback.
  2006-08-17  6:22     ` Andrew Morton
@ 2006-08-17  8:36       ` Jens Axboe
  0 siblings, 0 replies; 37+ messages in thread
From: Jens Axboe @ 2006-08-17  8:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, linux-kernel

On Wed, Aug 16 2006, Andrew Morton wrote:
> On Thu, 17 Aug 2006 13:59:41 +1000
> Neil Brown <neilb@suse.de> wrote:
> 
> > > CFQ used to have 1024 requests and we did have problems with excessive
> > > numbers of writeback pages.  I fixed that in 2.6.early, but that seems to
> > > have got lost as well.
> > > 
> > 
> > What would you say constitutes "excessive"?  Is there any sense in
> > which some absolute number is excessive (as it takes too long to scan
> > some list) or is it just a percent-of-memory thing?
> 
> Excessive = 100% of memory dirty or under writeback against a single disk
> on a 512MB machine.  Perhaps that problem just got forgotten about when CFQ
> went from 1024 requests down to 128.  (That 128 was actually
> 64-available-for-read+64-available-for-write, so it's really 64 requests).

That's not quite true, if you set nr_requests to 128 that's 128 for
reads and 128 for writes. With the batching you will actually typically
see 128 * 3 / 2 == 192 requests allocated. Which translates to about
96MiB of dirty data on the queue, if everything works smoothly. The 3/2
limit is quite new, before I introduced that, if you had a lot of writes
each of them would be allowed 16 requests over the limit. So you would
sometimes see huge queues, as with just eg 16 writes, you could have 128
+ 16*16 requests allocated.

I've always been of the opinion that the vm should handle all of this,
and things should not change or break if I set 10000 as the request
limit. A rate-of-dirtying throttling per process sounds like a really
good idea, we badly need to prevent the occasional write (like a process
doing sync reads, and getting stuck in slooow reclaim) from being
throttled in the presence of a heavy dirtier.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17  3:59   ` Neil Brown
  2006-08-17  6:22     ` Andrew Morton
@ 2006-08-17 13:21     ` Trond Myklebust
  2006-08-17 15:30       ` Andrew Morton
  1 sibling, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2006-08-17 13:21 UTC (permalink / raw)
  To: Neil Brown; +Cc: Andrew Morton, linux-kernel

On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote:
> On Tuesday August 15, akpm@osdl.org wrote:
> > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > > balance_dirty_pages will no longer be able to flush the full
> > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > > a busy loop, but it won't progress.
> > 
> > This assumes that the queues are unbounded.  They're not - they're limited
> > to 128 requests, which is 60MB or so.
> 
> Ahhh... so the limit on the requests-per-queue is an important part of
> write-throttling behaviour.  I didn't know that, thanks.
> 
> fs/nfs doesn't seem to impose a limit.  It will just allocate as many
> as you ask for until you start running out of memory.  I've seen 60%
> of memory (10 out of 16Gig) in writeback for NFS.
> 
> Maybe I should look there to address my current issue, though imposing
> a system-wide writeback limit seems safer.

Exactly how would a request limit help? All that boils down to is having
the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).

Cheers,
  Trond


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 13:21     ` Trond Myklebust
@ 2006-08-17 15:30       ` Andrew Morton
  2006-08-17 16:18         ` Trond Myklebust
  0 siblings, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2006-08-17 15:30 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, linux-kernel

On Thu, 17 Aug 2006 09:21:51 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote:
> > On Tuesday August 15, akpm@osdl.org wrote:
> > > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > > > balance_dirty_pages will no longer be able to flush the full
> > > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > > > a busy loop, but it won't progress.
> > > 
> > > This assumes that the queues are unbounded.  They're not - they're limited
> > > to 128 requests, which is 60MB or so.
> > 
> > Ahhh... so the limit on the requests-per-queue is an important part of
> > write-throttling behaviour.  I didn't know that, thanks.
> > 
> > fs/nfs doesn't seem to impose a limit.  It will just allocate as many
> > as you ask for until you start running out of memory.  I've seen 60%
> > of memory (10 out of 16Gig) in writeback for NFS.
> > 
> > Maybe I should look there to address my current issue, though imposing
> > a system-wide writeback limit seems safer.
> 
> Exactly how would a request limit help? All that boils down to is having
> the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
> 

I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
devices are doing so, we could get in a situation where NFS hogs all of the
fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
block-device-based writeback.

Perhaps.  The top-level poll-the-superblocks writeback loop might tend to
prevent that from happening.  But if applications were doing a lot of
superblock-specific writeback (fdatasync,
sync_file_range(SYNC_FILE_RANGE_WRITE), etc) then unfairness might occur.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 15:30       ` Andrew Morton
@ 2006-08-17 16:18         ` Trond Myklebust
  2006-08-18  5:34           ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Trond Myklebust @ 2006-08-17 16:18 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Neil Brown, linux-kernel

On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote:
> On Thu, 17 Aug 2006 09:21:51 -0400
> Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> > Exactly how would a request limit help? All that boils down to is having
> > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
> > 
> 
> I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
> devices are doing so, we could get in a situation where NFS hogs all of the
> fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
> block-device-based writeback.

Since NFS has no control over NR_DIRTY, how does controlling
NR_WRITEBACK help? The only resource that NFS shares with the block
device writeout queues is memory.

IOW: The resource that needs to be controlled is the dirty pages, not
the write-out queue. Unless you can throttle back on the creation of
dirty NFS pages in the first place, then the potential for unfairness
will exist.

Trond


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.
  2006-08-17 16:18         ` Trond Myklebust
@ 2006-08-18  5:34           ` Andrew Morton
  0 siblings, 0 replies; 37+ messages in thread
From: Andrew Morton @ 2006-08-18  5:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Neil Brown, linux-kernel

On Thu, 17 Aug 2006 12:18:52 -0400
Trond Myklebust <trond.myklebust@fys.uio.no> wrote:

> On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote:
> > On Thu, 17 Aug 2006 09:21:51 -0400
> > Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> > > Exactly how would a request limit help? All that boils down to is having
> > > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> > > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
> > > 
> > 
> > I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
> > devices are doing so, we could get in a situation where NFS hogs all of the
> > fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
> > block-device-based writeback.
> 
> Since NFS has no control over NR_DIRTY, how does controlling
> NR_WRITEBACK help? The only resource that NFS shares with the block
> device writeout queues is memory.

Block devices have a limit on the amount of IO which they will queue.  NFS
doesn't.

> IOW: The resource that needs to be controlled is the dirty pages, not
> the write-out queue. Unless you can throttle back on the creation of
> dirty NFS pages in the first place, then the potential for unfairness
> will exist.

Please read the whole thread - we're violently agreeing.

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2006-08-28  1:56 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-14 23:40 RFC - how to balance Dirty+Writeback in the face of slow writeback Neil Brown
2006-08-15  8:06 ` Andrew Morton
2006-08-15 23:00   ` David Chinner
2006-08-17  4:08     ` Neil Brown
2006-08-17  6:14       ` Andrew Morton
2006-08-17 12:36         ` Trond Myklebust
2006-08-17 15:14           ` Andrew Morton
2006-08-17 16:22             ` Trond Myklebust
2006-08-18  5:49               ` Andrew Morton
2006-08-18 10:43                 ` Nikita Danilov
2006-08-18  0:11         ` David Chinner
2006-08-18  6:29           ` Andrew Morton
2006-08-18  7:03             ` Jens Axboe
2006-08-18  7:11               ` Andrew Morton
2006-08-18 18:57               ` Andi Kleen
2006-08-21  0:35                 ` Neil Brown
2006-08-21  3:15                   ` David Chinner
2006-08-21  7:24                     ` Neil Brown
2006-08-21 13:51                       ` Jens Axboe
2006-08-25  4:36                         ` Neil Brown
2006-08-25  6:37                           ` Jens Axboe
2006-08-28  1:28                             ` David Chinner
2006-08-25 13:16                           ` Trond Myklebust
2006-08-27  8:21                             ` Neil Brown
2006-08-21 14:28                       ` David Chinner
2006-08-25  5:24                         ` Neil Brown
2006-08-28  1:55                           ` David Chinner
2006-08-21  7:47                     ` Andi Kleen
2006-08-18  7:07             ` Neil Brown
2006-08-17 22:17       ` David Chinner
2006-08-17  3:59   ` Neil Brown
2006-08-17  6:22     ` Andrew Morton
2006-08-17  8:36       ` Jens Axboe
2006-08-17 13:21     ` Trond Myklebust
2006-08-17 15:30       ` Andrew Morton
2006-08-17 16:18         ` Trond Myklebust
2006-08-18  5:34           ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox