Re: [patch 1/3] fix illogical behavior in balance_dirty

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
@ 2007-04-03 18:40 Kris Corwin
  0 siblings, 0 replies; 18+ messages in thread
From: Kris Corwin @ 2007-04-03 18:40 UTC (permalink / raw)
  To: linux-kernel

I think I'm seeing the same VM behavior with NFS/ext3 that was described 
with fuse.

I have a shared storage unit with 2 servers.  The local filesystem is 
ext3 and I'm failing
the ext3 mounting and NFS serving from one node to another.  I'm running 
I/O on the NFS
mount point to the active server.   If I failover to the server that is 
also executing the I/O,
I can hang the machine.  Looking at the stack traces, the NFS writes are 
stuck in
balance_dirty_pages().  NFS start up is causing rpc.mountd to write to 
the local
ext3 filesystem, but it gets blocked in balance_dirty_pages() too.  
Deadlock.  Without
the NFS server successfully started, the kernel can't write the dirty pages.

I'm running a 2.6.9-42 RH kernel.  Early in my debugging, I had tried 
several kernel
versions.  All 2.6 kernels saw this behavior and the 2.4 kernel I tried 
did not.

Kris Corwin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [patch 1/3] fix illogical behavior in balance_dirty_pages()
@ 2007-03-24 21:55 Miklos Szeredi
  2007-03-25 10:03 ` Peter Zijlstra
  2007-03-25 23:35 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-24 21:55 UTC (permalink / raw)
  To: akpm; +Cc: dgc, linux-kernel

This is a slightly different take on the fix for the deadlock in fuse
with dirty balancing.  David Chinner convinced me, that per-bdi
counters are too expensive, and that it's not worth trying to account
the number of pages under writeback, as they will be limited by the
queue anyway.

----
From: Miklos Szeredi <mszeredi@suse.cz>

Current behavior of balance_dirty_pages() is to try to start writeout
into the specified queue for at least "write_chunk" number of pages.
If "write_chunk" pages have been submitted, then return.

However if there are less than "write_chunk" dirty pages for this
queue, then it doesn't return, waiting for the global dirty+writeback
counters to subside, but without doing any actual work.

This is illogical behavior: it allows more dirtyings while there are
dirty pages, but stops further dirtying completely if there are no
more dirty pages.

It also makes a deadlock possible when one filesystem is writing data
through another, and the balance_dirty_pages() for the lower
filesystem is stalling the writeback for the upper filesystem's
data (*).

So the exit condition should instead be:

  submitted at least "write_chunk" number of pages
OR
    submitted ALL the dirty pages destined for this backing dev
  AND
    backing dev is not congested

To do this, introduce a new counter in writeback_control, which counts
the number of dirty pages encountered during writeback.  This includes
all dirty pages, even those which are already under writeback but have
been dirtied again, and those which have been skipped due to having
locked buffers.

If this counter is zero after trying to submit some pages for
writeback, and the backing dev is uncongested, then don't wait any
more.  After this, newly dirtied pages can quickly be written back to
this backing dev.

If there are globally no more pages to submit for writeback
(nr_reclaimable == 0), then also don't wait for ever, only while this
backing dev is congested.

(*) For more info on this deadlock, see the following discussions:

  http://lkml.org/lkml/2007/3/1/9
  http://lkml.org/lkml/2007/3/12/16

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-03-24 22:06:56.000000000 +0100
+++ linux/include/linux/writeback.h	2007-03-24 22:29:02.000000000 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long nr_dirty;			/* Number of dirty pages encountered */

 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
@@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (!nr_reclaimable) {
+			/*
+			 * If there's nothing more to write back and this queue
+			 * is uncongested,  then it is possible to quickly
+			 * write out some more data, so let's not wait
+			 */
+			if (!bdi_write_congested(bdi))
+				break;
+		} else {
 			writeback_inodes(&wbc);
 			get_dirty_limits(&background_thresh,
 					 	&dirty_thresh, mapping);
@@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
+
+			/*
+			 * If there are no more dirty pages for this backing
+			 * backing dev, and the queue is not congested, then
+			 * it is possible to quickly write out some more data
+			 */
+			if (!wbc.nr_dirty && !bdi_write_congested(bdi))
+				break;
 		}
 		congestion_wait(WRITE, HZ/10);
 	}
@@ -619,6 +635,7 @@ retry:
 					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;

+		wbc->nr_dirty += nr_pages;
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-24 21:55 Miklos Szeredi
@ 2007-03-25 10:03 ` Peter Zijlstra
  2007-03-25 11:12   ` Miklos Szeredi
  2007-03-25 23:35 ` Andrew Morton
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2007-03-25 10:03 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, dgc, linux-kernel

On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote:
> This is a slightly different take on the fix for the deadlock in fuse
> with dirty balancing.  David Chinner convinced me, that per-bdi
> counters are too expensive, and that it's not worth trying to account
> the number of pages under writeback, as they will be limited by the
> queue anyway.
> 

Please have a look at this:
  http://lkml.org/lkml/2007/3/19/220



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-25 10:03 ` Peter Zijlstra
@ 2007-03-25 11:12   ` Miklos Szeredi
  2007-03-25 11:34     ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-25 11:12 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel

> On Sat, 2007-03-24 at 22:55 +0100, Miklos Szeredi wrote:
> > This is a slightly different take on the fix for the deadlock in fuse
> > with dirty balancing.  David Chinner convinced me, that per-bdi
> > counters are too expensive, and that it's not worth trying to account
> > the number of pages under writeback, as they will be limited by the
> > queue anyway.
> > 
> 
> Please have a look at this:
>   http://lkml.org/lkml/2007/3/19/220



> +			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> +			     	bdi_thresh)
> +				break;
> 

Yes, this will resolve the deadlock as well, where balance_dirty_pages()
is currently looping forever with:

  bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) == 0

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-25 11:12   ` Miklos Szeredi
@ 2007-03-25 11:34     ` Miklos Szeredi
  2007-03-25 11:50       ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-25 11:34 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel

> > 
> > Please have a look at this:
> >   http://lkml.org/lkml/2007/3/19/220
> 
> 
> 
> > +			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> > +			     	bdi_thresh)
> > +				break;
> > 
> 
> Yes, this will resolve the deadlock as well, where balance_dirty_pages()
> is currently looping forever with:

Almost.

This

> -		if (nr_reclaimable) {
> +		if (bdi_nr_reclaimable) {
>  			writeback_inodes(&wbc);

still makes it loop forever if bdi_nr_reclaimable == 0, since the exit
condition is not checked. 

Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) <=
bdi_thresh in this case?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-25 11:34     ` Miklos Szeredi
@ 2007-03-25 11:50       ` Peter Zijlstra
  2007-03-25 20:41         ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2007-03-25 11:50 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: akpm, dgc, linux-kernel

On Sun, 2007-03-25 at 13:34 +0200, Miklos Szeredi wrote:
> > > 
> > > Please have a look at this:
> > >   http://lkml.org/lkml/2007/3/19/220
> > 
> > 
> > 
> > > +			if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
> > > +			     	bdi_thresh)
> > > +				break;
> > > 
> > 
> > Yes, this will resolve the deadlock as well, where balance_dirty_pages()
> > is currently looping forever with:
> 
> Almost.
> 
> This
> 
> > -		if (nr_reclaimable) {
> > +		if (bdi_nr_reclaimable) {
> >  			writeback_inodes(&wbc);
> 
> still makes it loop forever if bdi_nr_reclaimable == 0, since the exit
> condition is not checked. 
> 
> Shouldn't it break out of the loop if bdi_stat(bdi, BDI_WRITEBACK) <=
> bdi_thresh in this case?

        for (;;) {
                struct writeback_control wbc = {
                        .bdi            = bdi,
                        .sync_mode      = WB_SYNC_NONE,
                        .older_than_this = NULL,
                        .nr_to_write    = write_chunk,
                        .range_cyclic   = 1,
                };

                get_dirty_limits(&background_thresh, &dirty_thresh,
                                &bdi_thresh, bdi);
                bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
                                        bdi_stat(bdi, BDI_UNSTABLE);
(A)             if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
                        bdi_thresh)
                                break;

                /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
                 * Unstable writes are a feature of certain networked
                 * filesystems (i.e. NFS) in which data may have been
                 * written to the server's write cache, but has not yet
                 * been flushed to permanent storage.
                 */
(B)              if (bdi_nr_reclaimable) {
                        writeback_inodes(&wbc);

                        get_dirty_limits(&background_thresh, &dirty_thresh,
                                       &bdi_thresh, bdi);
                        bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
                                                bdi_stat(bdi, BDI_UNSTABLE);
(C)                     if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
                                bdi_thresh)
                                break;

                        pages_written += write_chunk - wbc.nr_to_write;
                        if (pages_written >= write_chunk)
                                break;          /* We've done our duty */
                }
                congestion_wait(WRITE, HZ/10);
        }

I'm thinking that if bdi_nr_reclaimable == 0, A reduces to
bdi_stat(bdi, BDI_WRITEBACK) <= bdi_thresh and we're still out of the
loop, no?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-25 11:50       ` Peter Zijlstra
@ 2007-03-25 20:41         ` Miklos Szeredi
  0 siblings, 0 replies; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-25 20:41 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: akpm, dgc, linux-kernel

>         for (;;) {
>                 struct writeback_control wbc = {
>                         .bdi            = bdi,
>                         .sync_mode      = WB_SYNC_NONE,
>                         .older_than_this = NULL,
>                         .nr_to_write    = write_chunk,
>                         .range_cyclic   = 1,
>                 };
> 
>                 get_dirty_limits(&background_thresh, &dirty_thresh,
>                                 &bdi_thresh, bdi);
>                 bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
>                                         bdi_stat(bdi, BDI_UNSTABLE);
> (A)             if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
>                         bdi_thresh)
>                                 break;
> 
>                 /* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>                  * Unstable writes are a feature of certain networked
>                  * filesystems (i.e. NFS) in which data may have been
>                  * written to the server's write cache, but has not yet
>                  * been flushed to permanent storage.
>                  */
> (B)              if (bdi_nr_reclaimable) {
>                         writeback_inodes(&wbc);
> 
>                         get_dirty_limits(&background_thresh, &dirty_thresh,
>                                        &bdi_thresh, bdi);
>                         bdi_nr_reclaimable = bdi_stat(bdi, BDI_DIRTY) +
>                                                 bdi_stat(bdi, BDI_UNSTABLE);
> (C)                     if (bdi_nr_reclaimable + bdi_stat(bdi, BDI_WRITEBACK) <=
>                                 bdi_thresh)
>                                 break;
> 
>                         pages_written += write_chunk - wbc.nr_to_write;
>                         if (pages_written >= write_chunk)
>                                 break;          /* We've done our duty */
>                 }
>                 congestion_wait(WRITE, HZ/10);
>         }
> 
> I'm thinking that if bdi_nr_reclaimable == 0, A reduces to
> bdi_stat(bdi, BDI_WRITEBACK) <= bdi_thresh and we're still out of the
> loop, no?
> 

Yep, sorry.

I'll review this patch, there seem to be odd things, like calculating
dirty_thresh, then not ever using it...

Btw, would you mind measuring the performance of my patch with the
slow/fast device mix, to see how it compares to your patchset and the
vanila kernel?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-24 21:55 Miklos Szeredi
  2007-03-25 10:03 ` Peter Zijlstra
@ 2007-03-25 23:35 ` Andrew Morton
  2007-03-26  8:26   ` Miklos Szeredi
  1 sibling, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2007-03-25 23:35 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, linux-kernel

On Sat, 24 Mar 2007 22:55:29 +0100 Miklos Szeredi <miklos@szeredi.hu> wrote:

> This is a slightly different take on the fix for the deadlock in fuse
> with dirty balancing.  David Chinner convinced me, that per-bdi
> counters are too expensive, and that it's not worth trying to account
> the number of pages under writeback, as they will be limited by the
> queue anyway.
> 
> ----
> From: Miklos Szeredi <mszeredi@suse.cz>
> 
> Current behavior of balance_dirty_pages() is to try to start writeout
> into the specified queue for at least "write_chunk" number of pages.
> If "write_chunk" pages have been submitted, then return.
> 
> However if there are less than "write_chunk" dirty pages for this
> queue, then it doesn't return, waiting for the global dirty+writeback
> counters to subside, but without doing any actual work.
> 
> This is illogical behavior: it allows more dirtyings while there are
> dirty pages, but stops further dirtying completely if there are no
> more dirty pages.

That behaviour is perfectly logical.  It prevents the number of
dirty+writeback pages from exceeding dirty_ratio.

> It also makes a deadlock possible when one filesystem is writing data
> through another, and the balance_dirty_pages() for the lower
> filesystem is stalling the writeback for the upper filesystem's
> data (*).

I still don't understand this one.  I got lost when belatedly told that
i_mutex had something to do with it.

> So the exit condition should instead be:
> 
>   submitted at least "write_chunk" number of pages
> OR
>     submitted ALL the dirty pages destined for this backing dev
>   AND
>     backing dev is not congested
> 
> To do this, introduce a new counter in writeback_control, which counts
> the number of dirty pages encountered during writeback.  This includes
> all dirty pages, even those which are already under writeback but have
> been dirtied again, and those which have been skipped due to having
> locked buffers.
> 
> If this counter is zero after trying to submit some pages for
> writeback, and the backing dev is uncongested, then don't wait any
> more.  After this, newly dirtied pages can quickly be written back to
> this backing dev.
> 
> If there are globally no more pages to submit for writeback
> (nr_reclaimable == 0), then also don't wait for ever, only while this
> backing dev is congested.
> 
> (*) For more info on this deadlock, see the following discussions:
> 
>   http://lkml.org/lkml/2007/3/1/9
>   http://lkml.org/lkml/2007/3/12/16
> 
> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
> ---
> 
> Index: linux/include/linux/writeback.h
> ===================================================================
> --- linux.orig/include/linux/writeback.h	2007-03-24 22:06:56.000000000 +0100
> +++ linux/include/linux/writeback.h	2007-03-24 22:29:02.000000000 +0100
> @@ -44,6 +44,7 @@ struct writeback_control {
>  	long nr_to_write;		/* Write this many pages, and decrement
>  					   this for each page written */
>  	long pages_skipped;		/* Pages which were not written */
> +	long nr_dirty;			/* Number of dirty pages encountered */
>  
>  	/*
>  	 * For a_ops->writepages(): is start or end are non-zero then this is
> Index: linux/mm/page-writeback.c
> ===================================================================
> --- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
> +++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
> @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
>  		 * written to the server's write cache, but has not yet
>  		 * been flushed to permanent storage.
>  		 */
> -		if (nr_reclaimable) {
> +		if (!nr_reclaimable) {
> +			/*
> +			 * If there's nothing more to write back and this queue
> +			 * is uncongested,  then it is possible to quickly
> +			 * write out some more data, so let's not wait
> +			 */
> +			if (!bdi_write_congested(bdi))
> +				break;
> +		} else {

This says "if there are no dirty pages in the machine at all, then go back
and dirty some more pages, regardless of the present number of
under-writeback pages".

>  			writeback_inodes(&wbc);
>  			get_dirty_limits(&background_thresh,
>  					 	&dirty_thresh, mapping);
> @@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
>  			pages_written += write_chunk - wbc.nr_to_write;
>  			if (pages_written >= write_chunk)
>  				break;		/* We've done our duty */
> +
> +			/*
> +			 * If there are no more dirty pages for this backing
> +			 * backing dev, and the queue is not congested, then
> +			 * it is possible to quickly write out some more data
> +			 */
> +			if (!wbc.nr_dirty && !bdi_write_congested(bdi))
> +				break;

This says "if there are no pages dirty againt this device then go back and
dirty some more pages, regardless of the present number of under-writeback
pages".



IOW: this change will allow us to 100% fill all memory with under-writeback
pages.

The VM _should_ cope with that - it kinda used to.  But it is an untested
region of operation and the chances of bogus oom-killings are excellent.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-25 23:35 ` Andrew Morton
@ 2007-03-26  8:26   ` Miklos Szeredi
  2007-03-26  9:01     ` Andrew Morton
  0 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-26  8:26 UTC (permalink / raw)
  To: akpm; +Cc: dgc, linux-kernel

> > This is a slightly different take on the fix for the deadlock in fuse
> > with dirty balancing.  David Chinner convinced me, that per-bdi
> > counters are too expensive, and that it's not worth trying to account
> > the number of pages under writeback, as they will be limited by the
> > queue anyway.
> > 
> > ----
> > From: Miklos Szeredi <mszeredi@suse.cz>
> > 
> > Current behavior of balance_dirty_pages() is to try to start writeout
> > into the specified queue for at least "write_chunk" number of pages.
> > If "write_chunk" pages have been submitted, then return.
> > 
> > However if there are less than "write_chunk" dirty pages for this
> > queue, then it doesn't return, waiting for the global dirty+writeback
> > counters to subside, but without doing any actual work.
> > 
> > This is illogical behavior: it allows more dirtyings while there are
> > dirty pages, but stops further dirtying completely if there are no
> > more dirty pages.
> 
> That behaviour is perfectly logical.  It prevents the number of
> dirty+writeback pages from exceeding dirty_ratio.

But it does it in an illogical way.  While it's cleaning your pages,
it allows more dirtyings, but when it has cleaned ALL the pages
destined for this queue, you hit a wall and stand still waiting for
other queues to get their act together even if this queue is
_completely idle_.

I think this accounts for 90% of the problems experienced with copying
to slow devices.

> > It also makes a deadlock possible when one filesystem is writing data
> > through another, and the balance_dirty_pages() for the lower
> > filesystem is stalling the writeback for the upper filesystem's
> > data (*).
> 
> I still don't understand this one.  I got lost when belatedly told that
> i_mutex had something to do with it.

This deadlock only happens, if there's some bottleneck for writing
data to the lower filesystem.   This bottleneck could be 

  - i_mutex, preventing parallel writes to the same inode
  - limited number of filesystem threads
  - limited request queue length in the upper filesystem

Imagine it this way: balance_dirty_pages() for the lower filesystem is
stalling a write() because dirty pages in the upper filesystem are
over the limit.  Because there's a bottleneck for writing to the lower
filesystem, this is stalling _other_ writes from completing.  So
there's no progress in writing back pages from the upper filesystem.

The problem is not that dirty pages are not being submitted for
writeback, rather that the writeback itself is stalling.

Hmm?

> > Index: linux/mm/page-writeback.c
> > ===================================================================
> > --- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
> > +++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
> > @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> >  		 * written to the server's write cache, but has not yet
> >  		 * been flushed to permanent storage.
> >  		 */
> > -		if (nr_reclaimable) {
> > +		if (!nr_reclaimable) {
> > +			/*
> > +			 * If there's nothing more to write back and this queue
> > +			 * is uncongested,  then it is possible to quickly
> > +			 * write out some more data, so let's not wait
> > +			 */
> > +			if (!bdi_write_congested(bdi))
> > +				break;
> > +		} else {
> 
> This says "if there are no dirty pages in the machine at all, then go back
> and dirty some more pages, regardless of the present number of
> under-writeback pages".

No, this says, that "there are no dirty pages in the machine at all,
and this queue is uncongested, so you can safely go back and dirty
some more data, because we are sure it can be written back quickly".

Ditto for the case, when there are no more dirty pages destined for
this queue.

I understand, that this can fill up the memory with under writeback
pages, but only if the data sitting in all the device queues is
comparable to the total memory.  I don't know what the realistic
chance of that is but David Chinner convinced me, that it's not
something that happens in real life.  Quoting him from an earlier
mail:

| Right, and most ppl don't have enough devices in their system for
| this to be a problem. Even those of us that do have enough devices
| for this to potentially be a problem usually have enough RAM in
| the machine so that it is not a problem....

Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  8:26   ` Miklos Szeredi
@ 2007-03-26  9:01     ` Andrew Morton
  2007-03-26  9:20       ` Miklos Szeredi
                         ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Andrew Morton @ 2007-03-26  9:01 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, linux-kernel

On Mon, 26 Mar 2007 10:26:18 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > This is a slightly different take on the fix for the deadlock in fuse
> > > with dirty balancing.  David Chinner convinced me, that per-bdi
> > > counters are too expensive, and that it's not worth trying to account
> > > the number of pages under writeback, as they will be limited by the
> > > queue anyway.
> > > 
> > > ----
> > > From: Miklos Szeredi <mszeredi@suse.cz>
> > > 
> > > Current behavior of balance_dirty_pages() is to try to start writeout
> > > into the specified queue for at least "write_chunk" number of pages.
> > > If "write_chunk" pages have been submitted, then return.
> > > 
> > > However if there are less than "write_chunk" dirty pages for this
> > > queue, then it doesn't return, waiting for the global dirty+writeback
> > > counters to subside, but without doing any actual work.
> > > 
> > > This is illogical behavior: it allows more dirtyings while there are
> > > dirty pages, but stops further dirtying completely if there are no
> > > more dirty pages.
> > 
> > That behaviour is perfectly logical.  It prevents the number of
> > dirty+writeback pages from exceeding dirty_ratio.
> 
> But it does it in an illogical way.  While it's cleaning your pages,
> it allows more dirtyings, but when it has cleaned ALL the pages
> destined for this queue, you hit a wall and stand still waiting for
> other queues to get their act together even if this queue is
> _completely idle_.

Memory is a machine-wide resource, not a per-queue resource.

If we have hit the dirty+writeback limit, it is logical to throttle
dirtiers until the levels subside.

> I think this accounts for 90% of the problems experienced with copying
> to slow devices.

It is related, but this observation doesn't actually strengthen your
argument.  If memory is full of dirty-or-writeback pages against a USB
drive, we don't want to go letting writers to other devices go and dirty
even more memory.  We instead need to clamp down harder on the writer to
the USB drive.

> > > It also makes a deadlock possible when one filesystem is writing data
> > > through another, and the balance_dirty_pages() for the lower
> > > filesystem is stalling the writeback for the upper filesystem's
> > > data (*).
> > 
> > I still don't understand this one.  I got lost when belatedly told that
> > i_mutex had something to do with it.
> 
> This deadlock only happens, if there's some bottleneck for writing
> data to the lower filesystem.   This bottleneck could be 
> 
>   - i_mutex, preventing parallel writes to the same inode
>   - limited number of filesystem threads
>   - limited request queue length in the upper filesystem
> 
> Imagine it this way: balance_dirty_pages() for the lower filesystem is
> stalling a write() because dirty pages in the upper filesystem are
> over the limit.  Because there's a bottleneck for writing to the lower
> filesystem, this is stalling _other_ writes from completing.  So
> there's no progress in writing back pages from the upper filesystem.

You mean that someone is stuck in balance_dirty_pages() against the lower
fs while holding locks which prevent writes into the upper fs from
succeeding?

Draw us a picture ;)

> The problem is not that dirty pages are not being submitted for
> writeback, rather that the writeback itself is stalling.
> 
> Hmm?
> 
> > > Index: linux/mm/page-writeback.c
> > > ===================================================================
> > > --- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
> > > +++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
> > > @@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
> > >  		 * written to the server's write cache, but has not yet
> > >  		 * been flushed to permanent storage.
> > >  		 */
> > > -		if (nr_reclaimable) {
> > > +		if (!nr_reclaimable) {
> > > +			/*
> > > +			 * If there's nothing more to write back and this queue
> > > +			 * is uncongested,  then it is possible to quickly
> > > +			 * write out some more data, so let's not wait
> > > +			 */
> > > +			if (!bdi_write_congested(bdi))
> > > +				break;
> > > +		} else {
> > 
> > This says "if there are no dirty pages in the machine at all, then go back
> > and dirty some more pages, regardless of the present number of
> > under-writeback pages".
> 
> No, this says, that "there are no dirty pages in the machine at all,
> and this queue is uncongested, so you can safely go back and dirty
> some more data, because we are sure it can be written back quickly".

The queue congestion thing is meaningless, and I ignored it.

> Ditto for the case, when there are no more dirty pages destined for
> this queue.
> 
> I understand, that this can fill up the memory with under writeback
> pages, but only if the data sitting in all the device queues is
> comparable to the total memory.  I don't know what the realistic
> chance of that is but David Chinner convinced me, that it's not
> something that happens in real life.  Quoting him from an earlier
> mail:
> 
> | Right, and most ppl don't have enough devices in their system for
> | this to be a problem. Even those of us that do have enough devices
> | for this to potentially be a problem usually have enough RAM in
> | the machine so that it is not a problem....
> 

David is overoptimistic ;)

Two scsi disks on a 256MB machine will put us over the edge, and that's
even before the user has gone and fiddled with the knobs.

Then there's NFS which for a long time had basically unbounded queue
lengths.  That's allegedly been fixed, but I haven't seen testing results
which convince me of that.

Then there's CIFS, SMBFS, v9fs, DM, MD, whatever.  Plenty of things in
there which can screw things up.

<ponders scsi_tgt_alloc_queue()>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:01     ` Andrew Morton
@ 2007-03-26  9:20       ` Miklos Szeredi
  2007-03-26  9:32         ` Andrew Morton
  2007-03-26  9:32       ` Miklos Szeredi
  2007-03-27  0:23       ` David Chinner
  2 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-26  9:20 UTC (permalink / raw)
  To: akpm; +Cc: dgc, linux-kernel

> > > > It also makes a deadlock possible when one filesystem is writing data
> > > > through another, and the balance_dirty_pages() for the lower
> > > > filesystem is stalling the writeback for the upper filesystem's
> > > > data (*).
> > > 
> > > I still don't understand this one.  I got lost when belatedly told that
> > > i_mutex had something to do with it.
> > 
> > This deadlock only happens, if there's some bottleneck for writing
> > data to the lower filesystem.   This bottleneck could be 
> > 
> >   - i_mutex, preventing parallel writes to the same inode
> >   - limited number of filesystem threads
> >   - limited request queue length in the upper filesystem
> > 
> > Imagine it this way: balance_dirty_pages() for the lower filesystem is
> > stalling a write() because dirty pages in the upper filesystem are
> > over the limit.  Because there's a bottleneck for writing to the lower
> > filesystem, this is stalling _other_ writes from completing.  So
> > there's no progress in writing back pages from the upper filesystem.
> 
> You mean that someone is stuck in balance_dirty_pages() against the lower
> fs while holding locks which prevent writes into the upper fs from
> succeeding?
> 
> Draw us a picture ;)

Well, not a picture, but a sort of indented call trace:

  [some process, which has a fuse file writably mmaped]
  write fault on upper filesystem
    balance_dirty_pages
      loop...
        submit write requests

  ---------------------------------
  [fuse loopback fs thread 1]
  read request from /dev/fuse
  sys_write
    mutex_lock(i_mutex)
    ...
       copy data to page cache
       balance_dirty_pages
          loop ...
                submit write requests
                write requests completed ...
                dirty still over limit ...
           ... loop forever

  [fuse loopback fs thread 2]
  read request from /dev/fuse
  sys_write
    mute_lock(i_mutex) blocks


The lower filesystem (e.g. ext3) has completed the single write
request that was sent to it, and then it's just looping in
balance_dirty_pages.  The upper (fuse) filesystem has all the dirty
data (over the threshold), either still dirty or waiting in the
request queue as writeback.

Does this help?

Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:20       ` Miklos Szeredi
@ 2007-03-26  9:32         ` Andrew Morton
  2007-03-26  9:48           ` Miklos Szeredi
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2007-03-26  9:32 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, linux-kernel

On Mon, 26 Mar 2007 11:20:11 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> > > > > It also makes a deadlock possible when one filesystem is writing data
> > > > > through another, and the balance_dirty_pages() for the lower
> > > > > filesystem is stalling the writeback for the upper filesystem's
> > > > > data (*).
> > > > 
> > > > I still don't understand this one.  I got lost when belatedly told that
> > > > i_mutex had something to do with it.
> > > 
> > > This deadlock only happens, if there's some bottleneck for writing
> > > data to the lower filesystem.   This bottleneck could be 
> > > 
> > >   - i_mutex, preventing parallel writes to the same inode
> > >   - limited number of filesystem threads
> > >   - limited request queue length in the upper filesystem
> > > 
> > > Imagine it this way: balance_dirty_pages() for the lower filesystem is
> > > stalling a write() because dirty pages in the upper filesystem are
> > > over the limit.  Because there's a bottleneck for writing to the lower
> > > filesystem, this is stalling _other_ writes from completing.  So
> > > there's no progress in writing back pages from the upper filesystem.
> > 
> > You mean that someone is stuck in balance_dirty_pages() against the lower
> > fs while holding locks which prevent writes into the upper fs from
> > succeeding?
> > 
> > Draw us a picture ;)
> 
> Well, not a picture, but a sort of indented call trace:
> 
>   [some process, which has a fuse file writably mmaped]
>   write fault on upper filesystem
>     balance_dirty_pages
>       loop...
>         submit write requests

This, I assume, is the upper fs

>   ---------------------------------
>   [fuse loopback fs thread 1]
>   read request from /dev/fuse
>   sys_write
>     mutex_lock(i_mutex)
>     ...
>        copy data to page cache
>        balance_dirty_pages
>           loop ...
>                 submit write requests
>                 write requests completed ...
>                 dirty still over limit ...
>            ... loop forever
> 
>   [fuse loopback fs thread 2]
>   read request from /dev/fuse
>   sys_write
>     mute_lock(i_mutex) blocks

And these, I assume, are handling what you term the lower fs.

> 
> The lower filesystem (e.g. ext3) has completed the single write
> request that was sent to it, and then it's just looping in
> balance_dirty_pages.  The upper (fuse) filesystem has all the dirty
> data (over the threshold), either still dirty or waiting in the
> request queue as writeback.
> 
> Does this help?

yup.

Interesting problem.  I don't suppose that it'd be appreiated if I were to
commend the use of O_DIRECT for handling the lower fs ;)

Let me think about that a bit, after I've made the latest shitpile people
have inflicted upon me begin to look like it has a chance of compiling.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:32         ` Andrew Morton
@ 2007-03-26  9:48           ` Miklos Szeredi
  0 siblings, 0 replies; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-26  9:48 UTC (permalink / raw)
  To: akpm; +Cc: miklos, dgc, linux-kernel

> > Well, not a picture, but a sort of indented call trace:
> > 
> >   [some process, which has a fuse file writably mmaped]
> >   write fault on upper filesystem
> >     balance_dirty_pages
> >       loop...
> >         submit write requests
> 
> This, I assume, is the upper fs
> 
> >   ---------------------------------
> >   [fuse loopback fs thread 1]
> >   read request from /dev/fuse
> >   sys_write
> >     mutex_lock(i_mutex)
> >     ...
> >        copy data to page cache
> >        balance_dirty_pages
> >           loop ...
> >                 submit write requests
> >                 write requests completed ...
> >                 dirty still over limit ...
> >            ... loop forever
> > 
> >   [fuse loopback fs thread 2]
> >   read request from /dev/fuse
> >   sys_write
> >     mute_lock(i_mutex) blocks
> 
> And these, I assume, are handling what you term the lower fs.

All of this is actually the upper, fuse filesystem (maybe except the
filemap/page-writeback parts, which could be argued to belong to the
lower, ext3 filesystem).

Above the line is the kernel part, below the line is the userspace
part.

> > The lower filesystem (e.g. ext3) has completed the single write
> > request that was sent to it, and then it's just looping in
> > balance_dirty_pages.  The upper (fuse) filesystem has all the dirty
> > data (over the threshold), either still dirty or waiting in the
> > request queue as writeback.
> > 
> > Does this help?
> 
> yup.
> 
> Interesting problem.  I don't suppose that it'd be appreiated if I were to
> commend the use of O_DIRECT for handling the lower fs ;)

Not really ;)  One of the goals of fuse is to not restrict the
filesystem daemon in any way.  There are zillions of ways to work
around this deadlock, but I don't even want to make it even
theoretically possible.

> Let me think about that a bit, after I've made the latest shitpile people
> have inflicted upon me begin to look like it has a chance of compiling.

OK.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:01     ` Andrew Morton
  2007-03-26  9:20       ` Miklos Szeredi
@ 2007-03-26  9:32       ` Miklos Szeredi
  2007-03-26 10:08         ` Andrew Morton
  2007-03-27  0:23       ` David Chinner
  2 siblings, 1 reply; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-26  9:32 UTC (permalink / raw)
  To: akpm; +Cc: dgc, linux-kernel

> > > > This is a slightly different take on the fix for the deadlock in fuse
> > > > with dirty balancing.  David Chinner convinced me, that per-bdi
> > > > counters are too expensive, and that it's not worth trying to account
> > > > the number of pages under writeback, as they will be limited by the
> > > > queue anyway.
> > > > 
> > > > ----
> > > > From: Miklos Szeredi <mszeredi@suse.cz>
> > > > 
> > > > Current behavior of balance_dirty_pages() is to try to start writeout
> > > > into the specified queue for at least "write_chunk" number of pages.
> > > > If "write_chunk" pages have been submitted, then return.
> > > > 
> > > > However if there are less than "write_chunk" dirty pages for this
> > > > queue, then it doesn't return, waiting for the global dirty+writeback
> > > > counters to subside, but without doing any actual work.
> > > > 
> > > > This is illogical behavior: it allows more dirtyings while there are
> > > > dirty pages, but stops further dirtying completely if there are no
> > > > more dirty pages.
> > > 
> > > That behaviour is perfectly logical.  It prevents the number of
> > > dirty+writeback pages from exceeding dirty_ratio.
> > 
> > But it does it in an illogical way.  While it's cleaning your pages,
> > it allows more dirtyings, but when it has cleaned ALL the pages
> > destined for this queue, you hit a wall and stand still waiting for
> > other queues to get their act together even if this queue is
> > _completely idle_.
> 
> Memory is a machine-wide resource, not a per-queue resource.
> 
> If we have hit the dirty+writeback limit, it is logical to throttle
> dirtiers until the levels subside.

It is logical to throttle, but it's not logical to stall.

> > I think this accounts for 90% of the problems experienced with copying
> > to slow devices.
> 
> It is related, but this observation doesn't actually strengthen your
> argument.  If memory is full of dirty-or-writeback pages against a USB
> drive, we don't want to go letting writers to other devices go and dirty
> even more memory.  We instead need to clamp down harder on the writer to
> the USB drive.

Yes, if you can notice that the USB drive is really slow, _before_ the
writer fills all the allotted memory, then that's good.  I don't know
how well Peter's patch does that, but I'll have a look.

But even if you don't want to bother with limiting per-bdi dirty
numbers based on device speed, it is more logical to allow unrelated
queues to function at a steady state, than to stop them completely.

Stopping writers which have idle queues is completely unproductive,
and that is basically what the current algorithm does.

Miklos

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:32       ` Miklos Szeredi
@ 2007-03-26 10:08         ` Andrew Morton
  2007-03-26 13:30           ` Peter Zijlstra
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2007-03-26 10:08 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dgc, linux-kernel

On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:

> Stopping writers which have idle queues is completely unproductive,
> and that is basically what the current algorithm does.

This is because the kernel permits all of its allotment of dirty+writeback
pages to be dirty+writeback against a single device.

A good way of solving the one-device-starves-another-one problem is to
dynamically adjust the per-device dirty+writeback levels so that (for
example) if two devices are being written to, each gets 50% of the
allotment.

I started working on that but got derailed by the usual blah.  I don't
think either of the proposed fixes took that approach, actually.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26 10:08         ` Andrew Morton
@ 2007-03-26 13:30           ` Peter Zijlstra
  2007-03-27  0:30             ` David Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2007-03-26 13:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, dgc, linux-kernel

On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote:
> On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Stopping writers which have idle queues is completely unproductive,
> > and that is basically what the current algorithm does.
> 
> This is because the kernel permits all of its allotment of dirty+writeback
> pages to be dirty+writeback against a single device.
> 
> A good way of solving the one-device-starves-another-one problem is to
> dynamically adjust the per-device dirty+writeback levels so that (for
> example) if two devices are being written to, each gets 50% of the
> allotment.

This is exactly what happens with my patch if both devices write at the
same speed. (Or at least, that is what is supposed to happen ;-)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26 13:30           ` Peter Zijlstra
@ 2007-03-27  0:30             ` David Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: David Chinner @ 2007-03-27  0:30 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, Miklos Szeredi, dgc, linux-kernel

On Mon, Mar 26, 2007 at 03:30:27PM +0200, Peter Zijlstra wrote:
> On Mon, 2007-03-26 at 02:08 -0800, Andrew Morton wrote:
> > On Mon, 26 Mar 2007 11:32:47 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > Stopping writers which have idle queues is completely unproductive,
> > > and that is basically what the current algorithm does.
> > 
> > This is because the kernel permits all of its allotment of dirty+writeback
> > pages to be dirty+writeback against a single device.
> > 
> > A good way of solving the one-device-starves-another-one problem is to
> > dynamically adjust the per-device dirty+writeback levels so that (for
> > example) if two devices are being written to, each gets 50% of the
> > allotment.
> 
> This is exactly what happens with my patch if both devices write at the
> same speed. (Or at least, that is what is supposed to happen ;-)

The testing that I did of Peter's patch showed that this cache
splitting works as advertised for multiple devices writing at
the same speed.

(http://marc.info/?l=linux-kernel&m=117437686328396&w=2)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
  2007-03-26  9:01     ` Andrew Morton
  2007-03-26  9:20       ` Miklos Szeredi
  2007-03-26  9:32       ` Miklos Szeredi
@ 2007-03-27  0:23       ` David Chinner
  2 siblings, 0 replies; 18+ messages in thread
From: David Chinner @ 2007-03-27  0:23 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Miklos Szeredi, dgc, linux-kernel

On Mon, Mar 26, 2007 at 01:01:24AM -0800, Andrew Morton wrote:
> On Mon, 26 Mar 2007 10:26:18 +0200 Miklos Szeredi <miklos@szeredi.hu> wrote:
> > Ditto for the case, when there are no more dirty pages destined for
> > this queue.
> > 
> > I understand, that this can fill up the memory with under writeback
> > pages, but only if the data sitting in all the device queues is
> > comparable to the total memory.  I don't know what the realistic
> > chance of that is but David Chinner convinced me, that it's not
> > something that happens in real life.  Quoting him from an earlier
> > mail:
> > 
> > | Right, and most ppl don't have enough devices in their system for
> > | this to be a problem. Even those of us that do have enough devices
> > | for this to potentially be a problem usually have enough RAM in
> > | the machine so that it is not a problem....
> > 
> 
> David is overoptimistic ;)

My excuse is that I have trouble thinking in units of memory smaller
than a GB. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-04-03 18:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 18:40 [patch 1/3] fix illogical behavior in balance_dirty_pages() Kris Corwin
  -- strict thread matches above, loose matches on Subject: below --
2007-03-24 21:55 Miklos Szeredi
2007-03-25 10:03 ` Peter Zijlstra
2007-03-25 11:12   ` Miklos Szeredi
2007-03-25 11:34     ` Miklos Szeredi
2007-03-25 11:50       ` Peter Zijlstra
2007-03-25 20:41         ` Miklos Szeredi
2007-03-25 23:35 ` Andrew Morton
2007-03-26  8:26   ` Miklos Szeredi
2007-03-26  9:01     ` Andrew Morton
2007-03-26  9:20       ` Miklos Szeredi
2007-03-26  9:32         ` Andrew Morton
2007-03-26  9:48           ` Miklos Szeredi
2007-03-26  9:32       ` Miklos Szeredi
2007-03-26 10:08         ` Andrew Morton
2007-03-26 13:30           ` Peter Zijlstra
2007-03-27  0:30             ` David Chinner
2007-03-27  0:23       ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox