public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Re: [patch 1/3] fix illogical behavior in balance_dirty_pages()
@ 2007-04-03 18:40 Kris Corwin
  0 siblings, 0 replies; 18+ messages in thread
From: Kris Corwin @ 2007-04-03 18:40 UTC (permalink / raw)
  To: linux-kernel

I think I'm seeing the same VM behavior with NFS/ext3 that was described 
with fuse.

I have a shared storage unit with 2 servers.  The local filesystem is 
ext3 and I'm failing
the ext3 mounting and NFS serving from one node to another.  I'm running 
I/O on the NFS
mount point to the active server.   If I failover to the server that is 
also executing the I/O,
I can hang the machine.  Looking at the stack traces, the NFS writes are 
stuck in
balance_dirty_pages().  NFS start up is causing rpc.mountd to write to 
the local
ext3 filesystem, but it gets blocked in balance_dirty_pages() too.  
Deadlock.  Without
the NFS server successfully started, the kernel can't write the dirty pages.

I'm running a 2.6.9-42 RH kernel.  Early in my debugging, I had tried 
several kernel
versions.  All 2.6 kernels saw this behavior and the 2.4 kernel I tried 
did not.

Kris Corwin

^ permalink raw reply	[flat|nested] 18+ messages in thread
* [patch 1/3] fix illogical behavior in balance_dirty_pages()
@ 2007-03-24 21:55 Miklos Szeredi
  2007-03-25 10:03 ` Peter Zijlstra
  2007-03-25 23:35 ` Andrew Morton
  0 siblings, 2 replies; 18+ messages in thread
From: Miklos Szeredi @ 2007-03-24 21:55 UTC (permalink / raw)
  To: akpm; +Cc: dgc, linux-kernel

This is a slightly different take on the fix for the deadlock in fuse
with dirty balancing.  David Chinner convinced me, that per-bdi
counters are too expensive, and that it's not worth trying to account
the number of pages under writeback, as they will be limited by the
queue anyway.

----
From: Miklos Szeredi <mszeredi@suse.cz>

Current behavior of balance_dirty_pages() is to try to start writeout
into the specified queue for at least "write_chunk" number of pages.
If "write_chunk" pages have been submitted, then return.

However if there are less than "write_chunk" dirty pages for this
queue, then it doesn't return, waiting for the global dirty+writeback
counters to subside, but without doing any actual work.

This is illogical behavior: it allows more dirtyings while there are
dirty pages, but stops further dirtying completely if there are no
more dirty pages.

It also makes a deadlock possible when one filesystem is writing data
through another, and the balance_dirty_pages() for the lower
filesystem is stalling the writeback for the upper filesystem's
data (*).

So the exit condition should instead be:

  submitted at least "write_chunk" number of pages
OR
    submitted ALL the dirty pages destined for this backing dev
  AND
    backing dev is not congested

To do this, introduce a new counter in writeback_control, which counts
the number of dirty pages encountered during writeback.  This includes
all dirty pages, even those which are already under writeback but have
been dirtied again, and those which have been skipped due to having
locked buffers.

If this counter is zero after trying to submit some pages for
writeback, and the backing dev is uncongested, then don't wait any
more.  After this, newly dirtied pages can quickly be written back to
this backing dev.

If there are globally no more pages to submit for writeback
(nr_reclaimable == 0), then also don't wait for ever, only while this
backing dev is congested.

(*) For more info on this deadlock, see the following discussions:

  http://lkml.org/lkml/2007/3/1/9
  http://lkml.org/lkml/2007/3/12/16

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/include/linux/writeback.h
===================================================================
--- linux.orig/include/linux/writeback.h	2007-03-24 22:06:56.000000000 +0100
+++ linux/include/linux/writeback.h	2007-03-24 22:29:02.000000000 +0100
@@ -44,6 +44,7 @@ struct writeback_control {
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
+	long nr_dirty;			/* Number of dirty pages encountered */
 
 	/*
 	 * For a_ops->writepages(): is start or end are non-zero then this is
Index: linux/mm/page-writeback.c
===================================================================
--- linux.orig/mm/page-writeback.c	2007-03-24 22:06:56.000000000 +0100
+++ linux/mm/page-writeback.c	2007-03-24 22:29:02.000000000 +0100
@@ -207,7 +207,15 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		if (nr_reclaimable) {
+		if (!nr_reclaimable) {
+			/*
+			 * If there's nothing more to write back and this queue
+			 * is uncongested,  then it is possible to quickly
+			 * write out some more data, so let's not wait
+			 */
+			if (!bdi_write_congested(bdi))
+				break;
+		} else {
 			writeback_inodes(&wbc);
 			get_dirty_limits(&background_thresh,
 					 	&dirty_thresh, mapping);
@@ -220,6 +228,14 @@ static void balance_dirty_pages(struct a
 			pages_written += write_chunk - wbc.nr_to_write;
 			if (pages_written >= write_chunk)
 				break;		/* We've done our duty */
+
+			/*
+			 * If there are no more dirty pages for this backing
+			 * backing dev, and the queue is not congested, then
+			 * it is possible to quickly write out some more data
+			 */
+			if (!wbc.nr_dirty && !bdi_write_congested(bdi))
+				break;
 		}
 		congestion_wait(WRITE, HZ/10);
 	}
@@ -619,6 +635,7 @@ retry:
 					      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
+		wbc->nr_dirty += nr_pages;
 		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-04-03 18:52 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 18:40 [patch 1/3] fix illogical behavior in balance_dirty_pages() Kris Corwin
  -- strict thread matches above, loose matches on Subject: below --
2007-03-24 21:55 Miklos Szeredi
2007-03-25 10:03 ` Peter Zijlstra
2007-03-25 11:12   ` Miklos Szeredi
2007-03-25 11:34     ` Miklos Szeredi
2007-03-25 11:50       ` Peter Zijlstra
2007-03-25 20:41         ` Miklos Szeredi
2007-03-25 23:35 ` Andrew Morton
2007-03-26  8:26   ` Miklos Szeredi
2007-03-26  9:01     ` Andrew Morton
2007-03-26  9:20       ` Miklos Szeredi
2007-03-26  9:32         ` Andrew Morton
2007-03-26  9:48           ` Miklos Szeredi
2007-03-26  9:32       ` Miklos Szeredi
2007-03-26 10:08         ` Andrew Morton
2007-03-26 13:30           ` Peter Zijlstra
2007-03-27  0:30             ` David Chinner
2007-03-27  0:23       ` David Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox