[PATCH] mm: Avoid livelocking of WB_SYNC

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
@ 2010-11-05 21:26 Jan Kara
  2010-11-05 22:30 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jan Kara @ 2010-11-05 21:26 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Andrew Morton, Jan Kara

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually
set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb()
with nr_to_write == MAX_WRITEBACK_PAGES and thus we easily end up with negative
nr_to_write after the function returns. wb_writeback() then decides we need
another round of writeback but this is wrong in some cases! For example when
a single large file is continuously dirtied, we would never finish syncing
it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty
timestamp never gets updated (as inode is never completely clean).

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not
need nr_to_write in WB_SYNC_ALL mode anyway since livelock avoidance is done
differently for it.

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer
able to stall sync forever.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   18 ++++++++++++++----
 1 files changed, 14 insertions(+), 4 deletions(-)

  Fengguang, I've been testing with those writeback fixes you reposted
a few days ago and I've been able to still reproduce livelocks with
Jan Engelhard's test case. Using writeback tracing I've tracked the
problem to the above and with this patch, sync finishes OK (well, it still
takes about 15 minutes but that's about expected time given the throughput
I see to the disk - the test case randomly dirties pages in a huge file).
So could you please add this patch to the previous two send them to Jens
for inclusion?

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6b4d02a..d5873a6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
+	long write_chunk;
 	struct inode *inode;

 	if (wbc.for_kupdate) {
@@ -640,6 +641,15 @@ static long wb_writeback(struct bdi_writeback *wb,
 		wbc.range_start = 0;
 		wbc.range_end = LLONG_MAX;
 	}
+	/*
+	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
+	 * we need to write everything and livelock avoidance is implemented
+	 * differently.
+	 */
+	if (wbc.sync_mode == WB_SYNC_NONE)
+		write_chunk = MAX_WRITEBACK_PAGES;
+	else
+		write_chunk = LONG_MAX;

 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
@@ -665,7 +675,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 			break;

 		wbc.more_io = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;

 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -675,8 +685,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);

-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;

 		/*
 		 * If we consumed everything, see if we have more
@@ -691,7 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-05 21:26 [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback Jan Kara
@ 2010-11-05 22:30 ` Christoph Hellwig
  2010-11-06  2:54   ` Wu Fengguang
  2010-11-06  2:55   ` Wu Fengguang
  2010-11-06  1:36 ` Johannes Weiner
  2010-11-06  4:12 ` Wu Fengguang
  2 siblings, 2 replies; 12+ messages in thread
From: Christoph Hellwig @ 2010-11-05 22:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel, Andrew Morton

> +	/*
> +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> +	 * we need to write everything and livelock avoidance is implemented
> +	 * differently.
> +	 */

I think it would be useful to elaborate here on how livelock avoidance
is supposed to work.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-05 21:26 [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback Jan Kara
  2010-11-05 22:30 ` Christoph Hellwig
@ 2010-11-06  1:36 ` Johannes Weiner
  2010-11-07 13:07   ` Jan Kara
  2010-11-06  4:12 ` Wu Fengguang
  2 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2010-11-06  1:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel, Andrew Morton,
	linux-mm

Can you please keep linux-mm@kvack.org in the loop on writeback stuff?

I Cc'd it now, here is the full quote:

On Fri, Nov 05, 2010 at 10:26:23PM +0100, Jan Kara wrote:
> When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually
> set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb()
> with nr_to_write == MAX_WRITEBACK_PAGES and thus we easily end up with negative
> nr_to_write after the function returns. wb_writeback() then decides we need
> another round of writeback but this is wrong in some cases! For example when
> a single large file is continuously dirtied, we would never finish syncing
> it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty
> timestamp never gets updated (as inode is never completely clean).
> 
> Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not
> need nr_to_write in WB_SYNC_ALL mode anyway since livelock avoidance is done
> differently for it.
> 
> After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer
> able to stall sync forever.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   18 ++++++++++++++----
>  1 files changed, 14 insertions(+), 4 deletions(-)
> 
>   Fengguang, I've been testing with those writeback fixes you reposted
> a few days ago and I've been able to still reproduce livelocks with
> Jan Engelhard's test case. Using writeback tracing I've tracked the
> problem to the above and with this patch, sync finishes OK (well, it still
> takes about 15 minutes but that's about expected time given the throughput
> I see to the disk - the test case randomly dirties pages in a huge file).
> So could you please add this patch to the previous two send them to Jens
> for inclusion?
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 6b4d02a..d5873a6 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	};
>  	unsigned long oldest_jif;
>  	long wrote = 0;
> +	long write_chunk;
>  	struct inode *inode;
>  
>  	if (wbc.for_kupdate) {
> @@ -640,6 +641,15 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		wbc.range_start = 0;
>  		wbc.range_end = LLONG_MAX;
>  	}
> +	/*
> +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> +	 * we need to write everything and livelock avoidance is implemented
> +	 * differently.
> +	 */
> +	if (wbc.sync_mode == WB_SYNC_NONE)
> +		write_chunk = MAX_WRITEBACK_PAGES;
> +	else
> +		write_chunk = LONG_MAX;
>  
>  	wbc.wb_start = jiffies; /* livelock avoidance */
>  	for (;;) {
> @@ -665,7 +675,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  			break;
>  
>  		wbc.more_io = 0;
> -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> +		wbc.nr_to_write = write_chunk;
>  		wbc.pages_skipped = 0;
>  
>  		trace_wbc_writeback_start(&wbc, wb->bdi);
> @@ -675,8 +685,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  			writeback_inodes_wb(wb, &wbc);
>  		trace_wbc_writeback_written(&wbc, wb->bdi);
>  
> -		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> -		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> +		wrote += write_chunk - wbc.nr_to_write;
>  
>  		/*
>  		 * If we consumed everything, see if we have more
> @@ -691,7 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		/*
>  		 * Did we write something? Try for more
>  		 */
> -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> +		if (wbc.nr_to_write < write_chunk)
>  			continue;
>  		/*
>  		 * Nothing written. Wait for some inode to
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-05 22:30 ` Christoph Hellwig
@ 2010-11-06  2:54   ` Wu Fengguang
  2010-11-06 11:45     ` Christoph Hellwig
  2010-11-06  2:55   ` Wu Fengguang
  1 sibling, 1 reply; 12+ messages in thread
From: Wu Fengguang @ 2010-11-06  2:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Andrew Morton

On Sat, Nov 06, 2010 at 06:30:38AM +0800, Christoph Hellwig wrote:
> > +	/*
> > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > +	 * we need to write everything and livelock avoidance is implemented
> > +	 * differently.
> > +	 */
> > +       if (wbc.sync_mode == WB_SYNC_NONE)
> > +               write_chunk = MAX_WRITEBACK_PAGES;
> > +       else
> > +               write_chunk = LONG_MAX;

Good catch!

> 
> I think it would be useful to elaborate here on how livelock avoidance
> is supposed to work.

It's supposed to sync files in a big loop

        for each dirty inode
            write_cache_pages()
                (quickly) tag currently dirty pages
                (maybe slowly) sync all tagged pages

Ideally the loop should call write_cache_pages() _once_ for each inode.
At least this is the assumption made by commit f446daaea (mm:
implement writeback livelock avoidance using page tagging).

Setting wbc.nr_to_write to LONG_MAX ensures that writeback_inodes_wb()
will complete the above loop before returning to wb_writeback(), and
to prevent wb_writeback() from looping (thus re-syncing extra data) in
the below range of code.

   643                 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
   644                 wbc.pages_skipped = 0;
   645            
   646                 trace_wbc_writeback_start(&wbc, wb->bdi);
   647                 if (work->sb)
   648                         __writeback_inodes_sb(work->sb, wb, &wbc);
   649                 else
   650                         writeback_inodes_wb(wb, &wbc);
   651                 trace_wbc_writeback_written(&wbc, wb->bdi);
   652            
   653                 work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   654                 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   655            
   656                 /*
   657                  * If we consumed everything, see if we have more
   658                  */
   659                 if (wbc.nr_to_write <= 0)
   660                         continue;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-05 22:30 ` Christoph Hellwig
  2010-11-06  2:54   ` Wu Fengguang
@ 2010-11-06  2:55   ` Wu Fengguang
  2010-11-06 16:39     ` Wu Fengguang
  1 sibling, 1 reply; 12+ messages in thread
From: Wu Fengguang @ 2010-11-06  2:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Andrew Morton, linux-mm,
	Johannes Weiner

[add CC to linux-mm list]

On Sat, Nov 06, 2010 at 06:30:38AM +0800, Christoph Hellwig wrote:
> > +	/*
> > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > +	 * we need to write everything and livelock avoidance is implemented
> > +	 * differently.
> > +	 */
> > +       if (wbc.sync_mode == WB_SYNC_NONE)
> > +               write_chunk = MAX_WRITEBACK_PAGES;
> > +       else
> > +               write_chunk = LONG_MAX;

Good catch!

> 
> I think it would be useful to elaborate here on how livelock avoidance
> is supposed to work.

It's supposed to sync files in a big loop

        for each dirty inode
            write_cache_pages()
                (quickly) tag currently dirty pages
                (maybe slowly) sync all tagged pages

Ideally the loop should call write_cache_pages() _once_ for each inode.
At least this is the assumption made by commit f446daaea (mm:
implement writeback livelock avoidance using page tagging).

Setting wbc.nr_to_write to LONG_MAX ensures that writeback_inodes_wb()
will complete the above loop before returning to wb_writeback(), and
to prevent wb_writeback() from looping (thus re-syncing extra data) in
the below range of code.

   643                 wbc.nr_to_write = MAX_WRITEBACK_PAGES;
   644                 wbc.pages_skipped = 0;
   645            
   646                 trace_wbc_writeback_start(&wbc, wb->bdi);
   647                 if (work->sb)
   648                         __writeback_inodes_sb(work->sb, wb, &wbc);
   649                 else
   650                         writeback_inodes_wb(wb, &wbc);
   651                 trace_wbc_writeback_written(&wbc, wb->bdi);
   652            
   653                 work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   654                 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
   655            
   656                 /*
   657                  * If we consumed everything, see if we have more
   658                  */
   659                 if (wbc.nr_to_write <= 0)
   660                         continue;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-05 21:26 [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback Jan Kara
  2010-11-05 22:30 ` Christoph Hellwig
  2010-11-06  1:36 ` Johannes Weiner
@ 2010-11-06  4:12 ` Wu Fengguang
  2010-11-07 13:22   ` Jan Kara
  2 siblings, 1 reply; 12+ messages in thread
From: Wu Fengguang @ 2010-11-06  4:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Johannes Weiner, Linux Memory Management List

On Sat, Nov 06, 2010 at 05:26:23AM +0800, Jan Kara wrote:

> +	/*
> +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> +	 * we need to write everything and livelock avoidance is implemented
> +	 * differently.
> +	 */
> +	if (wbc.sync_mode == WB_SYNC_NONE)
> +		write_chunk = MAX_WRITEBACK_PAGES;
> +	else
> +		write_chunk = LONG_MAX;

This looks like a safe change for .37.  I updated the patch on the
above comment and made no other changes (it seems OK to also remove
the below line, however that's not the necessary change as a bug fix,
so I'd rather leave the extra change to the next merge window).

write_cache_pages():

-->                     /*
-->                      * We stop writing back only if we are not doing
-->                      * integrity sync. In case of integrity sync we have to
-->                      * keep going until we have written all the pages
-->                      * we tagged for writeback prior to entering this loop.
-->                      */
                        if (--wbc->nr_to_write <= 0 &&
==>                         wbc->sync_mode == WB_SYNC_NONE) {
                                done = 1;
                                break;
                        }

Thanks,
Fengguang
---
Subject: writeback: avoid livelocking WB_SYNC_ALL writeback

From: Jan Kara <jack@suse.cz>

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and thus
we easily end up with negative nr_to_write after the function returns.
wb_writeback() then decides we need another round of writeback but this
is wrong in some cases! For example when a single large file is
continuously dirtied, we would never finish syncing it because each pass
would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
never gets updated (as inode is never completely clean).

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since livelock
avoidance is done differently for it.

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

  Fengguang, I've been testing with those writeback fixes you reposted
a few days ago and I've been able to still reproduce livelocks with
Jan Engelhard's test case. Using writeback tracing I've tracked the
problem to the above and with this patch, sync finishes OK (well, it still
takes about 15 minutes but that's about expected time given the throughput
I see to the disk - the test case randomly dirties pages in a huge file).
So could you please add this patch to the previous two send them to Jens
for inclusion?

--- linux-next.orig/fs/fs-writeback.c	2010-11-06 11:04:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-06 11:31:20.000000000 +0800
@@ -626,6 +626,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
+	long write_chunk;
 	struct inode *inode;

 	if (wbc.for_kupdate) {
@@ -638,6 +639,22 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}

+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 *      wb_writeback()
+	 *          writeback_inodes_wb()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (wbc.sync_mode == WB_SYNC_NONE)
+		write_chunk = MAX_WRITEBACK_PAGES;
+	else
+		write_chunk = LONG_MAX;
+
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
 		/*
@@ -663,7 +680,7 @@ static long wb_writeback(struct bdi_writ
 			break;

 		wbc.more_io = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;

 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -673,8 +690,8 @@ static long wb_writeback(struct bdi_writ
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);

-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;

 		/*
 		 * If we consumed everything, see if we have more
@@ -689,7 +706,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-06  2:54   ` Wu Fengguang
@ 2010-11-06 11:45     ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2010-11-06 11:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel@vger.kernel.org,
	Andrew Morton

On Sat, Nov 06, 2010 at 10:54:23AM +0800, Wu Fengguang wrote:
> > I think it would be useful to elaborate here on how livelock avoidance
> > is supposed to work.
> 
> It's supposed to sync files in a big loop
> 
>         for each dirty inode
>             write_cache_pages()
>                 (quickly) tag currently dirty pages
>                 (maybe slowly) sync all tagged pages
> 
> Ideally the loop should call write_cache_pages() _once_ for each inode.
> At least this is the assumption made by commit f446daaea (mm:
> implement writeback livelock avoidance using page tagging).
> 
> Setting wbc.nr_to_write to LONG_MAX ensures that writeback_inodes_wb()
> will complete the above loop before returning to wb_writeback(), and
> to prevent wb_writeback() from looping (thus re-syncing extra data) in
> the below range of code.

I know this, but putting a shortened version of this into the comment
would generally be helpful.  It should at least mention that the page
tagging is taking care of it.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-06  2:55   ` Wu Fengguang
@ 2010-11-06 16:39     ` Wu Fengguang
  2010-11-07 13:34       ` Jan Kara
  0 siblings, 1 reply; 12+ messages in thread
From: Wu Fengguang @ 2010-11-06 16:39 UTC (permalink / raw)
  To: Andrew Morton, Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm,
	Johannes Weiner

On Sat, Nov 06, 2010 at 10:55:48AM +0800, Wu Fengguang wrote:
> [add CC to linux-mm list]
> 
> On Sat, Nov 06, 2010 at 06:30:38AM +0800, Christoph Hellwig wrote:
> > > +	/*
> > > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > > +	 * we need to write everything and livelock avoidance is implemented
> > > +	 * differently.
> > > +	 */
> > > +       if (wbc.sync_mode == WB_SYNC_NONE)
> > > +               write_chunk = MAX_WRITEBACK_PAGES;
> > > +       else
> > > +               write_chunk = LONG_MAX;
> 
> Good catch!
> 
> > 
> > I think it would be useful to elaborate here on how livelock avoidance
> > is supposed to work.
> 
> It's supposed to sync files in a big loop
> 
>         for each dirty inode
>             write_cache_pages()
>                 (quickly) tag currently dirty pages
>                 (maybe slowly) sync all tagged pages
> 
> Ideally the loop should call write_cache_pages() _once_ for each inode.
> At least this is the assumption made by commit f446daaea (mm:
> implement writeback livelock avoidance using page tagging).

The above scheme relies on the filesystems to not skip pages in
WB_SYNC_ALL mode. It seems necessary to add an explicit check at
least in the -mm tree.

Thanks,
Fengguang
---
writeback: check skipped pages on WB_SYNC_ALL 

In WB_SYNC_ALL mode, filesystems are not expected to skip dirty pages on
temporal lock contentions or non fatal errors, otherwise sync() will
return without actually syncing the skipped pages. Add a check to
catch possible redirty_page_for_writepage() callers that violate this
expectation.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    1 +
 1 file changed, 1 insertion(+)

--- linux-next.orig/fs/fs-writeback.c	2010-11-07 00:20:43.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-07 00:29:29.000000000 +0800
@@ -527,6 +527,7 @@ static int writeback_sb_inodes(struct su
 			 * buffers.  Skip this inode for now.
 			 */
 			redirty_tail(inode);
+			WARN_ON_ONCE(wbc->sync_mode == WB_SYNC_ALL);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-06  1:36 ` Johannes Weiner
@ 2010-11-07 13:07   ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2010-11-07 13:07 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, linux-fsdevel,
	Andrew Morton, linux-mm

On Sat 06-11-10 02:36:14, Johannes Weiner wrote:
> Can you please keep linux-mm@kvack.org in the loop on writeback stuff?
  Will do. Sorry for that.

								Honza

> I Cc'd it now, here is the full quote:
> 
> On Fri, Nov 05, 2010 at 10:26:23PM +0100, Jan Kara wrote:
> > When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually
> > set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb()
> > with nr_to_write == MAX_WRITEBACK_PAGES and thus we easily end up with negative
> > nr_to_write after the function returns. wb_writeback() then decides we need
> > another round of writeback but this is wrong in some cases! For example when
> > a single large file is continuously dirtied, we would never finish syncing
> > it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty
> > timestamp never gets updated (as inode is never completely clean).
> > 
> > Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not
> > need nr_to_write in WB_SYNC_ALL mode anyway since livelock avoidance is done
> > differently for it.
> > 
> > After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer
> > able to stall sync forever.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/fs-writeback.c |   18 ++++++++++++++----
> >  1 files changed, 14 insertions(+), 4 deletions(-)
> > 
> >   Fengguang, I've been testing with those writeback fixes you reposted
> > a few days ago and I've been able to still reproduce livelocks with
> > Jan Engelhard's test case. Using writeback tracing I've tracked the
> > problem to the above and with this patch, sync finishes OK (well, it still
> > takes about 15 minutes but that's about expected time given the throughput
> > I see to the disk - the test case randomly dirties pages in a huge file).
> > So could you please add this patch to the previous two send them to Jens
> > for inclusion?
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 6b4d02a..d5873a6 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -629,6 +629,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  	};
> >  	unsigned long oldest_jif;
> >  	long wrote = 0;
> > +	long write_chunk;
> >  	struct inode *inode;
> >  
> >  	if (wbc.for_kupdate) {
> > @@ -640,6 +641,15 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  		wbc.range_start = 0;
> >  		wbc.range_end = LLONG_MAX;
> >  	}
> > +	/*
> > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > +	 * we need to write everything and livelock avoidance is implemented
> > +	 * differently.
> > +	 */
> > +	if (wbc.sync_mode == WB_SYNC_NONE)
> > +		write_chunk = MAX_WRITEBACK_PAGES;
> > +	else
> > +		write_chunk = LONG_MAX;
> >  
> >  	wbc.wb_start = jiffies; /* livelock avoidance */
> >  	for (;;) {
> > @@ -665,7 +675,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  			break;
> >  
> >  		wbc.more_io = 0;
> > -		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> > +		wbc.nr_to_write = write_chunk;
> >  		wbc.pages_skipped = 0;
> >  
> >  		trace_wbc_writeback_start(&wbc, wb->bdi);
> > @@ -675,8 +685,8 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  			writeback_inodes_wb(wb, &wbc);
> >  		trace_wbc_writeback_written(&wbc, wb->bdi);
> >  
> > -		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > -		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > +		work->nr_pages -= write_chunk - wbc.nr_to_write;
> > +		wrote += write_chunk - wbc.nr_to_write;
> >  
> >  		/*
> >  		 * If we consumed everything, see if we have more
> > @@ -691,7 +701,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> >  		/*
> >  		 * Did we write something? Try for more
> >  		 */
> > -		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
> > +		if (wbc.nr_to_write < write_chunk)
> >  			continue;
> >  		/*
> >  		 * Nothing written. Wait for some inode to
> > -- 
> > 1.7.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-06  4:12 ` Wu Fengguang
@ 2010-11-07 13:22   ` Jan Kara
  2010-11-07 13:37     ` Wu Fengguang
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kara @ 2010-11-07 13:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Johannes Weiner, Linux Memory Management List

On Sat 06-11-10 12:12:02, Wu Fengguang wrote:
> On Sat, Nov 06, 2010 at 05:26:23AM +0800, Jan Kara wrote:
> 
> > +	/*
> > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > +	 * we need to write everything and livelock avoidance is implemented
> > +	 * differently.
> > +	 */
> > +	if (wbc.sync_mode == WB_SYNC_NONE)
> > +		write_chunk = MAX_WRITEBACK_PAGES;
> > +	else
> > +		write_chunk = LONG_MAX;
> 
> This looks like a safe change for .37.  I updated the patch on the
> above comment and made no other changes (it seems OK to also remove
> the below line, however that's not the necessary change as a bug fix,
> so I'd rather leave the extra change to the next merge window).
> write_cache_pages():
> 
> -->                     /*
> -->                      * We stop writing back only if we are not doing
> -->                      * integrity sync. In case of integrity sync we have to
> -->                      * keep going until we have written all the pages
> -->                      * we tagged for writeback prior to entering this loop.
> -->                      */
>                         if (--wbc->nr_to_write <= 0 &&
> ==>                         wbc->sync_mode == WB_SYNC_NONE) {
>                                 done = 1;
>                                 break;
  Well, I'd rather leave the test as is. In fact, in my mind-model the
target rather is to completely ignore nr_to_write when we do WB_SYNC_ALL
writeback since obeying it is never what a caller wants to happen...

> +	/*
> +	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> +	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> +	 * here avoids calling into writeback_inodes_wb() more than once.
  Maybe I'd add here:
The intended call sequence for WB_SYNC_ALL writeback is:
> +	 *
> +	 *      wb_writeback()
> +	 *          writeback_inodes_wb()       <== called only once
> +	 *              write_cache_pages()     <== called once for each inode
> +	 *                   (quickly) tag currently dirty pages
> +	 *                   (maybe slowly) sync all tagged pages
> +	 */
> +	if (wbc.sync_mode == WB_SYNC_NONE)
> +		write_chunk = MAX_WRITEBACK_PAGES;
> +	else
> +		write_chunk = LONG_MAX;
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-06 16:39     ` Wu Fengguang
@ 2010-11-07 13:34       ` Jan Kara
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kara @ 2010-11-07 13:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Jan Kara,
	linux-fsdevel@vger.kernel.org, linux-mm, Johannes Weiner

On Sun 07-11-10 00:39:55, Wu Fengguang wrote:
> > It's supposed to sync files in a big loop
> > 
> >         for each dirty inode
> >             write_cache_pages()
> >                 (quickly) tag currently dirty pages
> >                 (maybe slowly) sync all tagged pages
> > 
> > Ideally the loop should call write_cache_pages() _once_ for each inode.
> > At least this is the assumption made by commit f446daaea (mm:
> > implement writeback livelock avoidance using page tagging).
> 
> The above scheme relies on the filesystems to not skip pages in
> WB_SYNC_ALL mode. It seems necessary to add an explicit check at
> least in the -mm tree.
> 
> Thanks,
> Fengguang
> ---
> writeback: check skipped pages on WB_SYNC_ALL 
> 
> In WB_SYNC_ALL mode, filesystems are not expected to skip dirty pages on
> temporal lock contentions or non fatal errors, otherwise sync() will
> return without actually syncing the skipped pages. Add a check to
> catch possible redirty_page_for_writepage() callers that violate this
> expectation.
  Yes, looks like a good debugging patch.

								Honza
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-11-07 00:20:43.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-11-07 00:29:29.000000000 +0800
> @@ -527,6 +527,7 @@ static int writeback_sb_inodes(struct su
>  			 * buffers.  Skip this inode for now.
>  			 */
>  			redirty_tail(inode);
> +			WARN_ON_ONCE(wbc->sync_mode == WB_SYNC_ALL);
>  		}
>  		spin_unlock(&inode_lock);
>  		iput(inode);
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback
  2010-11-07 13:22   ` Jan Kara
@ 2010-11-07 13:37     ` Wu Fengguang
  0 siblings, 0 replies; 12+ messages in thread
From: Wu Fengguang @ 2010-11-07 13:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Johannes Weiner, Linux Memory Management List

On Sun, Nov 07, 2010 at 09:22:04PM +0800, Jan Kara wrote:
> On Sat 06-11-10 12:12:02, Wu Fengguang wrote:
> > On Sat, Nov 06, 2010 at 05:26:23AM +0800, Jan Kara wrote:
> > 
> > > +	/*
> > > +	 * In WB_SYNC_ALL mode, we just want to ignore nr_to_write as
> > > +	 * we need to write everything and livelock avoidance is implemented
> > > +	 * differently.
> > > +	 */
> > > +	if (wbc.sync_mode == WB_SYNC_NONE)
> > > +		write_chunk = MAX_WRITEBACK_PAGES;
> > > +	else
> > > +		write_chunk = LONG_MAX;
> > 
> > This looks like a safe change for .37.  I updated the patch on the
> > above comment and made no other changes (it seems OK to also remove
> > the below line, however that's not the necessary change as a bug fix,
> > so I'd rather leave the extra change to the next merge window).
> > write_cache_pages():
> > 
> > -->                     /*
> > -->                      * We stop writing back only if we are not doing
> > -->                      * integrity sync. In case of integrity sync we have to
> > -->                      * keep going until we have written all the pages
> > -->                      * we tagged for writeback prior to entering this loop.
> > -->                      */
> >                         if (--wbc->nr_to_write <= 0 &&
> > ==>                         wbc->sync_mode == WB_SYNC_NONE) {
> >                                 done = 1;
> >                                 break;
>   Well, I'd rather leave the test as is. In fact, in my mind-model the
> target rather is to completely ignore nr_to_write when we do WB_SYNC_ALL
> writeback since obeying it is never what a caller wants to happen...

I thought (nr_to_write = LONG_MAX) effectively means to complete
ignore it. But whatever, it's not a big issue.

> > +	/*
> > +	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
> > +	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
> > +	 * here avoids calling into writeback_inodes_wb() more than once.
>   Maybe I'd add here:
> The intended call sequence for WB_SYNC_ALL writeback is:

Good addition.  Here is the updated patch.

Thanks,
Fengguang

---
writeback: avoid livelocking WB_SYNC_ALL writeback

From: Jan Kara <jack@suse.cz>

When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is
usually set to LONG_MAX. The logic in wb_writeback() then calls
__writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and thus
we easily end up with negative nr_to_write after the function returns.
wb_writeback() then decides we need another round of writeback but this
is wrong in some cases! For example when a single large file is
continuously dirtied, we would never finish syncing it because each pass
would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp
never gets updated (as inode is never completely clean).

Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We
do not need nr_to_write in WB_SYNC_ALL mode anyway since livelock
avoidance is done differently for it.

After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no
longer able to stall sync forever.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   27 +++++++++++++++++++++++----
 1 file changed, 23 insertions(+), 4 deletions(-)

  Fengguang, I've been testing with those writeback fixes you reposted
a few days ago and I've been able to still reproduce livelocks with
Jan Engelhard's test case. Using writeback tracing I've tracked the
problem to the above and with this patch, sync finishes OK (well, it still
takes about 15 minutes but that's about expected time given the throughput
I see to the disk - the test case randomly dirties pages in a huge file).
So could you please add this patch to the previous two send them to Jens
for inclusion?

--- linux-next.orig/fs/fs-writeback.c	2010-11-06 23:55:35.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-07 21:34:35.000000000 +0800
@@ -630,6 +630,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
+	long write_chunk;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -642,6 +643,24 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          writeback_inodes_wb()       <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                   (quickly) tag currently dirty pages
+	 *                   (maybe slowly) sync all tagged pages
+	 */
+	if (wbc.sync_mode == WB_SYNC_NONE)
+		write_chunk = MAX_WRITEBACK_PAGES;
+	else
+		write_chunk = LONG_MAX;
+
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	for (;;) {
 		/*
@@ -667,7 +686,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
@@ -677,8 +696,8 @@ static long wb_writeback(struct bdi_writ
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
-		work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		work->nr_pages -= write_chunk - wbc.nr_to_write;
+		wrote += write_chunk - wbc.nr_to_write;
 
 		/*
 		 * If we consumed everything, see if we have more
@@ -693,7 +712,7 @@ static long wb_writeback(struct bdi_writ
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (wbc.nr_to_write < write_chunk)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-11-07 13:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-05 21:26 [PATCH] mm: Avoid livelocking of WB_SYNC_ALL writeback Jan Kara
2010-11-05 22:30 ` Christoph Hellwig
2010-11-06  2:54   ` Wu Fengguang
2010-11-06 11:45     ` Christoph Hellwig
2010-11-06  2:55   ` Wu Fengguang
2010-11-06 16:39     ` Wu Fengguang
2010-11-07 13:34       ` Jan Kara
2010-11-06  1:36 ` Johannes Weiner
2010-11-07 13:07   ` Jan Kara
2010-11-06  4:12 ` Wu Fengguang
2010-11-07 13:22   ` Jan Kara
2010-11-07 13:37     ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox