All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>, Andrew Morton <akpm@osdl.org>,
	Ken Chen <kenchen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
Date: Fri, 17 Aug 2007 15:13:17 +0800	[thread overview]
Message-ID: <387334797.23195@ustc.edu.cn> (raw)
Message-ID: <20070817071317.GA8965@mail.ustc.edu.cn> (raw)
In-Reply-To: <20070813103000.GA8520@mail.ustc.edu.cn>

On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote:
> > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> > > Basicly they are
> > > - during the dd: ~16M 
> > > - after 30s:      ~4M
> > > - after 5s:       ~4M
> > > - after 5s:     ~176M
> > > 
> > > The box has 2G memory.
> > > 
> > > Question 1:
> > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> > 
> > pdflush runs every five seconds, so that is indicative of the inode being
> > written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> > perhaps the inodes aren't moving between the lists exactly the way you
> > think they are...
> 
> Now I figured out the exact situation. When the scan of s_io finishes
> with some small inodes, nr_to_write will be positive, fooling kupdate
> to quit prematurely. But in fact the big dirty file is on s_more_io
> waiting for more io... The attached patch fixes it.
> 
> Fengguang
> ===
> 
> Subject: writeback: introduce writeback_control.more_io to indicate more io
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
> 	- after 30s:    ~4M
> 	- after 5s:     ~4M
> 	- after 5s:     all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
> 		s_io            s_more_io
> 		-------------------------
> 	1)	100M,1K         0
> 	2)	1K              96M
> 	3)	0               96M
> 
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> 
> nr_to_write > 0 in 3) fools the upper layer to think that data have all been
> written out. Bug the big dirty file is still sitting in s_more_io. We cannot
> simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
> loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
> in s_dirty.  It is also not an option to draw inodes from both s_more_io and
> s_dirty, an let the loop go on: this might lead to live locks, and might also
> starve other superblocks in sync time(well kupdate may still starve some
> superblocks, that's another bug).
> 
> So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
> not necessarily mean that "all data are written". This patch introduces a flag
> writeback_control.more_io to indicate this situation. With it the big dirty file
> no longer has to wait for the next kupdate invocation 5s later.

Sorry, this patch is found to be dangerous. It locks up my desktop
on heavy I/O: kupdate *immediately* returns to push the file in
s_more_io for writeback, but it *could* still not able to make
progress(locks etc.). Now kupdate ends up *busy looping*.  That could
be fixed by wait for somehow 100ms and retry the io. Should we do
it?(or: Is 5s interval considered too long a wait?)

> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  fs/fs-writeback.c         |    2 ++
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    9 ++++++---
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
> +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
> @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
>  		if (wbc->nr_to_write <= 0)
>  			break;
>  	}
> +	if (!list_empty(&sb->s_more_io))
> +		wbc->more_io = 1;
>  	spin_unlock(&inode_lock);
>  	return ret;		/* Leave any unwritten inodes on s_io */
>  }
> --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
> +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
> @@ -61,6 +61,7 @@ struct writeback_control {
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned for_writepages:1;	/* This is a writepages() call */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> +	unsigned more_io:1;		/* more io to be dispatched */
>  
>  	void *fs_private;		/* For use by ->writepages() */
>  };
> --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
> @@ -382,6 +382,7 @@ static void background_writeout(unsigned
>  			global_page_state(NR_UNSTABLE_NFS) < background_thresh
>  				&& min_pages <= 0)
>  			break;
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
> @@ -389,8 +390,9 @@ static void background_writeout(unsigned
>  		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  			/* Wrote less than expected */
> -			congestion_wait(WRITE, HZ/10);
> -			if (!wbc.encountered_congestion)
> +			if (wbc.encountered_congestion)
> +				congestion_wait(WRITE, HZ/10);
> +			else if (!wbc.more_io)
>  				break;
>  		}
>  	}
> @@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
>  			global_page_state(NR_UNSTABLE_NFS) +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		writeback_inodes(&wbc);
>  		if (wbc.nr_to_write > 0) {
>  			if (wbc.encountered_congestion)
>  				congestion_wait(WRITE, HZ/10);
> -			else
> +			else if (!wbc.more_io)
>  				break;	/* All the old data is written */
>  		}
>  		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: David Chinner <dgc@sgi.com>, Andrew Morton <akpm@osdl.org>,
	Ken Chen <kenchen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.
Subject: Re: [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page()
Date: Fri, 17 Aug 2007 15:13:17 +0800	[thread overview]
Message-ID: <20070817071317.GA8965__16991.1062614576$1187334852$gmane$org@mail.ustc.edu.cn> (raw)
Message-ID: <20070817071317.GA8965@mail.ustc.edu.cn> (raw)
In-Reply-To: <20070813103000.GA8520@mail.ustc.edu.cn>

On Mon, Aug 13, 2007 at 06:30:00PM +0800, Fengguang Wu wrote:
> > On Sun, Aug 12, 2007 at 05:11:23PM +0800, Fengguang Wu wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> and me identified a writeback bug:
> > > Basicly they are
> > > - during the dd: ~16M 
> > > - after 30s:      ~4M
> > > - after 5s:       ~4M
> > > - after 5s:     ~176M
> > > 
> > > The box has 2G memory.
> > > 
> > > Question 1:
> > > How come the 5s delays? I run 4 tests in total, 2 of which have such 5s delays.
> > 
> > pdflush runs every five seconds, so that is indicative of the inode being
> > written once for 1024 pages, and then delayed to the next pdflush run 5s later.
> > perhaps the inodes aren't moving between the lists exactly the way you
> > think they are...
> 
> Now I figured out the exact situation. When the scan of s_io finishes
> with some small inodes, nr_to_write will be positive, fooling kupdate
> to quit prematurely. But in fact the big dirty file is on s_more_io
> waiting for more io... The attached patch fixes it.
> 
> Fengguang
> ===
> 
> Subject: writeback: introduce writeback_control.more_io to indicate more io
> 
> After making dirty a 100M file, the normal behavior is to
> start the writeback for all data after 30s delays. But
> sometimes the following happens instead:
> 
> 	- after 30s:    ~4M
> 	- after 5s:     ~4M
> 	- after 5s:     all remaining 92M
> 
> Some analyze shows that the internal io dispatch queues goes like this:
> 
> 		s_io            s_more_io
> 		-------------------------
> 	1)	100M,1K         0
> 	2)	1K              96M
> 	3)	0               96M
> 
> 1) initial state with a 100M file and a 1K file
> 2) 4M written, nr_to_write <= 0, so write more
> 3) 1K written, nr_to_write > 0, no more writes(BUG)
> 
> nr_to_write > 0 in 3) fools the upper layer to think that data have all been
> written out. Bug the big dirty file is still sitting in s_more_io. We cannot
> simply splice s_more_io back to s_io as soon as s_io becomes empty, and let the
> loop in generic_sync_sb_inodes() continue: this may starve newly expired inodes
> in s_dirty.  It is also not an option to draw inodes from both s_more_io and
> s_dirty, an let the loop go on: this might lead to live locks, and might also
> starve other superblocks in sync time(well kupdate may still starve some
> superblocks, that's another bug).
> 
> So we have to return when a full scan of s_io completes. So nr_to_write > 0 does
> not necessarily mean that "all data are written". This patch introduces a flag
> writeback_control.more_io to indicate this situation. With it the big dirty file
> no longer has to wait for the next kupdate invocation 5s later.

Sorry, this patch is found to be dangerous. It locks up my desktop
on heavy I/O: kupdate *immediately* returns to push the file in
s_more_io for writeback, but it *could* still not able to make
progress(locks etc.). Now kupdate ends up *busy looping*.  That could
be fixed by wait for somehow 100ms and retry the io. Should we do
it?(or: Is 5s interval considered too long a wait?)

> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> ---
>  fs/fs-writeback.c         |    2 ++
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    9 ++++++---
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> --- linux-2.6.23-rc2-mm2.orig/fs/fs-writeback.c
> +++ linux-2.6.23-rc2-mm2/fs/fs-writeback.c
> @@ -560,6 +560,8 @@ int generic_sync_sb_inodes(struct super_
>  		if (wbc->nr_to_write <= 0)
>  			break;
>  	}
> +	if (!list_empty(&sb->s_more_io))
> +		wbc->more_io = 1;
>  	spin_unlock(&inode_lock);
>  	return ret;		/* Leave any unwritten inodes on s_io */
>  }
> --- linux-2.6.23-rc2-mm2.orig/include/linux/writeback.h
> +++ linux-2.6.23-rc2-mm2/include/linux/writeback.h
> @@ -61,6 +61,7 @@ struct writeback_control {
>  	unsigned for_reclaim:1;		/* Invoked from the page allocator */
>  	unsigned for_writepages:1;	/* This is a writepages() call */
>  	unsigned range_cyclic:1;	/* range_start is cyclic */
> +	unsigned more_io:1;		/* more io to be dispatched */
>  
>  	void *fs_private;		/* For use by ->writepages() */
>  };
> --- linux-2.6.23-rc2-mm2.orig/mm/page-writeback.c
> +++ linux-2.6.23-rc2-mm2/mm/page-writeback.c
> @@ -382,6 +382,7 @@ static void background_writeout(unsigned
>  			global_page_state(NR_UNSTABLE_NFS) < background_thresh
>  				&& min_pages <= 0)
>  			break;
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		wbc.pages_skipped = 0;
> @@ -389,8 +390,9 @@ static void background_writeout(unsigned
>  		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>  		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>  			/* Wrote less than expected */
> -			congestion_wait(WRITE, HZ/10);
> -			if (!wbc.encountered_congestion)
> +			if (wbc.encountered_congestion)
> +				congestion_wait(WRITE, HZ/10);
> +			else if (!wbc.more_io)
>  				break;
>  		}
>  	}
> @@ -455,13 +457,14 @@ static void wb_kupdate(unsigned long arg
>  			global_page_state(NR_UNSTABLE_NFS) +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
>  	while (nr_to_write > 0) {
> +		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
>  		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>  		writeback_inodes(&wbc);
>  		if (wbc.nr_to_write > 0) {
>  			if (wbc.encountered_congestion)
>  				congestion_wait(WRITE, HZ/10);
> -			else
> +			else if (!wbc.more_io)
>  				break;	/* All the old data is written */
>  		}
>  		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


  parent reply	other threads:[~2007-08-17  7:13 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-12  9:11 [PATCH 0/6] writeback time order/delay fixes take 3 Fengguang Wu
2007-08-12  9:11 ` Fengguang Wu
2007-08-22  0:23   ` Chris Mason
2007-08-22  1:18     ` Fengguang Wu
2007-08-22  1:18       ` Fengguang Wu
2007-08-22 12:42         ` Chris Mason
2007-08-23  2:47           ` David Chinner
2007-08-23 12:13             ` Chris Mason
2007-08-24 12:56               ` Fengguang Wu
2007-08-24 12:56                 ` Fengguang Wu
2007-08-24 13:24           ` Fengguang Wu
2007-08-24 13:24             ` Fengguang Wu
2007-08-24 14:36               ` Chris Mason
2007-08-23  2:33       ` David Chinner
2007-08-24 13:55         ` Fengguang Wu
2007-08-24 13:55           ` Fengguang Wu
2007-08-28 14:55           ` David Chinner
2007-08-28 15:08             ` Chris Mason
2007-08-28 16:33               ` David Chinner
2007-08-28 16:57                 ` Chris Mason
2007-08-29  7:53                 ` Fengguang Wu
2007-08-29  7:53                   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 1/6] writeback: fix time ordering of the per superblock inode lists 8 Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 2/6] writeback: fix ntfs with sb_has_dirty_inodes() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-13  1:03   ` David Chinner
2007-08-13 10:30     ` Fengguang Wu
2007-08-13 10:30       ` Fengguang Wu
2007-08-17  7:13       ` Fengguang Wu [this message]
2007-08-17  7:13         ` Fengguang Wu
2007-08-17  7:13         ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 4/6] check dirty inode list Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 5/6] prevent time-ordering warnings Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
2007-08-12  9:11 ` [PATCH 6/6] track redirty_tail() calls Fengguang Wu
2007-08-12  9:11   ` Fengguang Wu
  -- strict thread matches above, loose matches on Subject: below --
2007-08-19  6:53 [PATCH 0/6] dirty inode lists time delay/ordering fixes Fengguang Wu
2007-08-19  6:53 ` [PATCH 3/6] writeback: remove pages_skipped accounting in __block_write_full_page() Fengguang Wu
2007-08-19  6:53   ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=387334797.23195@ustc.edu.cn \
    --to=wfg@mail.ustc.edu.cn \
    --cc=akpm@linux-foundation.org \
    --cc=akpm@osdl.org \
    --cc=dgc@sgi.com \
    --cc=kenchen@google.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.