Re: [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages
       [not found] ` <20090923124027.456325340@intel.com>
@ 2009-09-23 12:45   ` Christoph Hellwig
  2009-09-23 12:53     ` Wu Fengguang
  2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2009-09-23 12:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, Peter Zijlstra,
	linux-fsdevel, LKML

On Wed, Sep 23, 2009 at 08:33:39PM +0800, Wu Fengguang wrote:
> Some filesystem may choose to write much more than ratelimit_pages
> before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> determine number to write based on real number of dirtied pages.
> 
> The increased write_chunk may make the dirtier more bumpy.  This is
> filesystem writers' duty not to dirty too much at a time without
> checking the ratelimit.

For some reason the filesystems tricking around
balance_dirty_pages_ratelimited's semantics seem to get much better
buffered write performance.  Are you sure this is not going to destroy
that benefit?  (But yes, we should make sure it behaves the same for all
filesystems)

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-09-23 16:31:46.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-09-23 16:33:19.000000000 +0800
> @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
>  /*
>   * When balance_dirty_pages decides that the caller needs to perform some
>   * non-background writeback, this is how many pages it will attempt to write.
> - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> + * It should be somewhat larger than dirtied pages to ensure that reasonably
>   * large amounts of I/O are submitted.
>   */
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	return dirtied + dirtied / 2;
>  }
>  
>  /* The following parameters are exported via /proc/sys/vm */
> @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
>   * If we're over `background_thresh' then pdflush is woken to perform some
>   * writeout.
>   */
> -static void balance_dirty_pages(struct address_space *mapping)
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
>  {
>  	long nr_reclaimable, bdi_nr_reclaimable;
>  	long nr_writeback, bdi_nr_writeback;
> @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
> -	unsigned long write_chunk = sync_writeback_pages();
>  	unsigned long pause = 1;
>  
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> @@ -640,9 +640,10 @@ void balance_dirty_pages_ratelimited_nr(
>  	p =  &__get_cpu_var(bdp_ratelimits);
>  	*p += nr_pages_dirtied;
>  	if (unlikely(*p >= ratelimit)) {
> +		ratelimit = sync_writeback_pages(*p);
>  		*p = 0;
>  		preempt_enable();
> -		balance_dirty_pages(mapping);
> +		balance_dirty_pages(mapping, ratelimit);
>  		return;
>  	}
>  	preempt_enable();
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 12:45   ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Christoph Hellwig
@ 2009-09-23 12:53     ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2009-09-23 12:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Ts'o,
	Dave Chinner, Chris Mason, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Sep 23, 2009 at 08:45:29PM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 08:33:39PM +0800, Wu Fengguang wrote:
> > Some filesystem may choose to write much more than ratelimit_pages
> > before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
> > determine number to write based on real number of dirtied pages.
> > 
> > The increased write_chunk may make the dirtier more bumpy.  This is
> > filesystem writers' duty not to dirty too much at a time without
> > checking the ratelimit.
> 
> For some reason the filesystems tricking around
> balance_dirty_pages_ratelimited's semantics seem to get much better
> buffered write performance.

Yes, I would expect enlarged ratelimit numbers to benefit performance.

> Are you sure this is not going to destroy that benefit?  (But yes,
> we should make sure it behaves the same for all filesystems)

If filesystem chooses to use a larger ratelimit number, then this
patch will also enlarge the write chunk size (which is now computed
based on the enlarged ratelimit number), which I believe will be
beneficial, too.

Thanks,
Fengguang

> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |   13 +++++++------
> >  1 file changed, 7 insertions(+), 6 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-09-23 16:31:46.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-09-23 16:33:19.000000000 +0800
> > @@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
> >  /*
> >   * When balance_dirty_pages decides that the caller needs to perform some
> >   * non-background writeback, this is how many pages it will attempt to write.
> > - * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
> > + * It should be somewhat larger than dirtied pages to ensure that reasonably
> >   * large amounts of I/O are submitted.
> >   */
> > -static inline long sync_writeback_pages(void)
> > +static inline long sync_writeback_pages(unsigned long dirtied)
> >  {
> > -	return ratelimit_pages + ratelimit_pages / 2;
> > +	return dirtied + dirtied / 2;
> >  }
> >  
> >  /* The following parameters are exported via /proc/sys/vm */
> > @@ -476,7 +476,8 @@ get_dirty_limits(unsigned long *pbackgro
> >   * If we're over `background_thresh' then pdflush is woken to perform some
> >   * writeout.
> >   */
> > -static void balance_dirty_pages(struct address_space *mapping)
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> >  {
> >  	long nr_reclaimable, bdi_nr_reclaimable;
> >  	long nr_writeback, bdi_nr_writeback;
> > @@ -484,7 +485,6 @@ static void balance_dirty_pages(struct a
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> >  	unsigned long pages_written = 0;
> > -	unsigned long write_chunk = sync_writeback_pages();
> >  	unsigned long pause = 1;
> >  
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > @@ -640,9 +640,10 @@ void balance_dirty_pages_ratelimited_nr(
> >  	p =  &__get_cpu_var(bdp_ratelimits);
> >  	*p += nr_pages_dirtied;
> >  	if (unlikely(*p >= ratelimit)) {
> > +		ratelimit = sync_writeback_pages(*p);
> >  		*p = 0;
> >  		preempt_enable();
> > -		balance_dirty_pages(mapping);
> > +		balance_dirty_pages(mapping, ratelimit);
> >  		return;
> >  	}
> >  	preempt_enable();
> > 
> > 
> ---end quoted text---

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/6 -v2] writeback: balance_dirty_pages() shall write more than dirtied pages
       [not found] ` <20090923124027.456325340@intel.com>
  2009-09-23 12:45   ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Christoph Hellwig
@ 2009-09-23 13:56   ` Wu Fengguang
  2009-09-23 13:58     ` Wu Fengguang
  1 sibling, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:56 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Ts'o, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel@vger.kernel.org,
	LKML

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Otherwise it is possible that
  loop {
    btrfs_file_write():     dirty 1024 pages
    balance_dirty_pages():  write up to 48 pages (= ratelimit_pages * 1.5)
  }
in which the writeback rate cannot keep up with dirty rate, and the
dirty pages go all the way beyond dirty_thresh.

The increased write_chunk may make the dirtier more bumpy.
So filesystems shall be take care not to dirty too much at
a time (eg. > 4MB) without checking the ratelimit.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-23 21:41:21.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-23 21:42:17.000000000 +0800
@@ -44,12 +44,15 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	if (dirtied < ratelimit_pages)
+		dirtied = ratelimit_pages;
+
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -476,7 +479,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -484,7 +488,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 	unsigned long pause = 1;
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -639,9 +642,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/6 -v2] writeback: balance_dirty_pages() shall write more than dirtied pages
  2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
@ 2009-09-23 13:58     ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:58 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Ts'o, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel@vger.kernel.org,
	LKML

On Wed, Sep 23, 2009 at 09:56:00PM +0800, Wu Fengguang wrote:
> -static inline long sync_writeback_pages(void)
> +static inline long sync_writeback_pages(unsigned long dirtied)
>  {
> -	return ratelimit_pages + ratelimit_pages / 2;
> +	if (dirtied < ratelimit_pages)
> +		dirtied = ratelimit_pages;

Just added the above checks. Now balance_dirty_pages() for XFS 
works in a much larger 1536 chunk size :)

[   40.374081] redirty_tail() +516: inode=131
[   40.374951] mm/page-writeback.c +543 balance_dirty_pages(): comm=cp pid=3296 n=1536
[   40.377001] global dirty=49818 writeback=17027 nfs=0 flags=CM towrite=0 skipped=0

> +
> +	return dirtied + dirtied / 2;
>  }

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <20090923124027.589303074@intel.com>]

* Re: [PATCH 2/6] writeback: stop background writeback when below background threshold
       [not found] ` <20090923124027.589303074@intel.com>
@ 2009-09-23 15:05   ` Jens Axboe
  2009-09-24  1:24     ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Jens Axboe @ 2009-09-23 15:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig, linux-fsdevel, LKML

On Wed, Sep 23 2009, Wu Fengguang wrote:
> Treat bdi_start_writeback(0) as a special request to do background write,
> and stop such work when we are below the background dirty threshold.
> 
> Also simplify the (nr_pages <= 0) checks. Since we already pass in
> nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
> need to worry about it being decreased to zero.

Good cleanup! Do you want me to collect these fixes, or how shall we
handle this?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/6] writeback: stop background writeback when below background threshold
  2009-09-23 15:05   ` [PATCH 2/6] writeback: stop background writeback when below background threshold Jens Axboe
@ 2009-09-24  1:24     ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2009-09-24  1:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, Jan Kara, Peter Zijlstra, Theodore Ts'o,
	Dave Chinner, Chris Mason, Christoph Hellwig,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Sep 23, 2009 at 11:05:10PM +0800, Jens Axboe wrote:
> On Wed, Sep 23 2009, Wu Fengguang wrote:
> > Treat bdi_start_writeback(0) as a special request to do background write,
> > and stop such work when we are below the background dirty threshold.
> > 
> > Also simplify the (nr_pages <= 0) checks. Since we already pass in
> > nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't
> > need to worry about it being decreased to zero.
> 
> Good cleanup! Do you want me to collect these fixes, or how shall we
> handle this?

Thanks. I'm feeling OK with these patches(except the last one) as well
as Jan's busy loop fixing patch. Can you pull them to your tree?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <20090923124028.060887241@intel.com>]

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
       [not found] ` <20090923124028.060887241@intel.com>
@ 2009-09-23 13:20   ` Wu Fengguang
  2009-09-23 13:23     ` Christoph Hellwig
  2009-09-26 19:47   ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:20 UTC (permalink / raw)
  To: Andrew Morton, Jens Axboe
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason,
	Christoph Hellwig, Peter Zijlstra, linux-fsdevel@vger.kernel.org,
	LKML

On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu, Fengguang wrote:
> Debug traces show that in per-bdi writeback, the inode under writeback
> almost always get redirtied by a busy dirtier.  We used to call
> redirty_tail() in this case, which could delay inode for up to 30s.
> 
> This is unacceptable because it now happens so frequently for plain cp/dd,
> that the accumulated delays could make writeback of big files very slow.
> 
> So let's distinguish between data redirty and metadata only redirty.
> The first one is caused by a busy dirtier, while the latter one could
> happen in XFS, NFS, etc. when they are doing delalloc or updating isize.
> 
> The inode being busy dirtied will now be requeued for next io, while
> the inode being redirtied by fs will continue to be delayed to avoid
> repeated IO.

Here are some test results on XFS.  Workload is a simple

         cp /dev/zero /mnt/xfs/

I saw many repeated 

[  344.043711] mm/page-writeback.c +540 balance_dirty_pages(): comm=cp pid=3520 n=12
[  344.043711] global dirty=57051 writeback=12920 nfs=0 flags=CM towrite=0 skipped=0
[  344.043711] redirty_tail() +516: inode=128  => the directory inode
[  344.043711] redirty_tail() +516: inode=131  => the zero file being written to

and then repeated 

[  347.408629] fs/fs-writeback.c +813 wb_writeback(): comm=flush-8:0 pid=3298 n=4096
[  347.411045] global dirty=50397 writeback=18065 nfs=0 flags=CM towrite=0 skipped=0
[  347.422077] requeue_io() +468: inode=131
[  347.423809] redirty_tail() +516: inode=128

traces during copy, and repeated 


[  373.326496] redirty_tail() +516: inode=131
[  373.328209] fs/fs-writeback.c +813 wb_writeback(): comm=flush-8:0 pid=3298 n=4096
[  373.330988] global dirty=25213 writeback=20470 nfs=0 flags=CM towrite=0 skipped=0

after copy is interrupted.

I noticed that
- the write chunk size of balance_dirty_pages() is 12, which is pretty
  small and inefficient.
- during copy, the inode is sometimes redirty_tail (old behavior) and
  sometimes requeue_io (new behavior).
- during copy, the directory inode will always be synced and then
  redirty_tail.
- after copy, the inode will be redirtied after sync.

It shall not be a problem to use requeue_io for XFS, because whether
it be requeue_io or redirty_tail, write_inode() will be called once
for every 4MB.

It would be inefficient if XFS really tries to write inode and
directory inode's metadata every time it synced 4MB page. If
that write attempt is turned into _real_ IO, that would be bad
and kill performance. Increasing MAX_WRITEBACK_PAGES may help
reduce the frequency of write_inode() though.

Thanks,
Fengguang

> CC: Jan Kara <jack@suse.cz>
> CC: Theodore Ts'o <tytso@mit.edu>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Jens Axboe <jens.axboe@oracle.com>
> CC: Chris Mason <chris.mason@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-09-23 16:13:41.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-23 16:21:24.000000000 +0800
> @@ -491,10 +493,15 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> -		if (inode->i_state & I_DIRTY) {
> +		if (inode->i_state & I_DIRTY_PAGES) {
>  			/*
> -			 * Someone redirtied the inode while were writing back
> -			 * the pages.
> +			 * More pages get dirtied by a fast dirtier.
> +			 */
> +			goto select_queue;
> +		} else if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * At least XFS will redirty the inode during the
> +			 * writeback (delalloc) and on io completion (isize).
>  			 */
>  			redirty_tail(inode);
>  		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 13:20   ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
@ 2009-09-23 13:23     ` Christoph Hellwig
  2009-09-23 13:40       ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2009-09-23 13:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Christoph Hellwig, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Sep 23, 2009 at 09:20:08PM +0800, Wu Fengguang wrote:
> I noticed that
> - the write chunk size of balance_dirty_pages() is 12, which is pretty
>   small and inefficient.
> - during copy, the inode is sometimes redirty_tail (old behavior) and
>   sometimes requeue_io (new behavior).
> - during copy, the directory inode will always be synced and then
>   redirty_tail.
> - after copy, the inode will be redirtied after sync.

Yeah, XFS uses generic_file_uffered_write and the heuristics in there
for balance_dirty_pages turned out to be really bad.  So far we didn't
manage to sucessfully get that fixed, though.

> It shall not be a problem to use requeue_io for XFS, because whether
> it be requeue_io or redirty_tail, write_inode() will be called once
> for every 4MB.
> 
> It would be inefficient if XFS really tries to write inode and
> directory inode's metadata every time it synced 4MB page. If
> that write attempt is turned into _real_ IO, that would be bad
> and kill performance. Increasing MAX_WRITEBACK_PAGES may help
> reduce the frequency of write_inode() though.

The way we call write_inode for XFS is extremly inefficient for XFS.  As
you noticed XFS tends to redirty the inode on I/O completion, and we
also cluster inode writeouts.  For XFS we'd really prefer to not
intermix data and inode writeout, but first do the data writeout and
then later push out the inodes, preferably with as many as possible
inodes to sweep out in one go.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-23 13:23     ` Christoph Hellwig
@ 2009-09-23 13:40       ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2009-09-23 13:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Peter Zijlstra, linux-fsdevel@vger.kernel.org, LKML

On Wed, Sep 23, 2009 at 09:23:51PM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 09:20:08PM +0800, Wu Fengguang wrote:
> > I noticed that
> > - the write chunk size of balance_dirty_pages() is 12, which is pretty
> >   small and inefficient.
> > - during copy, the inode is sometimes redirty_tail (old behavior) and
> >   sometimes requeue_io (new behavior).
> > - during copy, the directory inode will always be synced and then
> >   redirty_tail.
> > - after copy, the inode will be redirtied after sync.
> 
> Yeah, XFS uses generic_file_uffered_write and the heuristics in there
> for balance_dirty_pages turned out to be really bad.  So far we didn't
> manage to sucessfully get that fixed, though.

Ah sorry. It's because of the first patch, it does not always "bump up"
the write chunk. In your case it is obviously decreased (the original
ratelimit_pages=4096 is much larger value).  I'll fix it.

> > It shall not be a problem to use requeue_io for XFS, because whether
> > it be requeue_io or redirty_tail, write_inode() will be called once
> > for every 4MB.
> > 
> > It would be inefficient if XFS really tries to write inode and
> > directory inode's metadata every time it synced 4MB page. If
> > that write attempt is turned into _real_ IO, that would be bad
> > and kill performance. Increasing MAX_WRITEBACK_PAGES may help
> > reduce the frequency of write_inode() though.
> 
> The way we call write_inode for XFS is extremly inefficient for XFS.  As
> you noticed XFS tends to redirty the inode on I/O completion, and we
> also cluster inode writeouts.  For XFS we'd really prefer to not
> intermix data and inode writeout, but first do the data writeout and
> then later push out the inodes, preferably with as many as possible
> inodes to sweep out in one go.

I guess the difficult part would be possible policy requirements
on the max batch size (max number of inodes or pages to write before
switching to write metadata) and the delay time (between the sync of
data and metadata). It may take a long time to make a full scan of
the dirty list.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
       [not found] ` <20090923124028.060887241@intel.com>
  2009-09-23 13:20   ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
@ 2009-09-26 19:47   ` Christoph Hellwig
  2009-09-27  2:02     ` Wu Fengguang
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2009-09-26 19:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Christoph Hellwig, Peter Zijlstra, linux-fsdevel,
	LKML

On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu Fengguang wrote:
> So let's distinguish between data redirty and metadata only redirty.
> The first one is caused by a busy dirtier, while the latter one could
> happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

Btw, I'm not sure the existing and preserved behaviour for that case
is good.  In the worst case the inode writeout gets delayed by another
30 seconds, doubling the window until a file is on disk if it was
extended.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier
  2009-09-26 19:47   ` Christoph Hellwig
@ 2009-09-27  2:02     ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2009-09-27  2:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Jens Axboe, Jan Kara, Theodore Tso, Dave Chinner,
	Chris Mason, Peter Zijlstra, linux-fsdevel@vger.kernel.org, LKML

On Sun, Sep 27, 2009 at 03:47:47AM +0800, Christoph Hellwig wrote:
> On Wed, Sep 23, 2009 at 08:33:43PM +0800, Wu Fengguang wrote:
> > So let's distinguish between data redirty and metadata only redirty.
> > The first one is caused by a busy dirtier, while the latter one could
> > happen in XFS, NFS, etc. when they are doing delalloc or updating isize.
> 
> Btw, I'm not sure the existing and preserved behaviour for that case
> is good.  In the worst case the inode writeout gets delayed by another
> 30 seconds, doubling the window until a file is on disk if it was
> extended.

Yes, the preserved behaviour is not optimal for XFS, but safe.

We could try redirty_tail when there are no remaining
dirty pages, and only metadata dirtiness. Like this:

---
 fs/fs-writeback.c |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-27 09:52:15.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-27 09:54:23.000000000 +0800
@@ -477,18 +477,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -510,7 +499,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -533,6 +521,12 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * At least XFS will redirty the inode during the
+			 * writeback (delalloc) and on io completion (isize).
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-09-27  2:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20090923123337.990689487@intel.com>
     [not found] ` <20090923124027.456325340@intel.com>
2009-09-23 12:45   ` [PATCH 1/6] writeback: balance_dirty_pages() shall write more than dirtied pages Christoph Hellwig
2009-09-23 12:53     ` Wu Fengguang
2009-09-23 13:56   ` [PATCH 1/6 -v2] " Wu Fengguang
2009-09-23 13:58     ` Wu Fengguang
     [not found] ` <20090923124027.589303074@intel.com>
2009-09-23 15:05   ` [PATCH 2/6] writeback: stop background writeback when below background threshold Jens Axboe
2009-09-24  1:24     ` Wu Fengguang
     [not found] ` <20090923124028.060887241@intel.com>
2009-09-23 13:20   ` [PATCH 5/6] writeback: don't delay inodes redirtied by a fast dirtier Wu Fengguang
2009-09-23 13:23     ` Christoph Hellwig
2009-09-23 13:40       ` Wu Fengguang
2009-09-26 19:47   ` Christoph Hellwig
2009-09-27  2:02     ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).