Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
       [not found] <200908212250.n7LMox3g029154@imap1.linux-foundation.org>
@ 2009-08-22  2:51 ` Wu Fengguang
  2009-08-22 18:11   ` Peter Zijlstra
  2009-08-23  9:33   ` Richard Kennedy
  0 siblings, 2 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-08-22  2:51 UTC (permalink / raw)
  To: akpm@linux-foundation.org
  Cc: mm-commits@vger.kernel.org, richard@rsk.demon.co.uk,
	a.p.zijlstra@chello.nl, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-mm, linux-fsdevel

On Sat, Aug 22, 2009 at 06:50:59AM +0800, Andrew Morton wrote:
> 
> The patch titled
>      mm: balance_dirty_pages(): reduce calls to global_page_state to reduce cache references
> has been added to the -mm tree.  Its filename is
>      mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references.patch
> 
> Before you just go and hit "reply", please:
>    a) Consider who else should be cc'ed
>    b) Prefer to cc a suitable mailing list as well
>    c) Ideally: find the original patch on the mailing list and do a
>       reply-to-all to that, adding suitable additional cc's
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://userweb.kernel.org/~akpm/stuff/added-to-mm.txt to find
> out what to do about this
> 
> The current -mm tree may be found at http://userweb.kernel.org/~akpm/mmotm/
> 
> ------------------------------------------------------
> Subject: mm: balance_dirty_pages(): reduce calls to global_page_state to reduce cache references
> From: Richard Kennedy <richard@rsk.demon.co.uk>
> 
> Reducing the number of times balance_dirty_pages calls global_page_state
> reduces the cache references and so improves write performance on a
> variety of workloads.
> 
> 'perf stats' of simple fio write tests shows the reduction in cache
> access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2
> with 3Gb memory (dirty_threshold approx 600 Mb) running each test 10
> times, taking the average & standard deviation
> 
> 		average (s.d.) in millions (10^6)
> 2.6.31-rc6	661 (9.88)
> +patch		604 (4.19)
> 
> Achieving this reduction is by dropping clip_bdi_dirty_limit as it rereads
> the counters to apply the dirty_threshold and moving this check up into
> balance_dirty_pages where it has already read the counters.
> 
> Also by rearrange the for loop to only contain one copy of the limit tests
> allows the pdflush test after the loop to use the local copies of the
> counters rather than rereading then.
> 
> In the common case with no throttling it now calls global_page_state 5
> fewer times and bdi_stat 2 fewer.
> 
> I have tried to retain the existing behavior as much as possible, but have
> added NR_WRITEBACK_TEMP to nr_writeback.  This counter was used in
> clip_bdi_dirty_limit but not in balance_dirty_pages, grep suggests this is
> only used by FUSE but I haven't done any testing on that.  It does seem
> logical to count all the WRITEBACK pages when making the throttling
> decisions so this change should be more correct ;)
> 
> I have been running this patch for over a week and have had no problems
> with it and generally see improved disk write performance on a variety of
> tests & workloads, even in the worst cases performance is the same as the
> unpatched kernel.  I also tried this on a Intel ATOM 330 twincore system
> and saw similar improvements.
> 
> Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> Cc: Chris Mason <chris.mason@oracle.com>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Jens Axboe <jens.axboe@oracle.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Cc: Martin Bligh <mbligh@mbligh.org>
> Cc: Miklos Szeredi <miklos@szeredi.hu>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/page-writeback.c |  116 +++++++++++++++---------------------------
>  1 file changed, 43 insertions(+), 73 deletions(-)
> 
> diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> +++ a/mm/page-writeback.c
> @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
>  	}
>  }
>  
> -/*
> - * Clip the earned share of dirty pages to that which is actually available.
> - * This avoids exceeding the total dirty_limit when the floating averages
> - * fluctuate too quickly.
> - */
> -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> -		unsigned long dirty, unsigned long *pbdi_dirty)
> -{
> -	unsigned long avail_dirty;
> -
> -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> -		 global_page_state(NR_WRITEBACK) +
> -		 global_page_state(NR_UNSTABLE_NFS) +
> -		 global_page_state(NR_WRITEBACK_TEMP);
> -
> -	if (avail_dirty < dirty)
> -		avail_dirty = dirty - avail_dirty;
> -	else
> -		avail_dirty = 0;
> -
> -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> -		bdi_stat(bdi, BDI_WRITEBACK);
> -
> -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> -}
> -
>  static inline void task_dirties_fraction(struct task_struct *tsk,
>  		long *numerator, long *denominator)
>  {
> @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
>  			bdi_dirty = dirty * bdi->max_ratio / 100;
>  
>  		*pbdi_dirty = bdi_dirty;
> -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
>  		task_dirty_limit(current, pbdi_dirty);
>  	}
>  }
> @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
>  		};
>  
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
> -				&bdi_thresh, bdi);
> +				 &bdi_thresh, bdi);
>  
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> -
> -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> -
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> -			break;
> -
> -		/*
> -		 * Throttle it only when the background writeback cannot
> -		 * catch-up. This avoids (excessively) small writeouts
> -		 * when the bdi limits are ramping up.
> -		 */
> -		if (nr_reclaimable + nr_writeback <
> -				(background_thresh + dirty_thresh) / 2)
> -			break;
> -
> -		if (!bdi->dirty_exceeded)
> -			bdi->dirty_exceeded = 1;
> -
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		if (bdi_nr_reclaimable > bdi_thresh) {
> -			generic_sync_bdi_inodes(NULL, &wbc);
> -			pages_written += write_chunk - wbc.nr_to_write;
> -			get_dirty_limits(&background_thresh, &dirty_thresh,
> -				       &bdi_thresh, bdi);
> -		}
> +			global_page_state(NR_UNSTABLE_NFS);
> +		nr_writeback = global_page_state(NR_WRITEBACK) +
> +			global_page_state(NR_WRITEBACK_TEMP);
>  
>  		/*
>  		 * In order to avoid the stacked BDI deadlock we need
> @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
>  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
>  		}
>  
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> -			break;
> -		if (pages_written >= write_chunk)
> -			break;		/* We've done our duty */

> +		/* always throttle if over threshold */
> +		if (nr_reclaimable + nr_writeback < dirty_thresh) {

That 'if' is a big behavior change. It effectively blocks every one
and canceled Peter's proportional throttling work: the less a process
dirtied, the less it should be throttled.

I'd propose to remove the above 'if' and liberate the following three 'if's.

> +
> +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> +				break;
> +
> +			/*
> +			 * Throttle it only when the background writeback cannot
> +			 * catch-up. This avoids (excessively) small writeouts
> +			 * when the bdi limits are ramping up.
> +			 */
> +			if (nr_reclaimable + nr_writeback <
> +			    (background_thresh + dirty_thresh) / 2)
> +				break;
> +
> +			/* done enough? */
> +			if (pages_written >= write_chunk)
> +				break;
> +		}
> +		if (!bdi->dirty_exceeded)
> +			bdi->dirty_exceeded = 1;
>  
> +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> +		 * Unstable writes are a feature of certain networked
> +		 * filesystems (i.e. NFS) in which data may have been
> +		 * written to the server's write cache, but has not yet
> +		 * been flushed to permanent storage.
> +		 * Only move pages to writeback if this bdi is over its
> +		 * threshold otherwise wait until the disk writes catch
> +		 * up.
> +		 */
> +		if (bdi_nr_reclaimable > bdi_thresh) {
> +			writeback_inodes(&wbc);
> +			pages_written += write_chunk - wbc.nr_to_write;

> +			if (wbc.nr_to_write == 0)
> +				continue;

What's the purpose of the above 2 lines?

Thanks,
Fengguang

> +		}
>  		congestion_wait(BLK_RW_ASYNC, HZ/10);
>  	}
>  
>  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> -			bdi->dirty_exceeded)
> +	    bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
>  	if (writeback_in_progress(bdi))
> @@ -580,10 +552,8 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
> -					  + global_page_state(NR_UNSTABLE_NFS)
> -					  > background_thresh)))
> +	if ((laptop_mode && pages_written) || (!laptop_mode &&
> +	     (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
>  }
>  
> _
> 
> Patches currently in -mm which might be from richard@rsk.demon.co.uk are
> 
> mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references.patch

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-22  2:51 ` + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree Wu Fengguang
@ 2009-08-22 18:11   ` Peter Zijlstra
  2009-08-23  1:32     ` Wu Fengguang
  2009-09-02  8:31     ` Peter Zijlstra
  2009-08-23  9:33   ` Richard Kennedy
  1 sibling, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2009-08-22 18:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel

On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:

> > +++ a/mm/page-writeback.c

> > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> >  
> >  		*pbdi_dirty = bdi_dirty;
> >  		task_dirty_limit(current, pbdi_dirty);
> >  	}
> >  }
> > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> >  		};
> >  
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > +				 &bdi_thresh, bdi);
> >  
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > +			global_page_state(NR_UNSTABLE_NFS);
> > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > +			global_page_state(NR_WRITEBACK_TEMP);
> >  
> >  		/*
> >  		 * In order to avoid the stacked BDI deadlock we need
> > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> >  		}
> >  
> 
> > +		/* always throttle if over threshold */
> > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> 
> That 'if' is a big behavior change. It effectively blocks every one
> and canceled Peter's proportional throttling work: the less a process
> dirtied, the less it should be throttled.

Hmm, I think you're right, I had not considered that, thanks for
catching that.

> I'd propose to remove the above 'if' and liberate the following three 'if's.

That might work, but it looses the total dirty_thresh constraint. The
sum of per-bdi dirties _should_ not be larger than that, but I'm not
sure it won't ever be.

The clip code Richard removed ensured that, and I think I wrote that out
of more than sheer paranoia, but I'm not sure anymore :/

I'll go over the numeric stuff again to see where it could go wrong.

If we can bound the error (I'm suspecting it was some numerical error
bound) we should be good and can indeed do this.

> > +
> > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > +				break;
> > +
> > +			/*
> > +			 * Throttle it only when the background writeback cannot
> > +			 * catch-up. This avoids (excessively) small writeouts
> > +			 * when the bdi limits are ramping up.
> > +			 */
> > +			if (nr_reclaimable + nr_writeback <
> > +			    (background_thresh + dirty_thresh) / 2)
> > +				break;
> > +
> > +			/* done enough? */
> > +			if (pages_written >= write_chunk)
> > +				break;
> > +		}
> > +		if (!bdi->dirty_exceeded)
> > +			bdi->dirty_exceeded = 1;
> >  
> > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > +		 * Unstable writes are a feature of certain networked
> > +		 * filesystems (i.e. NFS) in which data may have been
> > +		 * written to the server's write cache, but has not yet
> > +		 * been flushed to permanent storage.
> > +		 * Only move pages to writeback if this bdi is over its
> > +		 * threshold otherwise wait until the disk writes catch
> > +		 * up.
> > +		 */
> > +		if (bdi_nr_reclaimable > bdi_thresh) {
> > +			writeback_inodes(&wbc);
> > +			pages_written += write_chunk - wbc.nr_to_write;
> 
> > +			if (wbc.nr_to_write == 0)
> > +				continue;
> 
> What's the purpose of the above 2 lines?

I think I should slow down, I seem to have totally missed these two
lines when I read the patch :/

> > +		}
> >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> >  	}
> >  
> >  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > +	    bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
> >  
> >  	if (writeback_in_progress(bdi))
> > @@ -580,10 +552,8 @@ static void balance_dirty_pages(struct a
> >  	 * In normal mode, we start background writeout at the lower
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> > +	if ((laptop_mode && pages_written) || (!laptop_mode &&
> > +	     (nr_reclaimable > background_thresh)))
> >  		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
> >  }



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-22 18:11   ` Peter Zijlstra
@ 2009-08-23  1:32     ` Wu Fengguang
  2009-08-23  5:31       ` Peter Zijlstra
  2009-09-02  8:31     ` Peter Zijlstra
  1 sibling, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2009-08-23  1:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Sun, Aug 23, 2009 at 02:11:41AM +0800, Peter Zijlstra wrote:
> On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:
> 
> > > +++ a/mm/page-writeback.c
> 
> > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> > >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> > >  
> > >  		*pbdi_dirty = bdi_dirty;
> > >  		task_dirty_limit(current, pbdi_dirty);
> > >  	}
> > >  }
> > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> > >  		};
> > >  
> > >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > > +				 &bdi_thresh, bdi);
> > >  
> > >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > +			global_page_state(NR_UNSTABLE_NFS);
> > > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > > +			global_page_state(NR_WRITEBACK_TEMP);
> > >  
> > >  		/*
> > >  		 * In order to avoid the stacked BDI deadlock we need
> > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> > >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > >  		}
> > >  
> > 
> > > +		/* always throttle if over threshold */
> > > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > 
> > That 'if' is a big behavior change. It effectively blocks every one
> > and canceled Peter's proportional throttling work: the less a process
> > dirtied, the less it should be throttled.
> 
> Hmm, I think you're right, I had not considered that, thanks for
> catching that.

Thank you :)

> > I'd propose to remove the above 'if' and liberate the following three 'if's.
> 
> That might work, but it looses the total dirty_thresh constraint. The
> sum of per-bdi dirties _should_ not be larger than that, but I'm not
> sure it won't ever be.
> 
> The clip code Richard removed ensured that, and I think I wrote that out
> of more than sheer paranoia, but I'm not sure anymore :/

Oh I assumed that your per-bdi throttling is not too permissive to
exceed the global dirty_thresh. In theory the per-bdi throttling
should be able to quickly stop the growing of (nr_reclaimable +
nr_writeback).  Once dirty_thresh is reached we already lose. 

> I'll go over the numeric stuff again to see where it could go wrong.
> 
> If we can bound the error (I'm suspecting it was some numerical error
> bound) we should be good and can indeed do this.

Yes, that error bound should be smaller than (dirty_thresh -
background_thresh) / 2, unless the user set the two thresholds
insanely close (for that we may add some sanity checks in
dirty_bytes_handler() and dirty_background_bytes_handler() etc.).

Anyway we may do some thing like this for now?

                if (dirty_thresh exceeded) {
                        WARN_ONCE
                        block write more
                }

Thanks,
Fengguang

> > > +
> > > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > +				break;
> > > +
> > > +			/*
> > > +			 * Throttle it only when the background writeback cannot
> > > +			 * catch-up. This avoids (excessively) small writeouts
> > > +			 * when the bdi limits are ramping up.
> > > +			 */
> > > +			if (nr_reclaimable + nr_writeback <
> > > +			    (background_thresh + dirty_thresh) / 2)
> > > +				break;
> > > +
> > > +			/* done enough? */
> > > +			if (pages_written >= write_chunk)
> > > +				break;
> > > +		}
> > > +		if (!bdi->dirty_exceeded)
> > > +			bdi->dirty_exceeded = 1;
> > >  
> > > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > +		 * Unstable writes are a feature of certain networked
> > > +		 * filesystems (i.e. NFS) in which data may have been
> > > +		 * written to the server's write cache, but has not yet
> > > +		 * been flushed to permanent storage.
> > > +		 * Only move pages to writeback if this bdi is over its
> > > +		 * threshold otherwise wait until the disk writes catch
> > > +		 * up.
> > > +		 */
> > > +		if (bdi_nr_reclaimable > bdi_thresh) {
> > > +			writeback_inodes(&wbc);
> > > +			pages_written += write_chunk - wbc.nr_to_write;
> > 
> > > +			if (wbc.nr_to_write == 0)
> > > +				continue;
> > 
> > What's the purpose of the above 2 lines?
> 
> I think I should slow down, I seem to have totally missed these two
> lines when I read the patch :/
> 
> > > +		}
> > >  		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >  	}
> > >  
> > >  	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > +	    bdi->dirty_exceeded)
> > >  		bdi->dirty_exceeded = 0;
> > >  
> > >  	if (writeback_in_progress(bdi))
> > > @@ -580,10 +552,8 @@ static void balance_dirty_pages(struct a
> > >  	 * In normal mode, we start background writeout at the lower
> > >  	 * background_thresh, to keep the amount of dirty memory low.
> > >  	 */
> > > +	if ((laptop_mode && pages_written) || (!laptop_mode &&
> > > +	     (nr_reclaimable > background_thresh)))
> > >  		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
> > >  }
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23  1:32     ` Wu Fengguang
@ 2009-08-23  5:31       ` Peter Zijlstra
  2009-08-23  7:27         ` Wu Fengguang
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2009-08-23  5:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Sun, 2009-08-23 at 09:32 +0800, Wu Fengguang wrote:
> > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > 
> > That might work, but it looses the total dirty_thresh constraint. The
> > sum of per-bdi dirties _should_ not be larger than that, but I'm not
> > sure it won't ever be.
> > 
> > The clip code Richard removed ensured that, and I think I wrote that out
> > of more than sheer paranoia, but I'm not sure anymore :/
> 
> Oh I assumed that your per-bdi throttling is not too permissive to
> exceed the global dirty_thresh. In theory the per-bdi throttling
> should be able to quickly stop the growing of (nr_reclaimable +
> nr_writeback).  Once dirty_thresh is reached we already los

Right, so:

 bdi_thresh_n = dirty_thresh * p(n) + eps.

and

 \Sum_n p(n) = 1

So:

 \Sum_n bdi_thresh_n = dirty_thresh + n*eps

Which yields an O(n) error bound.

I'm just not sure how large the thing is in reality, and paranoia won
out.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23  5:31       ` Peter Zijlstra
@ 2009-08-23  7:27         ` Wu Fengguang
  2009-08-23  7:45           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2009-08-23  7:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Sun, Aug 23, 2009 at 01:31:17PM +0800, Peter Zijlstra wrote:
> On Sun, 2009-08-23 at 09:32 +0800, Wu Fengguang wrote:
> > > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > > 
> > > That might work, but it looses the total dirty_thresh constraint. The
> > > sum of per-bdi dirties _should_ not be larger than that, but I'm not
> > > sure it won't ever be.
> > > 
> > > The clip code Richard removed ensured that, and I think I wrote that out
> > > of more than sheer paranoia, but I'm not sure anymore :/
> > 
> > Oh I assumed that your per-bdi throttling is not too permissive to
> > exceed the global dirty_thresh. In theory the per-bdi throttling
> > should be able to quickly stop the growing of (nr_reclaimable +
> > nr_writeback).  Once dirty_thresh is reached we already los
> 
> Right, so:
> 
>  bdi_thresh_n = dirty_thresh * p(n) + eps.
> 
> and
> 
>  \Sum_n p(n) = 1
> 
> So:
> 
>  \Sum_n bdi_thresh_n = dirty_thresh + n*eps
> 
> Which yields an O(n) error bound.
> 
> I'm just not sure how large the thing is in reality, and paranoia won
> out.

A wild question: is it possible to make the equation:

        bdi_thresh_n = dirty_thresh * p(n) - eps.

?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23  7:27         ` Wu Fengguang
@ 2009-08-23  7:45           ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2009-08-23  7:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Sun, 2009-08-23 at 15:27 +0800, Wu Fengguang wrote:
> On Sun, Aug 23, 2009 at 01:31:17PM +0800, Peter Zijlstra wrote:
> > On Sun, 2009-08-23 at 09:32 +0800, Wu Fengguang wrote:
> > > > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > > > 
> > > > That might work, but it looses the total dirty_thresh constraint. The
> > > > sum of per-bdi dirties _should_ not be larger than that, but I'm not
> > > > sure it won't ever be.
> > > > 
> > > > The clip code Richard removed ensured that, and I think I wrote that out
> > > > of more than sheer paranoia, but I'm not sure anymore :/
> > > 
> > > Oh I assumed that your per-bdi throttling is not too permissive to
> > > exceed the global dirty_thresh. In theory the per-bdi throttling
> > > should be able to quickly stop the growing of (nr_reclaimable +
> > > nr_writeback).  Once dirty_thresh is reached we already los
> > 
> > Right, so:
> > 
> >  bdi_thresh_n = dirty_thresh * p(n) + eps.
> > 
> > and
> > 
> >  \Sum_n p(n) = 1
> > 
> > So:
> > 
> >  \Sum_n bdi_thresh_n = dirty_thresh + n*eps
> > 
> > Which yields an O(n) error bound.
> > 
> > I'm just not sure how large the thing is in reality, and paranoia won
> > out.
> 
> A wild question: is it possible to make the equation:
> 
>         bdi_thresh_n = dirty_thresh * p(n) - eps.

Possibly, I'd have to go over the math to get a proper equation for eps,
then see what needs to be done to get avg(eps) <= 0.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-22  2:51 ` + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree Wu Fengguang
  2009-08-22 18:11   ` Peter Zijlstra
@ 2009-08-23  9:33   ` Richard Kennedy
  2009-08-23 13:00     ` Wu Fengguang
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Kennedy @ 2009-08-23  9:33 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	a.p.zijlstra@chello.nl, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-mm, linux-fsdevel

On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:

> > 
> >  mm/page-writeback.c |  116 +++++++++++++++---------------------------
> >  1 file changed, 43 insertions(+), 73 deletions(-)
> > 
> > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> > +++ a/mm/page-writeback.c
> > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
> >  	}
> >  }
> >  
> > -/*
> > - * Clip the earned share of dirty pages to that which is actually available.
> > - * This avoids exceeding the total dirty_limit when the floating averages
> > - * fluctuate too quickly.
> > - */
> > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > -		unsigned long dirty, unsigned long *pbdi_dirty)
> > -{
> > -	unsigned long avail_dirty;
> > -
> > -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > -		 global_page_state(NR_WRITEBACK) +
> > -		 global_page_state(NR_UNSTABLE_NFS) +
> > -		 global_page_state(NR_WRITEBACK_TEMP);
> > -
> > -	if (avail_dirty < dirty)
> > -		avail_dirty = dirty - avail_dirty;
> > -	else
> > -		avail_dirty = 0;
> > -
> > -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > -		bdi_stat(bdi, BDI_WRITEBACK);
> > -
> > -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > -}
> > -
> >  static inline void task_dirties_fraction(struct task_struct *tsk,
> >  		long *numerator, long *denominator)
> >  {
> > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> >  
> >  		*pbdi_dirty = bdi_dirty;
> > -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
> >  		task_dirty_limit(current, pbdi_dirty);
> >  	}
> >  }
> > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> >  		};
> >  
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > -				&bdi_thresh, bdi);
> > +				 &bdi_thresh, bdi);
> >  
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > -					global_page_state(NR_UNSTABLE_NFS);
> > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > -
> > -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > -
> > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > -			break;
> > -
> > -		/*
> > -		 * Throttle it only when the background writeback cannot
> > -		 * catch-up. This avoids (excessively) small writeouts
> > -		 * when the bdi limits are ramping up.
> > -		 */
> > -		if (nr_reclaimable + nr_writeback <
> > -				(background_thresh + dirty_thresh) / 2)
> > -			break;
> > -
> > -		if (!bdi->dirty_exceeded)
> > -			bdi->dirty_exceeded = 1;
> > -
> > -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > -		 * Unstable writes are a feature of certain networked
> > -		 * filesystems (i.e. NFS) in which data may have been
> > -		 * written to the server's write cache, but has not yet
> > -		 * been flushed to permanent storage.
> > -		 * Only move pages to writeback if this bdi is over its
> > -		 * threshold otherwise wait until the disk writes catch
> > -		 * up.
> > -		 */
> > -		if (bdi_nr_reclaimable > bdi_thresh) {
> > -			generic_sync_bdi_inodes(NULL, &wbc);
> > -			pages_written += write_chunk - wbc.nr_to_write;
> > -			get_dirty_limits(&background_thresh, &dirty_thresh,
> > -				       &bdi_thresh, bdi);
> > -		}
> > +			global_page_state(NR_UNSTABLE_NFS);
> > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > +			global_page_state(NR_WRITEBACK_TEMP);
> >  
> >  		/*
> >  		 * In order to avoid the stacked BDI deadlock we need
> > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> >  		}
> >  
> > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > -			break;
> > -		if (pages_written >= write_chunk)
> > -			break;		/* We've done our duty */
> 
> > +		/* always throttle if over threshold */
> > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> 
> That 'if' is a big behavior change. It effectively blocks every one
> and canceled Peter's proportional throttling work: the less a process
> dirtied, the less it should be throttled.
> 
I don't think it does. the code ends up looking like

FOR
	IF less than dirty_thresh THEN
		check bdi limits etc	
	ENDIF

	thottle
ENDFOR

Therefore we always throttle when over the threshold otherwise we apply
the per bdi limits to decide if we throttle.

In the existing code clip_bdi_dirty_limit modified the bdi_thresh so
that it would not let a bdi dirty enough pages to go over the
dirty_threshold. All I've done is to bring the check of dirty_thresh up
into balance_dirty_pages.

So isn't this effectively the same ?


> I'd propose to remove the above 'if' and liberate the following three 'if's.
> 
> > +
> > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > +				break;
> > +
> > +			/*
> > +			 * Throttle it only when the background writeback cannot
> > +			 * catch-up. This avoids (excessively) small writeouts
> > +			 * when the bdi limits are ramping up.
> > +			 */
> > +			if (nr_reclaimable + nr_writeback <
> > +			    (background_thresh + dirty_thresh) / 2)
> > +				break;
> > +
> > +			/* done enough? */
> > +			if (pages_written >= write_chunk)
> > +				break;
> > +		}
> > +		if (!bdi->dirty_exceeded)
> > +			bdi->dirty_exceeded = 1;
> >  
> > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > +		 * Unstable writes are a feature of certain networked
> > +		 * filesystems (i.e. NFS) in which data may have been
> > +		 * written to the server's write cache, but has not yet
> > +		 * been flushed to permanent storage.
> > +		 * Only move pages to writeback if this bdi is over its
> > +		 * threshold otherwise wait until the disk writes catch
> > +		 * up.
> > +		 */
> > +		if (bdi_nr_reclaimable > bdi_thresh) {
> > +			writeback_inodes(&wbc);
> > +			pages_written += write_chunk - wbc.nr_to_write;
> 
> > +			if (wbc.nr_to_write == 0)
> > +				continue;
> 
> What's the purpose of the above 2 lines?

This is to try to replicate the existing code as closely as possible.

If writeback_inodes wrote write_chunk pages in one pass then skip to the
top of the loop to recheck the limits and decide if we can let the
application continue. Otherwise it's not making enough forward progress
due to congestion so do the congestion_wait & loop. 


> Thanks,
> Fengguang
> 
regards 
Richard



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23  9:33   ` Richard Kennedy
@ 2009-08-23 13:00     ` Wu Fengguang
  2009-08-23 13:46       ` Richard Kennedy
  0 siblings, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2009-08-23 13:00 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	a.p.zijlstra@chello.nl, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org

On Sun, Aug 23, 2009 at 05:33:33PM +0800, Richard Kennedy wrote:
> On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:
> 
> > > 
> > >  mm/page-writeback.c |  116 +++++++++++++++---------------------------
> > >  1 file changed, 43 insertions(+), 73 deletions(-)
> > > 
> > > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> > > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> > > +++ a/mm/page-writeback.c
> > > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
> > >  	}
> > >  }
> > >  
> > > -/*
> > > - * Clip the earned share of dirty pages to that which is actually available.
> > > - * This avoids exceeding the total dirty_limit when the floating averages
> > > - * fluctuate too quickly.
> > > - */
> > > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > > -		unsigned long dirty, unsigned long *pbdi_dirty)
> > > -{
> > > -	unsigned long avail_dirty;
> > > -
> > > -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > > -		 global_page_state(NR_WRITEBACK) +
> > > -		 global_page_state(NR_UNSTABLE_NFS) +
> > > -		 global_page_state(NR_WRITEBACK_TEMP);
> > > -
> > > -	if (avail_dirty < dirty)
> > > -		avail_dirty = dirty - avail_dirty;
> > > -	else
> > > -		avail_dirty = 0;
> > > -
> > > -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > > -		bdi_stat(bdi, BDI_WRITEBACK);
> > > -
> > > -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > > -}
> > > -
> > >  static inline void task_dirties_fraction(struct task_struct *tsk,
> > >  		long *numerator, long *denominator)
> > >  {
> > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> > >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> > >  
> > >  		*pbdi_dirty = bdi_dirty;
> > > -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
> > >  		task_dirty_limit(current, pbdi_dirty);
> > >  	}
> > >  }
> > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> > >  		};
> > >  
> > >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > > -				&bdi_thresh, bdi);
> > > +				 &bdi_thresh, bdi);
> > >  
> > >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > -					global_page_state(NR_UNSTABLE_NFS);
> > > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > > -
> > > -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > > -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > -
> > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > -			break;
> > > -
> > > -		/*
> > > -		 * Throttle it only when the background writeback cannot
> > > -		 * catch-up. This avoids (excessively) small writeouts
> > > -		 * when the bdi limits are ramping up.
> > > -		 */
> > > -		if (nr_reclaimable + nr_writeback <
> > > -				(background_thresh + dirty_thresh) / 2)
> > > -			break;
> > > -
> > > -		if (!bdi->dirty_exceeded)
> > > -			bdi->dirty_exceeded = 1;
> > > -
> > > -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > -		 * Unstable writes are a feature of certain networked
> > > -		 * filesystems (i.e. NFS) in which data may have been
> > > -		 * written to the server's write cache, but has not yet
> > > -		 * been flushed to permanent storage.
> > > -		 * Only move pages to writeback if this bdi is over its
> > > -		 * threshold otherwise wait until the disk writes catch
> > > -		 * up.
> > > -		 */
> > > -		if (bdi_nr_reclaimable > bdi_thresh) {
> > > -			generic_sync_bdi_inodes(NULL, &wbc);
> > > -			pages_written += write_chunk - wbc.nr_to_write;
> > > -			get_dirty_limits(&background_thresh, &dirty_thresh,
> > > -				       &bdi_thresh, bdi);
> > > -		}
> > > +			global_page_state(NR_UNSTABLE_NFS);
> > > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > > +			global_page_state(NR_WRITEBACK_TEMP);
> > >  
> > >  		/*
> > >  		 * In order to avoid the stacked BDI deadlock we need
> > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> > >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > >  		}
> > >  
> > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > -			break;
> > > -		if (pages_written >= write_chunk)
> > > -			break;		/* We've done our duty */
> > 
> > > +		/* always throttle if over threshold */
> > > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > 
> > That 'if' is a big behavior change. It effectively blocks every one
> > and canceled Peter's proportional throttling work: the less a process
> > dirtied, the less it should be throttled.
> > 
> I don't think it does. the code ends up looking like
> 
> FOR
> 	IF less than dirty_thresh THEN
> 		check bdi limits etc	
> 	ENDIF
> 
> 	thottle
> ENDFOR
> 
> Therefore we always throttle when over the threshold otherwise we apply
> the per bdi limits to decide if we throttle.
> 
> In the existing code clip_bdi_dirty_limit modified the bdi_thresh so
> that it would not let a bdi dirty enough pages to go over the
> dirty_threshold. All I've done is to bring the check of dirty_thresh up
> into balance_dirty_pages.
> 
> So isn't this effectively the same ?

Yes and no. For the bdi_thresh part it somehow makes the
clip_bdi_dirty_limit() logic more simple and obvious. Which I tend to
agree with you and Peter on doing something like this:

        if (nr_reclaimable + nr_writeback < dirty_thresh) {
                /* compute bdi_* */
                if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
         		break;
        }

For other two 'if's..
 
> > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > 
> > > +
> > > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > +				break;
> > > +
> > > +			/*
> > > +			 * Throttle it only when the background writeback cannot
> > > +			 * catch-up. This avoids (excessively) small writeouts
> > > +			 * when the bdi limits are ramping up.
> > > +			 */
> > > +			if (nr_reclaimable + nr_writeback <
> > > +			    (background_thresh + dirty_thresh) / 2)
> > > +				break;

That 'if' can be trivially moved out.

> > > +
> > > +			/* done enough? */
> > > +			if (pages_written >= write_chunk)
> > > +				break;

That 'if' must be moved out, otherwise it can block a light writer
for ever, as long as there is another heavy dirtier keeps the dirty
numbers high.

> > > +		}
> > > +		if (!bdi->dirty_exceeded)
> > > +			bdi->dirty_exceeded = 1;
> > >  
> > > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > +		 * Unstable writes are a feature of certain networked
> > > +		 * filesystems (i.e. NFS) in which data may have been
> > > +		 * written to the server's write cache, but has not yet
> > > +		 * been flushed to permanent storage.

> > > +		 * Only move pages to writeback if this bdi is over its
> > > +		 * threshold otherwise wait until the disk writes catch
> > > +		 * up.
> > > +		 */
> > > +		if (bdi_nr_reclaimable > bdi_thresh) {

I'd much prefer its original form

                if (bdi_nr_reclaimable) {

Let's push dirty pages to disk ASAP :)

> > > +			writeback_inodes(&wbc);
> > > +			pages_written += write_chunk - wbc.nr_to_write;
> > 
> > > +			if (wbc.nr_to_write == 0)
> > > +				continue;
> > 
> > What's the purpose of the above 2 lines?
> 
> This is to try to replicate the existing code as closely as possible.
> 
> If writeback_inodes wrote write_chunk pages in one pass then skip to the
> top of the loop to recheck the limits and decide if we can let the
> application continue. Otherwise it's not making enough forward progress
> due to congestion so do the congestion_wait & loop. 

It makes sense. We have wbc.encountered_congestion for that purpose.
However it may not able to write enough pages for other reasons like
lock contention. So I'd suggest to test (wbc.nr_to_write <= 0).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23 13:00     ` Wu Fengguang
@ 2009-08-23 13:46       ` Richard Kennedy
  2009-08-24  1:41         ` Wu Fengguang
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Kennedy @ 2009-08-23 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	a.p.zijlstra@chello.nl, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org

On Sun, 2009-08-23 at 21:00 +0800, Wu Fengguang wrote:
> On Sun, Aug 23, 2009 at 05:33:33PM +0800, Richard Kennedy wrote:
> > On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:
> > 
> > > > 
> > > >  mm/page-writeback.c |  116 +++++++++++++++---------------------------
> > > >  1 file changed, 43 insertions(+), 73 deletions(-)
> > > > 
> > > > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> > > > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> > > > +++ a/mm/page-writeback.c
> > > > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
> > > >  	}
> > > >  }
> > > >  
> > > > -/*
> > > > - * Clip the earned share of dirty pages to that which is actually available.
> > > > - * This avoids exceeding the total dirty_limit when the floating averages
> > > > - * fluctuate too quickly.
> > > > - */
> > > > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > > > -		unsigned long dirty, unsigned long *pbdi_dirty)
> > > > -{
> > > > -	unsigned long avail_dirty;
> > > > -
> > > > -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > > > -		 global_page_state(NR_WRITEBACK) +
> > > > -		 global_page_state(NR_UNSTABLE_NFS) +
> > > > -		 global_page_state(NR_WRITEBACK_TEMP);
> > > > -
> > > > -	if (avail_dirty < dirty)
> > > > -		avail_dirty = dirty - avail_dirty;
> > > > -	else
> > > > -		avail_dirty = 0;
> > > > -
> > > > -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > > > -		bdi_stat(bdi, BDI_WRITEBACK);
> > > > -
> > > > -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > > > -}
> > > > -
> > > >  static inline void task_dirties_fraction(struct task_struct *tsk,
> > > >  		long *numerator, long *denominator)
> > > >  {
> > > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> > > >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> > > >  
> > > >  		*pbdi_dirty = bdi_dirty;
> > > > -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
> > > >  		task_dirty_limit(current, pbdi_dirty);
> > > >  	}
> > > >  }
> > > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> > > >  		};
> > > >  
> > > >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > -				&bdi_thresh, bdi);
> > > > +				 &bdi_thresh, bdi);
> > > >  
> > > >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > > -					global_page_state(NR_UNSTABLE_NFS);
> > > > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > > > -
> > > > -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > > > -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > -
> > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > -			break;
> > > > -
> > > > -		/*
> > > > -		 * Throttle it only when the background writeback cannot
> > > > -		 * catch-up. This avoids (excessively) small writeouts
> > > > -		 * when the bdi limits are ramping up.
> > > > -		 */
> > > > -		if (nr_reclaimable + nr_writeback <
> > > > -				(background_thresh + dirty_thresh) / 2)
> > > > -			break;
> > > > -
> > > > -		if (!bdi->dirty_exceeded)
> > > > -			bdi->dirty_exceeded = 1;
> > > > -
> > > > -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > -		 * Unstable writes are a feature of certain networked
> > > > -		 * filesystems (i.e. NFS) in which data may have been
> > > > -		 * written to the server's write cache, but has not yet
> > > > -		 * been flushed to permanent storage.
> > > > -		 * Only move pages to writeback if this bdi is over its
> > > > -		 * threshold otherwise wait until the disk writes catch
> > > > -		 * up.
> > > > -		 */
> > > > -		if (bdi_nr_reclaimable > bdi_thresh) {
> > > > -			generic_sync_bdi_inodes(NULL, &wbc);
> > > > -			pages_written += write_chunk - wbc.nr_to_write;
> > > > -			get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > -				       &bdi_thresh, bdi);
> > > > -		}
> > > > +			global_page_state(NR_UNSTABLE_NFS);
> > > > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > > > +			global_page_state(NR_WRITEBACK_TEMP);
> > > >  
> > > >  		/*
> > > >  		 * In order to avoid the stacked BDI deadlock we need
> > > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> > > >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > >  		}
> > > >  
> > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > -			break;
> > > > -		if (pages_written >= write_chunk)
> > > > -			break;		/* We've done our duty */
> > > 
> > > > +		/* always throttle if over threshold */
> > > > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > 
> > > That 'if' is a big behavior change. It effectively blocks every one
> > > and canceled Peter's proportional throttling work: the less a process
> > > dirtied, the less it should be throttled.
> > > 
> > I don't think it does. the code ends up looking like
> > 
> > FOR
> > 	IF less than dirty_thresh THEN
> > 		check bdi limits etc	
> > 	ENDIF
> > 
> > 	thottle
> > ENDFOR
> > 
> > Therefore we always throttle when over the threshold otherwise we apply
> > the per bdi limits to decide if we throttle.
> > 
> > In the existing code clip_bdi_dirty_limit modified the bdi_thresh so
> > that it would not let a bdi dirty enough pages to go over the
> > dirty_threshold. All I've done is to bring the check of dirty_thresh up
> > into balance_dirty_pages.
> > 
> > So isn't this effectively the same ?
> 
> Yes and no. For the bdi_thresh part it somehow makes the
> clip_bdi_dirty_limit() logic more simple and obvious. Which I tend to
> agree with you and Peter on doing something like this:
> 
>         if (nr_reclaimable + nr_writeback < dirty_thresh) {
>                 /* compute bdi_* */
>                 if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
>          		break;
>         }
> 
> For other two 'if's..
>  
> > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > > 
> > > > +
> > > > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > +				break;
> > > > +
> > > > +			/*
> > > > +			 * Throttle it only when the background writeback cannot
> > > > +			 * catch-up. This avoids (excessively) small writeouts
> > > > +			 * when the bdi limits are ramping up.
> > > > +			 */
> > > > +			if (nr_reclaimable + nr_writeback <
> > > > +			    (background_thresh + dirty_thresh) / 2)
> > > > +				break;
> 
> That 'if' can be trivially moved out.

OK, 
> > > > +
> > > > +			/* done enough? */
> > > > +			if (pages_written >= write_chunk)
> > > > +				break;
> 
> That 'if' must be moved out, otherwise it can block a light writer
> for ever, as long as there is another heavy dirtier keeps the dirty
> numbers high.

Yes, I see. But I was worried about a failing device that gets stuck.
Doesn't this let the application keep dirtying pages forever if the
pages aren't get written to the device?

Maybe something like this ?

if ( nr_writeback < background_thresh && pages_written >= write_chunk)
	break;

or bdi_nr_writeback < bdi_thresh/2 ?


> > > > +		}
> > > > +		if (!bdi->dirty_exceeded)
> > > > +			bdi->dirty_exceeded = 1;
> > > >  
> > > > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > +		 * Unstable writes are a feature of certain networked
> > > > +		 * filesystems (i.e. NFS) in which data may have been
> > > > +		 * written to the server's write cache, but has not yet
> > > > +		 * been flushed to permanent storage.
> 
> > > > +		 * Only move pages to writeback if this bdi is over its
> > > > +		 * threshold otherwise wait until the disk writes catch
> > > > +		 * up.
> > > > +		 */
> > > > +		if (bdi_nr_reclaimable > bdi_thresh) {
> 
> I'd much prefer its original form
> 
>                 if (bdi_nr_reclaimable) {
> 
> Let's push dirty pages to disk ASAP :)

That change comes from my previous patch, and it's to stop this code
over reacting and pushing all the available dirty pages to the writeback
list.

> > > > +			writeback_inodes(&wbc);
> > > > +			pages_written += write_chunk - wbc.nr_to_write;
> > > 
> > > > +			if (wbc.nr_to_write == 0)
> > > > +				continue;
> > > 
> > > What's the purpose of the above 2 lines?
> > 
> > This is to try to replicate the existing code as closely as possible.
> > 
> > If writeback_inodes wrote write_chunk pages in one pass then skip to the
> > top of the loop to recheck the limits and decide if we can let the
> > application continue. Otherwise it's not making enough forward progress
> > due to congestion so do the congestion_wait & loop. 
> 
> It makes sense. We have wbc.encountered_congestion for that purpose.
> However it may not able to write enough pages for other reasons like
> lock contention. So I'd suggest to test (wbc.nr_to_write <= 0).
> Thanks,
> Fengguang


I didn't test the congestion flag directly because we don't care about
it if writeback_inodes did enough. If write_chunk pages get moved to
writeback then we don't need to do the congestion_wait.

Can writeback_inodes do more work than it was asked to do?

But OK, I can make that change if you think it worthwhile.

regards
Richard



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-23 13:46       ` Richard Kennedy
@ 2009-08-24  1:41         ` Wu Fengguang
  0 siblings, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-08-24  1:41 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	a.p.zijlstra@chello.nl, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org

On Sun, Aug 23, 2009 at 09:46:36PM +0800, Richard Kennedy wrote:
> On Sun, 2009-08-23 at 21:00 +0800, Wu Fengguang wrote:
> > On Sun, Aug 23, 2009 at 05:33:33PM +0800, Richard Kennedy wrote:
> > > On Sat, 2009-08-22 at 10:51 +0800, Wu Fengguang wrote:
> > > 
> > > > > 
> > > > >  mm/page-writeback.c |  116 +++++++++++++++---------------------------
> > > > >  1 file changed, 43 insertions(+), 73 deletions(-)
> > > > > 
> > > > > diff -puN mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references mm/page-writeback.c
> > > > > --- a/mm/page-writeback.c~mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-cache-references
> > > > > +++ a/mm/page-writeback.c
> > > > > @@ -249,32 +249,6 @@ static void bdi_writeout_fraction(struct
> > > > >  	}
> > > > >  }
> > > > >  
> > > > > -/*
> > > > > - * Clip the earned share of dirty pages to that which is actually available.
> > > > > - * This avoids exceeding the total dirty_limit when the floating averages
> > > > > - * fluctuate too quickly.
> > > > > - */
> > > > > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > > > > -		unsigned long dirty, unsigned long *pbdi_dirty)
> > > > > -{
> > > > > -	unsigned long avail_dirty;
> > > > > -
> > > > > -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > > > > -		 global_page_state(NR_WRITEBACK) +
> > > > > -		 global_page_state(NR_UNSTABLE_NFS) +
> > > > > -		 global_page_state(NR_WRITEBACK_TEMP);
> > > > > -
> > > > > -	if (avail_dirty < dirty)
> > > > > -		avail_dirty = dirty - avail_dirty;
> > > > > -	else
> > > > > -		avail_dirty = 0;
> > > > > -
> > > > > -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > > > > -		bdi_stat(bdi, BDI_WRITEBACK);
> > > > > -
> > > > > -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > > > > -}
> > > > > -
> > > > >  static inline void task_dirties_fraction(struct task_struct *tsk,
> > > > >  		long *numerator, long *denominator)
> > > > >  {
> > > > > @@ -465,7 +439,6 @@ get_dirty_limits(unsigned long *pbackgro
> > > > >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> > > > >  
> > > > >  		*pbdi_dirty = bdi_dirty;
> > > > > -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
> > > > >  		task_dirty_limit(current, pbdi_dirty);
> > > > >  	}
> > > > >  }
> > > > > @@ -499,45 +472,12 @@ static void balance_dirty_pages(struct a
> > > > >  		};
> > > > >  
> > > > >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > -				&bdi_thresh, bdi);
> > > > > +				 &bdi_thresh, bdi);
> > > > >  
> > > > >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > > > -					global_page_state(NR_UNSTABLE_NFS);
> > > > > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > > > > -
> > > > > -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > > > > -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > > -
> > > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > > -			break;
> > > > > -
> > > > > -		/*
> > > > > -		 * Throttle it only when the background writeback cannot
> > > > > -		 * catch-up. This avoids (excessively) small writeouts
> > > > > -		 * when the bdi limits are ramping up.
> > > > > -		 */
> > > > > -		if (nr_reclaimable + nr_writeback <
> > > > > -				(background_thresh + dirty_thresh) / 2)
> > > > > -			break;
> > > > > -
> > > > > -		if (!bdi->dirty_exceeded)
> > > > > -			bdi->dirty_exceeded = 1;
> > > > > -
> > > > > -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > > -		 * Unstable writes are a feature of certain networked
> > > > > -		 * filesystems (i.e. NFS) in which data may have been
> > > > > -		 * written to the server's write cache, but has not yet
> > > > > -		 * been flushed to permanent storage.
> > > > > -		 * Only move pages to writeback if this bdi is over its
> > > > > -		 * threshold otherwise wait until the disk writes catch
> > > > > -		 * up.
> > > > > -		 */
> > > > > -		if (bdi_nr_reclaimable > bdi_thresh) {
> > > > > -			generic_sync_bdi_inodes(NULL, &wbc);
> > > > > -			pages_written += write_chunk - wbc.nr_to_write;
> > > > > -			get_dirty_limits(&background_thresh, &dirty_thresh,
> > > > > -				       &bdi_thresh, bdi);
> > > > > -		}
> > > > > +			global_page_state(NR_UNSTABLE_NFS);
> > > > > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > > > > +			global_page_state(NR_WRITEBACK_TEMP);
> > > > >  
> > > > >  		/*
> > > > >  		 * In order to avoid the stacked BDI deadlock we need
> > > > > @@ -557,16 +497,48 @@ static void balance_dirty_pages(struct a
> > > > >  			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > > >  		}
> > > > >  
> > > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > > -			break;
> > > > > -		if (pages_written >= write_chunk)
> > > > > -			break;		/* We've done our duty */
> > > > 
> > > > > +		/* always throttle if over threshold */
> > > > > +		if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > 
> > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > and canceled Peter's proportional throttling work: the less a process
> > > > dirtied, the less it should be throttled.
> > > > 
> > > I don't think it does. the code ends up looking like
> > > 
> > > FOR
> > > 	IF less than dirty_thresh THEN
> > > 		check bdi limits etc	
> > > 	ENDIF
> > > 
> > > 	thottle
> > > ENDFOR
> > > 
> > > Therefore we always throttle when over the threshold otherwise we apply
> > > the per bdi limits to decide if we throttle.
> > > 
> > > In the existing code clip_bdi_dirty_limit modified the bdi_thresh so
> > > that it would not let a bdi dirty enough pages to go over the
> > > dirty_threshold. All I've done is to bring the check of dirty_thresh up
> > > into balance_dirty_pages.
> > > 
> > > So isn't this effectively the same ?
> > 
> > Yes and no. For the bdi_thresh part it somehow makes the
> > clip_bdi_dirty_limit() logic more simple and obvious. Which I tend to
> > agree with you and Peter on doing something like this:
> > 
> >         if (nr_reclaimable + nr_writeback < dirty_thresh) {
> >                 /* compute bdi_* */
> >                 if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> >          		break;
> >         }
> > 
> > For other two 'if's..
> >  
> > > > I'd propose to remove the above 'if' and liberate the following three 'if's.
> > > > 
> > > > > +
> > > > > +			if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > > +				break;
> > > > > +
> > > > > +			/*
> > > > > +			 * Throttle it only when the background writeback cannot
> > > > > +			 * catch-up. This avoids (excessively) small writeouts
> > > > > +			 * when the bdi limits are ramping up.
> > > > > +			 */
> > > > > +			if (nr_reclaimable + nr_writeback <
> > > > > +			    (background_thresh + dirty_thresh) / 2)
> > > > > +				break;
> > 
> > That 'if' can be trivially moved out.
> 
> OK, 
> > > > > +
> > > > > +			/* done enough? */
> > > > > +			if (pages_written >= write_chunk)
> > > > > +				break;
> > 
> > That 'if' must be moved out, otherwise it can block a light writer
> > for ever, as long as there is another heavy dirtier keeps the dirty
> > numbers high.
> 
> Yes, I see. But I was worried about a failing device that gets stuck.
> Doesn't this let the application keep dirtying pages forever if the
> pages aren't get written to the device?

In that case every task will be granted up to 8 dirty pages and then
get blocked here, because it will never get big enough pages_written.

That is not perfect, but should be acceptable for a relatively rare case.

> Maybe something like this ?
> 
> if ( nr_writeback < background_thresh && pages_written >= write_chunk)
> 	break;
> 
> or bdi_nr_writeback < bdi_thresh/2 ?

Does that improve _anything_ on a failing device?
That 8-pages-per-task will still be granted..

> > > > > +		}
> > > > > +		if (!bdi->dirty_exceeded)
> > > > > +			bdi->dirty_exceeded = 1;
> > > > >  
> > > > > +		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > > > > +		 * Unstable writes are a feature of certain networked
> > > > > +		 * filesystems (i.e. NFS) in which data may have been
> > > > > +		 * written to the server's write cache, but has not yet
> > > > > +		 * been flushed to permanent storage.
> > 
> > > > > +		 * Only move pages to writeback if this bdi is over its
> > > > > +		 * threshold otherwise wait until the disk writes catch
> > > > > +		 * up.
> > > > > +		 */
> > > > > +		if (bdi_nr_reclaimable > bdi_thresh) {
> > 
> > I'd much prefer its original form
> > 
> >                 if (bdi_nr_reclaimable) {
> > 
> > Let's push dirty pages to disk ASAP :)
> 
> That change comes from my previous patch, and it's to stop this code
> over reacting and pushing all the available dirty pages to the writeback
> list.

This is the fs guys' expectation. The background sync logic will
also try to push all available dirty pages all the way down to 0.
There may be fat IO pipes and we want to fill them as much as we can
once IO is started, to achieve better IO efficiency.

> > > > > +			writeback_inodes(&wbc);
> > > > > +			pages_written += write_chunk - wbc.nr_to_write;
> > > > 
> > > > > +			if (wbc.nr_to_write == 0)
> > > > > +				continue;
> > > > 
> > > > What's the purpose of the above 2 lines?
> > > 
> > > This is to try to replicate the existing code as closely as possible.
> > > 
> > > If writeback_inodes wrote write_chunk pages in one pass then skip to the
> > > top of the loop to recheck the limits and decide if we can let the
> > > application continue. Otherwise it's not making enough forward progress
> > > due to congestion so do the congestion_wait & loop. 
> > 
> > It makes sense. We have wbc.encountered_congestion for that purpose.
> > However it may not able to write enough pages for other reasons like
> > lock contention. So I'd suggest to test (wbc.nr_to_write <= 0).
> > Thanks,
> > Fengguang
> 
> 
> I didn't test the congestion flag directly because we don't care about
> it if writeback_inodes did enough. If write_chunk pages get moved to
> writeback then we don't need to do the congestion_wait.

Right. (wbc.nr_to_write <= 0) indicates no congestion encountered.

> Can writeback_inodes do more work than it was asked to do?

Maybe not. But all existing code check for inequality instead of '== 0' ;)

> But OK, I can make that change if you think it worthwhile.

OK, thanks!

Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-08-22 18:11   ` Peter Zijlstra
  2009-08-23  1:32     ` Wu Fengguang
@ 2009-09-02  8:31     ` Peter Zijlstra
  2009-09-02  9:57       ` Wu Fengguang
  1 sibling, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2009-09-02  8:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel

On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > +           /* always throttle if over threshold */
> > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > 
> > That 'if' is a big behavior change. It effectively blocks every one
> > and canceled Peter's proportional throttling work: the less a process
> > dirtied, the less it should be throttled.
> 
> Hmm, I think you're right, I had not considered that, thanks for
> catching that.

So in retrospect I think I might have been wrong here.

The per task thing causes the bdi limit to be lower than the bdi limit
based on writeback speed alone. That is, the more a task dirties, the
lower the bdi limit is as seen for that task.

So if we get a task that generates tons of dirty pages (dd) then it
won't ever actually hit the full dirty limit, even if its the only task
on the system, and this outer if() will always be true.

Only when we actually saturate the full dirty limit will we fall through
and throttle, but that is ok -- we want to enforce the full limit.

In short, a very aggressive dirtier will have a bdi limit lower than the
total limit (at all times) leaving a little room at the top for the
occasional dirtier to make quick progress.

Wu, does that cover the scenario you had in mind?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-02  8:31     ` Peter Zijlstra
@ 2009-09-02  9:57       ` Wu Fengguang
  2009-09-02 10:45         ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2009-09-02  9:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > +           /* always throttle if over threshold */
> > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > 
> > > That 'if' is a big behavior change. It effectively blocks every one
> > > and canceled Peter's proportional throttling work: the less a process
> > > dirtied, the less it should be throttled.
> > 
> > Hmm, I think you're right, I had not considered that, thanks for
> > catching that.
> 
> So in retrospect I think I might have been wrong here.
> 
> The per task thing causes the bdi limit to be lower than the bdi limit
> based on writeback speed alone. That is, the more a task dirties, the
> lower the bdi limit is as seen for that task.

Right. If I understand it right, there will be a safety margin of about
(1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
when there are more concurrent heavy dirtiers.

In principle, the ceiling will be a bit higher for a light dirtier to
make it easy to pass in the presence of more heavy dirtiers.

> So if we get a task that generates tons of dirty pages (dd) then it
> won't ever actually hit the full dirty limit, even if its the only task
> on the system, and this outer if() will always be true.

Right, we have the safety margin :)

> Only when we actually saturate the full dirty limit will we fall through
> and throttle, but that is ok -- we want to enforce the full limit.
> 
> In short, a very aggressive dirtier will have a bdi limit lower than the
> total limit (at all times) leaving a little room at the top for the
> occasional dirtier to make quick progress.
> 
> Wu, does that cover the scenario you had in mind?

Yes thanks! Please correct me if wrong:
- the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
  is elegant enough to prevent heavy dirtier to block light ones
- the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
  relevant in normal, but can be kept for safety in the form of

          if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
              nr_reclaimable + nr_writeback < dirty_thresh)
                  break;

- clip_bdi_dirty_limit() could be removed: we have been secured by the
  above test

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-02  9:57       ` Wu Fengguang
@ 2009-09-02 10:45         ` Peter Zijlstra
  2009-09-02 13:53           ` Richard Kennedy
  2009-09-03  4:53           ` Wu Fengguang
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2009-09-02 10:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > +           /* always throttle if over threshold */
> > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > 
> > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > and canceled Peter's proportional throttling work: the less a process
> > > > dirtied, the less it should be throttled.
> > > 
> > > Hmm, I think you're right, I had not considered that, thanks for
> > > catching that.
> > 
> > So in retrospect I think I might have been wrong here.
> > 
> > The per task thing causes the bdi limit to be lower than the bdi limit
> > based on writeback speed alone. That is, the more a task dirties, the
> > lower the bdi limit is as seen for that task.
> 
> Right. If I understand it right, there will be a safety margin of about
> (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> when there are more concurrent heavy dirtiers.

Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
is 1/32-nd.

With the side node that I think 1/8 is too much on large memory systems,
and I have posted a sqrt patch numerous times, but I don't think we've
ever found out if that helps or not...

> In principle, the ceiling will be a bit higher for a light dirtier to
> make it easy to pass in the presence of more heavy dirtiers.

Right.

> > So if we get a task that generates tons of dirty pages (dd) then it
> > won't ever actually hit the full dirty limit, even if its the only task
> > on the system, and this outer if() will always be true.
> 
> Right, we have the safety margin :)
> 
> > Only when we actually saturate the full dirty limit will we fall through
> > and throttle, but that is ok -- we want to enforce the full limit.
> > 
> > In short, a very aggressive dirtier will have a bdi limit lower than the
> > total limit (at all times) leaving a little room at the top for the
> > occasional dirtier to make quick progress.
> > 
> > Wu, does that cover the scenario you had in mind?
> 
> Yes thanks! Please correct me if wrong:
> - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
>   is elegant enough to prevent heavy dirtier to block light ones

ack

> - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
>   relevant in normal, but can be kept for safety in the form of
> 
>           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
>               nr_reclaimable + nr_writeback < dirty_thresh)
>                   break;

ack

> - clip_bdi_dirty_limit() could be removed: we have been secured by the
>   above test

ack.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-02 10:45         ` Peter Zijlstra
@ 2009-09-02 13:53           ` Richard Kennedy
  2009-09-03  2:22             ` Wu Fengguang
  2009-09-03  4:53           ` Wu Fengguang
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Kennedy @ 2009-09-02 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Wed, 2009-09-02 at 12:45 +0200, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > +           /* always throttle if over threshold */
> > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > 
> > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > dirtied, the less it should be throttled.
> > > > 
> > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > catching that.
> > > 
> > > So in retrospect I think I might have been wrong here.
> > > 
> > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > based on writeback speed alone. That is, the more a task dirties, the
> > > lower the bdi limit is as seen for that task.
> > 
> > Right. If I understand it right, there will be a safety margin of about
> > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > when there are more concurrent heavy dirtiers.
> 
> Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> is 1/32-nd.
> 
> With the side node that I think 1/8 is too much on large memory systems,
> and I have posted a sqrt patch numerous times, but I don't think we've
> ever found out if that helps or not...
> 
> > In principle, the ceiling will be a bit higher for a light dirtier to
> > make it easy to pass in the presence of more heavy dirtiers.
> 
> Right.
> 
> > > So if we get a task that generates tons of dirty pages (dd) then it
> > > won't ever actually hit the full dirty limit, even if its the only task
> > > on the system, and this outer if() will always be true.
> > 
> > Right, we have the safety margin :)
> > 
> > > Only when we actually saturate the full dirty limit will we fall through
> > > and throttle, but that is ok -- we want to enforce the full limit.
> > > 
> > > In short, a very aggressive dirtier will have a bdi limit lower than the
> > > total limit (at all times) leaving a little room at the top for the
> > > occasional dirtier to make quick progress.
> > > 
> > > Wu, does that cover the scenario you had in mind?
> > 
> > Yes thanks! Please correct me if wrong:
> > - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
> >   is elegant enough to prevent heavy dirtier to block light ones
> 
> ack
> 
> > - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
> >   relevant in normal, but can be kept for safety in the form of
> > 
> >           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> >               nr_reclaimable + nr_writeback < dirty_thresh)
> >                   break;
> 
> ack
> 
> > - clip_bdi_dirty_limit() could be removed: we have been secured by the
> >   above test
> 
> ack.


I've noticed that there's a difference in the handling of the
dirty_exceeded flag, because this change no longer clips the bdi_thresh
then the flag may get cleared more quickly here :-  

	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
	    bdi->dirty_exceeded)
		bdi->dirty_exceeded = 0;

So it then could call balance_dirty_pages a lot less often.

I've got an updated version of this patch that moves the clip_bdi logic
up into balance_dirty_pages that should be closer to the existing
behavior & tests so far look good. I can post it for comments if you're
interested ?

regards
Richard

 






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-02 13:53           ` Richard Kennedy
@ 2009-09-03  2:22             ` Wu Fengguang
  2009-09-03  3:09               ` Wu Fengguang
  2009-09-03  9:48               ` Richard Kennedy
  0 siblings, 2 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-09-03  2:22 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Peter Zijlstra, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Wed, Sep 02, 2009 at 09:53:31PM +0800, Richard Kennedy wrote:
> On Wed, 2009-09-02 at 12:45 +0200, Peter Zijlstra wrote:
> > On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > > +           /* always throttle if over threshold */
> > > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > > 
> > > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > > dirtied, the less it should be throttled.
> > > > > 
> > > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > > catching that.
> > > > 
> > > > So in retrospect I think I might have been wrong here.
> > > > 
> > > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > > based on writeback speed alone. That is, the more a task dirties, the
> > > > lower the bdi limit is as seen for that task.
> > > 
> > > Right. If I understand it right, there will be a safety margin of about
> > > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > > when there are more concurrent heavy dirtiers.
> > 
> > Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> > is 1/32-nd.
> > 
> > With the side node that I think 1/8 is too much on large memory systems,
> > and I have posted a sqrt patch numerous times, but I don't think we've
> > ever found out if that helps or not...
> > 
> > > In principle, the ceiling will be a bit higher for a light dirtier to
> > > make it easy to pass in the presence of more heavy dirtiers.
> > 
> > Right.
> > 
> > > > So if we get a task that generates tons of dirty pages (dd) then it
> > > > won't ever actually hit the full dirty limit, even if its the only task
> > > > on the system, and this outer if() will always be true.
> > > 
> > > Right, we have the safety margin :)
> > > 
> > > > Only when we actually saturate the full dirty limit will we fall through
> > > > and throttle, but that is ok -- we want to enforce the full limit.
> > > > 
> > > > In short, a very aggressive dirtier will have a bdi limit lower than the
> > > > total limit (at all times) leaving a little room at the top for the
> > > > occasional dirtier to make quick progress.
> > > > 
> > > > Wu, does that cover the scenario you had in mind?
> > > 
> > > Yes thanks! Please correct me if wrong:
> > > - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
> > >   is elegant enough to prevent heavy dirtier to block light ones
> > 
> > ack
> > 
> > > - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
> > >   relevant in normal, but can be kept for safety in the form of
> > > 
> > >           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > >               nr_reclaimable + nr_writeback < dirty_thresh)
> > >                   break;
> > 
> > ack
> > 
> > > - clip_bdi_dirty_limit() could be removed: we have been secured by the
> > >   above test
> > 
> > ack.
> 
> 
> I've noticed that there's a difference in the handling of the
> dirty_exceeded flag, because this change no longer clips the bdi_thresh
> then the flag may get cleared more quickly here :-  
> 
> 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> 	    bdi->dirty_exceeded)
> 		bdi->dirty_exceeded = 0;
> 
> So it then could call balance_dirty_pages a lot less often.

I guess in normal situations, clip_bdi_dirty_limit() is simply a
no-op, or just lowers bdi_thresh slightly (otherwise could a bug).
So it could be removed without causing much side effects, including
the influence on dirty_exceeded.

> I've got an updated version of this patch that moves the clip_bdi logic
> up into balance_dirty_pages that should be closer to the existing
> behavior & tests so far look good. I can post it for comments if you're
> interested ?

So I suggested just remove clip_bdi_dirty_limit(). To be sure, could
run with the following patch and check if big numbers are showed.

Thanks,
Fengguang
---
 mm/page-writeback.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- linux-mm.orig/mm/page-writeback.c	2009-09-02 17:16:51.000000000 +0800
+++ linux-mm/mm/page-writeback.c	2009-09-03 10:19:00.000000000 +0800
@@ -258,6 +258,7 @@ static void clip_bdi_dirty_limit(struct 
 		unsigned long dirty, unsigned long *pbdi_dirty)
 {
 	unsigned long avail_dirty;
+	unsigned long delta;
 
 	avail_dirty = global_page_state(NR_FILE_DIRTY) +
 		 global_page_state(NR_WRITEBACK) +
@@ -272,6 +273,11 @@ static void clip_bdi_dirty_limit(struct 
 	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
 		bdi_stat(bdi, BDI_WRITEBACK);
 
+	delta = *pbdi_dirty - min(*pbdi_dirty, avail_dirty);
+	delta *= 1024;
+	delta /= *pbdi_dirty + 1;
+	printk("clip_bdi_dirty_limit %lu\n", delta);
+
 	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
 }
 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-03  2:22             ` Wu Fengguang
@ 2009-09-03  3:09               ` Wu Fengguang
  2009-09-03  9:48               ` Richard Kennedy
  1 sibling, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-09-03  3:09 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Peter Zijlstra, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 03, 2009 at 10:22:23AM +0800, Wu Fengguang wrote:
> So I suggested just remove clip_bdi_dirty_limit(). To be sure, could
> run with the following patch and check if big numbers are showed.
> 

Simple tests show that clip_bdi_dirty_limit() is no-op with one single
disk and 1-2 dirtier(s):

[...]
[  720.630917] clip_bdi_dirty_limit 0
[  720.753865] clip_bdi_dirty_limit 0
[  720.915198] clip_bdi_dirty_limit 0
[  721.149511] clip_bdi_dirty_limit 0
[...]

Thanks,
Fengguang

> --- linux-mm.orig/mm/page-writeback.c	2009-09-02 17:16:51.000000000 +0800
> +++ linux-mm/mm/page-writeback.c	2009-09-03 10:19:00.000000000 +0800
> @@ -258,6 +258,7 @@ static void clip_bdi_dirty_limit(struct 
>  		unsigned long dirty, unsigned long *pbdi_dirty)
>  {
>  	unsigned long avail_dirty;
> +	unsigned long delta;
>  
>  	avail_dirty = global_page_state(NR_FILE_DIRTY) +
>  		 global_page_state(NR_WRITEBACK) +
> @@ -272,6 +273,11 @@ static void clip_bdi_dirty_limit(struct 
>  	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
>  		bdi_stat(bdi, BDI_WRITEBACK);
>  
> +	delta = *pbdi_dirty - min(*pbdi_dirty, avail_dirty);
> +	delta *= 1024;
> +	delta /= *pbdi_dirty + 1;
> +	printk("clip_bdi_dirty_limit %lu\n", delta);
> +
>  	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
>  }
>  

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-02 10:45         ` Peter Zijlstra
  2009-09-02 13:53           ` Richard Kennedy
@ 2009-09-03  4:53           ` Wu Fengguang
  1 sibling, 0 replies; 20+ messages in thread
From: Wu Fengguang @ 2009-09-03  4:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: akpm@linux-foundation.org, mm-commits@vger.kernel.org,
	richard@rsk.demon.co.uk, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Wed, Sep 02, 2009 at 06:45:24PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > +           /* always throttle if over threshold */
> > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > 
> > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > dirtied, the less it should be throttled.
> > > > 
> > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > catching that.
> > > 
> > > So in retrospect I think I might have been wrong here.
> > > 
> > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > based on writeback speed alone. That is, the more a task dirties, the
> > > lower the bdi limit is as seen for that task.
> > 
> > Right. If I understand it right, there will be a safety margin of about
> > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > when there are more concurrent heavy dirtiers.
> 
> Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> is 1/32-nd.
> 
> With the side node that I think 1/8 is too much on large memory systems,
> and I have posted a sqrt patch numerous times, but I don't think we've
> ever found out if that helps or not...

Can you repost it?  I think sqrt-like curve is good for making the
real threshold more aligned with user expectations (vm_dirty_ratio and
especially vm_dirty_bytes). That should not create noticeable side
effects.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-03  2:22             ` Wu Fengguang
  2009-09-03  3:09               ` Wu Fengguang
@ 2009-09-03  9:48               ` Richard Kennedy
  2009-09-03 11:05                 ` Wu Fengguang
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Kennedy @ 2009-09-03  9:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Thu, 2009-09-03 at 10:22 +0800, Wu Fengguang wrote:
> On Wed, Sep 02, 2009 at 09:53:31PM +0800, Richard Kennedy wrote:
> > On Wed, 2009-09-02 at 12:45 +0200, Peter Zijlstra wrote:
> > > On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > > > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > > > +           /* always throttle if over threshold */
> > > > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > > > 
> > > > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > > > dirtied, the less it should be throttled.
> > > > > > 
> > > > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > > > catching that.
> > > > > 
> > > > > So in retrospect I think I might have been wrong here.
> > > > > 
> > > > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > > > based on writeback speed alone. That is, the more a task dirties, the
> > > > > lower the bdi limit is as seen for that task.
> > > > 
> > > > Right. If I understand it right, there will be a safety margin of about
> > > > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > > > when there are more concurrent heavy dirtiers.
> > > 
> > > Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> > > is 1/32-nd.
> > > 
> > > With the side node that I think 1/8 is too much on large memory systems,
> > > and I have posted a sqrt patch numerous times, but I don't think we've
> > > ever found out if that helps or not...
> > > 
> > > > In principle, the ceiling will be a bit higher for a light dirtier to
> > > > make it easy to pass in the presence of more heavy dirtiers.
> > > 
> > > Right.
> > > 
> > > > > So if we get a task that generates tons of dirty pages (dd) then it
> > > > > won't ever actually hit the full dirty limit, even if its the only task
> > > > > on the system, and this outer if() will always be true.
> > > > 
> > > > Right, we have the safety margin :)
> > > > 
> > > > > Only when we actually saturate the full dirty limit will we fall through
> > > > > and throttle, but that is ok -- we want to enforce the full limit.
> > > > > 
> > > > > In short, a very aggressive dirtier will have a bdi limit lower than the
> > > > > total limit (at all times) leaving a little room at the top for the
> > > > > occasional dirtier to make quick progress.
> > > > > 
> > > > > Wu, does that cover the scenario you had in mind?
> > > > 
> > > > Yes thanks! Please correct me if wrong:
> > > > - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
> > > >   is elegant enough to prevent heavy dirtier to block light ones
> > > 
> > > ack
> > > 
> > > > - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
> > > >   relevant in normal, but can be kept for safety in the form of
> > > > 
> > > >           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > >               nr_reclaimable + nr_writeback < dirty_thresh)
> > > >                   break;
> > > 
> > > ack
> > > 
> > > > - clip_bdi_dirty_limit() could be removed: we have been secured by the
> > > >   above test
> > > 
> > > ack.
> > 
> > 
> > I've noticed that there's a difference in the handling of the
> > dirty_exceeded flag, because this change no longer clips the bdi_thresh
> > then the flag may get cleared more quickly here :-  
> > 
> > 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > 	    bdi->dirty_exceeded)
> > 		bdi->dirty_exceeded = 0;
> > 
> > So it then could call balance_dirty_pages a lot less often.
> 
> I guess in normal situations, clip_bdi_dirty_limit() is simply a
> no-op, or just lowers bdi_thresh slightly (otherwise could a bug).
> So it could be removed without causing much side effects, including
> the influence on dirty_exceeded.
> 
> > I've got an updated version of this patch that moves the clip_bdi logic
> > up into balance_dirty_pages that should be closer to the existing
> > behavior & tests so far look good. I can post it for comments if you're
> > interested ?
> 
> So I suggested just remove clip_bdi_dirty_limit(). To be sure, could
> run with the following patch and check if big numbers are showed.
> 
> Thanks,
> Fengguang
> ---
Yes, writing to one disk there's no difference, but what about writing
to multiple disks?

Can't we get into the situation where
	nr_reclaimable + nr_writeback >= dirty_threshold
and a bdi is
	bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh

With the old code that clips the bdi, the dirty_exceeded flag will stay
set, so balance_dirty_pages_ratelimited will check every 8 new dirty
pages.

And with the new code dirty_exceeded get cleared & ratelimited drops
back to checking every 32 new pages.

I'm not sure what difference this will have in practice, but I don't
have a RAID array to test it so I'm just guessing.

regards
Richard





^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-03  9:48               ` Richard Kennedy
@ 2009-09-03 11:05                 ` Wu Fengguang
  2009-09-03 12:26                   ` Richard Kennedy
  0 siblings, 1 reply; 20+ messages in thread
From: Wu Fengguang @ 2009-09-03 11:05 UTC (permalink / raw)
  To: Richard Kennedy
  Cc: Peter Zijlstra, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Thu, Sep 03, 2009 at 05:48:35PM +0800, Richard Kennedy wrote:
> On Thu, 2009-09-03 at 10:22 +0800, Wu Fengguang wrote:
> > On Wed, Sep 02, 2009 at 09:53:31PM +0800, Richard Kennedy wrote:
> > > On Wed, 2009-09-02 at 12:45 +0200, Peter Zijlstra wrote:
> > > > On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > > > > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > > > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > > > > +           /* always throttle if over threshold */
> > > > > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > > > > 
> > > > > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > > > > dirtied, the less it should be throttled.
> > > > > > > 
> > > > > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > > > > catching that.
> > > > > > 
> > > > > > So in retrospect I think I might have been wrong here.
> > > > > > 
> > > > > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > > > > based on writeback speed alone. That is, the more a task dirties, the
> > > > > > lower the bdi limit is as seen for that task.
> > > > > 
> > > > > Right. If I understand it right, there will be a safety margin of about
> > > > > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > > > > when there are more concurrent heavy dirtiers.
> > > > 
> > > > Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> > > > is 1/32-nd.
> > > > 
> > > > With the side node that I think 1/8 is too much on large memory systems,
> > > > and I have posted a sqrt patch numerous times, but I don't think we've
> > > > ever found out if that helps or not...
> > > > 
> > > > > In principle, the ceiling will be a bit higher for a light dirtier to
> > > > > make it easy to pass in the presence of more heavy dirtiers.
> > > > 
> > > > Right.
> > > > 
> > > > > > So if we get a task that generates tons of dirty pages (dd) then it
> > > > > > won't ever actually hit the full dirty limit, even if its the only task
> > > > > > on the system, and this outer if() will always be true.
> > > > > 
> > > > > Right, we have the safety margin :)
> > > > > 
> > > > > > Only when we actually saturate the full dirty limit will we fall through
> > > > > > and throttle, but that is ok -- we want to enforce the full limit.
> > > > > > 
> > > > > > In short, a very aggressive dirtier will have a bdi limit lower than the
> > > > > > total limit (at all times) leaving a little room at the top for the
> > > > > > occasional dirtier to make quick progress.
> > > > > > 
> > > > > > Wu, does that cover the scenario you had in mind?
> > > > > 
> > > > > Yes thanks! Please correct me if wrong:
> > > > > - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
> > > > >   is elegant enough to prevent heavy dirtier to block light ones
> > > > 
> > > > ack
> > > > 
> > > > > - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
> > > > >   relevant in normal, but can be kept for safety in the form of
> > > > > 
> > > > >           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > >               nr_reclaimable + nr_writeback < dirty_thresh)
> > > > >                   break;
> > > > 
> > > > ack
> > > > 
> > > > > - clip_bdi_dirty_limit() could be removed: we have been secured by the
> > > > >   above test
> > > > 
> > > > ack.
> > > 
> > > 
> > > I've noticed that there's a difference in the handling of the
> > > dirty_exceeded flag, because this change no longer clips the bdi_thresh
> > > then the flag may get cleared more quickly here :-  
> > > 
> > > 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > 	    bdi->dirty_exceeded)
> > > 		bdi->dirty_exceeded = 0;
> > > 
> > > So it then could call balance_dirty_pages a lot less often.
> > 
> > I guess in normal situations, clip_bdi_dirty_limit() is simply a
> > no-op, or just lowers bdi_thresh slightly (otherwise could a bug).
> > So it could be removed without causing much side effects, including
> > the influence on dirty_exceeded.
> > 
> > > I've got an updated version of this patch that moves the clip_bdi logic
> > > up into balance_dirty_pages that should be closer to the existing
> > > behavior & tests so far look good. I can post it for comments if you're
> > > interested ?
> > 
> > So I suggested just remove clip_bdi_dirty_limit(). To be sure, could
> > run with the following patch and check if big numbers are showed.
> > 
> > Thanks,
> > Fengguang
> > ---
> Yes, writing to one disk there's no difference, but what about writing
> to multiple disks?
> 
> Can't we get into the situation where
> 	nr_reclaimable + nr_writeback >= dirty_threshold
> and a bdi is
> 	bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh

This should also happen very infrequently. For the sake of safety we
could create some local variable

        dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) ||
                         (nr_reclaimable + nr_writeback >= dirty_threshold);

and to use it throughout the function, ie.

        if (!dirty_exceeded)
                break;
and
	if (!dirty_exceeded && bdi->dirty_exceeded)
		bdi->dirty_exceeded = 0;
?

Thanks,
Fengguang

> With the old code that clips the bdi, the dirty_exceeded flag will stay
> set, so balance_dirty_pages_ratelimited will check every 8 new dirty
> pages.
> 
> And with the new code dirty_exceeded get cleared & ratelimited drops
> back to checking every 32 new pages.
> 
> I'm not sure what difference this will have in practice, but I don't
> have a RAID array to test it so I'm just guessing.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree
  2009-09-03 11:05                 ` Wu Fengguang
@ 2009-09-03 12:26                   ` Richard Kennedy
  0 siblings, 0 replies; 20+ messages in thread
From: Richard Kennedy @ 2009-09-03 12:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, akpm@linux-foundation.org,
	mm-commits@vger.kernel.org, chris.mason@oracle.com,
	jens.axboe@oracle.com, mbligh@mbligh.org, miklos@szeredi.hu,
	linux-fsdevel@vger.kernel.org

On Thu, 2009-09-03 at 19:05 +0800, Wu Fengguang wrote:
> On Thu, Sep 03, 2009 at 05:48:35PM +0800, Richard Kennedy wrote:
> > On Thu, 2009-09-03 at 10:22 +0800, Wu Fengguang wrote:
> > > On Wed, Sep 02, 2009 at 09:53:31PM +0800, Richard Kennedy wrote:
> > > > On Wed, 2009-09-02 at 12:45 +0200, Peter Zijlstra wrote:
> > > > > On Wed, 2009-09-02 at 17:57 +0800, Wu Fengguang wrote:
> > > > > > On Wed, Sep 02, 2009 at 04:31:40PM +0800, Peter Zijlstra wrote:
> > > > > > > On Sat, 2009-08-22 at 20:11 +0200, Peter Zijlstra wrote:
> > > > > > > > > > +           /* always throttle if over threshold */
> > > > > > > > > > +           if (nr_reclaimable + nr_writeback < dirty_thresh) {
> > > > > > > > > 
> > > > > > > > > That 'if' is a big behavior change. It effectively blocks every one
> > > > > > > > > and canceled Peter's proportional throttling work: the less a process
> > > > > > > > > dirtied, the less it should be throttled.
> > > > > > > > 
> > > > > > > > Hmm, I think you're right, I had not considered that, thanks for
> > > > > > > > catching that.
> > > > > > > 
> > > > > > > So in retrospect I think I might have been wrong here.
> > > > > > > 
> > > > > > > The per task thing causes the bdi limit to be lower than the bdi limit
> > > > > > > based on writeback speed alone. That is, the more a task dirties, the
> > > > > > > lower the bdi limit is as seen for that task.
> > > > > > 
> > > > > > Right. If I understand it right, there will be a safety margin of about
> > > > > > (1/8) * dirty_limit for 1 heavy dirtier case, and that gap scales down
> > > > > > when there are more concurrent heavy dirtiers.
> > > > > 
> > > > > Right, with say 4 heavy writers the gap will be 1/4-th of 1/8-th, which
> > > > > is 1/32-nd.
> > > > > 
> > > > > With the side node that I think 1/8 is too much on large memory systems,
> > > > > and I have posted a sqrt patch numerous times, but I don't think we've
> > > > > ever found out if that helps or not...
> > > > > 
> > > > > > In principle, the ceiling will be a bit higher for a light dirtier to
> > > > > > make it easy to pass in the presence of more heavy dirtiers.
> > > > > 
> > > > > Right.
> > > > > 
> > > > > > > So if we get a task that generates tons of dirty pages (dd) then it
> > > > > > > won't ever actually hit the full dirty limit, even if its the only task
> > > > > > > on the system, and this outer if() will always be true.
> > > > > > 
> > > > > > Right, we have the safety margin :)
> > > > > > 
> > > > > > > Only when we actually saturate the full dirty limit will we fall through
> > > > > > > and throttle, but that is ok -- we want to enforce the full limit.
> > > > > > > 
> > > > > > > In short, a very aggressive dirtier will have a bdi limit lower than the
> > > > > > > total limit (at all times) leaving a little room at the top for the
> > > > > > > occasional dirtier to make quick progress.
> > > > > > > 
> > > > > > > Wu, does that cover the scenario you had in mind?
> > > > > > 
> > > > > > Yes thanks! Please correct me if wrong:
> > > > > > - the lower-ceiling-for-heavier-dirtier algorithm in task_dirty_limit()
> > > > > >   is elegant enough to prevent heavy dirtier to block light ones
> > > > > 
> > > > > ack
> > > > > 
> > > > > > - the test (nr_reclaimable + nr_writeback < dirty_thresh) is not
> > > > > >   relevant in normal, but can be kept for safety in the form of
> > > > > > 
> > > > > >           if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > > >               nr_reclaimable + nr_writeback < dirty_thresh)
> > > > > >                   break;
> > > > > 
> > > > > ack
> > > > > 
> > > > > > - clip_bdi_dirty_limit() could be removed: we have been secured by the
> > > > > >   above test
> > > > > 
> > > > > ack.
> > > > 
> > > > 
> > > > I've noticed that there's a difference in the handling of the
> > > > dirty_exceeded flag, because this change no longer clips the bdi_thresh
> > > > then the flag may get cleared more quickly here :-  
> > > > 
> > > > 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > 	    bdi->dirty_exceeded)
> > > > 		bdi->dirty_exceeded = 0;
> > > > 
> > > > So it then could call balance_dirty_pages a lot less often.
> > > 
> > > I guess in normal situations, clip_bdi_dirty_limit() is simply a
> > > no-op, or just lowers bdi_thresh slightly (otherwise could a bug).
> > > So it could be removed without causing much side effects, including
> > > the influence on dirty_exceeded.
> > > 
> > > > I've got an updated version of this patch that moves the clip_bdi logic
> > > > up into balance_dirty_pages that should be closer to the existing
> > > > behavior & tests so far look good. I can post it for comments if you're
> > > > interested ?
> > > 
> > > So I suggested just remove clip_bdi_dirty_limit(). To be sure, could
> > > run with the following patch and check if big numbers are showed.
> > > 
> > > Thanks,
> > > Fengguang
> > > ---
> > Yes, writing to one disk there's no difference, but what about writing
> > to multiple disks?
> > 
> > Can't we get into the situation where
> > 	nr_reclaimable + nr_writeback >= dirty_threshold
> > and a bdi is
> > 	bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh
> 
> This should also happen very infrequently. For the sake of safety we
> could create some local variable
> 
>         dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) ||
>                          (nr_reclaimable + nr_writeback >= dirty_threshold);
> 
> and to use it throughout the function, ie.
> 
>         if (!dirty_exceeded)
>                 break;
> and
> 	if (!dirty_exceeded && bdi->dirty_exceeded)
> 		bdi->dirty_exceeded = 0;
> ?
> 
> Thanks,
> Fengguang
yep that sounds good, I'll give it a try 
regards
Richard


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-09-03 12:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <200908212250.n7LMox3g029154@imap1.linux-foundation.org>
2009-08-22  2:51 ` + mm-balance_dirty_pages-reduce-calls-to-global_page_state-to-reduce-c ache-references.patch added to -mm tree Wu Fengguang
2009-08-22 18:11   ` Peter Zijlstra
2009-08-23  1:32     ` Wu Fengguang
2009-08-23  5:31       ` Peter Zijlstra
2009-08-23  7:27         ` Wu Fengguang
2009-08-23  7:45           ` Peter Zijlstra
2009-09-02  8:31     ` Peter Zijlstra
2009-09-02  9:57       ` Wu Fengguang
2009-09-02 10:45         ` Peter Zijlstra
2009-09-02 13:53           ` Richard Kennedy
2009-09-03  2:22             ` Wu Fengguang
2009-09-03  3:09               ` Wu Fengguang
2009-09-03  9:48               ` Richard Kennedy
2009-09-03 11:05                 ` Wu Fengguang
2009-09-03 12:26                   ` Richard Kennedy
2009-09-03  4:53           ` Wu Fengguang
2009-08-23  9:33   ` Richard Kennedy
2009-08-23 13:00     ` Wu Fengguang
2009-08-23 13:46       ` Richard Kennedy
2009-08-24  1:41         ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).