[PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
@ 2011-03-08 22:31 Jan Kara
  2011-03-08 22:31 ` [PATCH 1/5] writeback: account per-bdi accumulated written pages Jan Kara
                   ` (6 more replies)
  0 siblings, 7 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton

  Hello,

  I'm posting second version of my IO-less balance_dirty_pages() patches. This
is alternative approach to Fengguang's patches - much simpler I believe (only
300 lines added) - but obviously I does not provide so sophisticated control.
Fengguang is currently running some tests on my patches so that we can compare
the approaches.

The basic idea (implemented in the third patch) is that processes throttled
in balance_dirty_pages() wait for enough IO to complete. The waiting is
implemented as follows: Whenever we decide to throttle a task in
balance_dirty_pages(), task adds itself to a list of tasks that are throttled
against that bdi and goes to sleep waiting to receive specified amount of page
IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
autotuned based on observed IO rate), accumulated page IO completions are
distributed equally among waiting tasks.

This waiting scheme has been chosen so that waiting time in
balance_dirty_pages() is proportional to
  number_waited_pages * number_of_waiters.
In particular it does not depend on the total number of pages being waited for,
thus providing possibly a fairer results.

Since last version I've implemented cleanups as suggested by Peter Zilstra.
The patches undergone more throughout testing. So far I've tested different
filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
filesystem and nfs. The load was either various number of dd threads or
fio with several threads each dirtying pages at different speed.

Results and test scripts can be found at
  http://beta.suse.com/private/jack/balance_dirty_pages-v2/
See README file for some explanation of test framework, tests, and graphs.
Except for ext3 in data=ordered mode, where kjournald creates high
fluctuations in waiting time of throttled processes (and also high latencies),
the results look OK. Parallel dd threads are being throttled in the same way
(in a 2s window threads spend the same time waiting) and also latencies of
individual waits seem OK - except for ext3 they fit in 100 ms for local
filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
to fix that we'd have to modify current ratelimiting scheme to take into
account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
more often thus reducing latencies in individual waits...

The results for different bandwidths fio load is interesting. There are 8
threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
bdi dirty limits, what happens is that three most aggresive tasks get
throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
dirtiers run unthrottled.

I'm planning to run some tests with multiple SATA drives to verify whether
there aren't some unexpected fluctuations. But currently I have some trouble
with the HW...

As usual comments are welcome :).

								Honza

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 1/5] writeback: account per-bdi accumulated written pages
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
@ 2011-03-08 22:31 ` Jan Kara
  2011-03-08 22:31 ` [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic Jan Kara
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner

Introduce the BDI_WRITTEN counter. It will be used for waking up
waiters in balance_dirty_pages().

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..63ab4a5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..4d14072 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..c472c1c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
  2011-03-08 22:31 ` [PATCH 1/5] writeback: account per-bdi accumulated written pages Jan Kara
@ 2011-03-08 22:31 ` Jan Kara
  2011-03-09 21:02   ` Vivek Goyal
  2011-03-08 22:31 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner

We set bdi->dirty_exceeded (and thus ratelimiting code starts to
call balance_dirty_pages() every 8 pages) when a per-bdi limit is
exceeded or global limit is exceeded. But per-bdi limit also depends
on the task. Thus different tasks reach the limit on that bdi at
different levels of dirty pages. The result is that with current code
bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
just got into balance_dirty_pages().

We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
limits already do not have any influence.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c472c1c..f388f70 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -275,12 +275,13 @@ static inline void task_dirties_fraction(struct task_struct *tsk,
  * effectively curb the growth of dirty pages. Light dirtiers with high enough
  * dirty threshold may never get throttled.
  */
+#define TASK_LIMIT_FRACTION 8
 static unsigned long task_dirty_limit(struct task_struct *tsk,
 				       unsigned long bdi_dirty)
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_LIMIT_FRACTION;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -291,6 +292,12 @@ static unsigned long task_dirty_limit(struct task_struct *tsk,
 	return max(dirty, bdi_dirty/2);
 }
 
+/* Minimum limit for any task */
+static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
+{
+	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
+}
+
 /*
  *
  */
@@ -484,9 +491,11 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long min_bdi_thresh = ULONG_MAX;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
+	bool min_dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
@@ -513,6 +522,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -542,6 +552,9 @@ static void balance_dirty_pages(struct address_space *mapping,
 		dirty_exceeded =
 			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
 			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		min_dirty_exceeded =
+			(bdi_nr_reclaimable + bdi_nr_writeback > min_bdi_thresh)
+			|| (nr_reclaimable + nr_writeback > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;
@@ -579,7 +592,8 @@ static void balance_dirty_pages(struct address_space *mapping,
 			pause = HZ / 10;
 	}
 
-	if (!dirty_exceeded && bdi->dirty_exceeded)
+	/* Clear dirty_exceeded flag only when no task can exceed the limit */
+	if (!min_dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic
  2011-03-08 22:31 ` [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic Jan Kara
@ 2011-03-09 21:02   ` Vivek Goyal
  2011-03-14 20:44     ` Jan Kara
  0 siblings, 1 reply; 49+ messages in thread
From: Vivek Goyal @ 2011-03-09 21:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Tue, Mar 08, 2011 at 11:31:12PM +0100, Jan Kara wrote:
> We set bdi->dirty_exceeded (and thus ratelimiting code starts to
> call balance_dirty_pages() every 8 pages) when a per-bdi limit is
> exceeded or global limit is exceeded. But per-bdi limit also depends
> on the task. Thus different tasks reach the limit on that bdi at
> different levels of dirty pages. The result is that with current code
> bdi->dirty_exceeded ping-ponged between 1 and 0 depending on which task
> just got into balance_dirty_pages().
> 
> We fix the issue by clearing bdi->dirty_exceeded only when per-bdi amount
> of dirty pages drops below the threshold (7/8 * bdi_dirty_limit) where task
> limits already do not have any influence.
> 
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Wu Fengguang <fengguang.wu@intel.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  mm/page-writeback.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index c472c1c..f388f70 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -275,12 +275,13 @@ static inline void task_dirties_fraction(struct task_struct *tsk,
>   * effectively curb the growth of dirty pages. Light dirtiers with high enough
>   * dirty threshold may never get throttled.
>   */
> +#define TASK_LIMIT_FRACTION 8
>  static unsigned long task_dirty_limit(struct task_struct *tsk,
>  				       unsigned long bdi_dirty)
>  {
>  	long numerator, denominator;
>  	unsigned long dirty = bdi_dirty;
> -	u64 inv = dirty >> 3;
> +	u64 inv = dirty / TASK_LIMIT_FRACTION;
>  
>  	task_dirties_fraction(tsk, &numerator, &denominator);
>  	inv *= numerator;
> @@ -291,6 +292,12 @@ static unsigned long task_dirty_limit(struct task_struct *tsk,
>  	return max(dirty, bdi_dirty/2);
>  }
>  
> +/* Minimum limit for any task */
> +static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> +{
> +	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> +}
> +

Hi Jan,

Should the above be called bdi_min_dirty_limit()? In essense we seem to
be setting bdi->bdi_exceeded when dirty pages on bdi cross bdi_thresh and
clear it when dirty pages on bdi are below 7/8*bdi_thresh. So there does
not seem to be any dependency on task dirty limit here hence string
"task" sounds confusing to me. In fact, would bdi_dirty_exceeded_clear_thresh()
be a better name?
 
>  /*
>   *
>   */
> @@ -484,9 +491,11 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> +	unsigned long min_bdi_thresh = ULONG_MAX;
>  	unsigned long pages_written = 0;
>  	unsigned long pause = 1;
>  	bool dirty_exceeded = false;
> +	bool min_dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> @@ -513,6 +522,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> +		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
>  		bdi_thresh = task_dirty_limit(current, bdi_thresh);
                ^^^^^
This patch aside, we use bdi_thresh name both for bdi threshold as well
as per task per bdi threshold. will task_bdi_thresh be a better name
here.

>  
>  		/*
> @@ -542,6 +552,9 @@ static void balance_dirty_pages(struct address_space *mapping,
>  		dirty_exceeded =
>  			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
>  			|| (nr_reclaimable + nr_writeback > dirty_thresh);
> +		min_dirty_exceeded =
> +			(bdi_nr_reclaimable + bdi_nr_writeback > min_bdi_thresh)
> +			|| (nr_reclaimable + nr_writeback > dirty_thresh);

Would following be easier to understand.

		clear_dirty_exceeded =
			(bdi_nr_reclaimable + bdi_nr_writeback <
				dirty_exceeded_reset_thresh)
			&& (nr_reclaimable + nr_writeback < dirty_thresh);

>  
>  		if (!dirty_exceeded)
>  			break;
> @@ -579,7 +592,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>  			pause = HZ / 10;
>  	}
>  
> -	if (!dirty_exceeded && bdi->dirty_exceeded)
> +	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> +	if (!min_dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;

similiarly...

	if (bdi->dirty_exceeded && clear_dirty_exceeded)
		bdi->dirty_exceeded = 0;

I was confused with the term min_dirty_limit and task_min_dirty_limit()
for sometime as patch said that we are trying to move away from dependence
on task specific bdi_thres for clearing bdi->bdi_thresh. May be it is
just me...
 
Thanks
Vivek 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic
  2011-03-09 21:02   ` Vivek Goyal
@ 2011-03-14 20:44     ` Jan Kara
  2011-03-15 15:21       ` Vivek Goyal
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-14 20:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed 09-03-11 16:02:53, Vivek Goyal wrote:
> On Tue, Mar 08, 2011 at 11:31:12PM +0100, Jan Kara wrote:
> > @@ -291,6 +292,12 @@ static unsigned long task_dirty_limit(struct task_struct *tsk,
> >  	return max(dirty, bdi_dirty/2);
> >  }
> >  
> > +/* Minimum limit for any task */
> > +static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> > +{
> > +	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> > +}
> > +
> Should the above be called bdi_min_dirty_limit()? In essense we seem to
> be setting bdi->bdi_exceeded when dirty pages on bdi cross bdi_thresh and
> clear it when dirty pages on bdi are below 7/8*bdi_thresh. So there does
> not seem to be any dependency on task dirty limit here hence string
> "task" sounds confusing to me. In fact, would
> bdi_dirty_exceeded_clear_thresh() be a better name?
  See below...
  
> >  /*
> >   *
> >   */
> > @@ -484,9 +491,11 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  	unsigned long background_thresh;
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> > +	unsigned long min_bdi_thresh = ULONG_MAX;
> >  	unsigned long pages_written = 0;
> >  	unsigned long pause = 1;
> >  	bool dirty_exceeded = false;
> > +	bool min_dirty_exceeded = false;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> >  	for (;;) {
> > @@ -513,6 +522,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  			break;
> >  
> >  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> > +		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> >  		bdi_thresh = task_dirty_limit(current, bdi_thresh);
>                 ^^^^^
> This patch aside, we use bdi_thresh name both for bdi threshold as well
> as per task per bdi threshold. will task_bdi_thresh be a better name
> here.
  I agree that the naming is a bit confusing altough it is traditional :).
The renaming to task_bdi_thresh makes sense to me. Then we could name the
limit when we clear dirty_exceeded as: min_task_bdi_thresh(). The task in
the name tries to say that this is a limit for "any task" so I'd like to
keep it there. What do you think?

> > @@ -542,6 +552,9 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  		dirty_exceeded =
> >  			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> >  			|| (nr_reclaimable + nr_writeback > dirty_thresh);
> > +		min_dirty_exceeded =
> > +			(bdi_nr_reclaimable + bdi_nr_writeback > min_bdi_thresh)
> > +			|| (nr_reclaimable + nr_writeback > dirty_thresh);
> 
> Would following be easier to understand.
> 
> 		clear_dirty_exceeded =
> 			(bdi_nr_reclaimable + bdi_nr_writeback <
> 				dirty_exceeded_reset_thresh)
> 			&& (nr_reclaimable + nr_writeback < dirty_thresh);
  Yes, this looks better. I'll change it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic
  2011-03-14 20:44     ` Jan Kara
@ 2011-03-15 15:21       ` Vivek Goyal
  0 siblings, 0 replies; 49+ messages in thread
From: Vivek Goyal @ 2011-03-15 15:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Mon, Mar 14, 2011 at 09:44:18PM +0100, Jan Kara wrote:
> On Wed 09-03-11 16:02:53, Vivek Goyal wrote:
> > On Tue, Mar 08, 2011 at 11:31:12PM +0100, Jan Kara wrote:
> > > @@ -291,6 +292,12 @@ static unsigned long task_dirty_limit(struct task_struct *tsk,
> > >  	return max(dirty, bdi_dirty/2);
> > >  }
> > >  
> > > +/* Minimum limit for any task */
> > > +static unsigned long task_min_dirty_limit(unsigned long bdi_dirty)
> > > +{
> > > +	return bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION;
> > > +}
> > > +
> > Should the above be called bdi_min_dirty_limit()? In essense we seem to
> > be setting bdi->bdi_exceeded when dirty pages on bdi cross bdi_thresh and
> > clear it when dirty pages on bdi are below 7/8*bdi_thresh. So there does
> > not seem to be any dependency on task dirty limit here hence string
> > "task" sounds confusing to me. In fact, would
> > bdi_dirty_exceeded_clear_thresh() be a better name?
>   See below...
>   
> > >  /*
> > >   *
> > >   */
> > > @@ -484,9 +491,11 @@ static void balance_dirty_pages(struct address_space *mapping,
> > >  	unsigned long background_thresh;
> > >  	unsigned long dirty_thresh;
> > >  	unsigned long bdi_thresh;
> > > +	unsigned long min_bdi_thresh = ULONG_MAX;
> > >  	unsigned long pages_written = 0;
> > >  	unsigned long pause = 1;
> > >  	bool dirty_exceeded = false;
> > > +	bool min_dirty_exceeded = false;
> > >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > >  
> > >  	for (;;) {
> > > @@ -513,6 +522,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> > >  			break;
> > >  
> > >  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> > > +		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
> > >  		bdi_thresh = task_dirty_limit(current, bdi_thresh);
> >                 ^^^^^
> > This patch aside, we use bdi_thresh name both for bdi threshold as well
> > as per task per bdi threshold. will task_bdi_thresh be a better name
> > here.
>   I agree that the naming is a bit confusing altough it is traditional :).
> The renaming to task_bdi_thresh makes sense to me. Then we could name the
> limit when we clear dirty_exceeded as: min_task_bdi_thresh(). The task in
> the name tries to say that this is a limit for "any task" so I'd like to
> keep it there. What do you think?

Ok, so for a task, minimum task_bdi_thresh can be
		(bdi_dirty - bdi_dirty / TASK_LIMIT_FRACTION).

So min_task_dirty_limit() makes sense. Or if you happen to rename above
"bdi_thresh" to "task_bdi_thresh" then "min_task_bdi_thresh()" might
be even better. It is up to you depending on context of your later patches.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
  2011-03-08 22:31 ` [PATCH 1/5] writeback: account per-bdi accumulated written pages Jan Kara
  2011-03-08 22:31 ` [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic Jan Kara
@ 2011-03-08 22:31 ` Jan Kara
  2011-03-10  0:07   ` Vivek Goyal
  2011-03-16 16:53   ` Vivek Goyal
  2011-03-08 22:31 ` [PATCH 4/5] mm: Remove low limit from sync_writeback_pages() Jan Kara
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner

This patch changes balance_dirty_pages() throttling so that the function does
not submit writes on its own but rather waits for flusher thread to do enough
writes. This has an advantage that we have a single source of IO allowing for
better writeback locality. Also we do not have to reenter filesystems from a
non-trivial context.

The waiting is implemented as follows: Whenever we decide to throttle a task in
balance_dirty_pages(), task adds itself to a list of tasks that are throttled
against that bdi and goes to sleep waiting to receive specified amount of page
IO completions. Once in a while (currently HZ/10, later the interval should be
autotuned based on observed IO completion rate), accumulated page IO
completions are distributed equally among waiting tasks.

This waiting scheme has been chosen so that waiting time in
balance_dirty_pages() is proportional to
  number_waited_pages * number_of_waiters.
In particular it does not depend on the total number of pages being waited for,
thus providing possibly a fairer results. Note that the dependency on the
number of waiters is inevitable, since all the waiters compete for a common
resource so their number has to be somehow reflected in waiting time.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h      |    7 +
 include/linux/writeback.h        |    1 +
 include/trace/events/writeback.h |   65 +++++++-
 mm/backing-dev.c                 |    8 +
 mm/page-writeback.c              |  345 +++++++++++++++++++++++++-------------
 5 files changed, 310 insertions(+), 116 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 63ab4a5..65b6e61 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,6 +89,13 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t balance_lock;	/* lock protecting four entries below */
+	unsigned long written_start;	/* BDI_WRITTEN last time we scanned balance_list*/
+	struct list_head balance_list;	/* waiters in balance_dirty_pages */
+	unsigned int balance_waiters;	/* number of waiters in the list */
+	struct delayed_work balance_work;	/* work distributing page
+						   completions among waiters */
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0ead399..901c33f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -129,6 +129,7 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
 void page_writeback_init(void);
+void distribute_page_completions(struct work_struct *work);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 4e249b9..74815fa 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -147,11 +147,70 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_balance_dirty_pages_waiting,
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long pages),
+	TP_ARGS(bdi, pages),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, pages)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->pages = pages;
+	),
+	TP_printk("bdi=%s pages=%lu",
+		  __entry->name, __entry->pages
+	)
+);
+
+TRACE_EVENT(writeback_balance_dirty_pages_woken,
+	TP_PROTO(struct backing_dev_info *bdi),
+	TP_ARGS(bdi),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+	),
+	TP_printk("bdi=%s",
+		  __entry->name
+	)
+);
+
+TRACE_EVENT(writeback_distribute_page_completions,
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long written),
+	TP_ARGS(bdi, written),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, start)
+		__field(unsigned long, written)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->start = bdi->written_start;
+		__entry->written = written - bdi->written_start;
+	),
+	TP_printk("bdi=%s written_start=%lu to_distribute=%lu",
+		  __entry->name, __entry->start, __entry->written
+	)
+);
+
+TRACE_EVENT(writeback_distribute_page_completions_wakeall,
+	TP_PROTO(struct backing_dev_info *bdi),
+	TP_ARGS(bdi),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+	),
+	TP_printk("bdi=%s",
+		  __entry->name
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4d14072..2ecc3fe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -652,6 +652,12 @@ int bdi_init(struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->work_list);
 
+	spin_lock_init(&bdi->balance_lock);
+	INIT_LIST_HEAD(&bdi->balance_list);
+	bdi->written_start = 0;
+	bdi->balance_waiters = 0;
+	INIT_DELAYED_WORK(&bdi->balance_work, distribute_page_completions);
+
 	bdi_wb_init(&bdi->wb, bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
@@ -691,6 +697,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		spin_unlock(&inode_lock);
 	}
 
+	cancel_delayed_work_sync(&bdi->balance_work);
+	WARN_ON(!list_empty(&bdi->balance_list));
 	bdi_unregister(bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f388f70..697dd8e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -132,6 +132,17 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Item a process queues to bdi list in balance_dirty_pages() when it gets
+ * throttled
+ */
+struct balance_waiter {
+	struct list_head bw_list;
+	unsigned long bw_wait_pages;	/* Number of pages to wait for to
+					   get written */
+	struct task_struct *bw_task;	/* Task waiting for IO */
+};
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -476,140 +487,248 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 	return bdi_dirty;
 }
 
-/*
- * balance_dirty_pages() must be called by processes which are generating dirty
- * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
- * If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
- */
-static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
-{
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+struct dirty_limit_state {
+	long nr_reclaimable;
+	long nr_writeback;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long min_bdi_thresh = ULONG_MAX;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
-	bool dirty_exceeded = false;
-	bool min_dirty_exceeded = false;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+};
 
-	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
+static void get_global_dirty_limit_state(struct dirty_limit_state *st)
+{
+	/*
+	 * Note: nr_reclaimable denotes nr_dirty + nr_unstable.  Unstable
+	 * writes are a feature of certain networked filesystems (i.e. NFS) in
+	 * which data may have been written to the server's write cache, but
+	 * has not yet been flushed to permanent storage.
+	 */
+	st->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
+	st->nr_writeback = global_page_state(NR_WRITEBACK);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+	global_dirty_limits(&st->background_thresh, &st->dirty_thresh);
+}
 
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+/* This function expects global state to be already filled in! */
+static void get_bdi_dirty_limit_state(struct backing_dev_info *bdi,
+				      struct dirty_limit_state *st)
+{
+	unsigned long min_bdi_thresh;
 
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
-			break;
+	st->bdi_thresh = bdi_dirty_limit(bdi, st->dirty_thresh);
+	min_bdi_thresh = task_min_dirty_limit(st->bdi_thresh);
+	/*
+	 * In order to avoid the stacked BDI deadlock we need to ensure we
+	 * accurately count the 'dirty' pages when the threshold is low.
+	 *
+	 * Otherwise it would be possible to get thresh+n pages reported dirty,
+	 * even though there are thresh-m pages actually dirty; with m+n
+	 * sitting in the percpu deltas.
+	 */
+	if (min_bdi_thresh < 2*bdi_stat_error(bdi)) {
+		st->bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+		st->bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+	} else {
+		st->bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		st->bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+	}
+}
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+/* Possibly states of dirty memory for BDI */
+enum {
+	DIRTY_OK,			/* Everything below limit */
+	DIRTY_EXCEED_BACKGROUND,	/* Backround writeback limit exceeded */
+	DIRTY_MAY_EXCEED_LIMIT,		/* Some task may exceed its dirty limit */
+	DIRTY_EXCEED_LIMIT,		/* Global dirty limit exceeded */
+};
 
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-		}
+static int check_dirty_limits(struct backing_dev_info *bdi,
+			      struct dirty_limit_state *st)
+{
+	unsigned long min_bdi_thresh;
+	int ret = DIRTY_OK;
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
-		min_dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > min_bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
+	get_global_dirty_limit_state(st);
+	/*
+	 * Throttle it only when the background writeback cannot catch-up. This
+	 * avoids (excessively) small writeouts when the bdi limits are ramping
+	 * up.
+	 */
+	if (st->nr_reclaimable + st->nr_writeback <=
+			(st->background_thresh + st->dirty_thresh) / 2)
+		goto out;
 
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+	get_bdi_dirty_limit_state(bdi, st);
+	min_bdi_thresh = task_min_dirty_limit(st->bdi_thresh);
+
+	/*
+	 * The bdi thresh is somehow "soft" limit derived from the global
+	 * "hard" limit. The former helps to prevent heavy IO bdi or process
+	 * from holding back light ones; The latter is the last resort
+	 * safeguard.
+	 */
+	if (st->nr_reclaimable + st->nr_writeback > st->dirty_thresh) {
+		ret = DIRTY_EXCEED_LIMIT;
+		goto out;
+	}
+	if (st->bdi_nr_reclaimable + st->bdi_nr_writeback > min_bdi_thresh) {
+		ret = DIRTY_MAY_EXCEED_LIMIT;
+		goto out;
+	}
+	if (st->nr_reclaimable > st->background_thresh)
+		ret = DIRTY_EXCEED_BACKGROUND;
+out:
+	return ret;
+}
+
+static bool bdi_task_limit_exceeded(struct dirty_limit_state *st,
+				    struct task_struct *p)
+{
+	unsigned long bdi_thresh;
+
+	bdi_thresh = task_dirty_limit(p, st->bdi_thresh);
+
+	return st->bdi_nr_reclaimable + st->bdi_nr_writeback > bdi_thresh;
+}
+
+static void balance_waiter_done(struct backing_dev_info *bdi,
+				struct balance_waiter *bw)
+{
+	list_del_init(&bw->bw_list);
+	bdi->balance_waiters--;
+	wake_up_process(bw->bw_task);
+}
+
+void distribute_page_completions(struct work_struct *work)
+{
+	struct backing_dev_info *bdi =
+		container_of(work, struct backing_dev_info, balance_work.work);
+	unsigned long written = bdi_stat_sum(bdi, BDI_WRITTEN);
+	unsigned long pages_per_waiter;
+	struct balance_waiter *waiter, *tmpw;
+	struct dirty_limit_state st;
+	int dirty_exceeded;
+
+	trace_writeback_distribute_page_completions(bdi, written);
+	dirty_exceeded = check_dirty_limits(bdi, &st);
+	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
+		/* Wakeup everybody */
+		trace_writeback_distribute_page_completions_wakeall(bdi);
+		spin_lock(&bdi->balance_lock);
+		list_for_each_entry_safe(
+				waiter, tmpw, &bdi->balance_list, bw_list)
+			balance_waiter_done(bdi, waiter);
+		spin_unlock(&bdi->balance_lock);
+		return;
+	}
+
+	spin_lock(&bdi->balance_lock);
+	/* Distribute pages equally among waiters */
+	while (!list_empty(&bdi->balance_list)) {
+		pages_per_waiter = (written - bdi->written_start) /
+							bdi->balance_waiters;
+		if (!pages_per_waiter)
+			break;
+		list_for_each_entry_safe(
+				waiter, tmpw, &bdi->balance_list, bw_list) {
+			unsigned long delta = min(pages_per_waiter,
+						  waiter->bw_wait_pages);
+
+			waiter->bw_wait_pages -= delta;
+			bdi->written_start += delta;
+			if (waiter->bw_wait_pages == 0)
+				balance_waiter_done(bdi, waiter);
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		io_schedule_timeout(pause);
+	}
+	/* Wake tasks that might have gotten below their limits */
+	list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
+		if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
+		     !bdi_task_limit_exceeded(&st, waiter->bw_task))
+			balance_waiter_done(bdi, waiter);
+	}
+	/* More page completions needed? */
+	if (!list_empty(&bdi->balance_list))
+		schedule_delayed_work(&bdi->balance_work, HZ/10);
+	spin_unlock(&bdi->balance_lock);
+}
 
+/*
+ * balance_dirty_pages() must be called by processes which are generating dirty
+ * data.  It looks at the number of dirty pages in the machine and will force
+ * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * If we're over `background_thresh' then the writeback threads are woken to
+ * perform some writeout.
+ */
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct balance_waiter bw;
+	struct dirty_limit_state st;
+	int dirty_exceeded = check_dirty_limits(bdi, &st);
+
+	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
+	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
+	     !bdi_task_limit_exceeded(&st, current))) {
+		if (bdi->dirty_exceeded &&
+		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
+			bdi->dirty_exceeded = 0;
 		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
+		 * In laptop mode, we wait until hitting the higher threshold
+		 * before starting background writeout, and then write out all
+		 * the way down to the lower threshold.  So slow writers cause
+		 * minimal disk activity.
+		 *
+		 * In normal mode, we start background writeout at the lower
+		 * background_thresh, to keep the amount of dirty memory low.
 		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
+			bdi_start_background_writeback(bdi);
+		return;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (!min_dirty_exceeded && bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 0;
+	if (!bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 1;
 
-	if (writeback_in_progress(bdi))
-		return;
+	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
+	/* Kick flusher thread to start doing work if it isn't already */
+	bdi_start_background_writeback(bdi);
 
+	bw.bw_wait_pages = write_chunk;
+	bw.bw_task = current;
+	spin_lock(&bdi->balance_lock);
 	/*
-	 * In laptop mode, we wait until hitting the higher threshold before
-	 * starting background writeout, and then write out all the way down
-	 * to the lower threshold.  So slow writers cause minimal disk activity.
-	 *
-	 * In normal mode, we start background writeout at the lower
-	 * background_thresh, to keep the amount of dirty memory low.
+	 * First item? Need to schedule distribution of IO completions among
+	 * items on balance_list
+	 */
+	if (list_empty(&bdi->balance_list)) {
+		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
+		/* FIXME: Delay should be autotuned based on dev throughput */
+		schedule_delayed_work(&bdi->balance_work, HZ/10);
+	}
+	/*
+	 * Add work to the balance list, from now on the structure is handled
+	 * by distribute_page_completions()
+	 */
+	list_add_tail(&bw.bw_list, &bdi->balance_list);
+	bdi->balance_waiters++;
+	/*
+	 * Setting task state must happen inside balance_lock to avoid races
+	 * with distribution function waking us.
+	 */
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	spin_unlock(&bdi->balance_lock);
+	/* Wait for pages to get written */
+	schedule();
+	/*
+	 * Enough page completions should have happened by now and we should
+	 * have been removed from the list
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
-		bdi_start_background_writeback(bdi);
+	WARN_ON(!list_empty(&bw.bw_list));
+	trace_writeback_balance_dirty_pages_woken(bdi);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-08 22:31 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
@ 2011-03-10  0:07   ` Vivek Goyal
  2011-03-14 20:48     ` Jan Kara
  2011-03-16 16:53   ` Vivek Goyal
  1 sibling, 1 reply; 49+ messages in thread
From: Vivek Goyal @ 2011-03-10  0:07 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote:
> This patch changes balance_dirty_pages() throttling so that the function does
> not submit writes on its own but rather waits for flusher thread to do enough
> writes. This has an advantage that we have a single source of IO allowing for
> better writeback locality. Also we do not have to reenter filesystems from a
> non-trivial context.
> 
> The waiting is implemented as follows: Whenever we decide to throttle a task in
> balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> against that bdi and goes to sleep waiting to receive specified amount of page
> IO completions. Once in a while (currently HZ/10, later the interval should be
> autotuned based on observed IO completion rate), accumulated page IO
> completions are distributed equally among waiting tasks.
> 
> This waiting scheme has been chosen so that waiting time in
> balance_dirty_pages() is proportional to
>   number_waited_pages * number_of_waiters.
> In particular it does not depend on the total number of pages being waited for,
> thus providing possibly a fairer results. Note that the dependency on the
> number of waiters is inevitable, since all the waiters compete for a common
> resource so their number has to be somehow reflected in waiting time.
> 
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Christoph Hellwig <hch@infradead.org>
> CC: Dave Chinner <david@fromorbit.com>
> CC: Wu Fengguang <fengguang.wu@intel.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/backing-dev.h      |    7 +
>  include/linux/writeback.h        |    1 +
>  include/trace/events/writeback.h |   65 +++++++-
>  mm/backing-dev.c                 |    8 +
>  mm/page-writeback.c              |  345 +++++++++++++++++++++++++-------------
>  5 files changed, 310 insertions(+), 116 deletions(-)
> 

[..]
> +/*
> + * balance_dirty_pages() must be called by processes which are generating dirty
> + * data.  It looks at the number of dirty pages in the machine and will force
> + * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> + * If we're over `background_thresh' then the writeback threads are woken to
> + * perform some writeout.
> + */
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct balance_waiter bw;
> +	struct dirty_limit_state st;
> +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> +
> +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> +	     !bdi_task_limit_exceeded(&st, current))) {
> +		if (bdi->dirty_exceeded &&
> +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> +			bdi->dirty_exceeded = 0;
>  		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> +		 * In laptop mode, we wait until hitting the higher threshold
> +		 * before starting background writeout, and then write out all
> +		 * the way down to the lower threshold.  So slow writers cause
> +		 * minimal disk activity.
> +		 *
> +		 * In normal mode, we start background writeout at the lower
> +		 * background_thresh, to keep the amount of dirty memory low.
>  		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
> +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> +			bdi_start_background_writeback(bdi);
> +		return;
>  	}
>  
> -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> -		bdi->dirty_exceeded = 0;
> +	if (!bdi->dirty_exceeded)
> +		bdi->dirty_exceeded = 1;

Will it make sense to move out bdi_task_limit_exceeded() check in a
separate if condition statement as follows. May be this is little
easier to read.

	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
		if (bdi->dirty_exceeded)
			bdi->dirty_exceeded = 0;

		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
			bdi_start_background_writeback(bdi);

		return;
	}

	if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
	    !bdi_task_limit_exceeded(&st, current))
		return;   

	/* Either task is throttled or we crossed global dirty ratio */
	if (!bdi->dirty_exceeded)
		bdi->dirty_exceeded = 1;

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-10  0:07   ` Vivek Goyal
@ 2011-03-14 20:48     ` Jan Kara
  2011-03-15 15:23       ` Vivek Goyal
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-14 20:48 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed 09-03-11 19:07:31, Vivek Goyal wrote:
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> > +{
> > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	struct balance_waiter bw;
> > +	struct dirty_limit_state st;
> > +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> > +
> > +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> > +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > +	     !bdi_task_limit_exceeded(&st, current))) {
> > +		if (bdi->dirty_exceeded &&
> > +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> > +			bdi->dirty_exceeded = 0;
> >  		/*
> > -		 * Increase the delay for each loop, up to our previous
> > -		 * default of taking a 100ms nap.
> > +		 * In laptop mode, we wait until hitting the higher threshold
> > +		 * before starting background writeout, and then write out all
> > +		 * the way down to the lower threshold.  So slow writers cause
> > +		 * minimal disk activity.
> > +		 *
> > +		 * In normal mode, we start background writeout at the lower
> > +		 * background_thresh, to keep the amount of dirty memory low.
> >  		 */
> > -		pause <<= 1;
> > -		if (pause > HZ / 10)
> > -			pause = HZ / 10;
> > +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > +			bdi_start_background_writeback(bdi);
> > +		return;
> >  	}
> >  
> > -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> > -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> > -		bdi->dirty_exceeded = 0;
> > +	if (!bdi->dirty_exceeded)
> > +		bdi->dirty_exceeded = 1;
> 
> Will it make sense to move out bdi_task_limit_exceeded() check in a
> separate if condition statement as follows. May be this is little
> easier to read.
> 
> 	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
> 		if (bdi->dirty_exceeded)
> 			bdi->dirty_exceeded = 0;
> 
> 		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> 			bdi_start_background_writeback(bdi);
> 
> 		return;
> 	}
> 
> 	if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> 	    !bdi_task_limit_exceeded(&st, current))
> 		return;   
  But then we have to start background writeback here as well. Which is
actually a bug in the original patch as well! So clearly your way is more
readable :) I'll change it. Thanks.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-14 20:48     ` Jan Kara
@ 2011-03-15 15:23       ` Vivek Goyal
  2011-03-16 21:26         ` Curt Wohlgemuth
  0 siblings, 1 reply; 49+ messages in thread
From: Vivek Goyal @ 2011-03-15 15:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Mon, Mar 14, 2011 at 09:48:21PM +0100, Jan Kara wrote:
> On Wed 09-03-11 19:07:31, Vivek Goyal wrote:
> > > +static void balance_dirty_pages(struct address_space *mapping,
> > > +				unsigned long write_chunk)
> > > +{
> > > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > > +	struct balance_waiter bw;
> > > +	struct dirty_limit_state st;
> > > +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> > > +
> > > +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> > > +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > > +	     !bdi_task_limit_exceeded(&st, current))) {
> > > +		if (bdi->dirty_exceeded &&
> > > +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> > > +			bdi->dirty_exceeded = 0;
> > >  		/*
> > > -		 * Increase the delay for each loop, up to our previous
> > > -		 * default of taking a 100ms nap.
> > > +		 * In laptop mode, we wait until hitting the higher threshold
> > > +		 * before starting background writeout, and then write out all
> > > +		 * the way down to the lower threshold.  So slow writers cause
> > > +		 * minimal disk activity.
> > > +		 *
> > > +		 * In normal mode, we start background writeout at the lower
> > > +		 * background_thresh, to keep the amount of dirty memory low.
> > >  		 */
> > > -		pause <<= 1;
> > > -		if (pause > HZ / 10)
> > > -			pause = HZ / 10;
> > > +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > > +			bdi_start_background_writeback(bdi);
> > > +		return;
> > >  	}
> > >  
> > > -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> > > -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> > > -		bdi->dirty_exceeded = 0;
> > > +	if (!bdi->dirty_exceeded)
> > > +		bdi->dirty_exceeded = 1;
> > 
> > Will it make sense to move out bdi_task_limit_exceeded() check in a
> > separate if condition statement as follows. May be this is little
> > easier to read.
> > 
> > 	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
> > 		if (bdi->dirty_exceeded)
> > 			bdi->dirty_exceeded = 0;
> > 
> > 		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > 			bdi_start_background_writeback(bdi);
> > 
> > 		return;
> > 	}
> > 
> > 	if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > 	    !bdi_task_limit_exceeded(&st, current))
> > 		return;   
>   But then we have to start background writeback here as well. Which is
> actually a bug in the original patch as well! So clearly your way is more
> readable :) I'll change it. Thanks.

I was thinking about that starting of bdi writeback here. But I was
assuming that if we are here then we most likely have visited above
loop of < DIRTY_MAY_EXCEED_LIMIT and started background writeback.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-15 15:23       ` Vivek Goyal
@ 2011-03-16 21:26         ` Curt Wohlgemuth
  2011-03-16 22:53           ` Curt Wohlgemuth
  0 siblings, 1 reply; 49+ messages in thread
From: Curt Wohlgemuth @ 2011-03-16 21:26 UTC (permalink / raw)
  To: Vivek Goyal, jack
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

Hi Jan:

On Tue, Mar 15, 2011 at 8:23 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Mar 14, 2011 at 09:48:21PM +0100, Jan Kara wrote:
>> On Wed 09-03-11 19:07:31, Vivek Goyal wrote:
>> > > +static void balance_dirty_pages(struct address_space *mapping,
>> > > +                         unsigned long write_chunk)
>> > > +{
>> > > + struct backing_dev_info *bdi = mapping->backing_dev_info;
>> > > + struct balance_waiter bw;
>> > > + struct dirty_limit_state st;
>> > > + int dirty_exceeded = check_dirty_limits(bdi, &st);
>> > > +
>> > > + if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
>> > > +     (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
>> > > +      !bdi_task_limit_exceeded(&st, current))) {
>> > > +         if (bdi->dirty_exceeded &&
>> > > +             dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
>> > > +                 bdi->dirty_exceeded = 0;
>> > >           /*
>> > > -          * Increase the delay for each loop, up to our previous
>> > > -          * default of taking a 100ms nap.
>> > > +          * In laptop mode, we wait until hitting the higher threshold
>> > > +          * before starting background writeout, and then write out all
>> > > +          * the way down to the lower threshold.  So slow writers cause
>> > > +          * minimal disk activity.
>> > > +          *
>> > > +          * In normal mode, we start background writeout at the lower
>> > > +          * background_thresh, to keep the amount of dirty memory low.
>> > >            */
>> > > -         pause <<= 1;
>> > > -         if (pause > HZ / 10)
>> > > -                 pause = HZ / 10;
>> > > +         if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
>> > > +                 bdi_start_background_writeback(bdi);
>> > > +         return;
>> > >   }
>> > >
>> > > - /* Clear dirty_exceeded flag only when no task can exceed the limit */
>> > > - if (!min_dirty_exceeded && bdi->dirty_exceeded)
>> > > -         bdi->dirty_exceeded = 0;
>> > > + if (!bdi->dirty_exceeded)
>> > > +         bdi->dirty_exceeded = 1;
>> >
>> > Will it make sense to move out bdi_task_limit_exceeded() check in a
>> > separate if condition statement as follows. May be this is little
>> > easier to read.
>> >
>> >     if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
>> >             if (bdi->dirty_exceeded)
>> >                     bdi->dirty_exceeded = 0;
>> >
>> >             if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
>> >                     bdi_start_background_writeback(bdi);
>> >
>> >             return;
>> >     }
>> >
>> >     if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
>> >         !bdi_task_limit_exceeded(&st, current))
>> >             return;
>>   But then we have to start background writeback here as well. Which is
>> actually a bug in the original patch as well! So clearly your way is more
>> readable :) I'll change it. Thanks.
>
> I was thinking about that starting of bdi writeback here. But I was
> assuming that if we are here then we most likely have visited above
> loop of < DIRTY_MAY_EXCEED_LIMIT and started background writeback.

Maybe I'm missing something, but at the point in balance_dirty_pages()
where we kick the flusher thread , before we put the current task to
sleep, how do you know that background writeback is taking place?  Are
you simply assuming that in previous calls to balance_dirty_pages(),
that background writeback has been started, and is still taking place
at the time we need to do throttling?

Thanks,
Curt

>
> Thanks
> Vivek
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-16 21:26         ` Curt Wohlgemuth
@ 2011-03-16 22:53           ` Curt Wohlgemuth
  0 siblings, 0 replies; 49+ messages in thread
From: Curt Wohlgemuth @ 2011-03-16 22:53 UTC (permalink / raw)
  To: Vivek Goyal, jack
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed, Mar 16, 2011 at 2:26 PM, Curt Wohlgemuth <curtw@google.com> wrote:
> Hi Jan:
>
> On Tue, Mar 15, 2011 at 8:23 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Mon, Mar 14, 2011 at 09:48:21PM +0100, Jan Kara wrote:
>>> On Wed 09-03-11 19:07:31, Vivek Goyal wrote:
>>> > > +static void balance_dirty_pages(struct address_space *mapping,
>>> > > +                         unsigned long write_chunk)
>>> > > +{
>>> > > + struct backing_dev_info *bdi = mapping->backing_dev_info;
>>> > > + struct balance_waiter bw;
>>> > > + struct dirty_limit_state st;
>>> > > + int dirty_exceeded = check_dirty_limits(bdi, &st);
>>> > > +
>>> > > + if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
>>> > > +     (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
>>> > > +      !bdi_task_limit_exceeded(&st, current))) {
>>> > > +         if (bdi->dirty_exceeded &&
>>> > > +             dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
>>> > > +                 bdi->dirty_exceeded = 0;
>>> > >           /*
>>> > > -          * Increase the delay for each loop, up to our previous
>>> > > -          * default of taking a 100ms nap.
>>> > > +          * In laptop mode, we wait until hitting the higher threshold
>>> > > +          * before starting background writeout, and then write out all
>>> > > +          * the way down to the lower threshold.  So slow writers cause
>>> > > +          * minimal disk activity.
>>> > > +          *
>>> > > +          * In normal mode, we start background writeout at the lower
>>> > > +          * background_thresh, to keep the amount of dirty memory low.
>>> > >            */
>>> > > -         pause <<= 1;
>>> > > -         if (pause > HZ / 10)
>>> > > -                 pause = HZ / 10;
>>> > > +         if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
>>> > > +                 bdi_start_background_writeback(bdi);
>>> > > +         return;
>>> > >   }
>>> > >
>>> > > - /* Clear dirty_exceeded flag only when no task can exceed the limit */
>>> > > - if (!min_dirty_exceeded && bdi->dirty_exceeded)
>>> > > -         bdi->dirty_exceeded = 0;
>>> > > + if (!bdi->dirty_exceeded)
>>> > > +         bdi->dirty_exceeded = 1;
>>> >
>>> > Will it make sense to move out bdi_task_limit_exceeded() check in a
>>> > separate if condition statement as follows. May be this is little
>>> > easier to read.
>>> >
>>> >     if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
>>> >             if (bdi->dirty_exceeded)
>>> >                     bdi->dirty_exceeded = 0;
>>> >
>>> >             if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
>>> >                     bdi_start_background_writeback(bdi);
>>> >
>>> >             return;
>>> >     }
>>> >
>>> >     if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
>>> >         !bdi_task_limit_exceeded(&st, current))
>>> >             return;
>>>   But then we have to start background writeback here as well. Which is
>>> actually a bug in the original patch as well! So clearly your way is more
>>> readable :) I'll change it. Thanks.
>>
>> I was thinking about that starting of bdi writeback here. But I was
>> assuming that if we are here then we most likely have visited above
>> loop of < DIRTY_MAY_EXCEED_LIMIT and started background writeback.
>
> Maybe I'm missing something, but at the point in balance_dirty_pages()
> where we kick the flusher thread , before we put the current task to
> sleep, how do you know that background writeback is taking place?  Are
> you simply assuming that in previous calls to balance_dirty_pages(),
> that background writeback has been started, and is still taking place
> at the time we need to do throttling?

Never mind, I see that I'm not completely familiar with the writeback
changes for 2.6.38.  I see where we'll kick of BG writeback in
wb_check_background_flush() once we kick the flusher thread.

Curt

>
> Thanks,
> Curt
>
>>
>> Thanks
>> Vivek
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-08 22:31 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
  2011-03-10  0:07   ` Vivek Goyal
@ 2011-03-16 16:53   ` Vivek Goyal
  2011-03-16 19:10     ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Vivek Goyal @ 2011-03-16 16:53 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote:

[..]
> +/*
> + * balance_dirty_pages() must be called by processes which are generating dirty
> + * data.  It looks at the number of dirty pages in the machine and will force
> + * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> + * If we're over `background_thresh' then the writeback threads are woken to
> + * perform some writeout.
> + */
> +static void balance_dirty_pages(struct address_space *mapping,
> +				unsigned long write_chunk)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct balance_waiter bw;
> +	struct dirty_limit_state st;
> +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> +
> +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> +	     !bdi_task_limit_exceeded(&st, current))) {
> +		if (bdi->dirty_exceeded &&
> +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> +			bdi->dirty_exceeded = 0;
>  		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> +		 * In laptop mode, we wait until hitting the higher threshold
> +		 * before starting background writeout, and then write out all
> +		 * the way down to the lower threshold.  So slow writers cause
> +		 * minimal disk activity.
> +		 *
> +		 * In normal mode, we start background writeout at the lower
> +		 * background_thresh, to keep the amount of dirty memory low.
>  		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
> +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> +			bdi_start_background_writeback(bdi);
> +		return;
>  	}
>  
> -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> -		bdi->dirty_exceeded = 0;
> +	if (!bdi->dirty_exceeded)
> +		bdi->dirty_exceeded = 1;
>  
> -	if (writeback_in_progress(bdi))
> -		return;
> +	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
> +	/* Kick flusher thread to start doing work if it isn't already */
> +	bdi_start_background_writeback(bdi);
>  
> +	bw.bw_wait_pages = write_chunk;
> +	bw.bw_task = current;
> +	spin_lock(&bdi->balance_lock);
>  	/*
> -	 * In laptop mode, we wait until hitting the higher threshold before
> -	 * starting background writeout, and then write out all the way down
> -	 * to the lower threshold.  So slow writers cause minimal disk activity.
> -	 *
> -	 * In normal mode, we start background writeout at the lower
> -	 * background_thresh, to keep the amount of dirty memory low.
> +	 * First item? Need to schedule distribution of IO completions among
> +	 * items on balance_list
> +	 */
> +	if (list_empty(&bdi->balance_list)) {
> +		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
> +		/* FIXME: Delay should be autotuned based on dev throughput */
> +		schedule_delayed_work(&bdi->balance_work, HZ/10);
> +	}
> +	/*
> +	 * Add work to the balance list, from now on the structure is handled
> +	 * by distribute_page_completions()
> +	 */
> +	list_add_tail(&bw.bw_list, &bdi->balance_list);
> +	bdi->balance_waiters++;

Hi Jan,

Had a query.

- What makes sure that flusher thread will not stop writing back till all
  the waiters on the bdi have been woken up. IIUC, flusher thread will 
  stop once global background ratio is with-in limit. Is it possible that
  there are still some waiter on some bdi waiting for more pages to finish
  writeback and that might not happen for sometime. 

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-16 16:53   ` Vivek Goyal
@ 2011-03-16 19:10     ` Jan Kara
  2011-03-16 19:31       ` Vivek Goyal
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-16 19:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed 16-03-11 12:53:31, Vivek Goyal wrote:
> On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote:
> [..]
> > +/*
> > + * balance_dirty_pages() must be called by processes which are generating dirty
> > + * data.  It looks at the number of dirty pages in the machine and will force
> > + * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > + * If we're over `background_thresh' then the writeback threads are woken to
> > + * perform some writeout.
> > + */
> > +static void balance_dirty_pages(struct address_space *mapping,
> > +				unsigned long write_chunk)
> > +{
> > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	struct balance_waiter bw;
> > +	struct dirty_limit_state st;
> > +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> > +
> > +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> > +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > +	     !bdi_task_limit_exceeded(&st, current))) {
> > +		if (bdi->dirty_exceeded &&
> > +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> > +			bdi->dirty_exceeded = 0;
> >  		/*
> > -		 * Increase the delay for each loop, up to our previous
> > -		 * default of taking a 100ms nap.
> > +		 * In laptop mode, we wait until hitting the higher threshold
> > +		 * before starting background writeout, and then write out all
> > +		 * the way down to the lower threshold.  So slow writers cause
> > +		 * minimal disk activity.
> > +		 *
> > +		 * In normal mode, we start background writeout at the lower
> > +		 * background_thresh, to keep the amount of dirty memory low.
> >  		 */
> > -		pause <<= 1;
> > -		if (pause > HZ / 10)
> > -			pause = HZ / 10;
> > +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > +			bdi_start_background_writeback(bdi);
> > +		return;
> >  	}
> >  
> > -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> > -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> > -		bdi->dirty_exceeded = 0;
> > +	if (!bdi->dirty_exceeded)
> > +		bdi->dirty_exceeded = 1;
> >  
> > -	if (writeback_in_progress(bdi))
> > -		return;
> > +	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
> > +	/* Kick flusher thread to start doing work if it isn't already */
> > +	bdi_start_background_writeback(bdi);
> >  
> > +	bw.bw_wait_pages = write_chunk;
> > +	bw.bw_task = current;
> > +	spin_lock(&bdi->balance_lock);
> >  	/*
> > -	 * In laptop mode, we wait until hitting the higher threshold before
> > -	 * starting background writeout, and then write out all the way down
> > -	 * to the lower threshold.  So slow writers cause minimal disk activity.
> > -	 *
> > -	 * In normal mode, we start background writeout at the lower
> > -	 * background_thresh, to keep the amount of dirty memory low.
> > +	 * First item? Need to schedule distribution of IO completions among
> > +	 * items on balance_list
> > +	 */
> > +	if (list_empty(&bdi->balance_list)) {
> > +		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
> > +		/* FIXME: Delay should be autotuned based on dev throughput */
> > +		schedule_delayed_work(&bdi->balance_work, HZ/10);
> > +	}
> > +	/*
> > +	 * Add work to the balance list, from now on the structure is handled
> > +	 * by distribute_page_completions()
> > +	 */
> > +	list_add_tail(&bw.bw_list, &bdi->balance_list);
> > +	bdi->balance_waiters++;
> Had a query.
> 
> - What makes sure that flusher thread will not stop writing back till all
>   the waiters on the bdi have been woken up. IIUC, flusher thread will 
>   stop once global background ratio is with-in limit. Is it possible that
>   there are still some waiter on some bdi waiting for more pages to finish
>   writeback and that might not happen for sometime. 
  Yes, this can possibly happen but once distribute_page_completions()
gets called (after a given time), it will notice that we are below limits
and wake all waiters. Under normal circumstances, we should have a decent
estimate when distribute_page_completions() needs to be called and that
should be long before flusher thread finishes it's work. But in cases when
a bdi has only a small share of global dirty limit, what you describe can
possibly happen.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-16 19:10     ` Jan Kara
@ 2011-03-16 19:31       ` Vivek Goyal
  2011-03-16 19:58         ` Jan Kara
  0 siblings, 1 reply; 49+ messages in thread
From: Vivek Goyal @ 2011-03-16 19:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed, Mar 16, 2011 at 08:10:21PM +0100, Jan Kara wrote:
> On Wed 16-03-11 12:53:31, Vivek Goyal wrote:
> > On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote:
> > [..]
> > > +/*
> > > + * balance_dirty_pages() must be called by processes which are generating dirty
> > > + * data.  It looks at the number of dirty pages in the machine and will force
> > > + * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > > + * If we're over `background_thresh' then the writeback threads are woken to
> > > + * perform some writeout.
> > > + */
> > > +static void balance_dirty_pages(struct address_space *mapping,
> > > +				unsigned long write_chunk)
> > > +{
> > > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > > +	struct balance_waiter bw;
> > > +	struct dirty_limit_state st;
> > > +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> > > +
> > > +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> > > +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > > +	     !bdi_task_limit_exceeded(&st, current))) {
> > > +		if (bdi->dirty_exceeded &&
> > > +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> > > +			bdi->dirty_exceeded = 0;
> > >  		/*
> > > -		 * Increase the delay for each loop, up to our previous
> > > -		 * default of taking a 100ms nap.
> > > +		 * In laptop mode, we wait until hitting the higher threshold
> > > +		 * before starting background writeout, and then write out all
> > > +		 * the way down to the lower threshold.  So slow writers cause
> > > +		 * minimal disk activity.
> > > +		 *
> > > +		 * In normal mode, we start background writeout at the lower
> > > +		 * background_thresh, to keep the amount of dirty memory low.
> > >  		 */
> > > -		pause <<= 1;
> > > -		if (pause > HZ / 10)
> > > -			pause = HZ / 10;
> > > +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > > +			bdi_start_background_writeback(bdi);
> > > +		return;
> > >  	}
> > >  
> > > -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> > > -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> > > -		bdi->dirty_exceeded = 0;
> > > +	if (!bdi->dirty_exceeded)
> > > +		bdi->dirty_exceeded = 1;
> > >  
> > > -	if (writeback_in_progress(bdi))
> > > -		return;
> > > +	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
> > > +	/* Kick flusher thread to start doing work if it isn't already */
> > > +	bdi_start_background_writeback(bdi);
> > >  
> > > +	bw.bw_wait_pages = write_chunk;
> > > +	bw.bw_task = current;
> > > +	spin_lock(&bdi->balance_lock);
> > >  	/*
> > > -	 * In laptop mode, we wait until hitting the higher threshold before
> > > -	 * starting background writeout, and then write out all the way down
> > > -	 * to the lower threshold.  So slow writers cause minimal disk activity.
> > > -	 *
> > > -	 * In normal mode, we start background writeout at the lower
> > > -	 * background_thresh, to keep the amount of dirty memory low.
> > > +	 * First item? Need to schedule distribution of IO completions among
> > > +	 * items on balance_list
> > > +	 */
> > > +	if (list_empty(&bdi->balance_list)) {
> > > +		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
> > > +		/* FIXME: Delay should be autotuned based on dev throughput */
> > > +		schedule_delayed_work(&bdi->balance_work, HZ/10);
> > > +	}
> > > +	/*
> > > +	 * Add work to the balance list, from now on the structure is handled
> > > +	 * by distribute_page_completions()
> > > +	 */
> > > +	list_add_tail(&bw.bw_list, &bdi->balance_list);
> > > +	bdi->balance_waiters++;
> > Had a query.
> > 
> > - What makes sure that flusher thread will not stop writing back till all
> >   the waiters on the bdi have been woken up. IIUC, flusher thread will 
> >   stop once global background ratio is with-in limit. Is it possible that
> >   there are still some waiter on some bdi waiting for more pages to finish
> >   writeback and that might not happen for sometime. 
>   Yes, this can possibly happen but once distribute_page_completions()
> gets called (after a given time), it will notice that we are below limits
> and wake all waiters.
> Under normal circumstances, we should have a decent
> estimate when distribute_page_completions() needs to be called and that
> should be long before flusher thread finishes it's work. But in cases when
> a bdi has only a small share of global dirty limit, what you describe can
> possibly happen.

So if a bdi share is small then it can happen that global background
threshold is fine but per bdi threshold is not. That means
task_bdi_threshold is also above limit and IIUC, distribute_page_completion()
will not wake up the waiter until bdi_task_limit_exceeded() is in control.

        /*
         * Wake tasks that might have gotten below their limits and
         * compute
         * the number of pages we wait for
         */
        list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list,
bw_list) {
                if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
                    !bdi_task_limit_exceeded(&st, waiter->bw_task))
                        balance_waiter_done(bdi, waiter);
                else if (waiter->bw_wait_pages < min_pages)
                        min_pages = waiter->bw_wait_pages;
        }


So it can happen that global background threshold is under limit but a
specific task bdi threshold is not and I am failing to see who will
wake up the task in that situation.

Thanks
Vivek 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-16 19:31       ` Vivek Goyal
@ 2011-03-16 19:58         ` Jan Kara
  2011-03-16 20:22           ` Vivek Goyal
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-16 19:58 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed 16-03-11 15:31:44, Vivek Goyal wrote:
> On Wed, Mar 16, 2011 at 08:10:21PM +0100, Jan Kara wrote:
> > On Wed 16-03-11 12:53:31, Vivek Goyal wrote:
> > > On Tue, Mar 08, 2011 at 11:31:13PM +0100, Jan Kara wrote:
> > > [..]
> > > > +/*
> > > > + * balance_dirty_pages() must be called by processes which are generating dirty
> > > > + * data.  It looks at the number of dirty pages in the machine and will force
> > > > + * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > > > + * If we're over `background_thresh' then the writeback threads are woken to
> > > > + * perform some writeout.
> > > > + */
> > > > +static void balance_dirty_pages(struct address_space *mapping,
> > > > +				unsigned long write_chunk)
> > > > +{
> > > > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > > > +	struct balance_waiter bw;
> > > > +	struct dirty_limit_state st;
> > > > +	int dirty_exceeded = check_dirty_limits(bdi, &st);
> > > > +
> > > > +	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT ||
> > > > +	    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > > > +	     !bdi_task_limit_exceeded(&st, current))) {
> > > > +		if (bdi->dirty_exceeded &&
> > > > +		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
> > > > +			bdi->dirty_exceeded = 0;
> > > >  		/*
> > > > -		 * Increase the delay for each loop, up to our previous
> > > > -		 * default of taking a 100ms nap.
> > > > +		 * In laptop mode, we wait until hitting the higher threshold
> > > > +		 * before starting background writeout, and then write out all
> > > > +		 * the way down to the lower threshold.  So slow writers cause
> > > > +		 * minimal disk activity.
> > > > +		 *
> > > > +		 * In normal mode, we start background writeout at the lower
> > > > +		 * background_thresh, to keep the amount of dirty memory low.
> > > >  		 */
> > > > -		pause <<= 1;
> > > > -		if (pause > HZ / 10)
> > > > -			pause = HZ / 10;
> > > > +		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
> > > > +			bdi_start_background_writeback(bdi);
> > > > +		return;
> > > >  	}
> > > >  
> > > > -	/* Clear dirty_exceeded flag only when no task can exceed the limit */
> > > > -	if (!min_dirty_exceeded && bdi->dirty_exceeded)
> > > > -		bdi->dirty_exceeded = 0;
> > > > +	if (!bdi->dirty_exceeded)
> > > > +		bdi->dirty_exceeded = 1;
> > > >  
> > > > -	if (writeback_in_progress(bdi))
> > > > -		return;
> > > > +	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
> > > > +	/* Kick flusher thread to start doing work if it isn't already */
> > > > +	bdi_start_background_writeback(bdi);
> > > >  
> > > > +	bw.bw_wait_pages = write_chunk;
> > > > +	bw.bw_task = current;
> > > > +	spin_lock(&bdi->balance_lock);
> > > >  	/*
> > > > -	 * In laptop mode, we wait until hitting the higher threshold before
> > > > -	 * starting background writeout, and then write out all the way down
> > > > -	 * to the lower threshold.  So slow writers cause minimal disk activity.
> > > > -	 *
> > > > -	 * In normal mode, we start background writeout at the lower
> > > > -	 * background_thresh, to keep the amount of dirty memory low.
> > > > +	 * First item? Need to schedule distribution of IO completions among
> > > > +	 * items on balance_list
> > > > +	 */
> > > > +	if (list_empty(&bdi->balance_list)) {
> > > > +		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
> > > > +		/* FIXME: Delay should be autotuned based on dev throughput */
> > > > +		schedule_delayed_work(&bdi->balance_work, HZ/10);
> > > > +	}
> > > > +	/*
> > > > +	 * Add work to the balance list, from now on the structure is handled
> > > > +	 * by distribute_page_completions()
> > > > +	 */
> > > > +	list_add_tail(&bw.bw_list, &bdi->balance_list);
> > > > +	bdi->balance_waiters++;
> > > Had a query.
> > > 
> > > - What makes sure that flusher thread will not stop writing back till all
> > >   the waiters on the bdi have been woken up. IIUC, flusher thread will 
> > >   stop once global background ratio is with-in limit. Is it possible that
> > >   there are still some waiter on some bdi waiting for more pages to finish
> > >   writeback and that might not happen for sometime. 
> >   Yes, this can possibly happen but once distribute_page_completions()
> > gets called (after a given time), it will notice that we are below limits
> > and wake all waiters.
> > Under normal circumstances, we should have a decent
> > estimate when distribute_page_completions() needs to be called and that
> > should be long before flusher thread finishes it's work. But in cases when
> > a bdi has only a small share of global dirty limit, what you describe can
> > possibly happen.
> 
> So if a bdi share is small then it can happen that global background
> threshold is fine but per bdi threshold is not. That means
> task_bdi_threshold is also above limit and IIUC, distribute_page_completion()
> will not wake up the waiter until bdi_task_limit_exceeded() is in control.
  It will wake them. What you miss is the check right at the beginning of
distribute_page_completions():
      dirty_exceeded = check_dirty_limits(bdi, &st);
      if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
               /* Wakeup everybody */
...

  When we are globally below (background+limit)/2, dirty_exceeded is set to
DIRTY_OK or DIRTY_BACKGROUND and thus we just wake all the waiters.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-03-16 19:58         ` Jan Kara
@ 2011-03-16 20:22           ` Vivek Goyal
  0 siblings, 0 replies; 49+ messages in thread
From: Vivek Goyal @ 2011-03-16 20:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Christoph Hellwig, Dave Chinner

On Wed, Mar 16, 2011 at 08:58:44PM +0100, Jan Kara wrote:

[..]
> > > > Had a query.
> > > > 
> > > > - What makes sure that flusher thread will not stop writing back till all
> > > >   the waiters on the bdi have been woken up. IIUC, flusher thread will 
> > > >   stop once global background ratio is with-in limit. Is it possible that
> > > >   there are still some waiter on some bdi waiting for more pages to finish
> > > >   writeback and that might not happen for sometime. 
> > >   Yes, this can possibly happen but once distribute_page_completions()
> > > gets called (after a given time), it will notice that we are below limits
> > > and wake all waiters.
> > > Under normal circumstances, we should have a decent
> > > estimate when distribute_page_completions() needs to be called and that
> > > should be long before flusher thread finishes it's work. But in cases when
> > > a bdi has only a small share of global dirty limit, what you describe can
> > > possibly happen.
> > 
> > So if a bdi share is small then it can happen that global background
> > threshold is fine but per bdi threshold is not. That means
> > task_bdi_threshold is also above limit and IIUC, distribute_page_completion()
> > will not wake up the waiter until bdi_task_limit_exceeded() is in control.
>   It will wake them. What you miss is the check right at the beginning of
> distribute_page_completions():
>       dirty_exceeded = check_dirty_limits(bdi, &st);
>       if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
>                /* Wakeup everybody */
> ...
> 
>   When we are globally below (background+limit)/2, dirty_exceeded is set to
> DIRTY_OK or DIRTY_BACKGROUND and thus we just wake all the waiters.

Ok, thanks. Now I see it. 

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 4/5] mm: Remove low limit from sync_writeback_pages()
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
                   ` (2 preceding siblings ...)
  2011-03-08 22:31 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
@ 2011-03-08 22:31 ` Jan Kara
  2011-03-08 22:31 ` [PATCH 5/5] mm: Autotune interval between distribution of page completions Jan Kara
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner

sync_writeback_pages() limited minimal amount of pages to write
in balance_dirty_pages() to 3/2*ratelimit_pages (6 MB) to submit
reasonably sized IO. Since we do not submit any IO anymore, be more
fair and let the task wait only for 3/2*(the amount dirtied).

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 697dd8e..ff07280 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -43,16 +43,11 @@
 static long ratelimit_pages = 32;
 
 /*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
+ * When balance_dirty_pages decides that the caller needs to wait for some
+ * writeback to happen, this is how many pages it will attempt to write.
  */
 static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
 	return dirtied + dirtied / 2;
 }
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH 5/5] mm: Autotune interval between distribution of page completions
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
                   ` (3 preceding siblings ...)
  2011-03-08 22:31 ` [PATCH 4/5] mm: Remove low limit from sync_writeback_pages() Jan Kara
@ 2011-03-08 22:31 ` Jan Kara
  2011-03-17 15:46 ` [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Curt Wohlgemuth
  2011-03-18 14:30 ` Wu Fengguang
  6 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-08 22:31 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Wu Fengguang, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner

To avoid throttling processes in balance_dirty_pages() for too long, it
is desirable to distribute page completions often enough. On the other
hand we do not want to distribute them too often so that we do not burn
too much CPU. Obviously, the proper interval depends on the amount of
pages we wait for and on the speed the underlying device can write them.
So we estimate the throughput of the device, compute the number of
pages we need to be completed, and from that compute desired time of
next distribution of page completions. To avoid extremities, we force
the computed sleep time to be in [HZ/50..HZ/4] interval.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h      |    6 ++-
 include/trace/events/writeback.h |   25 ++++++++++
 mm/backing-dev.c                 |    2 +
 mm/page-writeback.c              |   90 ++++++++++++++++++++++++++++++++-----
 4 files changed, 108 insertions(+), 15 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 65b6e61..a4f9133 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,12 +89,14 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
-	spinlock_t balance_lock;	/* lock protecting four entries below */
-	unsigned long written_start;	/* BDI_WRITTEN last time we scanned balance_list*/
+	spinlock_t balance_lock;	/* lock protecting entries below */
 	struct list_head balance_list;	/* waiters in balance_dirty_pages */
 	unsigned int balance_waiters;	/* number of waiters in the list */
 	struct delayed_work balance_work;	/* work distributing page
 						   completions among waiters */
+	unsigned long written_start;	/* BDI_WRITTEN last time we scanned balance_list*/
+	unsigned long start_jiffies;	/* time when we last scanned list */
+	unsigned long pages_per_s;	/* estimated throughput of bdi */
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 74815fa..00b06a2 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -211,6 +211,31 @@ TRACE_EVENT(writeback_distribute_page_completions_wakeall,
 	)
 );
 
+TRACE_EVENT(writeback_distribute_page_completions_scheduled,
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long nap,
+		 unsigned long pages),
+	TP_ARGS(bdi, nap, pages),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, nap)
+		__field(unsigned long, pages)
+		__field(unsigned long, waiters)
+		__field(unsigned long, pages_per_s)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->nap = nap;
+		__entry->pages = pages;
+		__entry->waiters = bdi->balance_waiters;
+		__entry->pages_per_s = bdi->pages_per_s;
+	),
+	TP_printk("bdi=%s sleep=%u ms want_pages=%lu waiters=%lu"
+		  " pages_per_s=%lu",
+		  __entry->name, jiffies_to_msecs(__entry->nap),
+		  __entry->pages, __entry->waiters, __entry->pages_per_s
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2ecc3fe..e2cbe5c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -655,8 +655,10 @@ int bdi_init(struct backing_dev_info *bdi)
 	spin_lock_init(&bdi->balance_lock);
 	INIT_LIST_HEAD(&bdi->balance_list);
 	bdi->written_start = 0;
+	bdi->start_jiffies = 0;
 	bdi->balance_waiters = 0;
 	INIT_DELAYED_WORK(&bdi->balance_work, distribute_page_completions);
+	bdi->pages_per_s = 1;
 
 	bdi_wb_init(&bdi->wb, bdi);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ff07280..09f1adf 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,12 +597,65 @@ static void balance_waiter_done(struct backing_dev_info *bdi,
 	wake_up_process(bw->bw_task);
 }
 
+static unsigned long compute_distribute_time(struct backing_dev_info *bdi,
+					     unsigned long min_pages)
+{
+	unsigned long nap;
+
+	/*
+	 * Because of round robin distribution, every waiter has to get at
+	 * least min_pages pages.
+	 */
+	min_pages *= bdi->balance_waiters;
+	nap = msecs_to_jiffies(
+			((u64)min_pages) * MSEC_PER_SEC / bdi->pages_per_s);
+	/*
+	 * Force computed sleep time to be in interval (HZ/50..HZ/5)
+	 * so that we
+	 * a) don't wake too often and burn too much CPU
+	 * b) check dirty limits at least once in a while
+	 */
+	nap = max_t(unsigned long, HZ/50, nap);
+	nap = min_t(unsigned long, HZ/4, nap);
+	trace_writeback_distribute_page_completions_scheduled(bdi, nap,
+		min_pages);
+	return nap;
+}
+
+/*
+ * When the throughput is computed, we consider an imaginary WINDOW_MS
+ * miliseconds long window. In this window, we know that it took 'deltams'
+ * miliseconds to write 'written' pages and for the rest of the window we
+ * assume number of pages corresponding to the throughput we previously
+ * computed to have been written. Thus we obtain total number of pages
+ * written in the imaginary window and from it new throughput.
+ */
+#define WINDOW_MS 10000
+
+static void update_bdi_throughput(struct backing_dev_info *bdi,
+				 unsigned long written, unsigned long time)
+{
+	unsigned int deltams = jiffies_to_msecs(time - bdi->start_jiffies);
+
+	written -= bdi->written_start;
+	if (deltams > WINDOW_MS) {
+		/* Add 1 to avoid 0 result */
+		bdi->pages_per_s = 1 + ((u64)written) * MSEC_PER_SEC / deltams;
+		return;
+	}
+	bdi->pages_per_s = 1 +
+		(((u64)bdi->pages_per_s) * (WINDOW_MS - deltams) +
+		 ((u64)written) * MSEC_PER_SEC) / WINDOW_MS;
+}
+
 void distribute_page_completions(struct work_struct *work)
 {
 	struct backing_dev_info *bdi =
 		container_of(work, struct backing_dev_info, balance_work.work);
 	unsigned long written = bdi_stat_sum(bdi, BDI_WRITTEN);
 	unsigned long pages_per_waiter;
+	unsigned long cur_time = jiffies;
+	unsigned long min_pages = ULONG_MAX;
 	struct balance_waiter *waiter, *tmpw;
 	struct dirty_limit_state st;
 	int dirty_exceeded;
@@ -616,11 +669,14 @@ void distribute_page_completions(struct work_struct *work)
 		list_for_each_entry_safe(
 				waiter, tmpw, &bdi->balance_list, bw_list)
 			balance_waiter_done(bdi, waiter);
+		update_bdi_throughput(bdi, written, cur_time);
 		spin_unlock(&bdi->balance_lock);
 		return;
 	}
 
 	spin_lock(&bdi->balance_lock);
+	update_bdi_throughput(bdi, written, cur_time);
+	bdi->start_jiffies = cur_time;
 	/* Distribute pages equally among waiters */
 	while (!list_empty(&bdi->balance_list)) {
 		pages_per_waiter = (written - bdi->written_start) /
@@ -638,15 +694,22 @@ void distribute_page_completions(struct work_struct *work)
 				balance_waiter_done(bdi, waiter);
 		}
 	}
-	/* Wake tasks that might have gotten below their limits */
+	/*
+	 * Wake tasks that might have gotten below their limits and compute
+	 * the number of pages we wait for
+	 */
 	list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
 		if (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
-		     !bdi_task_limit_exceeded(&st, waiter->bw_task))
+		    !bdi_task_limit_exceeded(&st, waiter->bw_task))
 			balance_waiter_done(bdi, waiter);
+		else if (waiter->bw_wait_pages < min_pages)
+			min_pages = waiter->bw_wait_pages;
 	}
 	/* More page completions needed? */
-	if (!list_empty(&bdi->balance_list))
-		schedule_delayed_work(&bdi->balance_work, HZ/10);
+	if (!list_empty(&bdi->balance_list)) {
+		schedule_delayed_work(&bdi->balance_work,
+			      compute_distribute_time(bdi, min_pages));
+	}
 	spin_unlock(&bdi->balance_lock);
 }
 
@@ -696,21 +759,22 @@ static void balance_dirty_pages(struct address_space *mapping,
 	bw.bw_task = current;
 	spin_lock(&bdi->balance_lock);
 	/*
-	 * First item? Need to schedule distribution of IO completions among
-	 * items on balance_list
-	 */
-	if (list_empty(&bdi->balance_list)) {
-		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
-		/* FIXME: Delay should be autotuned based on dev throughput */
-		schedule_delayed_work(&bdi->balance_work, HZ/10);
-	}
-	/*
 	 * Add work to the balance list, from now on the structure is handled
 	 * by distribute_page_completions()
 	 */
 	list_add_tail(&bw.bw_list, &bdi->balance_list);
 	bdi->balance_waiters++;
 	/*
+	 * First item? Need to schedule distribution of IO completions among
+	 * items on balance_list
+	 */
+	if (bdi->balance_waiters == 1) {
+		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
+		bdi->start_jiffies = jiffies;
+		schedule_delayed_work(&bdi->balance_work,
+			compute_distribute_time(bdi, write_chunk));
+	}
+	/*
 	 * Setting task state must happen inside balance_lock to avoid races
 	 * with distribution function waking us.
 	 */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
                   ` (4 preceding siblings ...)
  2011-03-08 22:31 ` [PATCH 5/5] mm: Autotune interval between distribution of page completions Jan Kara
@ 2011-03-17 15:46 ` Curt Wohlgemuth
  2011-03-17 15:51   ` Christoph Hellwig
  2011-03-17 17:32   ` Jan Kara
  2011-03-18 14:30 ` Wu Fengguang
  6 siblings, 2 replies; 49+ messages in thread
From: Curt Wohlgemuth @ 2011-03-17 15:46 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton

Hi Jan:

On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@suse.cz> wrote:
>
>  Hello,
>
>  I'm posting second version of my IO-less balance_dirty_pages() patches. This
> is alternative approach to Fengguang's patches - much simpler I believe (only
> 300 lines added) - but obviously I does not provide so sophisticated control.
> Fengguang is currently running some tests on my patches so that we can compare
> the approaches.
>
> The basic idea (implemented in the third patch) is that processes throttled
> in balance_dirty_pages() wait for enough IO to complete. The waiting is
> implemented as follows: Whenever we decide to throttle a task in
> balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> against that bdi and goes to sleep waiting to receive specified amount of page
> IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> autotuned based on observed IO rate), accumulated page IO completions are
> distributed equally among waiting tasks.
>
> This waiting scheme has been chosen so that waiting time in
> balance_dirty_pages() is proportional to
>  number_waited_pages * number_of_waiters.
> In particular it does not depend on the total number of pages being waited for,
> thus providing possibly a fairer results.
>
> Since last version I've implemented cleanups as suggested by Peter Zilstra.
> The patches undergone more throughout testing. So far I've tested different
> filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
> filesystem and nfs. The load was either various number of dd threads or
> fio with several threads each dirtying pages at different speed.
>
> Results and test scripts can be found at
>  http://beta.suse.com/private/jack/balance_dirty_pages-v2/
> See README file for some explanation of test framework, tests, and graphs.
> Except for ext3 in data=ordered mode, where kjournald creates high
> fluctuations in waiting time of throttled processes (and also high latencies),
> the results look OK. Parallel dd threads are being throttled in the same way
> (in a 2s window threads spend the same time waiting) and also latencies of
> individual waits seem OK - except for ext3 they fit in 100 ms for local
> filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
> to fix that we'd have to modify current ratelimiting scheme to take into
> account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
> more often thus reducing latencies in individual waits...
>
> The results for different bandwidths fio load is interesting. There are 8
> threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> bdi dirty limits, what happens is that three most aggresive tasks get
> throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> dirtiers run unthrottled.
>
> I'm planning to run some tests with multiple SATA drives to verify whether
> there aren't some unexpected fluctuations. But currently I have some trouble
> with the HW...
>
> As usual comments are welcome :).

The design of IO-less foreground throttling of writeback in the context of
memory cgroups is being discussed in the memcg patch threads (e.g.,
"[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
another concern as well.  And that's how restricting per-BDI writeback to a
single task will affect proposed changes for tracking and accounting of
buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).

It seems totally reasonable that reducing competition for write requests to
a BDI -- by using the flusher thread to "handle" foreground writeout --
would increase throughput to that device.  At Google, we experiemented with
this in a hacked-up fashion several months ago (FG task would enqueue a work
item and sleep for some period of time, wake up and see if it was below the
dirty limit), and found that we were indeed getting better throughput.

But if one of one's goals is to provide some sort of disk isolation based on
cgroup parameters, than having at most one stream of write requests
effectively neuters the IO scheduler.  We saw that in practice, which led to
abandoning our attempt at "IO-less throttling."

One possible solution would be to put some of the disk isolation smarts into
the writeback path, so the flusher thread could choose inodes with this as a
criteria, but this seems ugly on its face, and makes my head hurt.

Otherwise, I'm having trouble thinking of a way to do effective isolation in
the IO scheduler without having competing threads -- for different cgroups --
making write requests for buffered data.  Perhaps the best we could do would
be to enable IO-less throttling in writeback as a config option?

Thoughts?

Thanks,
Curt

>
>                                                                Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 15:46 ` [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Curt Wohlgemuth
@ 2011-03-17 15:51   ` Christoph Hellwig
  2011-03-17 16:24     ` Curt Wohlgemuth
  2011-03-17 17:32   ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Christoph Hellwig @ 2011-03-17 15:51 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton

On Thu, Mar 17, 2011 at 08:46:23AM -0700, Curt Wohlgemuth wrote:
> But if one of one's goals is to provide some sort of disk isolation based on
> cgroup parameters, than having at most one stream of write requests
> effectively neuters the IO scheduler.

If you use any kind of buffered I/O you already fail in that respect.
Writeback from balance_dirty_page really is just the wort case right now
with more I/O supposed to be handled by the background threads.  So if
you want to implement isolation properly you need to track the
originator of the I/O between the copy to the pagecache and actual
writeback.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 15:51   ` Christoph Hellwig
@ 2011-03-17 16:24     ` Curt Wohlgemuth
  2011-03-17 16:43       ` Christoph Hellwig
  0 siblings, 1 reply; 49+ messages in thread
From: Curt Wohlgemuth @ 2011-03-17 16:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton

Hi Christoph:

On Thu, Mar 17, 2011 at 8:51 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Mar 17, 2011 at 08:46:23AM -0700, Curt Wohlgemuth wrote:
>> But if one of one's goals is to provide some sort of disk isolation based on
>> cgroup parameters, than having at most one stream of write requests
>> effectively neuters the IO scheduler.
>
> If you use any kind of buffered I/O you already fail in that respect.
> Writeback from balance_dirty_page really is just the wort case right now
> with more I/O supposed to be handled by the background threads.  So if
> you want to implement isolation properly you need to track the
> originator of the I/O between the copy to the pagecache and actual
> writeback.

Which is indeed part of the patchset I referred to above ("[RFC]
[PATCH 0/6] Provide cgroup isolation for buffered writes",
https://lkml.org/lkml/2011/3/8/332 ).

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 16:24     ` Curt Wohlgemuth
@ 2011-03-17 16:43       ` Christoph Hellwig
  0 siblings, 0 replies; 49+ messages in thread
From: Christoph Hellwig @ 2011-03-17 16:43 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, linux-mm,
	Wu Fengguang, Peter Zijlstra, Andrew Morton

On Thu, Mar 17, 2011 at 09:24:28AM -0700, Curt Wohlgemuth wrote:
> Which is indeed part of the patchset I referred to above ("[RFC]
> [PATCH 0/6] Provide cgroup isolation for buffered writes",
> https://lkml.org/lkml/2011/3/8/332 ).

So what about letting us fix normal writeback first and then later look
into cgroups properly.  And to do it properly we'll need to implement
something similar to the I/O less balance dirty pages - be that targeted
writeback from the flusher thread including proper tagging of pages,
or be that writeback from balance_dirty_pages in a why that we keep
multiple processes from writing at the same time.  Although I'd prefer
something that keeps the CG case as close as possible to the normal
code, right now we already have a huge mess with memcg and it's own
handrolled version of direct reclaim which is an even worse stack
hog than the already overly painfull "normal" direct reclaim.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 15:46 ` [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Curt Wohlgemuth
  2011-03-17 15:51   ` Christoph Hellwig
@ 2011-03-17 17:32   ` Jan Kara
  2011-03-17 18:55     ` Curt Wohlgemuth
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-17 17:32 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton

On Thu 17-03-11 08:46:23, Curt Wohlgemuth wrote:
> On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@suse.cz> wrote:
> The design of IO-less foreground throttling of writeback in the context of
> memory cgroups is being discussed in the memcg patch threads (e.g.,
> "[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
> another concern as well.  And that's how restricting per-BDI writeback to a
> single task will affect proposed changes for tracking and accounting of
> buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
> isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).
> 
> It seems totally reasonable that reducing competition for write requests to
> a BDI -- by using the flusher thread to "handle" foreground writeout --
> would increase throughput to that device.  At Google, we experiemented with
> this in a hacked-up fashion several months ago (FG task would enqueue a work
> item and sleep for some period of time, wake up and see if it was below the
> dirty limit), and found that we were indeed getting better throughput.
> 
> But if one of one's goals is to provide some sort of disk isolation based on
> cgroup parameters, than having at most one stream of write requests
> effectively neuters the IO scheduler.  We saw that in practice, which led to
> abandoning our attempt at "IO-less throttling."
  Let me check if I understand: The problem you have with one flusher
thread is that when written pages all belong to a single memcg, there is
nothing IO scheduler can prioritize, right?

> One possible solution would be to put some of the disk isolation smarts into
> the writeback path, so the flusher thread could choose inodes with this as a
> criteria, but this seems ugly on its face, and makes my head hurt.
  Well, I think it could be implemented in a reasonable way but then you
still miss reads and direct IO from the mix so it will be a poor isolation.
But maybe we could propagate the information from IO scheduler to flusher
thread? If IO scheduler sees memcg has run out of its limit, it could hint
to a flusher thread that it should switch to an inode from a different memcg.
But still the details get nasty as I think about them (how to pick next
memcg, how to pick inodes,...). Essentially, we'd have to do with flusher
threads what old pdflush did when handling congested devices. Ugh.

> Otherwise, I'm having trouble thinking of a way to do effective isolation in
> the IO scheduler without having competing threads -- for different cgroups --
> making write requests for buffered data.  Perhaps the best we could do would
> be to enable IO-less throttling in writeback as a config option?
  Well, nothing prevents us to choose to do foreground writeback throttling
for memcgs and IO-less one without them but as Christoph writes, this
doesn't seem very compeling either... I'll let this brew in my head for
some time and maybe something comes.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 17:32   ` Jan Kara
@ 2011-03-17 18:55     ` Curt Wohlgemuth
  2011-03-17 22:56       ` Vivek Goyal
  0 siblings, 1 reply; 49+ messages in thread
From: Curt Wohlgemuth @ 2011-03-17 18:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton

On Thu, Mar 17, 2011 at 10:32 AM, Jan Kara <jack@suse.cz> wrote:
> On Thu 17-03-11 08:46:23, Curt Wohlgemuth wrote:
>> On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@suse.cz> wrote:
>> The design of IO-less foreground throttling of writeback in the context of
>> memory cgroups is being discussed in the memcg patch threads (e.g.,
>> "[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
>> another concern as well.  And that's how restricting per-BDI writeback to a
>> single task will affect proposed changes for tracking and accounting of
>> buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
>> isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).
>>
>> It seems totally reasonable that reducing competition for write requests to
>> a BDI -- by using the flusher thread to "handle" foreground writeout --
>> would increase throughput to that device.  At Google, we experiemented with
>> this in a hacked-up fashion several months ago (FG task would enqueue a work
>> item and sleep for some period of time, wake up and see if it was below the
>> dirty limit), and found that we were indeed getting better throughput.
>>
>> But if one of one's goals is to provide some sort of disk isolation based on
>> cgroup parameters, than having at most one stream of write requests
>> effectively neuters the IO scheduler.  We saw that in practice, which led to
>> abandoning our attempt at "IO-less throttling."

>  Let me check if I understand: The problem you have with one flusher
> thread is that when written pages all belong to a single memcg, there is
> nothing IO scheduler can prioritize, right?

Correct.  Well, perhaps.  Given that the memory cgroups and the IO
cgroups may not overlap, it's possible that write requests from a
single memcg might be targeted to multiple IO cgroups, and scheduling
priorities can be maintained.  Of course, the other way round might be
the case as well.

The point is just that from however many memcgs the flusher thread is
working on behalf of, there's only a single stream of requests, which
are *likely* for a single IO cgroup, and hence there's nothing to
prioritize.

>> One possible solution would be to put some of the disk isolation smarts into
>> the writeback path, so the flusher thread could choose inodes with this as a
>> criteria, but this seems ugly on its face, and makes my head hurt.
>  Well, I think it could be implemented in a reasonable way but then you
> still miss reads and direct IO from the mix so it will be a poor isolation.

Um, not really, would it?  Presumably there are separate tasks
(directly) issuing simultaneous requests for reads, and DIO writes;
these should interact just fine with writes from the single flusher
thread.

> But maybe we could propagate the information from IO scheduler to flusher
> thread? If IO scheduler sees memcg has run out of its limit, it could hint
> to a flusher thread that it should switch to an inode from a different memcg.
> But still the details get nasty as I think about them (how to pick next
> memcg, how to pick inodes,...). Essentially, we'd have to do with flusher
> threads what old pdflush did when handling congested devices. Ugh.

Yeah, plus what I said above, that memcgs and IO cgroups aren't
necessarily the same cgroups.

>> Otherwise, I'm having trouble thinking of a way to do effective isolation in
>> the IO scheduler without having competing threads -- for different cgroups --
>> making write requests for buffered data.  Perhaps the best we could do would
>> be to enable IO-less throttling in writeback as a config option?

>  Well, nothing prevents us to choose to do foreground writeback throttling
> for memcgs and IO-less one without them but as Christoph writes, this
> doesn't seem very compeling either... I'll let this brew in my head for
> some time and maybe something comes.

I agree with Christoph too; I mainly wanted to get the issue out
there, and will be thinking on it more as well.

Thanks,
Curt
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-17 18:55     ` Curt Wohlgemuth
@ 2011-03-17 22:56       ` Vivek Goyal
  0 siblings, 0 replies; 49+ messages in thread
From: Vivek Goyal @ 2011-03-17 22:56 UTC (permalink / raw)
  To: Curt Wohlgemuth
  Cc: Jan Kara, linux-fsdevel, linux-mm, Wu Fengguang, Peter Zijlstra,
	Andrew Morton, Greg Thelen, Christoph Hellwig, Johannes Weiner,
	Justin TerAvest

On Thu, Mar 17, 2011 at 11:55:34AM -0700, Curt Wohlgemuth wrote:
> On Thu, Mar 17, 2011 at 10:32 AM, Jan Kara <jack@suse.cz> wrote:
> > On Thu 17-03-11 08:46:23, Curt Wohlgemuth wrote:
> >> On Tue, Mar 8, 2011 at 2:31 PM, Jan Kara <jack@suse.cz> wrote:
> >> The design of IO-less foreground throttling of writeback in the context of
> >> memory cgroups is being discussed in the memcg patch threads (e.g.,
> >> "[PATCH v6 0/9] memcg: per cgroup dirty page accounting"), but I've got
> >> another concern as well.  And that's how restricting per-BDI writeback to a
> >> single task will affect proposed changes for tracking and accounting of
> >> buffered writes to the IO scheduler ("[RFC] [PATCH 0/6] Provide cgroup
> >> isolation for buffered writes", https://lkml.org/lkml/2011/3/8/332 ).
> >>
> >> It seems totally reasonable that reducing competition for write requests to
> >> a BDI -- by using the flusher thread to "handle" foreground writeout --
> >> would increase throughput to that device.  At Google, we experiemented with
> >> this in a hacked-up fashion several months ago (FG task would enqueue a work
> >> item and sleep for some period of time, wake up and see if it was below the
> >> dirty limit), and found that we were indeed getting better throughput.
> >>
> >> But if one of one's goals is to provide some sort of disk isolation based on
> >> cgroup parameters, than having at most one stream of write requests
> >> effectively neuters the IO scheduler.  We saw that in practice, which led to
> >> abandoning our attempt at "IO-less throttling."
> 
> >  Let me check if I understand: The problem you have with one flusher
> > thread is that when written pages all belong to a single memcg, there is
> > nothing IO scheduler can prioritize, right?
> 
> Correct.  Well, perhaps.  Given that the memory cgroups and the IO
> cgroups may not overlap, it's possible that write requests from a
> single memcg might be targeted to multiple IO cgroups, and scheduling
> priorities can be maintained.  Of course, the other way round might be
> the case as well.

[CCing some folks who were involved in other mail thread]

I think that for buffered write case it would make most sense when memory
controller and IO controller are co-mounted and working with each other.
The reason being that for async writes we need to control the dirty share of
a cgroup as well as try to prioritize the IO at device level from cgroup.

It would not make any sense that a low prio async group is choked at device
level and its footprint in page cache is increasing resulting in choking
other fast writers. 

So we need to make sure that slow writers don't have huge page cache
footprint and hence I think using memory and IO controller together
makes sense. Do you have other use cases where it does not make sense?

> 
> The point is just that from however many memcgs the flusher thread is
> working on behalf of, there's only a single stream of requests, which
> are *likely* for a single IO cgroup, and hence there's nothing to
> prioritize.

I think even single submitter stream can also make sense if underlying
device/bdi is slow and submitter is fast and switches frequently between
memory cgroups for selection of inodes.

So we have IO control at device level and we have IO queues for each
cgroup and if flusher thread can move quickly (say submit 512 pages
from one cgroup and then move to next), from one cgroup to other,
then we should automatically get the IO difference.

In other mail I suggested that if we can keep per memory cgroup per BDI stats
for number of writes in progress, then flusher thread can skip submitting
IO from cgroups which are slow and there are many pending writebacks. That is
a hint to flusher thread that IO scheduler is giving this cgroup a lower
priority hence high number of writes in flight which are simply queued up at
IO schduler. For high priority cgroups which are making progress, pending
writebacks will be small or zero and flusher can submit more inodes/pages
from that memory cgroup. That way a higher weight group should get more IO
done as compared to a slower group.

I am assuming that prioritizing async request is primarily is for slow
media like single SATA disk. If yes, then flusher thread should be
able to submit pages much faster then device can complete those and
can cgroup IO queues busy at end device hence IO scheduler should be
able to prioritize.

Thoughts?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
                   ` (5 preceding siblings ...)
  2011-03-17 15:46 ` [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Curt Wohlgemuth
@ 2011-03-18 14:30 ` Wu Fengguang
  2011-03-22 21:43   ` Jan Kara
  6 siblings, 1 reply; 49+ messages in thread
From: Wu Fengguang @ 2011-03-18 14:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra,
	Andrew Morton

Hi Jan,

On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> 
>   Hello,
> 
>   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> is alternative approach to Fengguang's patches - much simpler I believe (only
> 300 lines added) - but obviously I does not provide so sophisticated control.

Well, it may be too early to claim "simplicity" as an advantage, until
you achieve the following performance/feature comparability (most of
them are not optional ones). AFAICS this work is kind of heavy lifting
that will consume a lot of time and attention. You'd better find some
more fundamental needs before go on the reworking.

(1)  latency
(2)  fairness
(3)  smoothness
(4)  scalability
(5)  per-task IO controller
(6)  per-cgroup IO controller (TBD)
(7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
(8)  think time compensation
(9)  backed by both theory and tests
(10) adapt pause time up on 100+ dirtiers
(11) adapt pause time down on low dirty pages 
(12) adapt to new dirty threshold/goal
(13) safeguard against dirty exceeding
(14) safeguard against device queue underflow

(brief listing first: I've just returned from travel)

> Fengguang is currently running some tests on my patches so that we can compare
> the approaches.
 
Yup, here are the tracing patches and graphs:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/

> The basic idea (implemented in the third patch) is that processes throttled
> in balance_dirty_pages() wait for enough IO to complete. The waiting is
> implemented as follows: Whenever we decide to throttle a task in
> balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> against that bdi and goes to sleep waiting to receive specified amount of page
> IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> autotuned based on observed IO rate), accumulated page IO completions are
> distributed equally among waiting tasks.
> 
> This waiting scheme has been chosen so that waiting time in
> balance_dirty_pages() is proportional to
>   number_waited_pages * number_of_waiters.
> In particular it does not depend on the total number of pages being waited for,
> thus providing possibly a fairer results.

When there comes no IO completion in 1 second (normal in NFS), the
tasks will all get stuck. It is fixable based on your v2 code base
(detailed below), however will likely bring the same level of
complexity as the base bandwidth solution.

As for v2, there are still big gap to fill. NFS dirtiers are
constantly doing 20-25 seconds long delays

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-31/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-pause.png

and the tasks are bumping forwards

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-task-bw.png

> Since last version I've implemented cleanups as suggested by Peter Zilstra.
> The patches undergone more throughout testing. So far I've tested different
> filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
> filesystem and nfs. The load was either various number of dd threads or
> fio with several threads each dirtying pages at different speed.
> 
> Results and test scripts can be found at
>   http://beta.suse.com/private/jack/balance_dirty_pages-v2/
> See README file for some explanation of test framework, tests, and graphs.
> Except for ext3 in data=ordered mode, where kjournald creates high
> fluctuations in waiting time of throttled processes (and also high latencies),
> the results look OK. Parallel dd threads are being throttled in the same way
> (in a 2s window threads spend the same time waiting) and also latencies of
> individual waits seem OK - except for ext3 they fit in 100 ms for local
> filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
> to fix that we'd have to modify current ratelimiting scheme to take into
> account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
> more often thus reducing latencies in individual waits...

Yes the per-cpu rate limit is a problem, so I'm switching to per-task
rate limit.

The direct input from IO completion is another issue. It leaves the
dirty tasks at the mercy of low layer (VFS/FS/bdev) fluctuations and
latencies. So I'm introducing the base bandwidth as a buffer layer.
You may employ the similar technique: to simulate a more smooth flow
of IO completion events based on the average write bandwidth. Then it
naturally introduce the problem of rate mismatch between
simulated/real IO completions, and the need to do more elaborated
position control.

> The results for different bandwidths fio load is interesting. There are 8
> threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> bdi dirty limits, what happens is that three most aggresive tasks get
> throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> dirtiers run unthrottled.

The base bandwidth based throttling can do better and provide almost
perfect fairness, because all tasks writing to one bdi derive their
own throttle bandwidth based on the same per-bdi base bandwidth. So
the heavier dirtiers will converge to equal dirty rate and weight.

> I'm planning to run some tests with multiple SATA drives to verify whether
> there aren't some unexpected fluctuations. But currently I have some trouble
> with the HW...
> 
> As usual comments are welcome :).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-18 14:30 ` Wu Fengguang
@ 2011-03-22 21:43   ` Jan Kara
  2011-03-23  4:41     ` Dave Chinner
  2011-03-25 13:44     ` Wu Fengguang
  0 siblings, 2 replies; 49+ messages in thread
From: Jan Kara @ 2011-03-22 21:43 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

  Hello Fengguang,

On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > 
> >   Hello,
> > 
> >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > is alternative approach to Fengguang's patches - much simpler I believe (only
> > 300 lines added) - but obviously I does not provide so sophisticated control.
> 
> Well, it may be too early to claim "simplicity" as an advantage, until
> you achieve the following performance/feature comparability (most of
> them are not optional ones). AFAICS this work is kind of heavy lifting
> that will consume a lot of time and attention. You'd better find some
> more fundamental needs before go on the reworking.
> 
> (1)  latency
> (2)  fairness
> (3)  smoothness
> (4)  scalability
> (5)  per-task IO controller
> (6)  per-cgroup IO controller (TBD)
> (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> (8)  think time compensation
> (9)  backed by both theory and tests
> (10) adapt pause time up on 100+ dirtiers
> (11) adapt pause time down on low dirty pages 
> (12) adapt to new dirty threshold/goal
> (13) safeguard against dirty exceeding
> (14) safeguard against device queue underflow
  I think this is a misunderstanding of my goals ;). My main goal is to
explore, how far we can get with a relatively simple approach to IO-less
balance_dirty_pages(). I guess what I have is better than the current
balance_dirty_pages() but it sure does not even try to provide all the
features you try to provide.

I'm thinking about tweaking ratelimiting logic to reduce latencies in some
tests, possibly add compensation when we waited for too long in
balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
about it...

Basically I do this so that we can compare and decide whether what my
simple approach offers is OK or whether we want some more complex solution
like your patches...

> > The basic idea (implemented in the third patch) is that processes throttled
> > in balance_dirty_pages() wait for enough IO to complete. The waiting is
> > implemented as follows: Whenever we decide to throttle a task in
> > balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> > against that bdi and goes to sleep waiting to receive specified amount of page
> > IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> > autotuned based on observed IO rate), accumulated page IO completions are
> > distributed equally among waiting tasks.
> > 
> > This waiting scheme has been chosen so that waiting time in
> > balance_dirty_pages() is proportional to
> >   number_waited_pages * number_of_waiters.
> > In particular it does not depend on the total number of pages being waited for,
> > thus providing possibly a fairer results.
> 
> When there comes no IO completion in 1 second (normal in NFS), the
> tasks will all get stuck. It is fixable based on your v2 code base
> (detailed below), however will likely bring the same level of
> complexity as the base bandwidth solution.
  I have some plans how to account for bumpy IO completion when we wait for
a long time and then get completion of much more IO than we actually need.
But in case where processes use all the bandwidth and the latency of the
device is high, sure they take the penalty and have to wait for a long time
in balance_dirty_pages().

> As for v2, there are still big gap to fill. NFS dirtiers are
> constantly doing 20-25 seconds long delays
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-31/balance_dirty_pages-pause.png
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-pause.png
  Yeah, this is because they want lots of pages each
(3/2*MAX_WRITEBACK_PAGES). I'll try to change ratelimiting to make several
shorter sleeps. But ultimately you have to wait this much. Just you can
split those big sleeps in more of smaller ones.
 
> and the tasks are bumping forwards
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-task-bw.png
  Yeah, that's a result of bumpy NFS writeout and basically the consequence
of the above. Maybe it can be helped but I don't find this to be a problem on
its own...

> > Since last version I've implemented cleanups as suggested by Peter Zilstra.
> > The patches undergone more throughout testing. So far I've tested different
> > filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
> > filesystem and nfs. The load was either various number of dd threads or
> > fio with several threads each dirtying pages at different speed.
> > 
> > Results and test scripts can be found at
> >   http://beta.suse.com/private/jack/balance_dirty_pages-v2/
> > See README file for some explanation of test framework, tests, and graphs.
> > Except for ext3 in data=ordered mode, where kjournald creates high
> > fluctuations in waiting time of throttled processes (and also high latencies),
> > the results look OK. Parallel dd threads are being throttled in the same way
> > (in a 2s window threads spend the same time waiting) and also latencies of
> > individual waits seem OK - except for ext3 they fit in 100 ms for local
> > filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
> > to fix that we'd have to modify current ratelimiting scheme to take into
> > account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
> > more often thus reducing latencies in individual waits...
> 
> Yes the per-cpu rate limit is a problem, so I'm switching to per-task
> rate limit.
  BTW: Have you considered per-bdi ratelimiting? Both per-task and per-bdi
make sense just they are going to have slightly different properties...
Current per-cpu ratelimit counters tend to behave like per-task
ratelimiting at least for fast dirtiers because once a task is blocked in
balance_dirty_pages() another task runs on that cpu and uses the counter
for itself. So I wouldn't expect big differences from per-task
ratelimiting...

> The direct input from IO completion is another issue. It leaves the
> dirty tasks at the mercy of low layer (VFS/FS/bdev) fluctuations and
> latencies. So I'm introducing the base bandwidth as a buffer layer.
> You may employ the similar technique: to simulate a more smooth flow
> of IO completion events based on the average write bandwidth. Then it
> naturally introduce the problem of rate mismatch between
> simulated/real IO completions, and the need to do more elaborated
> position control.
  Exacttly, that's why I don't want to base throttling on some computed
value (well, I also somehow estimate necessary sleep time but that's more a
performance optimization) but rather leave tasks "at the mercy of lower
layers" as you write ;) I don't think it's necessarily a bad thing. 
 
> > The results for different bandwidths fio load is interesting. There are 8
> > threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> > bdi dirty limits, what happens is that three most aggresive tasks get
> > throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> > dirtiers run unthrottled.
> 
> The base bandwidth based throttling can do better and provide almost
> perfect fairness, because all tasks writing to one bdi derive their
> own throttle bandwidth based on the same per-bdi base bandwidth. So
> the heavier dirtiers will converge to equal dirty rate and weight.
  So what do you consider a perfect fairness in this case and are you sure
it is desirable? I was thinking about this and I'm not sure...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-22 21:43   ` Jan Kara
@ 2011-03-23  4:41     ` Dave Chinner
  2011-03-25 12:59       ` Wu Fengguang
  2011-03-25 13:44     ` Wu Fengguang
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2011-03-23  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

On Tue, Mar 22, 2011 at 10:43:14PM +0100, Jan Kara wrote:
>   Hello Fengguang,
> 
> On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > 
> > >   Hello,
> > > 
> > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > 
> > Well, it may be too early to claim "simplicity" as an advantage, until
> > you achieve the following performance/feature comparability (most of
> > them are not optional ones). AFAICS this work is kind of heavy lifting
> > that will consume a lot of time and attention. You'd better find some
> > more fundamental needs before go on the reworking.
> > 
> > (1)  latency
> > (2)  fairness
> > (3)  smoothness
> > (4)  scalability
> > (5)  per-task IO controller
> > (6)  per-cgroup IO controller (TBD)
> > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > (8)  think time compensation
> > (9)  backed by both theory and tests
> > (10) adapt pause time up on 100+ dirtiers
> > (11) adapt pause time down on low dirty pages 
> > (12) adapt to new dirty threshold/goal
> > (13) safeguard against dirty exceeding
> > (14) safeguard against device queue underflow
>   I think this is a misunderstanding of my goals ;). My main goal is to
> explore, how far we can get with a relatively simple approach to IO-less
> balance_dirty_pages(). I guess what I have is better than the current
> balance_dirty_pages() but it sure does not even try to provide all the
> features you try to provide.

This is my major concern - maintainability of the code. It's all
well and good to evaluate the code based on it's current
performance, but what about 2 or 3 years down the track when for
some reason it's not working like it was intended - just like what
happened with slow degradation in writeback performance between
~2.6.15 and ~2.6.30.

Fundamentally, the _only_ thing I want balance_dirty_pages() to do
is _not issue IO_. Issuing IO in balance_dirty_pages() simply does
not scale, especially for devices that have no inherent concurrency.
I don't care if the solution is not perfectly fair or that there is
some latency jitter between threads, I just want to avoid having the
IO issue patterns change drastically when the system runs out of
clean pages.

IMO, that's all we should be trying to acheive with IO-less write
throttling right now. Get that algorithm and infrastructure right
first, then we can work out how to build on that to do more fancy
stuff.

> I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> tests, possibly add compensation when we waited for too long in
> balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> about it...
> 
> Basically I do this so that we can compare and decide whether what my
> simple approach offers is OK or whether we want some more complex solution
> like your patches...

I agree completely.

FWIW (and that may not be much), the IO-less write throttling that I
wrote for Irix back in 2004 was very simple and very effective -
input and output bandwidth estimation updated once per second, with
a variable write syscall delay applied on each syscall also
calculated once per second. The change to the delay was based on the
difference between input and output rates and the number of write
syscalls per second.

I tried all sorts of fancy stuff to improve it, but the corner cases
in anything fancy led to substantial complexity of algorithms and
code and workloads that just didn't work well.  In the end, simple
worked better than fancy and complex and was easier to understand,
predict and tune....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-23  4:41     ` Dave Chinner
@ 2011-03-25 12:59       ` Wu Fengguang
  0 siblings, 0 replies; 49+ messages in thread
From: Wu Fengguang @ 2011-03-25 12:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

Hi Dave,

On Wed, Mar 23, 2011 at 12:41:52PM +0800, Dave Chinner wrote:
> On Tue, Mar 22, 2011 at 10:43:14PM +0100, Jan Kara wrote:
> >   Hello Fengguang,
> > 
> > On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > > 
> > > >   Hello,
> > > > 
> > > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > > 
> > > Well, it may be too early to claim "simplicity" as an advantage, until
> > > you achieve the following performance/feature comparability (most of
> > > them are not optional ones). AFAICS this work is kind of heavy lifting
> > > that will consume a lot of time and attention. You'd better find some
> > > more fundamental needs before go on the reworking.

To start with, let me explain the below items and you judge whether
they are desirable ones now and future.

> > > (1)  latency

The potential impact a write(2) syscall get blocked for over 200ms
could be

- user perceptible unresponsiveness

- interrupted IO pipeline in such kind of applications

        loop {
                read(buf)
                write(buf)
        }

  In particular the read() IO stream will be interrupted. If it's a
  read stream from another disk, then the disk may go idle for the
  period. If it's a read stream from network, the client side
  uploading files via an ADSL link will find the upload bandwidth
  under-utilized.

> > > (2)  fairness

This is raised by Jan Kara. It should be application dependent.
Some workloads may not care at all.  Some may be more sensitive.

> > > (3)  smoothness

It reflects the variation of a task's throttled dirty bandwidth.
The fluctuation of dirty rate can degrade the possible read-write IO
pipeline in the similar way as (1).

> > > (4)  scalability

Being able to scale well to 1000+ dirtiers, and to large number of
CPUs and disks.

> > > (5)  per-task IO controller
> > > (6)  per-cgroup IO controller (TBD)
> > > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > > (8)  think time compensation

The IO controller stuff. It seems there are only disparities on when
and how to provide the features. (5) and priority based IO controller
are the side product of the base bandwidth solution.

> > > (9)  backed by both theory and tests

The prerequisites for quality code.

> > > (10) adapt pause time up on 100+ dirtiers

To reduce CPU overheads, requested by you :)

> > > (11) adapt pause time down on low dirty pages 

This is necessary to avoid < 100% disk utilization on small memory
systems.

> > > (12) adapt to new dirty threshold/goal

This is to support dynamically lowering the number of dirty pages 
in order to avoid excessive page_out() during page reclaim.

> > > (13) safeguard against dirty exceeding

This is a must have to prevent DoS situations. There have to be a hard
dirty limit that force the dirtiers to loop inside balance_dirty_pages().
Because the regular throttling scheme in both Jan and my patches can
at most throttle the dirtier at eg. 200ms per page. This is not enough
to stop the dirty pages from growing high when there are many dirtiers
writing to a slow USB stick.

> > > (14) safeguard against device queue underflow
(in JBOD case, maintain reasonable number of dirty pages in each disk queue)

This is necessary to avoid < 100% disk utilization on small memory
systems. Note that (14) and (11) share the same goal, however need
independent safeguards at two different places.

> >   I think this is a misunderstanding of my goals ;). My main goal is to
> > explore, how far we can get with a relatively simple approach to IO-less
> > balance_dirty_pages(). I guess what I have is better than the current
> > balance_dirty_pages() but it sure does not even try to provide all the
> > features you try to provide.
> 
> This is my major concern - maintainability of the code. It's all
> well and good to evaluate the code based on it's current
> performance, but what about 2 or 3 years down the track when for
> some reason it's not working like it was intended - just like what
> happened with slow degradation in writeback performance between
> ~2.6.15 and ~2.6.30.
>
> Fundamentally, the _only_ thing I want balance_dirty_pages() to do
> is _not issue IO_. Issuing IO in balance_dirty_pages() simply does
> not scale, especially for devices that have no inherent concurrency.
> I don't care if the solution is not perfectly fair or that there is
> some latency jitter between threads, I just want to avoid having the
> IO issue patterns change drastically when the system runs out of
> clean pages.
> 
> IMO, that's all we should be trying to acheive with IO-less write
> throttling right now. Get that algorithm and infrastructure right
> first, then we can work out how to build on that to do more fancy
> stuff.

You already got what you want in the base bandwidth patches v6 :)
It has been tested extensively and is ready for more wide -mm
exercises. Plus more for others that actually care.

As long as the implemented features/performance are

- desirable in long term
- cannot be done in fundamentally more simple/robust way

Then I see no point to drop the ready-to-work solution and take so
much efforts to breed another one, perhaps taking 1-2 more release
cycles only to reach the same level of performance _and_ complexity.

> > I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> > tests, possibly add compensation when we waited for too long in
> > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> > about it...
> > 
> > Basically I do this so that we can compare and decide whether what my
> > simple approach offers is OK or whether we want some more complex solution
> > like your patches...
> 
> I agree completely.
> 
> FWIW (and that may not be much), the IO-less write throttling that I
> wrote for Irix back in 2004 was very simple and very effective -
> input and output bandwidth estimation updated once per second, with
> a variable write syscall delay applied on each syscall also
> calculated once per second. The change to the delay was based on the
> difference between input and output rates and the number of write
> syscalls per second.
> 
> I tried all sorts of fancy stuff to improve it, but the corner cases
> in anything fancy led to substantial complexity of algorithms and
> code and workloads that just didn't work well.  In the end, simple
> worked better than fancy and complex and was easier to understand,
> predict and tune....

I may be one of the biggest sufferers of the inherent writeback
complexities, and would be more than pleased to have some simple
scheme that deals with only simple requirements. However.. I feel
obliged to treat smoothness as one requirement when everybody are
complaining writeback responsiveness issues in bugzilla, mailing list
as well as in LSF and kernel summit.

To help debug and evaluate the work, a bunch of test suites are
created and visualized. I can tell from the graphs that the v6 patches
are performing very good in all the test cases. It's achieved by 880
lines of code in page-writeback.c. It's good that Jan's v2 does it
merely in 300 lines of code, however it's still missing the code for
(10)-(14) and the performance gaps are simply too large in many of the
cases. IMHO it would need some non-trivial revisions to become a real
candidate.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-22 21:43   ` Jan Kara
  2011-03-23  4:41     ` Dave Chinner
@ 2011-03-25 13:44     ` Wu Fengguang
  2011-03-25 23:05       ` Jan Kara
  1 sibling, 1 reply; 49+ messages in thread
From: Wu Fengguang @ 2011-03-25 13:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra,
	Andrew Morton

Hi Jan,

On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
> On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > 
> > >   Hello,
> > > 
> > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > 
> > Well, it may be too early to claim "simplicity" as an advantage, until
> > you achieve the following performance/feature comparability (most of
> > them are not optional ones). AFAICS this work is kind of heavy lifting
> > that will consume a lot of time and attention. You'd better find some
> > more fundamental needs before go on the reworking.
> > 
> > (1)  latency
> > (2)  fairness
> > (3)  smoothness
> > (4)  scalability
> > (5)  per-task IO controller
> > (6)  per-cgroup IO controller (TBD)
> > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > (8)  think time compensation
> > (9)  backed by both theory and tests
> > (10) adapt pause time up on 100+ dirtiers
> > (11) adapt pause time down on low dirty pages 
> > (12) adapt to new dirty threshold/goal
> > (13) safeguard against dirty exceeding
> > (14) safeguard against device queue underflow
>   I think this is a misunderstanding of my goals ;). My main goal is to
> explore, how far we can get with a relatively simple approach to IO-less
> balance_dirty_pages(). I guess what I have is better than the current
> balance_dirty_pages() but it sure does not even try to provide all the
> features you try to provide.

OK.

> I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> tests, possibly add compensation when we waited for too long in
> balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> about it...
> 
> Basically I do this so that we can compare and decide whether what my
> simple approach offers is OK or whether we want some more complex solution
> like your patches...

Yeah, now both results are on the website. Let's see whether they are
acceptable for others.

> > > The basic idea (implemented in the third patch) is that processes throttled
> > > in balance_dirty_pages() wait for enough IO to complete. The waiting is
> > > implemented as follows: Whenever we decide to throttle a task in
> > > balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> > > against that bdi and goes to sleep waiting to receive specified amount of page
> > > IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> > > autotuned based on observed IO rate), accumulated page IO completions are
> > > distributed equally among waiting tasks.
> > > 
> > > This waiting scheme has been chosen so that waiting time in
> > > balance_dirty_pages() is proportional to
> > >   number_waited_pages * number_of_waiters.
> > > In particular it does not depend on the total number of pages being waited for,
> > > thus providing possibly a fairer results.
> > 
> > When there comes no IO completion in 1 second (normal in NFS), the
> > tasks will all get stuck. It is fixable based on your v2 code base
> > (detailed below), however will likely bring the same level of
> > complexity as the base bandwidth solution.
>   I have some plans how to account for bumpy IO completion when we wait for
> a long time and then get completion of much more IO than we actually need.
> But in case where processes use all the bandwidth and the latency of the
> device is high, sure they take the penalty and have to wait for a long time
> in balance_dirty_pages().

No, I don't think it's good to block for long time in
balance_dirty_pages(). This seems to be our biggest branch point.

> > As for v2, there are still big gap to fill. NFS dirtiers are
> > constantly doing 20-25 seconds long delays
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-31/balance_dirty_pages-pause.png
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-pause.png
>   Yeah, this is because they want lots of pages each
> (3/2*MAX_WRITEBACK_PAGES). I'll try to change ratelimiting to make several
> shorter sleeps. But ultimately you have to wait this much. Just you can
> split those big sleeps in more of smaller ones.

Ideally I prefer less than 100ms sleep time for achieving smooth
responsiveness and IO pipeline.

> > and the tasks are bumping forwards
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/jan-bdp-v2b/3G/nfs-10dd-1M-8p-2945M-20%25-2.6.38-rc8-jan-bdp+-2011-03-10-16-38/balance_dirty_pages-task-bw.png
>   Yeah, that's a result of bumpy NFS writeout and basically the consequence
> of the above. Maybe it can be helped but I don't find this to be a problem on
> its own...

ditto.

> > > Since last version I've implemented cleanups as suggested by Peter Zilstra.
> > > The patches undergone more throughout testing. So far I've tested different
> > > filesystems (ext2, ext3, ext4, xfs, nfs), also a combination of a local
> > > filesystem and nfs. The load was either various number of dd threads or
> > > fio with several threads each dirtying pages at different speed.
> > > 
> > > Results and test scripts can be found at
> > >   http://beta.suse.com/private/jack/balance_dirty_pages-v2/
> > > See README file for some explanation of test framework, tests, and graphs.
> > > Except for ext3 in data=ordered mode, where kjournald creates high
> > > fluctuations in waiting time of throttled processes (and also high latencies),
> > > the results look OK. Parallel dd threads are being throttled in the same way
> > > (in a 2s window threads spend the same time waiting) and also latencies of
> > > individual waits seem OK - except for ext3 they fit in 100 ms for local
> > > filesystems. They are in 200-500 ms range for NFS, which isn't that nice but
> > > to fix that we'd have to modify current ratelimiting scheme to take into
> > > account on which bdi a page is dirtied. Then we could ratelimit slower BDIs
> > > more often thus reducing latencies in individual waits...
> > 
> > Yes the per-cpu rate limit is a problem, so I'm switching to per-task
> > rate limit.
>   BTW: Have you considered per-bdi ratelimiting? Both per-task and per-bdi
> make sense just they are going to have slightly different properties...
> Current per-cpu ratelimit counters tend to behave like per-task
> ratelimiting at least for fast dirtiers because once a task is blocked in
> balance_dirty_pages() another task runs on that cpu and uses the counter
> for itself. So I wouldn't expect big differences from per-task
> ratelimiting...

Good point. It should be enough to have per-bdi rate limiting threshold.
However I still need to keep per-task nr_dirtied for doing per-task
throttling bandwidth. The per-cpu bdp_ratelimits mix dirtied pages
from different tasks and bdis, which is unacceptable for me.

> > The direct input from IO completion is another issue. It leaves the
> > dirty tasks at the mercy of low layer (VFS/FS/bdev) fluctuations and
> > latencies. So I'm introducing the base bandwidth as a buffer layer.
> > You may employ the similar technique: to simulate a more smooth flow
> > of IO completion events based on the average write bandwidth. Then it
> > naturally introduce the problem of rate mismatch between
> > simulated/real IO completions, and the need to do more elaborated
> > position control.
>   Exacttly, that's why I don't want to base throttling on some computed
> value (well, I also somehow estimate necessary sleep time but that's more a
> performance optimization) but rather leave tasks "at the mercy of lower
> layers" as you write ;) I don't think it's necessarily a bad thing. 

Again, the same branch point :)

> > > The results for different bandwidths fio load is interesting. There are 8
> > > threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> > > bdi dirty limits, what happens is that three most aggresive tasks get
> > > throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> > > dirtiers run unthrottled.
> > 
> > The base bandwidth based throttling can do better and provide almost
> > perfect fairness, because all tasks writing to one bdi derive their
> > own throttle bandwidth based on the same per-bdi base bandwidth. So
> > the heavier dirtiers will converge to equal dirty rate and weight.
>   So what do you consider a perfect fairness in this case and are you sure
> it is desirable? I was thinking about this and I'm not sure...

Perfect fairness could be 1, 2, 4, 8, N, N, N MB/s, where

        N = (write_bandwidth - 1 - 2 - 4 - 8) / 3.

I guess its usefulness is largely depending on the user space
applications.  Most of them should not be sensible to it.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-25 13:44     ` Wu Fengguang
@ 2011-03-25 23:05       ` Jan Kara
  2011-03-28  2:44         ` Wu Fengguang
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-25 23:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

  Hello Fengguang,

On Fri 25-03-11 21:44:11, Wu Fengguang wrote:
> On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:
> >   Hello Fengguang,
> > 
> > On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > > 
> > > >   Hello,
> > > > 
> > > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > > 
> > > Well, it may be too early to claim "simplicity" as an advantage, until
> > > you achieve the following performance/feature comparability (most of
> > > them are not optional ones). AFAICS this work is kind of heavy lifting
> > > that will consume a lot of time and attention. You'd better find some
> > > more fundamental needs before go on the reworking.
> > > 
> > > (1)  latency
> > > (2)  fairness
> > > (3)  smoothness
> > > (4)  scalability
> > > (5)  per-task IO controller
> > > (6)  per-cgroup IO controller (TBD)
> > > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > > (8)  think time compensation
> > > (9)  backed by both theory and tests
> > > (10) adapt pause time up on 100+ dirtiers
> > > (11) adapt pause time down on low dirty pages 
> > > (12) adapt to new dirty threshold/goal
> > > (13) safeguard against dirty exceeding
> > > (14) safeguard against device queue underflow
> >   I think this is a misunderstanding of my goals ;). My main goal is to
> > explore, how far we can get with a relatively simple approach to IO-less
> > balance_dirty_pages(). I guess what I have is better than the current
> > balance_dirty_pages() but it sure does not even try to provide all the
> > features you try to provide.
> 
> OK.
> 
> > I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> > tests, possibly add compensation when we waited for too long in
> > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> > about it...
> > 
> > Basically I do this so that we can compare and decide whether what my
> > simple approach offers is OK or whether we want some more complex solution
> > like your patches...
> 
> Yeah, now both results are on the website. Let's see whether they are
> acceptable for others.
  Yes. BTW, I think we'll discuss this at LSF so it would be beneficial if
we both prepared a fairly short explanation of our algorithm and some
summary of the measured results. I think it would be good to keep each of
us below 5 minutes so that we don't bore the audience - people will ask for
details where they are interested... What do you think?

I'll try to run also your patches on my setup to see how they work :) V6
from your website is the latest version, isn't it?

> > > > The basic idea (implemented in the third patch) is that processes throttled
> > > > in balance_dirty_pages() wait for enough IO to complete. The waiting is
> > > > implemented as follows: Whenever we decide to throttle a task in
> > > > balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> > > > against that bdi and goes to sleep waiting to receive specified amount of page
> > > > IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> > > > autotuned based on observed IO rate), accumulated page IO completions are
> > > > distributed equally among waiting tasks.
> > > > 
> > > > This waiting scheme has been chosen so that waiting time in
> > > > balance_dirty_pages() is proportional to
> > > >   number_waited_pages * number_of_waiters.
> > > > In particular it does not depend on the total number of pages being waited for,
> > > > thus providing possibly a fairer results.
> > > 
> > > When there comes no IO completion in 1 second (normal in NFS), the
> > > tasks will all get stuck. It is fixable based on your v2 code base
> > > (detailed below), however will likely bring the same level of
> > > complexity as the base bandwidth solution.
> >   I have some plans how to account for bumpy IO completion when we wait for
> > a long time and then get completion of much more IO than we actually need.
> > But in case where processes use all the bandwidth and the latency of the
> > device is high, sure they take the penalty and have to wait for a long time
> > in balance_dirty_pages().
> 
> No, I don't think it's good to block for long time in
> balance_dirty_pages(). This seems to be our biggest branch point.
  I agree we should not block for several seconds under normal load but
when something insane like 1000 dds is running, I don't think it's a big
problem :)

And actually the NFS traces you pointed to originally seem to be different
problem, in fact not directly related to what balance_dirty_pages() does...
And with local filesystem the results seem to be reasonable (although there
are some longer sleeps in your JBOD measurements I don't understand yet).

> > > > The results for different bandwidths fio load is interesting. There are 8
> > > > threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> > > > bdi dirty limits, what happens is that three most aggresive tasks get
> > > > throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> > > > dirtiers run unthrottled.
> > > 
> > > The base bandwidth based throttling can do better and provide almost
> > > perfect fairness, because all tasks writing to one bdi derive their
> > > own throttle bandwidth based on the same per-bdi base bandwidth. So
> > > the heavier dirtiers will converge to equal dirty rate and weight.
> >   So what do you consider a perfect fairness in this case and are you sure
> > it is desirable? I was thinking about this and I'm not sure...
> 
> Perfect fairness could be 1, 2, 4, 8, N, N, N MB/s, where
> 
>         N = (write_bandwidth - 1 - 2 - 4 - 8) / 3.
> 
> I guess its usefulness is largely depending on the user space
> applications.  Most of them should not be sensible to it.
  I see, that makes some sense although it makes it advantageous to split
heavy dirtier task into two less heavy dirtiers which is a bit strange. But
as you say, precise results here probably do not matter much.

						Have a nice weekend

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-25 23:05       ` Jan Kara
@ 2011-03-28  2:44         ` Wu Fengguang
  2011-03-28 15:08           ` Jan Kara
  2011-03-29  2:14           ` Dave Chinner
  0 siblings, 2 replies; 49+ messages in thread
From: Wu Fengguang @ 2011-03-28  2:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra,
	Andrew Morton

Hi Jan,

On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
> On Fri 25-03-11 21:44:11, Wu Fengguang wrote:
> > On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:
> > >   Hello Fengguang,
> > > 
> > > On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > > > 
> > > > >   Hello,
> > > > > 
> > > > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > > > 
> > > > Well, it may be too early to claim "simplicity" as an advantage, until
> > > > you achieve the following performance/feature comparability (most of
> > > > them are not optional ones). AFAICS this work is kind of heavy lifting
> > > > that will consume a lot of time and attention. You'd better find some
> > > > more fundamental needs before go on the reworking.
> > > > 
> > > > (1)  latency
> > > > (2)  fairness
> > > > (3)  smoothness
> > > > (4)  scalability
> > > > (5)  per-task IO controller
> > > > (6)  per-cgroup IO controller (TBD)
> > > > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > > > (8)  think time compensation
> > > > (9)  backed by both theory and tests
> > > > (10) adapt pause time up on 100+ dirtiers
> > > > (11) adapt pause time down on low dirty pages 
> > > > (12) adapt to new dirty threshold/goal
> > > > (13) safeguard against dirty exceeding
> > > > (14) safeguard against device queue underflow
> > >   I think this is a misunderstanding of my goals ;). My main goal is to
> > > explore, how far we can get with a relatively simple approach to IO-less
> > > balance_dirty_pages(). I guess what I have is better than the current
> > > balance_dirty_pages() but it sure does not even try to provide all the
> > > features you try to provide.
> > 
> > OK.
> > 
> > > I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> > > tests, possibly add compensation when we waited for too long in
> > > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> > > about it...
> > > 
> > > Basically I do this so that we can compare and decide whether what my
> > > simple approach offers is OK or whether we want some more complex solution
> > > like your patches...
> > 
> > Yeah, now both results are on the website. Let's see whether they are
> > acceptable for others.
>   Yes. BTW, I think we'll discuss this at LSF so it would be beneficial if
> we both prepared a fairly short explanation of our algorithm and some
> summary of the measured results. I think it would be good to keep each of
> us below 5 minutes so that we don't bore the audience - people will ask for
> details where they are interested... What do you think?

That looks good, however I'm not able to attend LSF this year, would
you help show my slides?

> I'll try to run also your patches on my setup to see how they work :) V6
> from your website is the latest version, isn't it?

Thank you. You can run 
http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v6
or 
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/dirty-throttling-v6-2.6.38-rc6.patch
whatever convenient for you.

If you are ready with v3, I can also help test it out and do some
comparison on the results.

> > > > > The basic idea (implemented in the third patch) is that processes throttled
> > > > > in balance_dirty_pages() wait for enough IO to complete. The waiting is
> > > > > implemented as follows: Whenever we decide to throttle a task in
> > > > > balance_dirty_pages(), task adds itself to a list of tasks that are throttled
> > > > > against that bdi and goes to sleep waiting to receive specified amount of page
> > > > > IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
> > > > > autotuned based on observed IO rate), accumulated page IO completions are
> > > > > distributed equally among waiting tasks.
> > > > > 
> > > > > This waiting scheme has been chosen so that waiting time in
> > > > > balance_dirty_pages() is proportional to
> > > > >   number_waited_pages * number_of_waiters.
> > > > > In particular it does not depend on the total number of pages being waited for,
> > > > > thus providing possibly a fairer results.
> > > > 
> > > > When there comes no IO completion in 1 second (normal in NFS), the
> > > > tasks will all get stuck. It is fixable based on your v2 code base
> > > > (detailed below), however will likely bring the same level of
> > > > complexity as the base bandwidth solution.
> > >   I have some plans how to account for bumpy IO completion when we wait for
> > > a long time and then get completion of much more IO than we actually need.
> > > But in case where processes use all the bandwidth and the latency of the
> > > device is high, sure they take the penalty and have to wait for a long time
> > > in balance_dirty_pages().
> > 
> > No, I don't think it's good to block for long time in
> > balance_dirty_pages(). This seems to be our biggest branch point.
>   I agree we should not block for several seconds under normal load but
> when something insane like 1000 dds is running, I don't think it's a big
> problem :)

I don't think so. The client does not care whether the server is
accepting files from 1000+ clients and its writeback algorithm does
not scale well beyond that point; it will simply notice its upload
goes slow and paused from time to time, which is rather annoying.

> And actually the NFS traces you pointed to originally seem to be different
> problem, in fact not directly related to what balance_dirty_pages() does...
> And with local filesystem the results seem to be reasonable (although there
> are some longer sleeps in your JBOD measurements I don't understand yet).

Yeah the NFS case can be improved on the FS side (for now you may just
reuse my NFS patches and focus on other generic improvements).

The JBOD issue is also beyond my understanding.

Note that XFS will also see one big IO completion per 0.5-1 seconds,
when we are to increase the write chunk size from the current 4MB to
near the bdi's write bandwidth. As illustrated by this graph:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png

> > > > > The results for different bandwidths fio load is interesting. There are 8
> > > > > threads dirtying pages at 1,2,4,..,128 MB/s rate. Due to different task
> > > > > bdi dirty limits, what happens is that three most aggresive tasks get
> > > > > throttled so they end up at bandwidths 24, 26, and 30 MB/s and the lighter
> > > > > dirtiers run unthrottled.
> > > > 
> > > > The base bandwidth based throttling can do better and provide almost
> > > > perfect fairness, because all tasks writing to one bdi derive their
> > > > own throttle bandwidth based on the same per-bdi base bandwidth. So
> > > > the heavier dirtiers will converge to equal dirty rate and weight.
> > >   So what do you consider a perfect fairness in this case and are you sure
> > > it is desirable? I was thinking about this and I'm not sure...
> > 
> > Perfect fairness could be 1, 2, 4, 8, N, N, N MB/s, where
> > 
> >         N = (write_bandwidth - 1 - 2 - 4 - 8) / 3.
> > 
> > I guess its usefulness is largely depending on the user space
> > applications.  Most of them should not be sensible to it.
>   I see, that makes some sense although it makes it advantageous to split
> heavy dirtier task into two less heavy dirtiers which is a bit strange. But
> as you say, precise results here probably do not matter much.

Right.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-28  2:44         ` Wu Fengguang
@ 2011-03-28 15:08           ` Jan Kara
  2011-03-29  1:44             ` Wu Fengguang
  2011-03-29  2:14           ` Dave Chinner
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-03-28 15:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

On Mon 28-03-11 10:44:45, Wu Fengguang wrote:
> Hi Jan,
> 
> On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> >   Hello Fengguang,
> > 
> > On Fri 25-03-11 21:44:11, Wu Fengguang wrote:
> > > On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:
> > > >   Hello Fengguang,
> > > > 
> > > > On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > > > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > > > > 
> > > > > >   Hello,
> > > > > > 
> > > > > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > > > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > > > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > > > > 
> > > > > Well, it may be too early to claim "simplicity" as an advantage, until
> > > > > you achieve the following performance/feature comparability (most of
> > > > > them are not optional ones). AFAICS this work is kind of heavy lifting
> > > > > that will consume a lot of time and attention. You'd better find some
> > > > > more fundamental needs before go on the reworking.
> > > > > 
> > > > > (1)  latency
> > > > > (2)  fairness
> > > > > (3)  smoothness
> > > > > (4)  scalability
> > > > > (5)  per-task IO controller
> > > > > (6)  per-cgroup IO controller (TBD)
> > > > > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > > > > (8)  think time compensation
> > > > > (9)  backed by both theory and tests
> > > > > (10) adapt pause time up on 100+ dirtiers
> > > > > (11) adapt pause time down on low dirty pages 
> > > > > (12) adapt to new dirty threshold/goal
> > > > > (13) safeguard against dirty exceeding
> > > > > (14) safeguard against device queue underflow
> > > >   I think this is a misunderstanding of my goals ;). My main goal is to
> > > > explore, how far we can get with a relatively simple approach to IO-less
> > > > balance_dirty_pages(). I guess what I have is better than the current
> > > > balance_dirty_pages() but it sure does not even try to provide all the
> > > > features you try to provide.
> > > 
> > > OK.
> > > 
> > > > I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> > > > tests, possibly add compensation when we waited for too long in
> > > > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> > > > about it...
> > > > 
> > > > Basically I do this so that we can compare and decide whether what my
> > > > simple approach offers is OK or whether we want some more complex solution
> > > > like your patches...
> > > 
> > > Yeah, now both results are on the website. Let's see whether they are
> > > acceptable for others.
> >   Yes. BTW, I think we'll discuss this at LSF so it would be beneficial if
> > we both prepared a fairly short explanation of our algorithm and some
> > summary of the measured results. I think it would be good to keep each of
> > us below 5 minutes so that we don't bore the audience - people will ask for
> > details where they are interested... What do you think?
> That looks good, however I'm not able to attend LSF this year, would
> you help show my slides?
  Ah, that's a pity :(. If you send me a few slides I can show them, that's
no problem. I'll also try to understand your patches in enough detail so
that I can answer possible questinons but author is always the best to
present his work :).
 
> > I'll try to run also your patches on my setup to see how they work :) V6
> > from your website is the latest version, isn't it?
> 
> Thank you. You can run 
> http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v6
> or 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/dirty-throttling-v6-2.6.38-rc6.patch
> whatever convenient for you.
> 
> If you are ready with v3, I can also help test it out and do some
> comparison on the results.
  I have done a couple of smaller fixes but I don't expect them to affect
performance in the loads we use. But I'll send you the patches when I
implement some significant change (but for that I need to reproduce the
latencies you sometimes see first...).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-28 15:08           ` Jan Kara
@ 2011-03-29  1:44             ` Wu Fengguang
  0 siblings, 0 replies; 49+ messages in thread
From: Wu Fengguang @ 2011-03-29  1:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra,
	Andrew Morton

On Mon, Mar 28, 2011 at 11:08:15PM +0800, Jan Kara wrote:
> On Mon 28-03-11 10:44:45, Wu Fengguang wrote:
> > Hi Jan,
> > 
> > On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > >   Hello Fengguang,
> > > 
> > > On Fri 25-03-11 21:44:11, Wu Fengguang wrote:
> > > > On Wed, Mar 23, 2011 at 05:43:14AM +0800, Jan Kara wrote:
> > > > >   Hello Fengguang,
> > > > > 
> > > > > On Fri 18-03-11 22:30:01, Wu Fengguang wrote:
> > > > > > On Wed, Mar 09, 2011 at 06:31:10AM +0800, Jan Kara wrote:
> > > > > > > 
> > > > > > >   Hello,
> > > > > > > 
> > > > > > >   I'm posting second version of my IO-less balance_dirty_pages() patches. This
> > > > > > > is alternative approach to Fengguang's patches - much simpler I believe (only
> > > > > > > 300 lines added) - but obviously I does not provide so sophisticated control.
> > > > > > 
> > > > > > Well, it may be too early to claim "simplicity" as an advantage, until
> > > > > > you achieve the following performance/feature comparability (most of
> > > > > > them are not optional ones). AFAICS this work is kind of heavy lifting
> > > > > > that will consume a lot of time and attention. You'd better find some
> > > > > > more fundamental needs before go on the reworking.
> > > > > > 
> > > > > > (1)  latency
> > > > > > (2)  fairness
> > > > > > (3)  smoothness
> > > > > > (4)  scalability
> > > > > > (5)  per-task IO controller
> > > > > > (6)  per-cgroup IO controller (TBD)
> > > > > > (7)  free combinations of per-task/per-cgroup and bandwidth/priority controllers
> > > > > > (8)  think time compensation
> > > > > > (9)  backed by both theory and tests
> > > > > > (10) adapt pause time up on 100+ dirtiers
> > > > > > (11) adapt pause time down on low dirty pages 
> > > > > > (12) adapt to new dirty threshold/goal
> > > > > > (13) safeguard against dirty exceeding
> > > > > > (14) safeguard against device queue underflow
> > > > >   I think this is a misunderstanding of my goals ;). My main goal is to
> > > > > explore, how far we can get with a relatively simple approach to IO-less
> > > > > balance_dirty_pages(). I guess what I have is better than the current
> > > > > balance_dirty_pages() but it sure does not even try to provide all the
> > > > > features you try to provide.
> > > > 
> > > > OK.
> > > > 
> > > > > I'm thinking about tweaking ratelimiting logic to reduce latencies in some
> > > > > tests, possibly add compensation when we waited for too long in
> > > > > balance_dirty_pages() (e.g. because of bumpy IO completion) but that's
> > > > > about it...
> > > > > 
> > > > > Basically I do this so that we can compare and decide whether what my
> > > > > simple approach offers is OK or whether we want some more complex solution
> > > > > like your patches...
> > > > 
> > > > Yeah, now both results are on the website. Let's see whether they are
> > > > acceptable for others.
> > >   Yes. BTW, I think we'll discuss this at LSF so it would be beneficial if
> > > we both prepared a fairly short explanation of our algorithm and some
> > > summary of the measured results. I think it would be good to keep each of
> > > us below 5 minutes so that we don't bore the audience - people will ask for
> > > details where they are interested... What do you think?
> > That looks good, however I'm not able to attend LSF this year, would
> > you help show my slides?
>   Ah, that's a pity :(. If you send me a few slides I can show them, that's
> no problem. I'll also try to understand your patches in enough detail so
> that I can answer possible questinons but author is always the best to
> present his work :).

Thank you very much :)

> > > I'll try to run also your patches on my setup to see how they work :) V6
> > > from your website is the latest version, isn't it?
> > 
> > Thank you. You can run 
> > http://git.kernel.org/?p=linux/kernel/git/wfg/writeback.git;a=shortlog;h=refs/heads/dirty-throttling-v6
> > or 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/dirty-throttling-v6-2.6.38-rc6.patch
> > whatever convenient for you.
> > 
> > If you are ready with v3, I can also help test it out and do some
> > comparison on the results.
>   I have done a couple of smaller fixes but I don't expect them to affect
> performance in the loads we use. But I'll send you the patches when I
> implement some significant change (but for that I need to reproduce the
> latencies you sometimes see first...).

OK. I can conveniently test the single disk cases. For JBOD and RAID
cases, I don't own the servers so normally have to wait for some time
before being able to carry out the tests..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-28  2:44         ` Wu Fengguang
  2011-03-28 15:08           ` Jan Kara
@ 2011-03-29  2:14           ` Dave Chinner
  2011-03-29  2:41             ` Wu Fengguang
  1 sibling, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2011-03-29  2:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton

On Mon, Mar 28, 2011 at 10:44:45AM +0800, Wu Fengguang wrote:
> On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > And actually the NFS traces you pointed to originally seem to be different
> > problem, in fact not directly related to what balance_dirty_pages() does...
> > And with local filesystem the results seem to be reasonable (although there
> > are some longer sleeps in your JBOD measurements I don't understand yet).
> 
> Yeah the NFS case can be improved on the FS side (for now you may just
> reuse my NFS patches and focus on other generic improvements).
> 
> The JBOD issue is also beyond my understanding.
> 
> Note that XFS will also see one big IO completion per 0.5-1 seconds,
> when we are to increase the write chunk size from the current 4MB to
> near the bdi's write bandwidth. As illustrated by this graph:
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png

Which is _bad_.

Increasing the writeback chunk size simply causes dirty queue
starvation issues when there are lots of dirty files and lots more
memory than there is writeback bandwidth. Think of a machine with
1TB of RAM (that's a 200GB dirty limit) and 1GB/s of disk
throughput. Thats 3 minutes worth of writeback and increasing the
chunk size to ~1s worth of throughput means that the 200th dirty
file won't get serviced for 3 minutes....

We used to have behaviour similar to this this (prior to 2.6.16, IIRC),
and it caused all sorts of problems where people were losing 10-15
minute old data when the system crashed because writeback didn't
process the dirty inode list fast enough in the presence of lots of
large files....

A small writeback chunk size has no adverse impact on XFS as long as
the elevator does it's job of merging IOs (which in 99.9% of cases
it does) so I'm wondering what the reason for making this change
is.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-29  2:14           ` Dave Chinner
@ 2011-03-29  2:41             ` Wu Fengguang
  2011-03-29  5:59               ` Dave Chinner
  0 siblings, 1 reply; 49+ messages in thread
From: Wu Fengguang @ 2011-03-29  2:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton, Theodore Ts'o, Chris Mason

On Tue, Mar 29, 2011 at 10:14:58AM +0800, Dave Chinner wrote:
> -printable
> Content-Length: 2034
> Lines: 51
> 
> On Mon, Mar 28, 2011 at 10:44:45AM +0800, Wu Fengguang wrote:
> > On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > > And actually the NFS traces you pointed to originally seem to be different
> > > problem, in fact not directly related to what balance_dirty_pages() does...
> > > And with local filesystem the results seem to be reasonable (although there
> > > are some longer sleeps in your JBOD measurements I don't understand yet).
> > 
> > Yeah the NFS case can be improved on the FS side (for now you may just
> > reuse my NFS patches and focus on other generic improvements).
> > 
> > The JBOD issue is also beyond my understanding.
> > 
> > Note that XFS will also see one big IO completion per 0.5-1 seconds,
> > when we are to increase the write chunk size from the current 4MB to
> > near the bdi's write bandwidth. As illustrated by this graph:
> > 
> > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png
> 
> Which is _bad_.
> 
> Increasing the writeback chunk size simply causes dirty queue
> starvation issues when there are lots of dirty files and lots more
> memory than there is writeback bandwidth. Think of a machine with
> 1TB of RAM (that's a 200GB dirty limit) and 1GB/s of disk
> throughput. Thats 3 minutes worth of writeback and increasing the
> chunk size to ~1s worth of throughput means that the 200th dirty
> file won't get serviced for 3 minutes....
> 
> We used to have behaviour similar to this this (prior to 2.6.16, IIRC),
> and it caused all sorts of problems where people were losing 10-15
> minute old data when the system crashed because writeback didn't
> process the dirty inode list fast enough in the presence of lots of
> large files....
 
Yes it is a problem, and can be best solved by automatically lowering
bdi dirty limit to (bdi->write_bandwidth * dirty_expire_interval/100).
Then we reliably control the lost data size to < 30s by default.

> A small writeback chunk size has no adverse impact on XFS as long as
> the elevator does it's job of merging IOs (which in 99.9% of cases
> it does) so I'm wondering what the reason for making this change
> is.

It's explained in this changelog (is the XFS paragraph still valid?)

        https://patchwork.kernel.org/patch/605151/

The larger write chunk size generally helps ext4 and RAID setups.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-29  2:41             ` Wu Fengguang
@ 2011-03-29  5:59               ` Dave Chinner
  2011-03-29  7:31                 ` Wu Fengguang
  0 siblings, 1 reply; 49+ messages in thread
From: Dave Chinner @ 2011-03-29  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton, Theodore Ts'o, Chris Mason

On Tue, Mar 29, 2011 at 10:41:20AM +0800, Wu Fengguang wrote:
> On Tue, Mar 29, 2011 at 10:14:58AM +0800, Dave Chinner wrote:
> > -printable
> > Content-Length: 2034
> > Lines: 51
> > 
> > On Mon, Mar 28, 2011 at 10:44:45AM +0800, Wu Fengguang wrote:
> > > On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > > > And actually the NFS traces you pointed to originally seem to be different
> > > > problem, in fact not directly related to what balance_dirty_pages() does...
> > > > And with local filesystem the results seem to be reasonable (although there
> > > > are some longer sleeps in your JBOD measurements I don't understand yet).
> > > 
> > > Yeah the NFS case can be improved on the FS side (for now you may just
> > > reuse my NFS patches and focus on other generic improvements).
> > > 
> > > The JBOD issue is also beyond my understanding.
> > > 
> > > Note that XFS will also see one big IO completion per 0.5-1 seconds,
> > > when we are to increase the write chunk size from the current 4MB to
> > > near the bdi's write bandwidth. As illustrated by this graph:
> > > 
> > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png
> > 
> > Which is _bad_.
> > 
> > Increasing the writeback chunk size simply causes dirty queue
> > starvation issues when there are lots of dirty files and lots more
> > memory than there is writeback bandwidth. Think of a machine with
> > 1TB of RAM (that's a 200GB dirty limit) and 1GB/s of disk
> > throughput. Thats 3 minutes worth of writeback and increasing the
> > chunk size to ~1s worth of throughput means that the 200th dirty
> > file won't get serviced for 3 minutes....
> > 
> > We used to have behaviour similar to this this (prior to 2.6.16, IIRC),
> > and it caused all sorts of problems where people were losing 10-15
> > minute old data when the system crashed because writeback didn't
> > process the dirty inode list fast enough in the presence of lots of
> > large files....
>  
> Yes it is a problem, and can be best solved by automatically lowering
> bdi dirty limit to (bdi->write_bandwidth * dirty_expire_interval/100).
> Then we reliably control the lost data size to < 30s by default.

Perhaps, though I see problems with that also. e.g. write bandwidth
is 100MB/s (dirty limit ~= 3GB), then someone runs a find on the
same disk and write bandwidth drops to 10MB/s (dirty limit changes
to ~300MB). Then we are 10x over the new dirty limit and the
writing application will be completely throttled for the next 270s
until the dirty pages drop below the new dirty limit or the find
stops.

IOWs, it changing IO workloads will cause interesting corner cases
to be discovered and hence further complexity to handle effectively.
And trying to diagnose problems because of such changes in IO load
will be nigh on impossible - how would you gather sufficient
information to determine that application A stalled for a minute
because application B read a bunch of stuff from disk at the wrong
time? Then how would you prove that you'd fixed the problem without
introducing some other regression triggered by different workload
changes?

> > A small writeback chunk size has no adverse impact on XFS as long as
> > the elevator does it's job of merging IOs (which in 99.9% of cases
> > it does) so I'm wondering what the reason for making this change
> > is.
> 
> It's explained in this changelog (is the XFS paragraph still valid?)
> 
>         https://patchwork.kernel.org/patch/605151/

You mean this paragraph?

"According to Christoph, the current writeback size is way too
small, and XFS had a hack that bumped out nr_to_write to four times
the value sent by the VM to be able to saturate medium-sized RAID
arrays.  This value was also problematic for ext4 as well, as it
caused large files to be come interleaved on disk by in 8 megabyte
chunks (we bumped up the nr_to_write by a factor of two)."

We _used_ to have such a hack. It was there from 2.6.30 through to
2.6.35 - from when we realised writeback had bitrotted into badness
to when we fixed the last set of bugs that the nr_to_write windup
was papering over. between 2.6.30 and 2.6.35 we changed to dedicated
flusher threads, got rid of congestion backoff, fixed up a bunch
of queueing issues and finally stopped nr_to_write from going and
staying negative and getting stuck on single inodes until they had
no more dirty pages left. That was when this was committed:

commit 254c8c2dbf0e06a560a5814eb90cb628adb2de66
Author: Dave Chinner <dchinner@redhat.com>
Date:   Wed Jun 9 10:37:19 2010 +1000

    xfs: remove nr_to_write writeback windup.

    Now that the background flush code has been fixed, we shouldn't need to
    silently multiply the wbc->nr_to_write to get good writeback. Remove
    that code.

    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Reviewed-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

And writeback throughput is now as good as it ever was....

> The larger write chunk size generally helps ext4 and RAID setups.

Is this still true with ext4's new submit_bio()-based writeback IO
submission path that was copied from the XFS? It's a lot more
efficient so should be much better on RAID setups.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-29  5:59               ` Dave Chinner
@ 2011-03-29  7:31                 ` Wu Fengguang
  2011-03-29  7:52                   ` Wu Fengguang
  0 siblings, 1 reply; 49+ messages in thread
From: Wu Fengguang @ 2011-03-29  7:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton, Theodore Ts'o, Chris Mason,
	Andreas Dilger

On Tue, Mar 29, 2011 at 01:59:47PM +0800, Dave Chinner wrote:
> On Tue, Mar 29, 2011 at 10:41:20AM +0800, Wu Fengguang wrote:
> > On Tue, Mar 29, 2011 at 10:14:58AM +0800, Dave Chinner wrote:
> > > -printable
> > > Content-Length: 2034
> > > Lines: 51
> > > 
> > > On Mon, Mar 28, 2011 at 10:44:45AM +0800, Wu Fengguang wrote:
> > > > On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > > > > And actually the NFS traces you pointed to originally seem to be different
> > > > > problem, in fact not directly related to what balance_dirty_pages() does...
> > > > > And with local filesystem the results seem to be reasonable (although there
> > > > > are some longer sleeps in your JBOD measurements I don't understand yet).
> > > > 
> > > > Yeah the NFS case can be improved on the FS side (for now you may just
> > > > reuse my NFS patches and focus on other generic improvements).
> > > > 
> > > > The JBOD issue is also beyond my understanding.
> > > > 
> > > > Note that XFS will also see one big IO completion per 0.5-1 seconds,
> > > > when we are to increase the write chunk size from the current 4MB to
> > > > near the bdi's write bandwidth. As illustrated by this graph:
> > > > 
> > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png
> > > 
> > > Which is _bad_.
> > > 
> > > Increasing the writeback chunk size simply causes dirty queue
> > > starvation issues when there are lots of dirty files and lots more
> > > memory than there is writeback bandwidth. Think of a machine with
> > > 1TB of RAM (that's a 200GB dirty limit) and 1GB/s of disk
> > > throughput. Thats 3 minutes worth of writeback and increasing the
> > > chunk size to ~1s worth of throughput means that the 200th dirty
> > > file won't get serviced for 3 minutes....
> > > 
> > > We used to have behaviour similar to this this (prior to 2.6.16, IIRC),
> > > and it caused all sorts of problems where people were losing 10-15
> > > minute old data when the system crashed because writeback didn't
> > > process the dirty inode list fast enough in the presence of lots of
> > > large files....
> >  
> > Yes it is a problem, and can be best solved by automatically lowering
> > bdi dirty limit to (bdi->write_bandwidth * dirty_expire_interval/100).
> > Then we reliably control the lost data size to < 30s by default.
> 
> Perhaps, though I see problems with that also. e.g. write bandwidth
> is 100MB/s (dirty limit ~= 3GB), then someone runs a find on the
> same disk and write bandwidth drops to 10MB/s (dirty limit changes
> to ~300MB). Then we are 10x over the new dirty limit and the
> writing application will be completely throttled for the next 270s
> until the dirty pages drop below the new dirty limit or the find
> stops.
> 
> IOWs, it changing IO workloads will cause interesting corner cases
> to be discovered and hence further complexity to handle effectively.
> And trying to diagnose problems because of such changes in IO load
> will be nigh on impossible - how would you gather sufficient
> information to determine that application A stalled for a minute
> because application B read a bunch of stuff from disk at the wrong
> time? Then how would you prove that you'd fixed the problem without
> introducing some other regression triggered by different workload
> changes?

Good point. The v6 dirty throttle patchset has taken this into
account, by separating the concept of dirty goal and hard dirty
limit. Sorry I should have use bdi dirty goal in the previous email.

When for whatever reason the bdi dirty goal becomes a lot more smaller
than bdi dirty pages, the bdi dirtiers will be throttled at typically
lower than balanced bandwidth, so that the bdi dirty pages can smoothly
drop to the dirty goal.

In the below graph, the bdi dirty pages start from much higher from
bdi dirty goal because we start dd's on a USB stick and a hard disk
at the same time, and the USB stick manage to accumulate lots of dirty
pages before the dirty throttling logic starts to work. So you can see
two dropping red lines in the (40s, 120s) time range. The green
"position bandwidth" line shows that in that range, the tasks are 
throttled a most 1/8 lower than the balanced throttle bandwidth and
restored to normal after 140s.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/xfs-1dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-21/balance_dirty_pages-pages.png

This is the corresponding pause times. They are perfectly under
control (less than the configurable 200ms max pause time).

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/xfs-1dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-21/balance_dirty_pages-pause.png

Actually the patchset does not set hard limit for bdi dirty pages at
all. It only maintains one hard limit for the global dirty pages.
That global hard limit is introduced exactly to handle your mentioned
case.

https://patchwork.kernel.org/patch/605201/

+ * The global dirty threshold is normally equal to global dirty limit, except
+ * when the system suddenly allocates a lot of anonymous memory and knocks down
+ * the global dirty threshold quickly, in which case the global dirty limit
+ * will follow down slowly to prevent livelocking all dirtier tasks.

Here "global dirty threshold" is the one returned by current
global_dirty_limits(), the "global dirty limit" is
default_backing_dev_info.dirty_threshold in the above patch.
Sorry the names are a bit confusing..

> > > A small writeback chunk size has no adverse impact on XFS as long as
> > > the elevator does it's job of merging IOs (which in 99.9% of cases
> > > it does) so I'm wondering what the reason for making this change
> > > is.
> > 
> > It's explained in this changelog (is the XFS paragraph still valid?)
> > 
> >         https://patchwork.kernel.org/patch/605151/
> 
> You mean this paragraph?
> 
> "According to Christoph, the current writeback size is way too
> small, and XFS had a hack that bumped out nr_to_write to four times
> the value sent by the VM to be able to saturate medium-sized RAID
> arrays.  This value was also problematic for ext4 as well, as it
> caused large files to be come interleaved on disk by in 8 megabyte
> chunks (we bumped up the nr_to_write by a factor of two)."

Yes.
 
> We _used_ to have such a hack. It was there from 2.6.30 through to
> 2.6.35 - from when we realised writeback had bitrotted into badness
> to when we fixed the last set of bugs that the nr_to_write windup
> was papering over. between 2.6.30 and 2.6.35 we changed to dedicated
> flusher threads, got rid of congestion backoff, fixed up a bunch
> of queueing issues and finally stopped nr_to_write from going and
> staying negative and getting stuck on single inodes until they had
> no more dirty pages left. That was when this was committed:
> 
> commit 254c8c2dbf0e06a560a5814eb90cb628adb2de66
> Author: Dave Chinner <dchinner@redhat.com>
> Date:   Wed Jun 9 10:37:19 2010 +1000
> 
>     xfs: remove nr_to_write writeback windup.
>     
>     Now that the background flush code has been fixed, we shouldn't need to
>     silently multiply the wbc->nr_to_write to get good writeback. Remove
>     that code.
>     
>     Signed-off-by: Dave Chinner <dchinner@redhat.com>
>     Reviewed-by: Christoph Hellwig <hch@lst.de>
>     Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> 
> And writeback throughput is now as good as it ever was....

Glad to know about the improvements :)
 
> > The larger write chunk size generally helps ext4 and RAID setups.
> 
> Is this still true with ext4's new submit_bio()-based writeback IO
> submission path that was copied from the XFS? It's a lot more
> efficient so should be much better on RAID setups.

I believe >4MB write chunk size will still help.. Although 
I have no real data to back it up for now.

On the other hand, are there chances for XFS to be hurt by the
larger write chunk size?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach)
  2011-03-29  7:31                 ` Wu Fengguang
@ 2011-03-29  7:52                   ` Wu Fengguang
  0 siblings, 0 replies; 49+ messages in thread
From: Wu Fengguang @ 2011-03-29  7:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	Peter Zijlstra, Andrew Morton, Theodore Ts'o, Chris Mason,
	Andreas Dilger

On Tue, Mar 29, 2011 at 03:31:02PM +0800, Wu Fengguang wrote:
> On Tue, Mar 29, 2011 at 01:59:47PM +0800, Dave Chinner wrote:
> > On Tue, Mar 29, 2011 at 10:41:20AM +0800, Wu Fengguang wrote:
> > > On Tue, Mar 29, 2011 at 10:14:58AM +0800, Dave Chinner wrote:
> > > > -printable
> > > > Content-Length: 2034
> > > > Lines: 51
> > > > 
> > > > On Mon, Mar 28, 2011 at 10:44:45AM +0800, Wu Fengguang wrote:
> > > > > On Sat, Mar 26, 2011 at 07:05:44AM +0800, Jan Kara wrote:
> > > > > > And actually the NFS traces you pointed to originally seem to be different
> > > > > > problem, in fact not directly related to what balance_dirty_pages() does...
> > > > > > And with local filesystem the results seem to be reasonable (although there
> > > > > > are some longer sleeps in your JBOD measurements I don't understand yet).
> > > > > 
> > > > > Yeah the NFS case can be improved on the FS side (for now you may just
> > > > > reuse my NFS patches and focus on other generic improvements).
> > > > > 
> > > > > The JBOD issue is also beyond my understanding.
> > > > > 
> > > > > Note that XFS will also see one big IO completion per 0.5-1 seconds,
> > > > > when we are to increase the write chunk size from the current 4MB to
> > > > > near the bdi's write bandwidth. As illustrated by this graph:
> > > > > 
> > > > > http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-1dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-22-58/global_dirtied_written-500.png
> > > > 
> > > > Which is _bad_.
> > > > 
> > > > Increasing the writeback chunk size simply causes dirty queue
> > > > starvation issues when there are lots of dirty files and lots more
> > > > memory than there is writeback bandwidth. Think of a machine with
> > > > 1TB of RAM (that's a 200GB dirty limit) and 1GB/s of disk
> > > > throughput. Thats 3 minutes worth of writeback and increasing the
> > > > chunk size to ~1s worth of throughput means that the 200th dirty
> > > > file won't get serviced for 3 minutes....
> > > > 
> > > > We used to have behaviour similar to this this (prior to 2.6.16, IIRC),
> > > > and it caused all sorts of problems where people were losing 10-15
> > > > minute old data when the system crashed because writeback didn't
> > > > process the dirty inode list fast enough in the presence of lots of
> > > > large files....
> > >  
> > > Yes it is a problem, and can be best solved by automatically lowering
> > > bdi dirty limit to (bdi->write_bandwidth * dirty_expire_interval/100).
> > > Then we reliably control the lost data size to < 30s by default.
> > 
> > Perhaps, though I see problems with that also. e.g. write bandwidth
> > is 100MB/s (dirty limit ~= 3GB), then someone runs a find on the
> > same disk and write bandwidth drops to 10MB/s (dirty limit changes
> > to ~300MB). Then we are 10x over the new dirty limit and the
> > writing application will be completely throttled for the next 270s
> > until the dirty pages drop below the new dirty limit or the find
> > stops.
> > 
> > IOWs, it changing IO workloads will cause interesting corner cases
> > to be discovered and hence further complexity to handle effectively.
> > And trying to diagnose problems because of such changes in IO load
> > will be nigh on impossible - how would you gather sufficient
> > information to determine that application A stalled for a minute
> > because application B read a bunch of stuff from disk at the wrong
> > time? Then how would you prove that you'd fixed the problem without
> > introducing some other regression triggered by different workload
> > changes?
> 
> Good point. The v6 dirty throttle patchset has taken this into
> account, by separating the concept of dirty goal and hard dirty
> limit. Sorry I should have use bdi dirty goal in the previous email.
> 
> When for whatever reason the bdi dirty goal becomes a lot more smaller
> than bdi dirty pages, the bdi dirtiers will be throttled at typically
> lower than balanced bandwidth, so that the bdi dirty pages can smoothly
> drop to the dirty goal.
> 
> In the below graph, the bdi dirty pages start from much higher from
> bdi dirty goal because we start dd's on a USB stick and a hard disk
> at the same time, and the USB stick manage to accumulate lots of dirty
> pages before the dirty throttling logic starts to work. So you can see
> two dropping red lines in the (40s, 120s) time range. The green
> "position bandwidth" line shows that in that range, the tasks are 
> throttled a most 1/8 lower than the balanced throttle bandwidth and
> restored to normal after 140s.
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/xfs-1dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-21/balance_dirty_pages-pages.png

This is the comment that explains how the above behavior is achieved in code.

https://patchwork.kernel.org/patch/605161/

 * (4) global/bdi control lines
 *
 * dirty_throttle_bandwidth() applies 2 main and 3 regional control lines for
 * scaling up/down the base bandwidth based on the position of dirty pages.
 *
 * The two main control lines for the global/bdi control scopes do not end at
 * thresh/bdi_thresh.  They are centered at setpoint/bdi_setpoint and cover the
 * whole [0, limit].  If the control line drops below 0 before reaching @limit,
 * an auxiliary line will be setup to connect them. The below figure illustrates
 * the main bdi control line with an auxiliary line extending it to @limit.
 *
 * This allows smoothly throttling down bdi_dirty back to normal if it starts
 * high in situations like
 * - start writing to a slow SD card and a fast disk at the same time. The SD
 *   card's bdi_dirty may rush to 5 times higher than bdi_setpoint.
 * - the global/bdi dirty thresh/goal may be knocked down suddenly either on
 *   user request or on increased memory consumption.
 *
 *   o
 *     o
 *       o                                      [o] main control line
 *         o                                    [*] auxiliary control line
 *           o
 *             o
 *               o
 *                 o
 *                   o
 *                     o
 *                       o--------------------- balance point, bw scale = 1
 *                       | o
 *                       |   o
 *                       |     o
 *                       |       o
 *                       |         o
 *                       |           o
 *                       |             o------- connect point, bw scale = 1/2
 *                       |               .*
 *                       |                 .   *
 *                       |                   .      *
 *                       |                     .         *
 *                       |                       .           *
 *                       |                         .              *
 *                       |                           .                 *
 *  [--------------------*-----------------------------.--------------------*]
 *  0                 bdi_setpoint                  bdi_origin           limit

Suppose two dirty page numbers

                         A                                 B

At point B, the dirtiers will be throttled at roughly 1/4 balanced
bandwidth (dirty rate == disk write rate). So under the control of the
above control line, the bdi dirty pages will slowly decrease to A
while the task's throttle bandwidth slowly increase to the balanced
bandwidth.

Thanks,
Fengguang

> This is the corresponding pause times. They are perfectly under
> control (less than the configurable 200ms max pause time).
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/xfs-1dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-21/balance_dirty_pages-pause.png
> 
> Actually the patchset does not set hard limit for bdi dirty pages at
> all. It only maintains one hard limit for the global dirty pages.
> That global hard limit is introduced exactly to handle your mentioned
> case.
> 
> https://patchwork.kernel.org/patch/605201/
> 
> + * The global dirty threshold is normally equal to global dirty limit, except
> + * when the system suddenly allocates a lot of anonymous memory and knocks down
> + * the global dirty threshold quickly, in which case the global dirty limit
> + * will follow down slowly to prevent livelocking all dirtier tasks.
> 
> Here "global dirty threshold" is the one returned by current
> global_dirty_limits(), the "global dirty limit" is
> default_backing_dev_info.dirty_threshold in the above patch.
> Sorry the names are a bit confusing..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH 0/5] IO-less balance dirty pages
@ 2011-02-04  1:38 Jan Kara
  2011-02-04  1:38 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
  0 siblings, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-02-04  1:38 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-mm

  Hi,

  I've decided to take my stab at trying to make balance_dirty_pages() not
submit IO :). I hoped to have something simpler than Fengguang and we'll see
whether it is good enough.

The basic idea (implemented in the third patch) is that processes throttled
in balance_dirty_pages() wait for enough IO to complete. The waiting is
implemented as follows: Whenever we decide to throttle a task in
balance_dirty_pages(), task adds itself to a list of tasks that are throttled
against that bdi and goes to sleep waiting to receive specified amount of page
IO completions. Once in a while (currently HZ/10, in patch 5 the interval is
autotuned based on observed IO rate), accumulated page IO completions are
distributed equally among waiting tasks.

This waiting scheme has been chosen so that waiting time in
balance_dirty_pages() is proportional to
  number_waited_pages * number_of_waiters.
In particular it does not depend on the total number of pages being waited for,
thus providing possibly a fairer results.

I gave the patches some basic testing (multiple parallel dd's to a single
drive) and they seem to work OK. The dd's get equal share of the disk
throughput (about 10.5 MB/s, which is nice result given the disk can do
about 87 MB/s when writing single-threaded), and dirty limit does not get
exceeded. Of course much more testing needs to be done but I hope it's fine
for the first posting :).

Comments welcome.

								Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04  1:38 [RFC PATCH 0/5] IO-less balance dirty pages Jan Kara
@ 2011-02-04  1:38 ` Jan Kara
  2011-02-04 13:09   ` Peter Zijlstra
                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Jan Kara @ 2011-02-04  1:38 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: linux-mm, Jan Kara, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang, Peter Zijlstra

This patch changes balance_dirty_pages() throttling so that the function does
not submit writes on its own but rather waits for flusher thread to do enough
writes. This has an advantage that we have a single source of IO allowing for
better writeback locality. Also we do not have to reenter filesystems from a
non-trivial context.

The waiting is implemented as follows: Whenever we decide to throttle a task in
balance_dirty_pages(), task adds itself to a list of tasks that are throttled
against that bdi and goes to sleep waiting to receive specified amount of page
IO completions. Once in a while (currently HZ/10, later the interval should be
autotuned based on observed IO completion rate), accumulated page IO
completions are distributed equally among waiting tasks.

This waiting scheme has been chosen so that waiting time in
balance_dirty_pages() is proportional to
  number_waited_pages * number_of_waiters.
In particular it does not depend on the total number of pages being waited for,
thus providing possibly a fairer results. Note that the dependency on the
number of waiters is inevitable, since all the waiters compete for a common
resource so their number has to be somehow reflected in waiting time.

CC: Andrew Morton <akpm@linux-foundation.org>
CC: Christoph Hellwig <hch@infradead.org>
CC: Dave Chinner <david@fromorbit.com>
CC: Wu Fengguang <fengguang.wu@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/backing-dev.h      |    7 +
 include/linux/writeback.h        |    1 +
 include/trace/events/writeback.h |   66 +++++++-
 mm/backing-dev.c                 |    8 +
 mm/page-writeback.c              |  361 ++++++++++++++++++++++++++------------
 5 files changed, 327 insertions(+), 116 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 63ab4a5..65b6e61 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -89,6 +89,13 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t balance_lock;	/* lock protecting four entries below */
+	unsigned long written_start;	/* BDI_WRITTEN last time we scanned balance_list*/
+	struct list_head balance_list;	/* waiters in balance_dirty_pages */
+	unsigned int balance_waiters;	/* number of waiters in the list */
+	struct delayed_work balance_work;	/* work distributing page
+						   completions among waiters */
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0ead399..901c33f 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -129,6 +129,7 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
 void page_writeback_init(void);
+void distribute_page_completions(struct work_struct *work);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);
 
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 4e249b9..cf9f458 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -147,11 +147,71 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_balance_dirty_pages_waiting,
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long pages),
+	TP_ARGS(bdi, pages),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, pages)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->pages = pages;
+	),
+	TP_printk("bdi=%s, pages=%lu",
+		  __entry->name, __entry->pages
+	)
+);
+
+TRACE_EVENT(writeback_balance_dirty_pages_woken,
+	TP_PROTO(struct backing_dev_info *bdi),
+	TP_ARGS(bdi),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+	),
+	TP_printk("bdi=%s",
+		  __entry->name
+	)
+);
+
+TRACE_EVENT(writeback_distribute_page_completions,
+	TP_PROTO(struct backing_dev_info *bdi, unsigned long start,
+		 unsigned long written),
+	TP_ARGS(bdi, start, written),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, start)
+		__field(unsigned long, written)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+		__entry->start = start;
+		__entry->written = written;
+	),
+	TP_printk("bdi=%s, written_start=%lu, to_distribute=%lu",
+		  __entry->name, __entry->start, __entry->written
+	)
+);
+
+TRACE_EVENT(writeback_distribute_page_completions_wakeall,
+	TP_PROTO(struct backing_dev_info *bdi),
+	TP_ARGS(bdi),
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+	),
+	TP_fast_assign(
+		strncpy(__entry->name, dev_name(bdi->dev), 32);
+	),
+	TP_printk("bdi=%s",
+		  __entry->name
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4d14072..2ecc3fe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -652,6 +652,12 @@ int bdi_init(struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->work_list);
 
+	spin_lock_init(&bdi->balance_lock);
+	INIT_LIST_HEAD(&bdi->balance_list);
+	bdi->written_start = 0;
+	bdi->balance_waiters = 0;
+	INIT_DELAYED_WORK(&bdi->balance_work, distribute_page_completions);
+
 	bdi_wb_init(&bdi->wb, bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
@@ -691,6 +697,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 		spin_unlock(&inode_lock);
 	}
 
+	cancel_delayed_work_sync(&bdi->balance_work);
+	WARN_ON(!list_empty(&bdi->balance_list));
 	bdi_unregister(bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f388f70..8533032 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -132,6 +132,17 @@ static struct prop_descriptor vm_completions;
 static struct prop_descriptor vm_dirties;
 
 /*
+ * Item a process queues to bdi list in balance_dirty_pages() when it gets
+ * throttled
+ */
+struct balance_waiter {
+	struct list_head bw_list;
+	unsigned long bw_to_write;	/* Number of pages to wait for to
+					   get written */
+	struct task_struct *bw_task;	/* Task waiting for IO */
+};
+
+/*
  * couple the period to the dirty_ratio:
  *
  *   period/2 ~ roundup_pow_of_two(dirty limit)
@@ -476,140 +487,264 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 	return bdi_dirty;
 }
 
-/*
- * balance_dirty_pages() must be called by processes which are generating dirty
- * data.  It looks at the number of dirty pages in the machine and will force
- * the caller to perform writeback if the system is over `vm_dirty_ratio'.
- * If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
- */
-static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
-{
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+struct dirty_limit_state {
+	long nr_reclaimable;
+	long nr_writeback;
+	long bdi_nr_reclaimable;
+	long bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long min_bdi_thresh = ULONG_MAX;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
-	bool dirty_exceeded = false;
-	bool min_dirty_exceeded = false;
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
+};
 
-	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
+static void get_global_dirty_limit_state(struct dirty_limit_state *st)
+{
+	/*
+	 * Note: nr_reclaimable denotes nr_dirty + nr_unstable.  Unstable
+	 * writes are a feature of certain networked filesystems (i.e. NFS) in
+	 * which data may have been written to the server's write cache, but
+	 * has not yet been flushed to permanent storage.
+	 */
+	st->nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
+	st->nr_writeback = global_page_state(NR_WRITEBACK);
 
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+	global_dirty_limits(&st->background_thresh, &st->dirty_thresh);
+}
 
-		global_dirty_limits(&background_thresh, &dirty_thresh);
+/* This function expects global state to be already filled in! */
+static void get_bdi_dirty_limit_state(struct backing_dev_info *bdi,
+				      struct dirty_limit_state *st)
+{
+	unsigned long min_bdi_thresh;
 
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
-			break;
+	st->bdi_thresh = bdi_dirty_limit(bdi, st->dirty_thresh);
+	min_bdi_thresh = task_min_dirty_limit(st->bdi_thresh);
+	/*
+	 * In order to avoid the stacked BDI deadlock we need to ensure we
+	 * accurately count the 'dirty' pages when the threshold is low.
+	 *
+	 * Otherwise it would be possible to get thresh+n pages reported dirty,
+	 * even though there are thresh-m pages actually dirty; with m+n
+	 * sitting in the percpu deltas.
+	 */
+	if (min_bdi_thresh < 2*bdi_stat_error(bdi)) {
+		st->bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+		st->bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+	} else {
+		st->bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		st->bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+	}
+}
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		min_bdi_thresh = task_min_dirty_limit(bdi_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+/* Possibly states of dirty memory for BDI */
+enum {
+	DIRTY_OK,			/* Everything below limit */
+	DIRTY_EXCEED_BACKGROUND,	/* Backround writeback limit exceeded */
+	DIRTY_MAY_EXCEED_LIMIT,		/* Some task may exceed dirty limit */
+	DIRTY_EXCEED_LIMIT,		/* Current task exceeds dirty limit */
+};
 
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-		}
+static int check_dirty_limits(struct backing_dev_info *bdi,
+			      struct dirty_limit_state *pst)
+{
+	struct dirty_limit_state st;
+	unsigned long bdi_thresh;
+	unsigned long min_bdi_thresh;
+	int ret = DIRTY_OK;
 
-		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
-		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
-		min_dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > min_bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
-
-		if (!dirty_exceeded)
-			break;
+	get_global_dirty_limit_state(&st);
+	/*
+	 * Throttle it only when the background writeback cannot catch-up. This
+	 * avoids (excessively) small writeouts when the bdi limits are ramping
+	 * up.
+	 */
+	if (st.nr_reclaimable + st.nr_writeback <=
+			(st.background_thresh + st.dirty_thresh) / 2)
+		goto out;
 
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+	get_bdi_dirty_limit_state(bdi, &st);
+	min_bdi_thresh = task_min_dirty_limit(st.bdi_thresh);
+	bdi_thresh = task_dirty_limit(current, st.bdi_thresh);
+
+	/*
+	 * The bdi thresh is somehow "soft" limit derived from the global
+	 * "hard" limit. The former helps to prevent heavy IO bdi or process
+	 * from holding back light ones; The latter is the last resort
+	 * safeguard.
+	 */
+	if ((st.bdi_nr_reclaimable + st.bdi_nr_writeback > bdi_thresh)
+	    || (st.nr_reclaimable + st.nr_writeback > st.dirty_thresh)) {
+		ret = DIRTY_EXCEED_LIMIT;
+		goto out;
+	}
+	if (st.bdi_nr_reclaimable + st.bdi_nr_writeback > min_bdi_thresh) {
+		ret = DIRTY_MAY_EXCEED_LIMIT;
+		goto out;
+	}
+	if (st.nr_reclaimable > st.background_thresh)
+		ret = DIRTY_EXCEED_BACKGROUND;
+out:
+	if (pst)
+		*pst = st;
+	return ret;
+}
+
+static bool bdi_task_limit_exceeded(struct dirty_limit_state *st,
+				    struct task_struct *p)
+{
+	unsigned long bdi_thresh;
+
+	bdi_thresh = task_dirty_limit(p, st->bdi_thresh);
+
+	return st->bdi_nr_reclaimable + st->bdi_nr_writeback > bdi_thresh;
+}
+
+static void balance_waiter_done(struct backing_dev_info *bdi,
+				struct balance_waiter *bw)
+{
+	list_del_init(&bw->bw_list);
+	bdi->balance_waiters--;
+	wake_up_process(bw->bw_task);
+}
+
+void distribute_page_completions(struct work_struct *work)
+{
+	struct backing_dev_info *bdi =
+		container_of(work, struct backing_dev_info, balance_work.work);
+	unsigned long written = bdi_stat_sum(bdi, BDI_WRITTEN);
+	unsigned long pages_per_waiter, remainder_pages;
+	struct balance_waiter *waiter, *tmpw;
+	struct dirty_limit_state st;
+	int dirty_exceeded;
+
+	trace_writeback_distribute_page_completions(bdi, bdi->written_start,
+					  written - bdi->written_start);
+	dirty_exceeded = check_dirty_limits(bdi, &st);
+	if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
+		/* Wakeup everybody */
+		trace_writeback_distribute_page_completions_wakeall(bdi);
+		spin_lock(&bdi->balance_lock);
+		list_for_each_entry_safe(
+				waiter, tmpw, &bdi->balance_list, bw_list)
+			balance_waiter_done(bdi, waiter);
+		spin_unlock(&bdi->balance_lock);
+		return;
+	}
+
+	spin_lock(&bdi->balance_lock);
+	/*
+	 * Note: This loop can have quadratic complexity in the number of
+	 * waiters. It can be changed to a linear one if we also maintained a
+	 * list sorted by number of pages. But for now that does not seem to be
+	 * worth the effort.
+	 */
+	remainder_pages = written - bdi->written_start;
+	bdi->written_start = written;
+	while (!list_empty(&bdi->balance_list)) {
+		pages_per_waiter = remainder_pages / bdi->balance_waiters;
+		if (!pages_per_waiter)
+			break;
+		remainder_pages %= bdi->balance_waiters;
+		list_for_each_entry_safe(
+				waiter, tmpw, &bdi->balance_list, bw_list) {
+			if (waiter->bw_to_write <= pages_per_waiter) {
+				remainder_pages += pages_per_waiter -
+						   waiter->bw_to_write;
+				balance_waiter_done(bdi, waiter);
+				continue;
+			}
+			waiter->bw_to_write -= pages_per_waiter;
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_UNINTERRUPTIBLE);
-		io_schedule_timeout(pause);
+	}
+	/* Distribute remaining pages */
+	list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
+		if (remainder_pages > 0) {
+			waiter->bw_to_write--;
+			remainder_pages--;
+		}
+		if (waiter->bw_to_write == 0 ||
+		    (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
+		     !bdi_task_limit_exceeded(&st, waiter->bw_task)))
+			balance_waiter_done(bdi, waiter);
+	}
+	/* More page completions needed? */
+	if (!list_empty(&bdi->balance_list))
+		schedule_delayed_work(&bdi->balance_work, HZ/10);
+	spin_unlock(&bdi->balance_lock);
+}
+
+/*
+ * balance_dirty_pages() must be called by processes which are generating dirty
+ * data.  It looks at the number of dirty pages in the machine and will force
+ * the caller to perform writeback if the system is over `vm_dirty_ratio'.
+ * If we're over `background_thresh' then the writeback threads are woken to
+ * perform some writeout.
+ */
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct balance_waiter bw;
+	int dirty_exceeded = check_dirty_limits(bdi, NULL);
 
+	if (dirty_exceeded != DIRTY_EXCEED_LIMIT) {
+		if (bdi->dirty_exceeded &&
+		    dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT)
+			bdi->dirty_exceeded = 0;
 		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
+		 * In laptop mode, we wait until hitting the higher threshold
+		 * before starting background writeout, and then write out all
+		 * the way down to the lower threshold.  So slow writers cause
+		 * minimal disk activity.
+		 *
+		 * In normal mode, we start background writeout at the lower
+		 * background_thresh, to keep the amount of dirty memory low.
 		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		if (!laptop_mode && dirty_exceeded == DIRTY_EXCEED_BACKGROUND)
+			bdi_start_background_writeback(bdi);
+		return;
 	}
 
-	/* Clear dirty_exceeded flag only when no task can exceed the limit */
-	if (!min_dirty_exceeded && bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 0;
+	if (!bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 1;
 
-	if (writeback_in_progress(bdi))
-		return;
+	trace_writeback_balance_dirty_pages_waiting(bdi, write_chunk);
+	/* Kick flusher thread to start doing work if it isn't already */
+	bdi_start_background_writeback(bdi);
 
+	bw.bw_to_write = write_chunk;
+	bw.bw_task = current;
+	spin_lock(&bdi->balance_lock);
 	/*
-	 * In laptop mode, we wait until hitting the higher threshold before
-	 * starting background writeout, and then write out all the way down
-	 * to the lower threshold.  So slow writers cause minimal disk activity.
-	 *
-	 * In normal mode, we start background writeout at the lower
-	 * background_thresh, to keep the amount of dirty memory low.
+	 * First item? Need to schedule distribution of IO completions among
+	 * items on balance_list
+	 */
+	if (list_empty(&bdi->balance_list)) {
+		bdi->written_start = bdi_stat_sum(bdi, BDI_WRITTEN);
+		/* FIXME: Delay should be autotuned based on dev throughput */
+		schedule_delayed_work(&bdi->balance_work, HZ/10);
+	}
+	/*
+	 * Add work to the balance list, from now on the structure is handled
+	 * by distribute_page_completions()
+	 */
+	list_add_tail(&bw.bw_list, &bdi->balance_list);
+	bdi->balance_waiters++;
+	/*
+	 * Setting task state must happen inside balance_lock to avoid races
+	 * with distribution function waking us.
+	 */
+	__set_current_state(TASK_UNINTERRUPTIBLE);
+	spin_unlock(&bdi->balance_lock);
+	/* Wait for pages to get written */
+	schedule();
+	/*
+	 * Enough page completions should have happened by now and we should
+	 * have been removed from the list
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
-		bdi_start_background_writeback(bdi);
+	WARN_ON(!list_empty(&bw.bw_list));
+	trace_writeback_balance_dirty_pages_woken(bdi);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04  1:38 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
@ 2011-02-04 13:09   ` Peter Zijlstra
  2011-02-11 14:56     ` Jan Kara
  2011-02-04 13:09   ` Peter Zijlstra
  2011-02-04 13:09   ` Peter Zijlstra
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2011-02-04 13:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang

On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> +struct balance_waiter {
> +       struct list_head bw_list;
> +       unsigned long bw_to_write;      /* Number of pages to wait for to
> +                                          get written */

That names somehow rubs me the wrong way.. the name suggests we need to
do the writing, whereas we only wait for them to be written.

> +       struct task_struct *bw_task;    /* Task waiting for IO */
> +}; 




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04 13:09   ` Peter Zijlstra
@ 2011-02-11 14:56     ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-02-11 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Wu Fengguang

On Fri 04-02-11 14:09:15, Peter Zijlstra wrote:
> On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> > +struct balance_waiter {
> > +       struct list_head bw_list;
> > +       unsigned long bw_to_write;      /* Number of pages to wait for to
> > +                                          get written */
> 
> That names somehow rubs me the wrong way.. the name suggests we need to
> do the writing, whereas we only wait for them to be written.
  Good point. Will change that.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04  1:38 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
  2011-02-04 13:09   ` Peter Zijlstra
@ 2011-02-04 13:09   ` Peter Zijlstra
  2011-02-11 14:56     ` Jan Kara
  2011-02-04 13:09   ` Peter Zijlstra
  2 siblings, 1 reply; 49+ messages in thread
From: Peter Zijlstra @ 2011-02-04 13:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang

On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> +static int check_dirty_limits(struct backing_dev_info *bdi,
> +                             struct dirty_limit_state *pst)
> +{
> +       struct dirty_limit_state st;
> +       unsigned long bdi_thresh;
> +       unsigned long min_bdi_thresh;
> +       int ret = DIRTY_OK;
>  
> +       get_global_dirty_limit_state(&st);
> +       /*
> +        * Throttle it only when the background writeback cannot catch-up. This
> +        * avoids (excessively) small writeouts when the bdi limits are ramping
> +        * up.
> +        */
> +       if (st.nr_reclaimable + st.nr_writeback <=
> +                       (st.background_thresh + st.dirty_thresh) / 2)
> +               goto out;
>  
> +       get_bdi_dirty_limit_state(bdi, &st);
> +       min_bdi_thresh = task_min_dirty_limit(st.bdi_thresh);
> +       bdi_thresh = task_dirty_limit(current, st.bdi_thresh);
> +
> +       /*
> +        * The bdi thresh is somehow "soft" limit derived from the global
> +        * "hard" limit. The former helps to prevent heavy IO bdi or process
> +        * from holding back light ones; The latter is the last resort
> +        * safeguard.
> +        */
> +       if ((st.bdi_nr_reclaimable + st.bdi_nr_writeback > bdi_thresh)
> +           || (st.nr_reclaimable + st.nr_writeback > st.dirty_thresh)) {
> +               ret = DIRTY_EXCEED_LIMIT;
> +               goto out;
> +       }
> +       if (st.bdi_nr_reclaimable + st.bdi_nr_writeback > min_bdi_thresh) {
> +               ret = DIRTY_MAY_EXCEED_LIMIT;
> +               goto out;
> +       }
> +       if (st.nr_reclaimable > st.background_thresh)
> +               ret = DIRTY_EXCEED_BACKGROUND;
> +out:
> +       if (pst)
> +               *pst = st;

By mandating pst is always provided you can reduce the total stack
footprint, avoid the memcopy and clean up the control flow ;-)

> +       return ret;
> +} 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04 13:09   ` Peter Zijlstra
@ 2011-02-11 14:56     ` Jan Kara
  0 siblings, 0 replies; 49+ messages in thread
From: Jan Kara @ 2011-02-11 14:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Wu Fengguang

On Fri 04-02-11 14:09:16, Peter Zijlstra wrote:
> On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> > +static int check_dirty_limits(struct backing_dev_info *bdi,
> > +                             struct dirty_limit_state *pst)
> > +{
> > +       struct dirty_limit_state st;
> > +       unsigned long bdi_thresh;
> > +       unsigned long min_bdi_thresh;
> > +       int ret = DIRTY_OK;
> >  
> > +       get_global_dirty_limit_state(&st);
> > +       /*
> > +        * Throttle it only when the background writeback cannot catch-up. This
> > +        * avoids (excessively) small writeouts when the bdi limits are ramping
> > +        * up.
> > +        */
> > +       if (st.nr_reclaimable + st.nr_writeback <=
> > +                       (st.background_thresh + st.dirty_thresh) / 2)
> > +               goto out;
> >  
> > +       get_bdi_dirty_limit_state(bdi, &st);
> > +       min_bdi_thresh = task_min_dirty_limit(st.bdi_thresh);
> > +       bdi_thresh = task_dirty_limit(current, st.bdi_thresh);
> > +
> > +       /*
> > +        * The bdi thresh is somehow "soft" limit derived from the global
> > +        * "hard" limit. The former helps to prevent heavy IO bdi or process
> > +        * from holding back light ones; The latter is the last resort
> > +        * safeguard.
> > +        */
> > +       if ((st.bdi_nr_reclaimable + st.bdi_nr_writeback > bdi_thresh)
> > +           || (st.nr_reclaimable + st.nr_writeback > st.dirty_thresh)) {
> > +               ret = DIRTY_EXCEED_LIMIT;
> > +               goto out;
> > +       }
> > +       if (st.bdi_nr_reclaimable + st.bdi_nr_writeback > min_bdi_thresh) {
> > +               ret = DIRTY_MAY_EXCEED_LIMIT;
> > +               goto out;
> > +       }
> > +       if (st.nr_reclaimable > st.background_thresh)
> > +               ret = DIRTY_EXCEED_BACKGROUND;
> > +out:
> > +       if (pst)
> > +               *pst = st;
> 
> By mandating pst is always provided you can reduce the total stack
> footprint, avoid the memcopy and clean up the control flow ;-)
  OK, will do.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04  1:38 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
  2011-02-04 13:09   ` Peter Zijlstra
  2011-02-04 13:09   ` Peter Zijlstra
@ 2011-02-04 13:09   ` Peter Zijlstra
  2011-02-04 13:19     ` Peter Zijlstra
  2011-02-11 15:46     ` Jan Kara
  2 siblings, 2 replies; 49+ messages in thread
From: Peter Zijlstra @ 2011-02-04 13:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang

On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> +void distribute_page_completions(struct work_struct *work)
> +{
> +       struct backing_dev_info *bdi =
> +               container_of(work, struct backing_dev_info, balance_work.work);
> +       unsigned long written = bdi_stat_sum(bdi, BDI_WRITTEN);
> +       unsigned long pages_per_waiter, remainder_pages;
> +       struct balance_waiter *waiter, *tmpw;
> +       struct dirty_limit_state st;
> +       int dirty_exceeded;
> +
> +       trace_writeback_distribute_page_completions(bdi, bdi->written_start,
> +                                         written - bdi->written_start);

So in fact you only need to pass bdi and written :-)

> +       dirty_exceeded = check_dirty_limits(bdi, &st);
> +       if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
> +               /* Wakeup everybody */
> +               trace_writeback_distribute_page_completions_wakeall(bdi);
> +               spin_lock(&bdi->balance_lock);
> +               list_for_each_entry_safe(
> +                               waiter, tmpw, &bdi->balance_list, bw_list)
> +                       balance_waiter_done(bdi, waiter);
> +               spin_unlock(&bdi->balance_lock);
> +               return;
> +       }
> +
> +       spin_lock(&bdi->balance_lock);

is there any reason this is a spinlock and not a mutex?

> +       /*
> +        * Note: This loop can have quadratic complexity in the number of
> +        * waiters. It can be changed to a linear one if we also maintained a
> +        * list sorted by number of pages. But for now that does not seem to be
> +        * worth the effort.
> +        */

That doesn't seem to explain much :/

> +       remainder_pages = written - bdi->written_start;
> +       bdi->written_start = written;
> +       while (!list_empty(&bdi->balance_list)) {
> +               pages_per_waiter = remainder_pages / bdi->balance_waiters;
> +               if (!pages_per_waiter)
> +                       break;

if remainder_pages < balance_waiters you just lost you delta, its best
to not set bdi->written_start until the end and leave everything not
processed for the next round.

> +               remainder_pages %= bdi->balance_waiters;
> +               list_for_each_entry_safe(
> +                               waiter, tmpw, &bdi->balance_list, bw_list) {
> +                       if (waiter->bw_to_write <= pages_per_waiter) {
> +                               remainder_pages += pages_per_waiter -
> +                                                  waiter->bw_to_write;
> +                               balance_waiter_done(bdi, waiter);
> +                               continue;
> +                       }
> +                       waiter->bw_to_write -= pages_per_waiter;
>                 }
> +       }
> +       /* Distribute remaining pages */
> +       list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
> +               if (remainder_pages > 0) {
> +                       waiter->bw_to_write--;
> +                       remainder_pages--;
> +               }
> +               if (waiter->bw_to_write == 0 ||
> +                   (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> +                    !bdi_task_limit_exceeded(&st, waiter->bw_task)))
> +                       balance_waiter_done(bdi, waiter);
> +       }

OK, I see what you're doing, but I'm not quite sure it makes complete
sense yet.

  mutex_lock(&bdi->balance_mutex);
  for (;;) {
    unsigned long pages = written - bdi->written_start;
    unsigned long pages_per_waiter = pages / bdi->balance_waiters;
    if (!pages_per_waiter)
      break;
    list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list){
      unsigned long delta = min(pages_per_waiter, waiter->bw_to_write);

      bdi->written_start += delta;
      waiter->bw_to_write -= delta;
      if (!waiter->bw_to_write)
        balance_waiter_done(bdi, waiter);
    }
  }
  mutex_unlock(&bdi->balance_mutex);

Comes close to what you wrote I think.

One of the problems I have with it is that min(), it means that that
waiter waited too long, but will not be compensated for this by reducing
its next wait. Instead you give it away to other waiters which preserves
fairness on the bdi level, but not for tasks.

You can do that by keeping ->bw_to_write in task_struct and normalize it
by the estimated bdi bandwidth (patch 5), that way, when you next
increment it it will turn out to be lower and the wait will be shorter.

That also removes the need to loop over the waiters.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04 13:09   ` Peter Zijlstra
@ 2011-02-04 13:19     ` Peter Zijlstra
  2011-02-11 15:46     ` Jan Kara
  1 sibling, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2011-02-04 13:19 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang

On Fri, 2011-02-04 at 14:09 +0100, Peter Zijlstra wrote:
> 
> You can do that by keeping ->bw_to_write in task_struct and normalize it
> by the estimated bdi bandwidth (patch 5), that way, when you next
> increment it it will turn out to be lower and the wait will be shorter.

The down-side is of course that time will leak from the system on exit,
since its impossible to map back to a bdi.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-04 13:09   ` Peter Zijlstra
  2011-02-04 13:19     ` Peter Zijlstra
@ 2011-02-11 15:46     ` Jan Kara
  2011-02-22 15:40       ` Peter Zijlstra
  1 sibling, 1 reply; 49+ messages in thread
From: Jan Kara @ 2011-02-11 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, linux-fsdevel, linux-mm, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Wu Fengguang

On Fri 04-02-11 14:09:16, Peter Zijlstra wrote:
> On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> > +       dirty_exceeded = check_dirty_limits(bdi, &st);
> > +       if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
> > +               /* Wakeup everybody */
> > +               trace_writeback_distribute_page_completions_wakeall(bdi);
> > +               spin_lock(&bdi->balance_lock);
> > +               list_for_each_entry_safe(
> > +                               waiter, tmpw, &bdi->balance_list, bw_list)
> > +                       balance_waiter_done(bdi, waiter);
> > +               spin_unlock(&bdi->balance_lock);
> > +               return;
> > +       }
> > +
> > +       spin_lock(&bdi->balance_lock);
> is there any reason this is a spinlock and not a mutex?
  No. Is mutex preferable?

> > +       /*
> > +        * Note: This loop can have quadratic complexity in the number of
> > +        * waiters. It can be changed to a linear one if we also maintained a
> > +        * list sorted by number of pages. But for now that does not seem to be
> > +        * worth the effort.
> > +        */
> 
> That doesn't seem to explain much :/
> 
> > +       remainder_pages = written - bdi->written_start;
> > +       bdi->written_start = written;
> > +       while (!list_empty(&bdi->balance_list)) {
> > +               pages_per_waiter = remainder_pages / bdi->balance_waiters;
> > +               if (!pages_per_waiter)
> > +                       break;
> 
> if remainder_pages < balance_waiters you just lost you delta, its best
> to not set bdi->written_start until the end and leave everything not
> processed for the next round.
  I haven't lost it, it will be distributed in the second loop.

> > +               remainder_pages %= bdi->balance_waiters;
> > +               list_for_each_entry_safe(
> > +                               waiter, tmpw, &bdi->balance_list, bw_list) {
> > +                       if (waiter->bw_to_write <= pages_per_waiter) {
> > +                               remainder_pages += pages_per_waiter -
> > +                                                  waiter->bw_to_write;
> > +                               balance_waiter_done(bdi, waiter);
> > +                               continue;
> > +                       }
> > +                       waiter->bw_to_write -= pages_per_waiter;
> >                 }
> > +       }
> > +       /* Distribute remaining pages */
> > +       list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
> > +               if (remainder_pages > 0) {
> > +                       waiter->bw_to_write--;
> > +                       remainder_pages--;
> > +               }
> > +               if (waiter->bw_to_write == 0 ||
> > +                   (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > +                    !bdi_task_limit_exceeded(&st, waiter->bw_task)))
> > +                       balance_waiter_done(bdi, waiter);
> > +       }
> 
> OK, I see what you're doing, but I'm not quite sure it makes complete
> sense yet.
> 
>   mutex_lock(&bdi->balance_mutex);
>   for (;;) {
>     unsigned long pages = written - bdi->written_start;
>     unsigned long pages_per_waiter = pages / bdi->balance_waiters;
>     if (!pages_per_waiter)
>       break;
>     list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list){
>       unsigned long delta = min(pages_per_waiter, waiter->bw_to_write);
> 
>       bdi->written_start += delta;
>       waiter->bw_to_write -= delta;
>       if (!waiter->bw_to_write)
>         balance_waiter_done(bdi, waiter);
>     }
>   }
>   mutex_unlock(&bdi->balance_mutex);
> 
> Comes close to what you wrote I think.
  Yes, quite close. Only if we wake up and there are not enough pages for
all waiters. We at least "help" waiters in the beginning of the queue. That
could have some impact when the queue grows quickly on a slow device
(something like write fork bomb) but given we need only one page written per
waiter it would be really horrible situation when this triggers anyway...

> One of the problems I have with it is that min(), it means that that
> waiter waited too long, but will not be compensated for this by reducing
> its next wait. Instead you give it away to other waiters which preserves
> fairness on the bdi level, but not for tasks.
> 
> You can do that by keeping ->bw_to_write in task_struct and normalize it
> by the estimated bdi bandwidth (patch 5), that way, when you next
> increment it it will turn out to be lower and the wait will be shorter.
> 
> That also removes the need to loop over the waiters.
  Umm, interesting idea! Just the implication "pages_per_waiter >
->bw_to_write => we waited for too long" isn't completely right. In fact,
each waiter entered the queue sometime between the first waiter entered it
and the time timer triggered. It might have been just shortly before the
timer triggered and thus it will be in fact delayed for too short time. So
this problem is somehow the other side of the problem you describe and so
far I just ignored this problem in a hope that it just levels out in the
long term.

The trouble I see with storing remaining written pages with each task is
that we can accumulate significant amount of pages in that - from what I
see e.g. with my SATA drive the writeback completion seems to be rather
bumpy (and it's even worse over NFS). If we then get below dirty limit,
process can carry over a lot of written pages to the time when dirty limit
gets exceeded again which reduces the effect of throttling at that time
and we can exceed dirty limits by more than we'd wish. We could solve this
by somehow invalidating the written pages when we stop throttling on that
bdi but that would mean to track something like pairs <bonus pages, bdi>
with each task - not sure we want to do that...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH 3/5] mm: Implement IO-less balance_dirty_pages()
  2011-02-11 15:46     ` Jan Kara
@ 2011-02-22 15:40       ` Peter Zijlstra
  0 siblings, 0 replies; 49+ messages in thread
From: Peter Zijlstra @ 2011-02-22 15:40 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, linux-mm, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Wu Fengguang

On Fri, 2011-02-11 at 16:46 +0100, Jan Kara wrote:
> On Fri 04-02-11 14:09:16, Peter Zijlstra wrote:
> > On Fri, 2011-02-04 at 02:38 +0100, Jan Kara wrote:
> > > +       dirty_exceeded = check_dirty_limits(bdi, &st);
> > > +       if (dirty_exceeded < DIRTY_MAY_EXCEED_LIMIT) {
> > > +               /* Wakeup everybody */
> > > +               trace_writeback_distribute_page_completions_wakeall(bdi);
> > > +               spin_lock(&bdi->balance_lock);
> > > +               list_for_each_entry_safe(
> > > +                               waiter, tmpw, &bdi->balance_list, bw_list)
> > > +                       balance_waiter_done(bdi, waiter);
> > > +               spin_unlock(&bdi->balance_lock);
> > > +               return;
> > > +       }
> > > +
> > > +       spin_lock(&bdi->balance_lock);
> > is there any reason this is a spinlock and not a mutex?
>   No. Is mutex preferable?

I'd say so, the loops you do under it can be quite large and you're
going to make the task sleep anyway, doesn't really matter if it blocks
on a mutex too while its at it.

> > > +       /*
> > > +        * Note: This loop can have quadratic complexity in the number of
> > > +        * waiters. It can be changed to a linear one if we also maintained a
> > > +        * list sorted by number of pages. But for now that does not seem to be
> > > +        * worth the effort.
> > > +        */
> > 
> > That doesn't seem to explain much :/
> > 
> > > +       remainder_pages = written - bdi->written_start;
> > > +       bdi->written_start = written;
> > > +       while (!list_empty(&bdi->balance_list)) {
> > > +               pages_per_waiter = remainder_pages / bdi->balance_waiters;
> > > +               if (!pages_per_waiter)
> > > +                       break;
> > 
> > if remainder_pages < balance_waiters you just lost you delta, its best
> > to not set bdi->written_start until the end and leave everything not
> > processed for the next round.
>   I haven't lost it, it will be distributed in the second loop.

Ah, indeed. weird construct though.

> > > +               remainder_pages %= bdi->balance_waiters;
> > > +               list_for_each_entry_safe(
> > > +                               waiter, tmpw, &bdi->balance_list, bw_list) {
> > > +                       if (waiter->bw_to_write <= pages_per_waiter) {
> > > +                               remainder_pages += pages_per_waiter -
> > > +                                                  waiter->bw_to_write;
> > > +                               balance_waiter_done(bdi, waiter);
> > > +                               continue;
> > > +                       }
> > > +                       waiter->bw_to_write -= pages_per_waiter;
> > >                 }
> > > +       }
> > > +       /* Distribute remaining pages */
> > > +       list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list) {
> > > +               if (remainder_pages > 0) {
> > > +                       waiter->bw_to_write--;
> > > +                       remainder_pages--;
> > > +               }
> > > +               if (waiter->bw_to_write == 0 ||
> > > +                   (dirty_exceeded == DIRTY_MAY_EXCEED_LIMIT &&
> > > +                    !bdi_task_limit_exceeded(&st, waiter->bw_task)))
> > > +                       balance_waiter_done(bdi, waiter);
> > > +       }
> > 
> > OK, I see what you're doing, but I'm not quite sure it makes complete
> > sense yet.
> > 
> >   mutex_lock(&bdi->balance_mutex);
> >   for (;;) {
> >     unsigned long pages = written - bdi->written_start;
> >     unsigned long pages_per_waiter = pages / bdi->balance_waiters;
> >     if (!pages_per_waiter)
> >       break;
> >     list_for_each_entry_safe(waiter, tmpw, &bdi->balance_list, bw_list){
> >       unsigned long delta = min(pages_per_waiter, waiter->bw_to_write);
> > 
> >       bdi->written_start += delta;
> >       waiter->bw_to_write -= delta;
> >       if (!waiter->bw_to_write)
> >         balance_waiter_done(bdi, waiter);
> >     }
> >   }
> >   mutex_unlock(&bdi->balance_mutex);
> > 
> > Comes close to what you wrote I think.
>   Yes, quite close. Only if we wake up and there are not enough pages for
> all waiters. We at least "help" waiters in the beginning of the queue. That
> could have some impact when the queue grows quickly on a slow device
> (something like write fork bomb) but given we need only one page written per
> waiter it would be really horrible situation when this triggers anyway...

Right, but its also unfair, suppose your timer/bandwidth ratio is such
that each timer hit is after 1 writeout, in that case you would
distribute pages to the first waiter in a 1 ratio instead of
1/nr_waiters, essentially robbing the other waiters of progress.

> > One of the problems I have with it is that min(), it means that that
> > waiter waited too long, but will not be compensated for this by reducing
> > its next wait. Instead you give it away to other waiters which preserves
> > fairness on the bdi level, but not for tasks.
> > 
> > You can do that by keeping ->bw_to_write in task_struct and normalize it
> > by the estimated bdi bandwidth (patch 5), that way, when you next
> > increment it it will turn out to be lower and the wait will be shorter.
> > 
> > That also removes the need to loop over the waiters.
>   Umm, interesting idea! Just the implication "pages_per_waiter >
> ->bw_to_write => we waited for too long" isn't completely right. In fact,
> each waiter entered the queue sometime between the first waiter entered it
> and the time timer triggered. It might have been just shortly before the
> timer triggered and thus it will be in fact delayed for too short time. So
> this problem is somehow the other side of the problem you describe and so
> far I just ignored this problem in a hope that it just levels out in the
> long term.

Yeah,.. it will balance out over the bdi, just not over tasks afaict.
But as you note below its not without problems of its own. Part of the
issues can be fixed by more accurate accounting, but not at all sure
that's actually worth the effort.

> The trouble I see with storing remaining written pages with each task is
> that we can accumulate significant amount of pages in that - from what I
> see e.g. with my SATA drive the writeback completion seems to be rather
> bumpy (and it's even worse over NFS). If we then get below dirty limit,
> process can carry over a lot of written pages to the time when dirty limit
> gets exceeded again which reduces the effect of throttling at that time
> and we can exceed dirty limits by more than we'd wish. We could solve this
> by somehow invalidating the written pages when we stop throttling on that
> bdi but that would mean to track something like pairs <bonus pages, bdi>
> with each task - not sure we want to do that...

Right, but even for those bursty devices the estimated bandwidth should
average out.. but yeah, its all a bit messy and I can quite see why you
wouldn't want to go there ;-)



^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2011-03-29  7:52 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-08 22:31 [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Jan Kara
2011-03-08 22:31 ` [PATCH 1/5] writeback: account per-bdi accumulated written pages Jan Kara
2011-03-08 22:31 ` [PATCH 2/5] mm: Properly reflect task dirty limits in dirty_exceeded logic Jan Kara
2011-03-09 21:02   ` Vivek Goyal
2011-03-14 20:44     ` Jan Kara
2011-03-15 15:21       ` Vivek Goyal
2011-03-08 22:31 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
2011-03-10  0:07   ` Vivek Goyal
2011-03-14 20:48     ` Jan Kara
2011-03-15 15:23       ` Vivek Goyal
2011-03-16 21:26         ` Curt Wohlgemuth
2011-03-16 22:53           ` Curt Wohlgemuth
2011-03-16 16:53   ` Vivek Goyal
2011-03-16 19:10     ` Jan Kara
2011-03-16 19:31       ` Vivek Goyal
2011-03-16 19:58         ` Jan Kara
2011-03-16 20:22           ` Vivek Goyal
2011-03-08 22:31 ` [PATCH 4/5] mm: Remove low limit from sync_writeback_pages() Jan Kara
2011-03-08 22:31 ` [PATCH 5/5] mm: Autotune interval between distribution of page completions Jan Kara
2011-03-17 15:46 ` [PATCH RFC 0/5] IO-less balance_dirty_pages() v2 (simple approach) Curt Wohlgemuth
2011-03-17 15:51   ` Christoph Hellwig
2011-03-17 16:24     ` Curt Wohlgemuth
2011-03-17 16:43       ` Christoph Hellwig
2011-03-17 17:32   ` Jan Kara
2011-03-17 18:55     ` Curt Wohlgemuth
2011-03-17 22:56       ` Vivek Goyal
2011-03-18 14:30 ` Wu Fengguang
2011-03-22 21:43   ` Jan Kara
2011-03-23  4:41     ` Dave Chinner
2011-03-25 12:59       ` Wu Fengguang
2011-03-25 13:44     ` Wu Fengguang
2011-03-25 23:05       ` Jan Kara
2011-03-28  2:44         ` Wu Fengguang
2011-03-28 15:08           ` Jan Kara
2011-03-29  1:44             ` Wu Fengguang
2011-03-29  2:14           ` Dave Chinner
2011-03-29  2:41             ` Wu Fengguang
2011-03-29  5:59               ` Dave Chinner
2011-03-29  7:31                 ` Wu Fengguang
2011-03-29  7:52                   ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-02-04  1:38 [RFC PATCH 0/5] IO-less balance dirty pages Jan Kara
2011-02-04  1:38 ` [PATCH 3/5] mm: Implement IO-less balance_dirty_pages() Jan Kara
2011-02-04 13:09   ` Peter Zijlstra
2011-02-11 14:56     ` Jan Kara
2011-02-04 13:09   ` Peter Zijlstra
2011-02-11 14:56     ` Jan Kara
2011-02-04 13:09   ` Peter Zijlstra
2011-02-04 13:19     ` Peter Zijlstra
2011-02-11 15:46     ` Jan Kara
2011-02-22 15:40       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).