[PATCH 06/13] writeback: bdi write bandwidth estimation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 06/13] writeback: bdi write bandwidth estimation
@ 2010-11-17  3:58 Wu Fengguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-17  3:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Ts'o, Li Shaohua, Wu Fengguang, Dave Chinner,
	Jan Kara, Peter Zijlstra, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Chris Mason, Christoph Hellwig, linux-mm,
	linux-fsdevel, LKML

Andrew,
References: <20101117035821.000579293@intel.com>
Content-Disposition: inline; filename=writeback-bandwidth-estimation-in-flusher.patch

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth udpate only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    5 ++++
 include/linux/backing-dev.h |    2 +
 include/linux/writeback.h   |    3 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   41 +++++++++++++++++++++++++++++++++-
 5 files changed, 51 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-11-15 21:51:41.000000000 +0800
@@ -75,6 +75,8 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
+	unsigned long write_bandwidth_update_time;
+	int write_bandwidth;
 	int dirty_exceeded;
 
 	unsigned int min_ratio;
--- linux-next.orig/mm/backing-dev.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-11-15 21:51:41.000000000 +0800
@@ -660,6 +660,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -635,6 +635,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
@@ -668,6 +670,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -702,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -479,6 +479,41 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long written;
+	unsigned long elapsed;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto snapshot;
+
+	elapsed = jiffies - *bw_time;
+	if (elapsed < HZ/100)
+		return;
+
+	/*
+	 * When there lots of tasks throttled in balance_dirty_pages(), they
+	 * will each try to update the bandwidth for the same period, making
+	 * the bandwidth drift much faster than the desired rate (as in the
+	 * single dirtier case). So do some rate limiting.
+	 */
+	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+		goto snapshot;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
+	w = min(elapsed / (HZ/100), 128UL);
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+	bdi->write_bandwidth_update_time = jiffies;
+snapshot:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -498,6 +533,8 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -546,7 +583,7 @@ static void balance_dirty_pages(struct a
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
@@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 21:51:41.000000000 +0800
@@ -137,6 +137,9 @@ int dirty_writeback_centisecs_handler(st
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
@ 2010-11-17  4:27 ` Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
                     ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-17  4:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6559 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth udpate only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    5 ++++
 include/linux/backing-dev.h |    2 +
 include/linux/writeback.h   |    3 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   41 +++++++++++++++++++++++++++++++++-
 5 files changed, 51 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-11-15 21:51:41.000000000 +0800
@@ -75,6 +75,8 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
+	unsigned long write_bandwidth_update_time;
+	int write_bandwidth;
 	int dirty_exceeded;
 
 	unsigned int min_ratio;
--- linux-next.orig/mm/backing-dev.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-11-15 21:51:41.000000000 +0800
@@ -660,6 +660,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -635,6 +635,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
@@ -668,6 +670,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -702,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-11-15 21:51:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-15 21:51:41.000000000 +0800
@@ -479,6 +479,41 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long written;
+	unsigned long elapsed;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto snapshot;
+
+	elapsed = jiffies - *bw_time;
+	if (elapsed < HZ/100)
+		return;
+
+	/*
+	 * When there lots of tasks throttled in balance_dirty_pages(), they
+	 * will each try to update the bandwidth for the same period, making
+	 * the bandwidth drift much faster than the desired rate (as in the
+	 * single dirtier case). So do some rate limiting.
+	 */
+	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+		goto snapshot;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
+	w = min(elapsed / (HZ/100), 128UL);
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+	bdi->write_bandwidth_update_time = jiffies;
+snapshot:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -498,6 +533,8 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -546,7 +583,7 @@ static void balance_dirty_pages(struct a
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
@@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-11-15 21:43:51.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-11-15 21:51:41.000000000 +0800
@@ -137,6 +137,9 @@ int dirty_writeback_centisecs_handler(st
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2010-11-17 23:08   ` Andrew Morton
  2010-11-17 23:24     ` Peter Zijlstra
  2010-11-18  6:51     ` Wu Fengguang
  2010-11-24 10:58   ` Peter Zijlstra
  2010-11-24 11:05   ` Peter Zijlstra
  2 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2010-11-17 23:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Li Shaohua, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 17 Nov 2010 12:27:26 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> +	w = min(elapsed / (HZ/100), 128UL);

I did try setting HZ=10 many years ago, and the kernel blew up.

I do recall hearing of people who set HZ very low, perhaps because
their huge machines were seeing performance prolems when the timer tick
went off.  Probably there's no need to do that any more.

But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
don't absolutely need to, and I don't think it is absolutely needed
here.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-17 23:24     ` Peter Zijlstra
  2010-11-17 23:38       ` Andrew Morton
  2010-11-18  6:51     ` Wu Fengguang
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-17 23:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 12:27:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +     w = min(elapsed / (HZ/100), 128UL);
> 
> I did try setting HZ=10 many years ago, and the kernel blew up.
> 
> I do recall hearing of people who set HZ very low, perhaps because
> their huge machines were seeing performance prolems when the timer tick
> went off.  Probably there's no need to do that any more.
> 
> But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> don't absolutely need to, and I don't think it is absolutely needed
> here.  

People who do cpu bring-up on very slow FPGAs also lower HZ as far as
possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:24     ` Peter Zijlstra
@ 2010-11-17 23:38       ` Andrew Morton
  2010-11-17 23:43         ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2010-11-17 23:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Thu, 18 Nov 2010 00:24:59 +0100
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> > On Wed, 17 Nov 2010 12:27:26 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > +     w = min(elapsed / (HZ/100), 128UL);
> > 
> > I did try setting HZ=10 many years ago, and the kernel blew up.
> > 
> > I do recall hearing of people who set HZ very low, perhaps because
> > their huge machines were seeing performance prolems when the timer tick
> > went off.  Probably there's no need to do that any more.
> > 
> > But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> > don't absolutely need to, and I don't think it is absolutely needed
> > here.  
> 
> People who do cpu bring-up on very slow FPGAs also lower HZ as far as
> possible.

	grep -r "[^a-zA-Z0-9_]HZ[ ]*/[ ]*100[^0-9]" .

:-(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:38       ` Andrew Morton
@ 2010-11-17 23:43         ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-17 23:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Wu Fengguang, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 15:38 -0800, Andrew Morton wrote:
> On Thu, 18 Nov 2010 00:24:59 +0100
> Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > On Wed, 2010-11-17 at 15:08 -0800, Andrew Morton wrote:
> > > On Wed, 17 Nov 2010 12:27:26 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > +     w = min(elapsed / (HZ/100), 128UL);
> > > 
> > > I did try setting HZ=10 many years ago, and the kernel blew up.
> > > 
> > > I do recall hearing of people who set HZ very low, perhaps because
> > > their huge machines were seeing performance prolems when the timer tick
> > > went off.  Probably there's no need to do that any more.
> > > 
> > > But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> > > don't absolutely need to, and I don't think it is absolutely needed
> > > here.  
> > 
> > People who do cpu bring-up on very slow FPGAs also lower HZ as far as
> > possible.
> 
> 	grep -r "[^a-zA-Z0-9_]HZ[ ]*/[ ]*100[^0-9]" .

Maybe they've got a patch-kit somewhere,. I'll ask around. It would be
nice to have that sorted.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17 23:08   ` Andrew Morton
  2010-11-17 23:24     ` Peter Zijlstra
@ 2010-11-18  6:51     ` Wu Fengguang
  1 sibling, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-18  6:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li, Shaohua, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Nov 18, 2010 at 07:08:37AM +0800, Andrew Morton wrote:
> On Wed, 17 Nov 2010 12:27:26 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +	w = min(elapsed / (HZ/100), 128UL);
> 
> I did try setting HZ=10 many years ago, and the kernel blew up.
> 
> I do recall hearing of people who set HZ very low, perhaps because
> their huge machines were seeing performance prolems when the timer tick
> went off.  Probably there's no need to do that any more.
> 
> But still, we shouldn't hard-wire the (HZ >= 100) assumption if we
> don't absolutely need to, and I don't think it is absolutely needed
> here.  

Fair enough.  Here is the fix. The other (HZ/10) will be addressed by
another patch that increase it to MAX_PAUSE=max(HZ/5, 1).

Thanks,
Fengguang
---

Subject: writeback: prevent divide error on tiny HZ
Date: Thu Nov 18 12:19:56 CST 2010

As suggested by Andrew and Peter:

I do recall hearing of people who set HZ very low, perhaps because their
huge machines were seeing performance prolems when the timer tick went
off.  Probably there's no need to do that any more.

But still, we shouldn't hard-wire the (HZ >= 100) assumption if we don't
absolutely need to, and I don't think it is absolutely needed here.

People who do cpu bring-up on very slow FPGAs also lower HZ as far as
possible.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-11-18 12:35:18.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-18 12:35:38.000000000 +0800
@@ -490,6 +490,7 @@ void bdi_update_write_bandwidth(struct b
 				unsigned long *bw_time,
 				s64 *bw_written)
 {
+	const unsigned long unit_time = max(HZ/100, 1);
 	unsigned long written;
 	unsigned long elapsed;
 	unsigned long bw;
@@ -499,7 +500,7 @@ void bdi_update_write_bandwidth(struct b
 		goto snapshot;
 
 	elapsed = jiffies - *bw_time;
-	if (elapsed < HZ/100)
+	if (elapsed < unit_time)
 		return;
 
 	/*
@@ -513,7 +514,7 @@ void bdi_update_write_bandwidth(struct b
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
 	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
-	w = min(elapsed / (HZ/100), 128UL);
+	w = min(elapsed / unit_time, 128UL);
 	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
 	bdi->write_bandwidth_update_time = jiffies;
 snapshot:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
@ 2010-11-24 10:58   ` Peter Zijlstra
  2010-11-24 14:06     ` Wu Fengguang
  2010-11-24 11:05   ` Peter Zijlstra
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 10:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:

> @@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
>  		pause = clamp_val(pause, 1, HZ/10);
>  
>  pause:
> +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
>  
>  		/*
>  		 * The bdi thresh is somehow "soft" limit derived from the

So its really a two part bandwidth calculation, the first call is:

  bdi_get_bandwidth()

and the second call is:

  bdi_update_bandwidth()

Would it make sense to actually implement it with two functions instead
of overloading the functionality of the one function?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
  2010-11-17 23:08   ` Andrew Morton
  2010-11-24 10:58   ` Peter Zijlstra
@ 2010-11-24 11:05   ` Peter Zijlstra
  2010-11-24 12:10     ` Wu Fengguang
  2 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 11:05 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm, linux-fsdevel, LKML

On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> +void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> +                               unsigned long *bw_time,
> +                               s64 *bw_written)
> +{
> +       unsigned long written;
> +       unsigned long elapsed;
> +       unsigned long bw;
> +       unsigned long w;
> +
> +       if (*bw_written == 0)
> +               goto snapshot;
> +
> +       elapsed = jiffies - *bw_time;
> +       if (elapsed < HZ/100)
> +               return;
> +
> +       /*
> +        * When there lots of tasks throttled in balance_dirty_pages(), they
> +        * will each try to update the bandwidth for the same period, making
> +        * the bandwidth drift much faster than the desired rate (as in the
> +        * single dirtier case). So do some rate limiting.
> +        */
> +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> +               goto snapshot;

Why this goto snapshot and not simply return? This is the second call
(bdi_update_bandwidth equivalent).

If you were to leave the old bw_written/bw_time in place the next loop
around in wb_writeback() would see a larger delta..

I guess this funny loop in wb_writeback() is also the reason you've got
a single function and not the get/update like separation

> +       written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
> +       bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
> +       w = min(elapsed / (HZ/100), 128UL);
> +       bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
> +       bdi->write_bandwidth_update_time = jiffies;
> +snapshot:
> +       *bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> +       *bw_time = jiffies;
> +} 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 11:05   ` Peter Zijlstra
@ 2010-11-24 12:10     ` Wu Fengguang
  2010-11-24 12:50       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 07:05:32PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> > +                               unsigned long *bw_time,
> > +                               s64 *bw_written)
> > +{
> > +       unsigned long written;
> > +       unsigned long elapsed;
> > +       unsigned long bw;
> > +       unsigned long w;
> > +
> > +       if (*bw_written == 0)
> > +               goto snapshot;
> > +
> > +       elapsed = jiffies - *bw_time;
> > +       if (elapsed < HZ/100)
> > +               return;
> > +
> > +       /*
> > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > +        * will each try to update the bandwidth for the same period, making
> > +        * the bandwidth drift much faster than the desired rate (as in the
> > +        * single dirtier case). So do some rate limiting.
> > +        */
> > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > +               goto snapshot;
>
> Why this goto snapshot and not simply return? This is the second call
> (bdi_update_bandwidth equivalent).

Good question. The loop inside balance_dirty_pages() normally run only
once, however wb_writeback() may loop over and over again. If we just
return here, the condition

        (jiffies - bdi->write_bandwidth_update_time < elapsed)

cannot be reset, then future bdi_update_bandwidth() calls in the same
wb_writeback() loop will never find it OK to update the bandwidth.

It does assume no races between CPUs.. We may need some per-cpu based
estimation.

> If you were to leave the old bw_written/bw_time in place the next loop
> around in wb_writeback() would see a larger delta..
>
> I guess this funny loop in wb_writeback() is also the reason you've got
> a single function and not the get/update like separation

I do the single function mainly for wb_writeback(), where it
continuously updates bandwidth inside the loop. The function can be
called in such way:

loop {
        take snapshot on first call

        no action if recalled within 10ms
        ...
        no action if recalled within 10ms
        ...

        update bandwidth and prepare for next update by taking the snapshot

        no action if recalled within 10ms
        ...
}

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 12:10     ` Wu Fengguang
@ 2010-11-24 12:50       ` Peter Zijlstra
  2010-11-24 13:14         ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 12:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > +       /*
> > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > +        * will each try to update the bandwidth for the same period, making
> > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > +        * single dirtier case). So do some rate limiting.
> > > +        */
> > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > +               goto snapshot;
> >
> > Why this goto snapshot and not simply return? This is the second call
> > (bdi_update_bandwidth equivalent).
> 
> Good question. The loop inside balance_dirty_pages() normally run only
> once, however wb_writeback() may loop over and over again. If we just
> return here, the condition
> 
>         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> 
> cannot be reset, then future bdi_update_bandwidth() calls in the same
> wb_writeback() loop will never find it OK to update the bandwidth.

But the thing is, you don't want to reset that, it might loop so fast
you'll throttle all of them, if you keep the pre-throttle value you'll
eventually pass, no?

> It does assume no races between CPUs.. We may need some per-cpu based
> estimation. 

But that multi-writer race is valid even for the balance_dirty_pages()
call, two or more could interleave on the bw_time and bw_written
variables.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 12:50       ` Peter Zijlstra
@ 2010-11-24 13:14         ` Wu Fengguang
  2010-11-24 13:20           ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 08:50:47PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > > +       /*
> > > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > > +        * will each try to update the bandwidth for the same period, making
> > > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > > +        * single dirtier case). So do some rate limiting.
> > > > +        */
> > > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > +               goto snapshot;
> > >
> > > Why this goto snapshot and not simply return? This is the second call
> > > (bdi_update_bandwidth equivalent).
> > 
> > Good question. The loop inside balance_dirty_pages() normally run only
> > once, however wb_writeback() may loop over and over again. If we just
> > return here, the condition
> > 
> >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > 
> > cannot be reset, then future bdi_update_bandwidth() calls in the same
> > wb_writeback() loop will never find it OK to update the bandwidth.
> 
> But the thing is, you don't want to reset that, it might loop so fast
> you'll throttle all of them, if you keep the pre-throttle value you'll
> eventually pass, no?

It (let's name it A) only resets the _local_ vars bw_* when it's sure
by the condition

        (jiffies - bdi->write_bandwidth_update_time < elapsed)

that someone else (name B) has updated the _global_ bandwidth in the
time range we planned. So there may be some time in A's range that is
not covered by B, but sure the range is not totally bypassed without
updating the bandwidth.

> > It does assume no races between CPUs.. We may need some per-cpu based
> > estimation. 
> 
> But that multi-writer race is valid even for the balance_dirty_pages()
> call, two or more could interleave on the bw_time and bw_written
> variables.

The race will only exist in each task's local vars (their bw_* will
overlap). But the update bdi->write_bandwidth* will be safeguarded
by the above check. When the task is scheduled back, it may find
updated write_bandwidth_update_time and hence give up his estimation.
This is rather tricky..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:14         ` Wu Fengguang
@ 2010-11-24 13:20           ` Wu Fengguang
  2010-11-24 13:42             ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 09:14:37PM +0800, Wu Fengguang wrote:
> On Wed, Nov 24, 2010 at 08:50:47PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-24 at 20:10 +0800, Wu Fengguang wrote:
> > > > > +       /*
> > > > > +        * When there lots of tasks throttled in balance_dirty_pages(), they
> > > > > +        * will each try to update the bandwidth for the same period, making
> > > > > +        * the bandwidth drift much faster than the desired rate (as in the
> > > > > +        * single dirtier case). So do some rate limiting.
> > > > > +        */
> > > > > +       if (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > > +               goto snapshot;
> > > >
> > > > Why this goto snapshot and not simply return? This is the second call
> > > > (bdi_update_bandwidth equivalent).
> > > 
> > > Good question. The loop inside balance_dirty_pages() normally run only
> > > once, however wb_writeback() may loop over and over again. If we just
> > > return here, the condition
> > > 
> > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > 
> > > cannot be reset, then future bdi_update_bandwidth() calls in the same
> > > wb_writeback() loop will never find it OK to update the bandwidth.
> > 
> > But the thing is, you don't want to reset that, it might loop so fast
> > you'll throttle all of them, if you keep the pre-throttle value you'll
> > eventually pass, no?
> 
> It (let's name it A) only resets the _local_ vars bw_* when it's sure
> by the condition
> 
>         (jiffies - bdi->write_bandwidth_update_time < elapsed)

this will be true if someone else has _done_ overlapped estimation,
otherwise it will equal:

        jiffies - bdi->write_bandwidth_update_time == elapsed

Sorry the comment needs updating.

Thanks,
Fengguang

> that someone else (name B) has updated the _global_ bandwidth in the
> time range we planned. So there may be some time in A's range that is
> not covered by B, but sure the range is not totally bypassed without
> updating the bandwidth.
> 
> > > It does assume no races between CPUs.. We may need some per-cpu based
> > > estimation. 
> > 
> > But that multi-writer race is valid even for the balance_dirty_pages()
> > call, two or more could interleave on the bw_time and bw_written
> > variables.
> 
> The race will only exist in each task's local vars (their bw_* will
> overlap). But the update bdi->write_bandwidth* will be safeguarded
> by the above check. When the task is scheduled back, it may find
> updated write_bandwidth_update_time and hence give up his estimation.
> This is rather tricky..
> 
> Thanks,
> Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:20           ` Wu Fengguang
@ 2010-11-24 13:42             ` Peter Zijlstra
  2010-11-24 13:46               ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 13:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> 
> this will be true if someone else has _done_ overlapped estimation,
> otherwise it will equal:
> 
>         jiffies - bdi->write_bandwidth_update_time == elapsed
> 
> Sorry the comment needs updating. 

Right, but its racy as hell..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:42             ` Peter Zijlstra
@ 2010-11-24 13:46               ` Wu Fengguang
  2010-11-24 14:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > 
> > this will be true if someone else has _done_ overlapped estimation,
> > otherwise it will equal:
> > 
> >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > 
> > Sorry the comment needs updating. 
> 
> Right, but its racy as hell..

Yeah, for N concurrent dirtiers, plus the background flusher, only
one is able to update write_bandwidth[_update_time]..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 10:58   ` Peter Zijlstra
@ 2010-11-24 14:06     ` Wu Fengguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 06:58:22PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-17 at 12:27 +0800, Wu Fengguang wrote:
> 
> > @@ -555,8 +592,10 @@ static void balance_dirty_pages(struct a
> >  		pause = clamp_val(pause, 1, HZ/10);
> >  
> >  pause:
> > +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
> >  		__set_current_state(TASK_INTERRUPTIBLE);
> >  		io_schedule_timeout(pause);
> > +		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
> >  
> >  		/*
> >  		 * The bdi thresh is somehow "soft" limit derived from the
> 
> So its really a two part bandwidth calculation, the first call is:
> 
>   bdi_get_bandwidth()
> 
> and the second call is:
> 
>   bdi_update_bandwidth()
> 
> Would it make sense to actually implement it with two functions instead
> of overloading the functionality of the one function?

Thanks, it's good suggestion indeed. However after looking around, I
find it hard to split it up cleanly.. To make it clear, how about this
comment update?

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2010-11-24 19:05:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-11-24 22:01:43.000000000 +0800
@@ -554,6 +554,14 @@ out:
 	return a;
 }
 
+/*
+ * This can be repeatedly called inside a long run loop, eg. by wb_writeback().
+ *
+ * On first invocation it will find *bw_written=0 and take the initial snapshot.
+ * On follow up calls it will update the bandwidth if
+ * - at least 10ms data have been collected
+ * - the bandwidth for the time range has not been updated in parallel by others
+ */
 void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				unsigned long *bw_time,
 				s64 *bw_written)
@@ -575,9 +583,12 @@ void bdi_update_write_bandwidth(struct b
 	 * When there lots of tasks throttled in balance_dirty_pages(), they
 	 * will each try to update the bandwidth for the same period, making
 	 * the bandwidth drift much faster than the desired rate (as in the
-	 * single dirtier case). So do some rate limiting.
+	 * single dirtier case).
+	 *
+	 * If someone changed bdi->write_bandwidth_update_time, he has done
+	 * overlapped estimation with us. So start the next round of estimation.
 	 */
-	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+	if (jiffies - bdi->write_bandwidth_update_time != elapsed)
 		goto snapshot;
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 13:46               ` Wu Fengguang
@ 2010-11-24 14:12                 ` Peter Zijlstra
  2010-11-24 14:21                   ` Wu Fengguang
  2010-11-24 14:34                   ` Wu Fengguang
  0 siblings, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 14:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, 2010-11-24 at 21:46 +0800, Wu Fengguang wrote:
> On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> > On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > 
> > > this will be true if someone else has _done_ overlapped estimation,
> > > otherwise it will equal:
> > > 
> > >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > > 
> > > Sorry the comment needs updating. 
> > 
> > Right, but its racy as hell..
> 
> Yeah, for N concurrent dirtiers, plus the background flusher, only
> one is able to update write_bandwidth[_update_time]..

Wrong, nr_cpus are, they could all observe the old value before seeing
the update of the variable.

Why not something like the below, which keeps the stamps per bdi and
serializes on a lock (trylock, you only need a single updater at any one
time anyway):

probably got the math wrong, but the idea should be clear, you can even
add an explicit bdi_update_bandwidth_stamps() function which resets the
stamps to the current situation in order to skip periods of low
throughput (that would need to do spin_lock).

---
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..de690c3 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
@@ -88,6 +89,11 @@ struct backing_dev_info {
 
 	struct timer_list laptop_mode_wb_timer;
 
+	spinlock_t bw_lock;
+	unsigned long bw_time_stamp;
+	unsigned long bw_write_stamp;
+	int write_bandwidth;
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..a934fe9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -661,6 +661,11 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
+	spin_lock_init(&bdi->bw_lock);
+	bdi->bw_time_stamp = jiffies;
+	bdi->bw_write_stamp = 0;
+	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT); /* 100 MB/s */
+
 	if (err) {
 err:
 		while (i--)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index b840afa..f3f5c24 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_state(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }
@@ -238,6 +239,35 @@ void task_dirty_inc(struct task_struct *tsk)
 	prop_inc_single(&vm_dirties, &tsk->dirties);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
+{
+	unsigned long time_now, write_now;
+	long time_delta, write_delta;
+	long bw;
+
+	if (!spin_try_lock(&bdi->bw_lock))
+		return;
+
+	write_now = bdi_stat(bdi, BDI_WRITTEN);
+	time_now = jiffies;
+
+	write_delta = write_now - bdi->bw_write_stamp;
+	time_delta = time_now - bdi->bw_time_stamp;
+
+	/* rate-limit, only update once every 100ms */
+	if (time_delta < HZ/10 || !write_delta)
+		goto unlock;
+
+	bdi->bw_write_stamp = write_now;
+	bdi->bw_time_stamp = time_now;
+
+	bw = write_delta * HZ / time_delta;
+	bdi->write_bandwidth = (bdi->write_bandwidth + bw + 3) / 4;
+
+unlock:
+	spin_unlock(&bdi->bw_lock);
+}
+
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:12                 ` Peter Zijlstra
@ 2010-11-24 14:21                   ` Wu Fengguang
  2010-11-24 14:31                     ` Peter Zijlstra
  2010-11-24 14:34                   ` Wu Fengguang
  1 sibling, 1 reply; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 10:12:33PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 21:46 +0800, Wu Fengguang wrote:
> > On Wed, Nov 24, 2010 at 09:42:09PM +0800, Peter Zijlstra wrote:
> > > On Wed, 2010-11-24 at 21:20 +0800, Wu Fengguang wrote:
> > > > >         (jiffies - bdi->write_bandwidth_update_time < elapsed)
> > > > 
> > > > this will be true if someone else has _done_ overlapped estimation,
> > > > otherwise it will equal:
> > > > 
> > > >         jiffies - bdi->write_bandwidth_update_time == elapsed
> > > > 
> > > > Sorry the comment needs updating. 
> > > 
> > > Right, but its racy as hell..
> > 
> > Yeah, for N concurrent dirtiers, plus the background flusher, only
> > one is able to update write_bandwidth[_update_time]..
> 
> Wrong, nr_cpus are, they could all observe the old value before seeing
> the update of the variable.

Yes, that's what I meant to do it "per-cpu" in the previous email.

> Why not something like the below, which keeps the stamps per bdi and
> serializes on a lock (trylock, you only need a single updater at any one
> time anyway):

Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
each CPU will see slightly different bandwidth, but that should be
close enough and not a big problem.

> probably got the math wrong, but the idea should be clear, you can even
> add an explicit bdi_update_bandwidth_stamps() function which resets the
> stamps to the current situation in order to skip periods of low
> throughput (that would need to do spin_lock).
> 
> ---
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 4ce34fa..de690c3 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
>  enum bdi_stat_item {
>  	BDI_RECLAIMABLE,
>  	BDI_WRITEBACK,
> +	BDI_WRITTEN,
>  	NR_BDI_STAT_ITEMS
>  };
>  
> @@ -88,6 +89,11 @@ struct backing_dev_info {
>  
>  	struct timer_list laptop_mode_wb_timer;
>  
> +	spinlock_t bw_lock;
> +	unsigned long bw_time_stamp;
> +	unsigned long bw_write_stamp;
> +	int write_bandwidth;
> +
>  #ifdef CONFIG_DEBUG_FS
>  	struct dentry *debug_dir;
>  	struct dentry *debug_stats;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 027100d..a934fe9 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -661,6 +661,11 @@ int bdi_init(struct backing_dev_info *bdi)
>  	bdi->dirty_exceeded = 0;
>  	err = prop_local_init_percpu(&bdi->completions);
>  
> +	spin_lock_init(&bdi->bw_lock);
> +	bdi->bw_time_stamp = jiffies;
> +	bdi->bw_write_stamp = 0;
> +	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT); /* 100 MB/s */
> +
>  	if (err) {
>  err:
>  		while (i--)
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index b840afa..f3f5c24 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table *table, int write,
>   */
>  static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
>  {
> +	__inc_bdi_state(bdi, BDI_WRITTEN);
>  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
>  			      bdi->max_prop_frac);
>  }
> @@ -238,6 +239,35 @@ void task_dirty_inc(struct task_struct *tsk)
>  	prop_inc_single(&vm_dirties, &tsk->dirties);
>  }
>  
> +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> +{
> +	unsigned long time_now, write_now;
> +	long time_delta, write_delta;
> +	long bw;
> +
> +	if (!spin_try_lock(&bdi->bw_lock))
> +		return;

spin_try_lock is good, however is still global state and risks
cacheline bouncing..

> +	write_now = bdi_stat(bdi, BDI_WRITTEN);
> +	time_now = jiffies;
> +
> +	write_delta = write_now - bdi->bw_write_stamp;
> +	time_delta = time_now - bdi->bw_time_stamp;
> +
> +	/* rate-limit, only update once every 100ms */
> +	if (time_delta < HZ/10 || !write_delta)
> +		goto unlock;
> +
> +	bdi->bw_write_stamp = write_now;
> +	bdi->bw_time_stamp = time_now;
> +
> +	bw = write_delta * HZ / time_delta;
> +	bdi->write_bandwidth = (bdi->write_bandwidth + bw + 3) / 4;
> +
> +unlock:
> +	spin_unlock(&bdi->bw_lock);
> +}
> +
>  /*
>   * Obtain an accurate fraction of the BDI's portion.
>   */
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:21                   ` Wu Fengguang
@ 2010-11-24 14:31                     ` Peter Zijlstra
  2010-11-24 14:38                       ` Wu Fengguang
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2010-11-24 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, 2010-11-24 at 22:21 +0800, Wu Fengguang wrote:
> 
> Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
> each CPU will see slightly different bandwidth, but that should be
> close enough and not a big problem.

I don't think so, on a large enough machine some cpus might hardly ever
use a particular BDI and hence get very stale data.

Also, it increases the memory footprint of the whole solution.

> > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> > +{
> > +     unsigned long time_now, write_now;
> > +     long time_delta, write_delta;
> > +     long bw;
> > +
> > +     if (!spin_try_lock(&bdi->bw_lock))
> > +             return;
> 
> spin_try_lock is good, however is still global state and risks
> cacheline bouncing.. 

If there are many concurrent writers to the BDI I don't think this is
going to be the top sore spot, once it is we can think of something
else.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:12                 ` Peter Zijlstra
  2010-11-24 14:21                   ` Wu Fengguang
@ 2010-11-24 14:34                   ` Wu Fengguang
  1 sibling, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

btw, I noticed that the same dd is constantly running from different
CPUs.

It's multi-core CPUs on the same socket, so should not be a big
problem. I'll test NUMA setup later.

This trace is for single dd case. 

dd-2893  [005]  3535.182111: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.230125: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.274805: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.289337: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.497863: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.585510: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.622001: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.645003: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.659603: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.669497: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.731294: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.785879: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.827724: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.853108: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3535.875667: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.895731: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.914920: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.925765: balance_dirty_pages: bdi=8:0
dd-2893  [005]  3535.935545: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3536.015888: balance_dirty_pages: bdi=8:0
dd-2893  [001]  3536.085571: balance_dirty_pages: bdi=8:0

Here is another run (1-dd):

dd-3014  [001]  1248.388001: balance_dirty_pages: bdi=8:0 bdi_dirty=122640 avg_dirty=122643 
dd-3014  [002]  1248.465999: balance_dirty_pages: bdi=8:0 bdi_dirty=122400 avg_dirty=122625 
dd-3014  [007]  1248.531451: balance_dirty_pages: bdi=8:0 bdi_dirty=122240 avg_dirty=122600 
dd-3014  [007]  1248.599260: balance_dirty_pages: bdi=8:0 bdi_dirty=122000 avg_dirty=122561 
dd-3014  [003]  1248.667152: balance_dirty_pages: bdi=8:0 bdi_dirty=120840 avg_dirty=122447 
dd-3014  [007]  1248.734848: balance_dirty_pages: bdi=8:0 bdi_dirty=120640 avg_dirty=122328 
dd-3014  [007]  1248.798652: balance_dirty_pages: bdi=8:0 bdi_dirty=120440 avg_dirty=122210 
dd-3014  [007]  1248.862456: balance_dirty_pages: bdi=8:0 bdi_dirty=120240 avg_dirty=122087 
dd-3014  [003]  1248.926280: balance_dirty_pages: bdi=8:0 bdi_dirty=120040 avg_dirty=121960 
dd-3014  [005]  1248.986091: balance_dirty_pages: bdi=8:0 bdi_dirty=119760 avg_dirty=121832 
dd-3014  [005]  1249.045841: balance_dirty_pages: bdi=8:0 bdi_dirty=119600 avg_dirty=121702 

And this 100 dd case looks much better (however it's not always the case):

dd-2775  [007]   454.844907: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=409468930 
dd-2775  [007]   455.048108: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=409676055 
dd-2775  [007]   455.252347: balance_dirty_pages: bdi=8:0 bdi_dirty=103720 avg_dirty=409676055 
dd-2775  [007]   455.455538: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=409676055 
dd-2775  [007]   455.658597: balance_dirty_pages: bdi=8:0 bdi_dirty=103880 avg_dirty=409676055 
dd-2775  [007]   455.862631: balance_dirty_pages: bdi=8:0 bdi_dirty=103760 avg_dirty=410276387 
dd-2775  [007]   456.068149: balance_dirty_pages: bdi=8:0 bdi_dirty=103800 avg_dirty=410276387 
dd-2775  [007]   456.266729: balance_dirty_pages: bdi=8:0 bdi_dirty=103760 avg_dirty=410833633 
dd-2775  [007]   456.470184: balance_dirty_pages: bdi=8:0 bdi_dirty=103840 avg_dirty=410833633 
dd-2775  [007]   456.668919: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=410833633 
dd-2775  [007]   456.872522: balance_dirty_pages: bdi=8:0 bdi_dirty=103880 avg_dirty=411001977 
dd-2775  [007]   457.070753: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=411001977 
dd-2775  [007]   457.275970: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=411227747 
dd-2775  [007]   457.473736: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=411534633 
dd-2775  [007]   457.677204: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=412195106 
dd-2775  [007]   457.881156: balance_dirty_pages: bdi=8:0 bdi_dirty=104040 avg_dirty=412195106 
dd-2775  [007]   458.085173: balance_dirty_pages: bdi=8:0 bdi_dirty=103920 avg_dirty=412295260 
dd-2775  [007]   458.285146: balance_dirty_pages: bdi=8:0 bdi_dirty=104000 avg_dirty=412295260 
dd-2775  [007]   458.490109: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=412789184 
dd-2775  [007]   458.692230: balance_dirty_pages: bdi=8:0 bdi_dirty=103960 avg_dirty=412789184 

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 06/13] writeback: bdi write bandwidth estimation
  2010-11-24 14:31                     ` Peter Zijlstra
@ 2010-11-24 14:38                       ` Wu Fengguang
  0 siblings, 0 replies; 21+ messages in thread
From: Wu Fengguang @ 2010-11-24 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Li, Shaohua, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Nov 24, 2010 at 10:31:57PM +0800, Peter Zijlstra wrote:
> On Wed, 2010-11-24 at 22:21 +0800, Wu Fengguang wrote:
> > 
> > Hmm, but why not avoid locking at all?  With per-cpu bandwidth vars,
> > each CPU will see slightly different bandwidth, but that should be
> > close enough and not a big problem.
> 
> I don't think so, on a large enough machine some cpus might hardly ever
> use a particular BDI and hence get very stale data.

Good point!

> Also, it increases the memory footprint of the whole solution.

Yeah, maybe not a good trade off.

> > > +void bdi_update_write_bandwidth(struct backing_dev_info *bdi)
> > > +{
> > > +     unsigned long time_now, write_now;
> > > +     long time_delta, write_delta;
> > > +     long bw;
> > > +
> > > +     if (!spin_try_lock(&bdi->bw_lock))
> > > +             return;
> > 
> > spin_try_lock is good, however is still global state and risks
> > cacheline bouncing.. 
> 
> If there are many concurrent writers to the BDI I don't think this is
> going to be the top sore spot, once it is we can think of something
> else.

When there are lots of concurrent writers, we'll target at ~100ms
pause time, hence the update frequency will be lowered accordingly.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2010-11-24 14:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-17  3:58 [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2010-11-17  4:27 [PATCH 00/13] IO-less dirty throttling v2 Wu Fengguang
2010-11-17  4:27 ` [PATCH 06/13] writeback: bdi write bandwidth estimation Wu Fengguang
2010-11-17 23:08   ` Andrew Morton
2010-11-17 23:24     ` Peter Zijlstra
2010-11-17 23:38       ` Andrew Morton
2010-11-17 23:43         ` Peter Zijlstra
2010-11-18  6:51     ` Wu Fengguang
2010-11-24 10:58   ` Peter Zijlstra
2010-11-24 14:06     ` Wu Fengguang
2010-11-24 11:05   ` Peter Zijlstra
2010-11-24 12:10     ` Wu Fengguang
2010-11-24 12:50       ` Peter Zijlstra
2010-11-24 13:14         ` Wu Fengguang
2010-11-24 13:20           ` Wu Fengguang
2010-11-24 13:42             ` Peter Zijlstra
2010-11-24 13:46               ` Wu Fengguang
2010-11-24 14:12                 ` Peter Zijlstra
2010-11-24 14:21                   ` Wu Fengguang
2010-11-24 14:31                     ` Peter Zijlstra
2010-11-24 14:38                       ` Wu Fengguang
2010-11-24 14:34                   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).