[PATCH 0/6] writeback cleanups and trivial fixes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] writeback cleanups and trivial fixes
@ 2010-07-11  2:06 Wu Fengguang
  2010-07-11  2:06 ` [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages() Wu Fengguang
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dave Chinner, Jan Kara, Peter Zijlstra,
	Wu Fengguang, linux-fsdevel, Linux Memory Management List, LKML

Andrew,

Here are some writeback cleanups to avoid unnecessary calculation overheads,
and relative simple bug fixes.

The patch applies to latest linux-next tree. The mmotm tree will need rebase
to include commit 32422c79 (writeback: Add tracing to balance_dirty_pages)
in order to avoid merge conflicts.

[PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()
[PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
[PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds

[PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
[PATCH 5/6] writeback: fix queue_io() ordering
[PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases

 fs/fs-writeback.c         |   68 ++++-----------
 include/linux/writeback.h |    5 -
 mm/backing-dev.c          |    3 
 mm/page-writeback.c       |  158 ++++++++++++++----------------------
 4 files changed, 89 insertions(+), 145 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
@ 2010-07-11  2:06 ` Wu Fengguang
  2010-07-12 21:52   ` Andrew Morton
  2010-07-11  2:06 ` [PATCH 2/6] writeback: reduce calls to global_page_state " Wu Fengguang
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Richard Kennedy, Wu Fengguang, Dave Chinner,
	Jan Kara, Peter Zijlstra, linux-fsdevel,
	Linux Memory Management List, LKML

[-- Attachment #1: writeback-temp.patch --]
[-- Type: text/plain, Size: 928 bytes --]


Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-11 08:41:37.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
@@ -503,11 +503,12 @@ static void balance_dirty_pages(struct a
 		};
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
+				 &bdi_thresh, bdi);
 
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+				 global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK) +
+			       global_page_state(NR_WRITEBACK_TEMP);
 
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
  2010-07-11  2:06 ` [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages() Wu Fengguang
@ 2010-07-11  2:06 ` Wu Fengguang
  2010-07-26 15:19   ` Jan Kara
  2010-08-03 14:55   ` Peter Zijlstra
  2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Jan Kara, Peter Zijlstra, Richard Kennedy,
	Wu Fengguang, Dave Chinner, linux-fsdevel,
	Linux Memory Management List, LKML

[-- Attachment #1: writeback-less-balance-calc.patch --]
[-- Type: text/plain, Size: 7525 bytes --]

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD
AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each
test 10 times, dropping the fasted & slowest values then taking the
average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc8	648.6 (14.6)
+patch		620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Fengguang:

This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
accurate we don't need to do routinely clip. A simple dirty limit check
would be enough.

The check is necessary because, in principle we should throttle
everything calling balance_dirty_pages() when we're over the total
limit, as said by Peter.

We now set and clear dirty_exceeded not only based on bdi dirty limits,
but also on the global dirty limits. This is a bit counterintuitive, but
the global limits are the ultimate goal and shall be always imposed.

We may now start background writeback work based on outdated conditions.
That's safe because the bdi flush thread will (and have to) double check
the states. It reduces overall overheads because the test based on old
states still have good chance to be right.

CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   95 ++++++++++++++----------------------------
 1 file changed, 33 insertions(+), 62 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-11 08:44:49.000000000 +0800
@@ -253,32 +253,6 @@ static void bdi_writeout_fraction(struct
 	}
 }
 
-/*
- * Clip the earned share of dirty pages to that which is actually available.
- * This avoids exceeding the total dirty_limit when the floating averages
- * fluctuate too quickly.
- */
-static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
-		unsigned long dirty, unsigned long *pbdi_dirty)
-{
-	unsigned long avail_dirty;
-
-	avail_dirty = global_page_state(NR_FILE_DIRTY) +
-		 global_page_state(NR_WRITEBACK) +
-		 global_page_state(NR_UNSTABLE_NFS) +
-		 global_page_state(NR_WRITEBACK_TEMP);
-
-	if (avail_dirty < dirty)
-		avail_dirty = dirty - avail_dirty;
-	else
-		avail_dirty = 0;
-
-	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
-		bdi_stat(bdi, BDI_WRITEBACK);
-
-	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
-}
-
 static inline void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
@@ -469,7 +443,6 @@ get_dirty_limits(unsigned long *pbackgro
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
 		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
 		task_dirty_limit(current, pbdi_dirty);
 	}
 }
@@ -491,7 +464,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
-
+	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
@@ -510,10 +483,35 @@ static void balance_dirty_pages(struct a
 		nr_writeback = global_page_state(NR_WRITEBACK) +
 			       global_page_state(NR_WRITEBACK_TEMP);
 
-		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		/*
+		 * In order to avoid the stacked BDI deadlock we need
+		 * to ensure we accurately count the 'dirty' pages when
+		 * the threshold is low.
+		 *
+		 * Otherwise it would be possible to get thresh+n pages
+		 * reported dirty, even though there are thresh-m pages
+		 * actually dirty; with m+n sitting in the percpu
+		 * deltas.
+		 */
+		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+		} else {
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		}
+
+		/*
+		 * The bdi thresh is somehow "soft" limit derived from the
+		 * global "hard" limit. The former helps to prevent heavy IO
+		 * bdi or process from holding back light ones; The latter is
+		 * the last resort safeguard.
+		 */
+		dirty_exceeded =
+			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
+			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (!dirty_exceeded)
 			break;
 
 		/*
@@ -541,34 +539,10 @@ static void balance_dirty_pages(struct a
 		if (bdi_nr_reclaimable > bdi_thresh) {
 			writeback_inodes_wb(&bdi->wb, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
 			trace_wbc_balance_dirty_written(&wbc, bdi);
+			if (pages_written >= write_chunk)
+				break;		/* We've done our duty */
 		}
-
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else if (bdi_nr_reclaimable) {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-		}
-
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-		if (pages_written >= write_chunk)
-			break;		/* We've done our duty */
-
 		trace_wbc_balance_dirty_wait(&wbc, bdi);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
@@ -582,8 +556,7 @@ static void balance_dirty_pages(struct a
 			pause = HZ / 10;
 	}
 
-	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -598,9 +571,7 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
  2010-07-11  2:06 ` [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages() Wu Fengguang
  2010-07-11  2:06 ` [PATCH 2/6] writeback: reduce calls to global_page_state " Wu Fengguang
@ 2010-07-11  2:06 ` Wu Fengguang
  2010-07-12 21:56   ` Andrew Morton
                     ` (2 more replies)
  2010-07-11  2:07 ` [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Peter Zijlstra, Wu Fengguang, Dave Chinner,
	Jan Kara, linux-fsdevel, Linux Memory Management List, LKML

[-- Attachment #1: writeback-less-bdi-calc.patch --]
[-- Type: text/plain, Size: 6556 bytes --]

Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
so that the latter can be avoided when under global dirty background
threshold (which is the normal state for most systems).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    5 +-
 mm/backing-dev.c          |    3 -
 mm/page-writeback.c       |   74 ++++++++++++++++++------------------
 4 files changed, 43 insertions(+), 41 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-11 08:50:00.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-11 08:53:44.000000000 +0800
@@ -267,10 +267,11 @@ static inline void task_dirties_fraction
  *
  *   dirty -= (dirty/8) * p_{t}
  */
-static void task_dirty_limit(struct task_struct *tsk, unsigned long *pdirty)
+static unsigned long task_dirty_limit(struct task_struct *tsk,
+				       unsigned long bdi_dirty)
 {
 	long numerator, denominator;
-	unsigned long dirty = *pdirty;
+	unsigned long dirty = bdi_dirty;
 	u64 inv = dirty >> 3;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
@@ -278,10 +279,8 @@ static void task_dirty_limit(struct task
 	do_div(inv, denominator);
 
 	dirty -= inv;
-	if (dirty < *pdirty/2)
-		dirty = *pdirty/2;
 
-	*pdirty = dirty;
+	return max(dirty, bdi_dirty/2);
 }
 
 /*
@@ -391,9 +390,7 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-void
-get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
 	unsigned long dirty;
@@ -425,26 +422,28 @@ get_dirty_limits(unsigned long *pbackgro
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+}
 
-	if (bdi) {
-		u64 bdi_dirty;
-		long numerator, denominator;
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			       unsigned long dirty)
+{
+	u64 bdi_dirty;
+	long numerator, denominator;
 
-		/*
-		 * Calculate this BDI's share of the dirty ratio.
-		 */
-		bdi_writeout_fraction(bdi, &numerator, &denominator);
+	/*
+	 * Calculate this BDI's share of the dirty ratio.
+	 */
+	bdi_writeout_fraction(bdi, &numerator, &denominator);
 
-		bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
-		bdi_dirty *= numerator;
-		do_div(bdi_dirty, denominator);
-		bdi_dirty += (dirty * bdi->min_ratio) / 100;
-		if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
-			bdi_dirty = dirty * bdi->max_ratio / 100;
+	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
+	bdi_dirty *= numerator;
+	do_div(bdi_dirty, denominator);
 
-		*pbdi_dirty = bdi_dirty;
-		task_dirty_limit(current, pbdi_dirty);
-	}
+	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
+		bdi_dirty = dirty * bdi->max_ratio / 100;
+
+	return task_dirty_limit(current, bdi_dirty);
 }
 
 /*
@@ -475,14 +474,24 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh,
-				 &bdi_thresh, bdi);
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				 global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK) +
 			       global_page_state(NR_WRITEBACK_TEMP);
 
+		global_dirty_limits(&background_thresh, &dirty_thresh);
+
+		/*
+		 * Throttle it only when the background writeback cannot
+		 * catch-up. This avoids (excessively) small writeouts
+		 * when the bdi limits are ramping up.
+		 */
+		if (nr_reclaimable + nr_writeback <
+				(background_thresh + dirty_thresh) / 2)
+			break;
+
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
 		 * to ensure we accurately count the 'dirty' pages when
@@ -514,15 +523,6 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			break;
 
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
-			break;
-
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
@@ -635,7 +635,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+		global_dirty_limits(&background_thresh, &dirty_thresh);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
--- linux-next.orig/fs/fs-writeback.c	2010-07-11 08:50:00.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-11 08:53:44.000000000 +0800
@@ -594,7 +594,7 @@ static inline bool over_bground_thresh(v
 {
 	unsigned long background_thresh, dirty_thresh;
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+	global_dirty_limits(&background_thresh, &dirty_thresh);
 
 	return (global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
--- linux-next.orig/mm/backing-dev.c	2010-07-11 08:50:00.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-07-11 08:53:44.000000000 +0800
@@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
 		nr_more_io++;
 	spin_unlock(&inode_lock);
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-07-11 08:50:00.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-07-11 08:53:44.000000000 +0800
@@ -124,8 +124,9 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
-void get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		      unsigned long *pbdi_dirty, struct backing_dev_info *bdi);
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			       unsigned long dirty);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
@ 2010-07-11  2:07 ` Wu Fengguang
  2010-07-12  2:01   ` Dave Chinner
  2010-07-11  2:07 ` [PATCH 5/6] writeback: fix queue_io() ordering Wu Fengguang
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dave Chinner, Jan Kara, Wu Fengguang,
	Peter Zijlstra, linux-fsdevel, Linux Memory Management List, LKML

[-- Attachment #1: writeback-xfs-fast-redirty.patch --]
[-- Type: text/plain, Size: 1970 bytes --]

This avoids delaying writeback for an expired (XFS) inode with lots of
dirty pages, but no active dirtier at the moment. Previously we only do
that for the kupdate case.

CC: Dave Chinner <david@fromorbit.com>
CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-11 08:53:44.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-11 08:57:35.000000000 +0800
@@ -367,18 +367,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -400,7 +389,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -423,6 +411,12 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * At least XFS will redirty the inode during the
+			 * writeback (delalloc) and on io completion (isize).
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 5/6] writeback: fix queue_io() ordering
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-07-11  2:07 ` [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
@ 2010-07-11  2:07 ` Wu Fengguang
  2010-07-12 22:15   ` Andrew Morton
  2010-07-11  2:07 ` [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
  2010-07-11  2:44 ` [PATCH 0/6] writeback cleanups and trivial fixes Christoph Hellwig
  6 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dave Chinner, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Wu Fengguang, Jan Kara, Peter Zijlstra,
	linux-fsdevel, Linux Memory Management List, LKML

[-- Attachment #1: queue_io-fix.patch --]
[-- Type: text/plain, Size: 1452 bytes --]

This was not a bug, since b_io is empty for kupdate writeback.
The next patch will do requeue_io() for non-kupdate writeback,
so let's fix it.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-11 09:13:32.000000000 +0800
@@ -252,11 +252,14 @@ static void move_expired_inodes(struct l
 }
 
 /*
- * Queue all expired dirty inodes for io, eldest first.
+ * Queue all expired dirty inodes for io, eldest first:
+ * (newly dirtied) => b_dirty inodes
+ *                 => b_more_io inodes
+ *                 => remaining inodes in b_io => (dequeue for sync)
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-07-11  2:07 ` [PATCH 5/6] writeback: fix queue_io() ordering Wu Fengguang
@ 2010-07-11  2:07 ` Wu Fengguang
  2010-07-12  2:08   ` Dave Chinner
  2010-07-11  2:44 ` [PATCH 0/6] writeback cleanups and trivial fixes Christoph Hellwig
  6 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Dave Chinner, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Wu Fengguang, Jan Kara, Peter Zijlstra,
	linux-fsdevel, Linux Memory Management List, LKML

[-- Attachment #1: writeback-merge-kupdate-cases.patch --]
[-- Type: text/plain, Size: 2842 bytes --]

Unify the logic for kupdate and non-kupdate cases.
There won't be starvation because the inodes requeued into b_more_io
will later be spliced _after_ the remaining inodes in b_io, hence won't
stand in the way of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's periodic writeback, or exclude the inode from
the data integrity sync operation (which will check timestamp to avoid
extra work and livelock).

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   39 ++++++---------------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:32.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-11 09:13:36.000000000 +0800
@@ -373,45 +373,18 @@ writeback_single_inode(struct inode *ino
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from b_io onto b_more_io/b_dirty.
+			 * sometimes bales out without doing anything.
 			 */
-			/*
-			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of b_dirty so it gets first
-			 * consideration.  Otherwise, move it to the tail, for
-			 * the reasons described there.  I'm not really sure
-			 * how much sense this makes.  Presumably I had a good
-			 * reasons for doing it this way, and I'd rather not
-			 * muck with it at present.
-			 */
-			if (wbc->for_kupdate) {
+			inode->i_state |= I_DIRTY_PAGES;
+			if (wbc->nr_to_write <= 0) {
 				/*
-				 * For the kupdate function we move the inode
-				 * to b_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
+				 * slice used up: queue for next turn
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
-					/*
-					 * slice used up: queue for next turn
-					 */
-					requeue_io(inode);
-				} else {
-					/*
-					 * somehow blocked: retry later
-					 */
-					redirty_tail(inode);
-				}
+				requeue_io(inode);
 			} else {
 				/*
-				 * Otherwise fully redirty the inode so that
-				 * other inodes on this superblock will get some
-				 * writeout.  Otherwise heavy writing to one
-				 * file would indefinitely suspend writeout of
-				 * all the other files.
+				 * somehow blocked: retry later
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/6] writeback cleanups and trivial fixes
  2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-07-11  2:07 ` [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
@ 2010-07-11  2:44 ` Christoph Hellwig
  2010-07-11  2:50   ` Wu Fengguang
  6 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2010-07-11  2:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Jan Kara,
	Peter Zijlstra, linux-fsdevel, Linux Memory Management List, LKML

On Sun, Jul 11, 2010 at 10:06:56AM +0800, Wu Fengguang wrote:
> Andrew,
> 
> Here are some writeback cleanups to avoid unnecessary calculation overheads,
> and relative simple bug fixes.
> 
> The patch applies to latest linux-next tree. The mmotm tree will need rebase
> to include commit 32422c79 (writeback: Add tracing to balance_dirty_pages)
> in order to avoid merge conflicts.

Maybe it's a better idea to get them in through Jens' tree?  At least he
has handled all my flusher thread patches, and I have another big series
in preparation which might cause some interesting conflicts if not
merged through the same tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/6] writeback cleanups and trivial fixes
  2010-07-11  2:44 ` [PATCH 0/6] writeback cleanups and trivial fixes Christoph Hellwig
@ 2010-07-11  2:50   ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-11  2:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Andrew Morton, Dave Chinner, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

[add cc to Jens]

On Sun, Jul 11, 2010 at 10:44:04AM +0800, Christoph Hellwig wrote:
> On Sun, Jul 11, 2010 at 10:06:56AM +0800, Wu Fengguang wrote:
> > Andrew,
> > 
> > Here are some writeback cleanups to avoid unnecessary calculation overheads,
> > and relative simple bug fixes.
> > 
> > The patch applies to latest linux-next tree. The mmotm tree will need rebase
> > to include commit 32422c79 (writeback: Add tracing to balance_dirty_pages)
> > in order to avoid merge conflicts.
> 
> Maybe it's a better idea to get them in through Jens' tree?  At least he
> has handled all my flusher thread patches, and I have another big series
> in preparation which might cause some interesting conflicts if not
> merged through the same tree.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
  2010-07-11  2:07 ` [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
@ 2010-07-12  2:01   ` Dave Chinner
  2010-07-12 15:31     ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2010-07-12  2:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Jan Kara, Peter Zijlstra,
	linux-fsdevel, Linux Memory Management List, LKML

On Sun, Jul 11, 2010 at 10:07:00AM +0800, Wu Fengguang wrote:
> This avoids delaying writeback for an expired (XFS) inode with lots of
> dirty pages, but no active dirtier at the moment. Previously we only do
> that for the kupdate case.
> 
> CC: Dave Chinner <david@fromorbit.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Acked-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-11 08:53:44.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-11 08:57:35.000000000 +0800
> @@ -367,18 +367,7 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & I_FREEING)) {
> -		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
> -			/*
> -			 * More pages get dirtied by a fast dirtier.
> -			 */
> -			goto select_queue;
> -		} else if (inode->i_state & I_DIRTY) {
> -			/*
> -			 * At least XFS will redirty the inode during the
> -			 * writeback (delalloc) and on io completion (isize).
> -			 */
> -			redirty_tail(inode);
> -		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
>  			 * sometimes bales out without doing anything. Redirty
> @@ -400,7 +389,6 @@ writeback_single_inode(struct inode *ino
>  				 * soon as the queue becomes uncongested.
>  				 */
>  				inode->i_state |= I_DIRTY_PAGES;
> -select_queue:
>  				if (wbc->nr_to_write <= 0) {
>  					/*
>  					 * slice used up: queue for next turn
> @@ -423,6 +411,12 @@ select_queue:
>  				inode->i_state |= I_DIRTY_PAGES;
>  				redirty_tail(inode);
>  			}
> +		} else if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * At least XFS will redirty the inode during the
> +			 * writeback (delalloc) and on io completion (isize).
> +			 */
> +			redirty_tail(inode);

I'd drop the mention of XFS here - any filesystem that does delayed
allocation or unwritten extent conversion after Io completion will
cause this. Perhaps make the comment:

	/*
	 * Filesystems can dirty the inode during writeback
	 * operations, such as delayed allocation during submission
	 * or metadata updates after data IO completion.
	 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-11  2:07 ` [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
@ 2010-07-12  2:08   ` Dave Chinner
  2010-07-12 15:52     ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Dave Chinner @ 2010-07-12  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra, linux-fsdevel,
	Linux Memory Management List, LKML

On Sun, Jul 11, 2010 at 10:07:02AM +0800, Wu Fengguang wrote:
> Unify the logic for kupdate and non-kupdate cases.
> There won't be starvation because the inodes requeued into b_more_io
> will later be spliced _after_ the remaining inodes in b_io, hence won't
> stand in the way of other inodes in the next run.
> 
> It avoids unnecessary redirty_tail() calls, hence the update of
> i_dirtied_when. The timestamp update is undesirable because it could
> later delay the inode's periodic writeback, or exclude the inode from
> the data integrity sync operation (which will check timestamp to avoid
> extra work and livelock).
> 
> CC: Dave Chinner <david@fromorbit.com>
> Cc: Martin Bligh <mbligh@google.com>
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   39 ++++++---------------------------------
>  1 file changed, 6 insertions(+), 33 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:32.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-11 09:13:36.000000000 +0800
> @@ -373,45 +373,18 @@ writeback_single_inode(struct inode *ino
>  		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
> -			 * sometimes bales out without doing anything. Redirty
> -			 * the inode; Move it from b_io onto b_more_io/b_dirty.
> +			 * sometimes bales out without doing anything.
>  			 */
> -			/*
> -			 * akpm: if the caller was the kupdate function we put
> -			 * this inode at the head of b_dirty so it gets first
> -			 * consideration.  Otherwise, move it to the tail, for
> -			 * the reasons described there.  I'm not really sure
> -			 * how much sense this makes.  Presumably I had a good
> -			 * reasons for doing it this way, and I'd rather not
> -			 * muck with it at present.
> -			 */
> -			if (wbc->for_kupdate) {
> +			inode->i_state |= I_DIRTY_PAGES;
> +			if (wbc->nr_to_write <= 0) {
>  				/*
> -				 * For the kupdate function we move the inode
> -				 * to b_more_io so it will get more writeout as
> -				 * soon as the queue becomes uncongested.
> +				 * slice used up: queue for next turn
>  				 */
> -				inode->i_state |= I_DIRTY_PAGES;
> -				if (wbc->nr_to_write <= 0) {
> -					/*
> -					 * slice used up: queue for next turn
> -					 */
> -					requeue_io(inode);
> -				} else {
> -					/*
> -					 * somehow blocked: retry later
> -					 */
> -					redirty_tail(inode);
> -				}
> +				requeue_io(inode);
>  			} else {
>  				/*
> -				 * Otherwise fully redirty the inode so that
> -				 * other inodes on this superblock will get some
> -				 * writeout.  Otherwise heavy writing to one
> -				 * file would indefinitely suspend writeout of
> -				 * all the other files.
> +				 * somehow blocked: retry later
>  				 */
> -				inode->i_state |= I_DIRTY_PAGES;
>  				redirty_tail(inode);
>  			}

This means that congestion will always trigger redirty_tail(). Is
that really what we want for that case? Also, I'd prefer that the
comments remain somewhat more descriptive of the circumstances that
we are operating under. Comments like "retry later to avoid blocking
writeback of other inodes" is far, far better than "retry later"
because it has "why" component that explains the reason for the
logic. You may remember why, but I sure won't in a few months time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
  2010-07-12  2:01   ` Dave Chinner
@ 2010-07-12 15:31     ` Wu Fengguang
  2010-07-12 22:13       ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-12 15:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Christoph Hellwig, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

> > +		} else if (inode->i_state & I_DIRTY) {
> > +			/*
> > +			 * At least XFS will redirty the inode during the
> > +			 * writeback (delalloc) and on io completion (isize).
> > +			 */
> > +			redirty_tail(inode);
> 
> I'd drop the mention of XFS here - any filesystem that does delayed
> allocation or unwritten extent conversion after Io completion will
> cause this. Perhaps make the comment:
> 
> 	/*
> 	 * Filesystems can dirty the inode during writeback
> 	 * operations, such as delayed allocation during submission
> 	 * or metadata updates after data IO completion.
> 	 */

Thanks, comments updated accordingly.

---
writeback: don't redirty tail an inode with dirty pages

This avoids delaying writeback for an expired (XFS) inode with lots of
dirty pages, but no active dirtier at the moment. Previously we only do
that for the kupdate case.

CC: Dave Chinner <david@fromorbit.com>
CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:30.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-07-12 23:26:06.000000000 +0800
@@ -367,18 +367,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -400,7 +389,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -423,6 +411,14 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * Filesystems can dirty the inode during writeback
+			 * operations, such as delayed allocation during
+			 * submission or metadata updates after data IO
+			 * completion.
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-12  2:08   ` Dave Chinner
@ 2010-07-12 15:52     ` Wu Fengguang
  2010-07-12 22:06       ` Dave Chinner
  2010-07-12 22:22       ` Andrew Morton
  0 siblings, 2 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-12 15:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Christoph Hellwig, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Mon, Jul 12, 2010 at 10:08:42AM +0800, Dave Chinner wrote:
> On Sun, Jul 11, 2010 at 10:07:02AM +0800, Wu Fengguang wrote:
> > Unify the logic for kupdate and non-kupdate cases.
> > There won't be starvation because the inodes requeued into b_more_io
> > will later be spliced _after_ the remaining inodes in b_io, hence won't
> > stand in the way of other inodes in the next run.
> > 
> > It avoids unnecessary redirty_tail() calls, hence the update of
> > i_dirtied_when. The timestamp update is undesirable because it could
> > later delay the inode's periodic writeback, or exclude the inode from
> > the data integrity sync operation (which will check timestamp to avoid
> > extra work and livelock).
> > 
> > CC: Dave Chinner <david@fromorbit.com>
> > Cc: Martin Bligh <mbligh@google.com>
> > Cc: Michael Rubin <mrubin@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c |   39 ++++++---------------------------------
> >  1 file changed, 6 insertions(+), 33 deletions(-)
> > 
> > --- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:32.000000000 +0800
> > +++ linux-next/fs/fs-writeback.c	2010-07-11 09:13:36.000000000 +0800
> > @@ -373,45 +373,18 @@ writeback_single_inode(struct inode *ino
> >  		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> >  			/*
> >  			 * We didn't write back all the pages.  nfs_writepages()
> > -			 * sometimes bales out without doing anything. Redirty
> > -			 * the inode; Move it from b_io onto b_more_io/b_dirty.
> > +			 * sometimes bales out without doing anything.
> >  			 */
> > -			/*
> > -			 * akpm: if the caller was the kupdate function we put
> > -			 * this inode at the head of b_dirty so it gets first
> > -			 * consideration.  Otherwise, move it to the tail, for
> > -			 * the reasons described there.  I'm not really sure
> > -			 * how much sense this makes.  Presumably I had a good
> > -			 * reasons for doing it this way, and I'd rather not
> > -			 * muck with it at present.
> > -			 */
> > -			if (wbc->for_kupdate) {
> > +			inode->i_state |= I_DIRTY_PAGES;
> > +			if (wbc->nr_to_write <= 0) {
> >  				/*
> > -				 * For the kupdate function we move the inode
> > -				 * to b_more_io so it will get more writeout as
> > -				 * soon as the queue becomes uncongested.
> > +				 * slice used up: queue for next turn
> >  				 */
> > -				inode->i_state |= I_DIRTY_PAGES;
> > -				if (wbc->nr_to_write <= 0) {
> > -					/*
> > -					 * slice used up: queue for next turn
> > -					 */
> > -					requeue_io(inode);
> > -				} else {
> > -					/*
> > -					 * somehow blocked: retry later
> > -					 */
> > -					redirty_tail(inode);
> > -				}
> > +				requeue_io(inode);
> >  			} else {
> >  				/*
> > -				 * Otherwise fully redirty the inode so that
> > -				 * other inodes on this superblock will get some
> > -				 * writeout.  Otherwise heavy writing to one
> > -				 * file would indefinitely suspend writeout of
> > -				 * all the other files.
> > +				 * somehow blocked: retry later
> >  				 */
> > -				inode->i_state |= I_DIRTY_PAGES;
> >  				redirty_tail(inode);
> >  			}
> 
> This means that congestion will always trigger redirty_tail(). Is
> that really what we want for that case?

This patch actually converts some redirty_tail() cases to use
requeue_io(), so are reducing the use of redirty_tail(). Also
recent kernels are blocked _inside_ get_request() on congestion
instead of returning to writeback_single_inode() on congestion.
So the "somehow blocked" comment for redirty_tail() no longer includes
the congestion case.

> Also, I'd prefer that the
> comments remain somewhat more descriptive of the circumstances that
> we are operating under. Comments like "retry later to avoid blocking
> writeback of other inodes" is far, far better than "retry later"
> because it has "why" component that explains the reason for the
> logic. You may remember why, but I sure won't in a few months time....

Ah yes the comment is too simple. However the redirty_tail() is not to
avoid blocking writeback of other inodes, but to avoid eating 100% CPU
on busy retrying a dirty inode/page that cannot perform writeback for
a while. (In theory redirty_tail() can still busy retry though, when
there is only one single dirty inode.) So how about

        /*
         * somehow blocked: avoid busy retrying
         */

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()
  2010-07-11  2:06 ` [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages() Wu Fengguang
@ 2010-07-12 21:52   ` Andrew Morton
  2010-07-13  8:58     ` Miklos Szeredi
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2010-07-12 21:52 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Richard Kennedy, Dave Chinner, Jan Kara,
	Peter Zijlstra, linux-fsdevel, Linux Memory Management List, LKML,
	Miklos Szeredi

On Sun, 11 Jul 2010 10:06:57 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> 
> Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-07-11 08:41:37.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
> @@ -503,11 +503,12 @@ static void balance_dirty_pages(struct a
>  		};
>  
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
> -				&bdi_thresh, bdi);
> +				 &bdi_thresh, bdi);
>  
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +				 global_page_state(NR_UNSTABLE_NFS);
> +		nr_writeback = global_page_state(NR_WRITEBACK) +
> +			       global_page_state(NR_WRITEBACK_TEMP);
>  
>  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
>  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> 

hm, OK.

I wonder whether we could/should have unified NR_WRITEBACK_TEMP and
NR_UNSTABLE_NFS.  Their "meanings" aren't quite the same, but perhaps
some "treat page as dirty because the fs is futzing with it" thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
@ 2010-07-12 21:56   ` Andrew Morton
  2010-07-15 14:55     ` Wu Fengguang
  2010-07-19 21:35   ` Andrew Morton
  2010-08-03 15:03   ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2010-07-12 21:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Peter Zijlstra, Dave Chinner, Jan Kara,
	linux-fsdevel, Linux Memory Management List, LKML

On Sun, 11 Jul 2010 10:06:59 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>
> ...
>
> +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> +			       unsigned long dirty)

It'd be nice to have some documentation for these things.  They're
non-static, non-obvious and are stuffed to the gills with secret magic
numbers.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-12 15:52     ` Wu Fengguang
@ 2010-07-12 22:06       ` Dave Chinner
  2010-07-12 22:22       ` Andrew Morton
  1 sibling, 0 replies; 36+ messages in thread
From: Dave Chinner @ 2010-07-12 22:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Mon, Jul 12, 2010 at 11:52:39PM +0800, Wu Fengguang wrote:
> On Mon, Jul 12, 2010 at 10:08:42AM +0800, Dave Chinner wrote:
> > On Sun, Jul 11, 2010 at 10:07:02AM +0800, Wu Fengguang wrote:
> > > -			/*
> > > -			 * akpm: if the caller was the kupdate function we put
> > > -			 * this inode at the head of b_dirty so it gets first
> > > -			 * consideration.  Otherwise, move it to the tail, for
> > > -			 * the reasons described there.  I'm not really sure
> > > -			 * how much sense this makes.  Presumably I had a good
> > > -			 * reasons for doing it this way, and I'd rather not
> > > -			 * muck with it at present.
> > > -			 */
> > > -			if (wbc->for_kupdate) {
> > > +			inode->i_state |= I_DIRTY_PAGES;
> > > +			if (wbc->nr_to_write <= 0) {
> > >  				/*
> > > -				 * For the kupdate function we move the inode
> > > -				 * to b_more_io so it will get more writeout as
> > > -				 * soon as the queue becomes uncongested.
> > > +				 * slice used up: queue for next turn
> > >  				 */
> > > -				inode->i_state |= I_DIRTY_PAGES;
> > > -				if (wbc->nr_to_write <= 0) {
> > > -					/*
> > > -					 * slice used up: queue for next turn
> > > -					 */
> > > -					requeue_io(inode);
> > > -				} else {
> > > -					/*
> > > -					 * somehow blocked: retry later
> > > -					 */
> > > -					redirty_tail(inode);
> > > -				}
> > > +				requeue_io(inode);
> > >  			} else {
> > >  				/*
> > > -				 * Otherwise fully redirty the inode so that
> > > -				 * other inodes on this superblock will get some
> > > -				 * writeout.  Otherwise heavy writing to one
> > > -				 * file would indefinitely suspend writeout of
> > > -				 * all the other files.
> > > +				 * somehow blocked: retry later
> > >  				 */
> > > -				inode->i_state |= I_DIRTY_PAGES;
> > >  				redirty_tail(inode);
> > >  			}
> > 
> > This means that congestion will always trigger redirty_tail(). Is
> > that really what we want for that case?
> 
> This patch actually converts some redirty_tail() cases to use
> requeue_io(), so are reducing the use of redirty_tail(). Also
> recent kernels are blocked _inside_ get_request() on congestion
> instead of returning to writeback_single_inode() on congestion.
> So the "somehow blocked" comment for redirty_tail() no longer includes
> the congestion case.

Shouldn't some of this be in the comment explain why the tail is
redirtied rather than requeued?

> > Also, I'd prefer that the
> > comments remain somewhat more descriptive of the circumstances that
> > we are operating under. Comments like "retry later to avoid blocking
> > writeback of other inodes" is far, far better than "retry later"
> > because it has "why" component that explains the reason for the
> > logic. You may remember why, but I sure won't in a few months time....
> 
> Ah yes the comment is too simple. However the redirty_tail() is not to
> avoid blocking writeback of other inodes, but to avoid eating 100% CPU
> on busy retrying a dirty inode/page that cannot perform writeback for
> a while. (In theory redirty_tail() can still busy retry though, when
> there is only one single dirty inode.) So how about
> 
>         /*
>          * somehow blocked: avoid busy retrying
>          */

IMO, no better than "somehow blocked: retry later" because it
desont' include any of the explanation for the code you just gave
me.  The comment needs to tell us _why_ we are calling
redirty_tail, not what redirty_tail does. Perhaps something like:

	/*
	 * Writeback blocked by something other than congestion.
	 * Redirty the inode to avoid spinning on the CPU retrying
	 * writeback of the dirty page/inode that cannot be
	 * performed immediately. This allows writeback of other
	 * inodes until the blocking condition clears.
	 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
  2010-07-12 15:31     ` Wu Fengguang
@ 2010-07-12 22:13       ` Andrew Morton
  2010-07-15 15:35         ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2010-07-12 22:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Mon, 12 Jul 2010 23:31:27 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> > > +		} else if (inode->i_state & I_DIRTY) {
> > > +			/*
> > > +			 * At least XFS will redirty the inode during the
> > > +			 * writeback (delalloc) and on io completion (isize).
> > > +			 */
> > > +			redirty_tail(inode);
> > 
> > I'd drop the mention of XFS here - any filesystem that does delayed
> > allocation or unwritten extent conversion after Io completion will
> > cause this. Perhaps make the comment:
> > 
> > 	/*
> > 	 * Filesystems can dirty the inode during writeback
> > 	 * operations, such as delayed allocation during submission
> > 	 * or metadata updates after data IO completion.
> > 	 */
> 
> Thanks, comments updated accordingly.
> 
> ---
> writeback: don't redirty tail an inode with dirty pages
> 
> This avoids delaying writeback for an expired (XFS) inode with lots of
> dirty pages, but no active dirtier at the moment. Previously we only do
> that for the kupdate case.
> 

You didn't actually explain the _reason_ for making this change. 
Please always do that.

The patch is...  surprisingly complicated, although the end result
looks OK.  This is not aided by the partial duplication between
mapping_tagged(PAGECACHE_TAG_DIRTY) and I_DIRTY_PAGES.  I don't think
we can easily remove I_DIRTY_PAGES because it's used for the
did-someone-just-dirty-a-page test here.

This code is way too complex and fragile and I fear that anything we do
to it will break something :(

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/6] writeback: fix queue_io() ordering
  2010-07-11  2:07 ` [PATCH 5/6] writeback: fix queue_io() ordering Wu Fengguang
@ 2010-07-12 22:15   ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2010-07-12 22:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Dave Chinner, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra, linux-fsdevel,
	Linux Memory Management List, LKML

On Sun, 11 Jul 2010 10:07:01 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> This was not a bug, since b_io is empty for kupdate writeback.
> The next patch will do requeue_io() for non-kupdate writeback,
> so let's fix it.
> 
> CC: Dave Chinner <david@fromorbit.com>
> Cc: Martin Bligh <mbligh@google.com>
> Cc: Michael Rubin <mrubin@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>

I assumed you didn't mean to sign this twice so I removed this signoff.

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |    7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2010-07-11 09:13:31.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2010-07-11 09:13:32.000000000 +0800
> @@ -252,11 +252,14 @@ static void move_expired_inodes(struct l
>  }
>  
>  /*
> - * Queue all expired dirty inodes for io, eldest first.
> + * Queue all expired dirty inodes for io, eldest first:
> + * (newly dirtied) => b_dirty inodes
> + *                 => b_more_io inodes
> + *                 => remaining inodes in b_io => (dequeue for sync)
>   */
>  static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
>  {
> -	list_splice_init(&wb->b_more_io, wb->b_io.prev);
> +	list_splice_init(&wb->b_more_io, &wb->b_io);
>  	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-12 15:52     ` Wu Fengguang
  2010-07-12 22:06       ` Dave Chinner
@ 2010-07-12 22:22       ` Andrew Morton
  2010-08-05 16:01         ` Wu Fengguang
  1 sibling, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2010-07-12 22:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Dave Chinner, Christoph Hellwig, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Mon, 12 Jul 2010 23:52:39 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> > Also, I'd prefer that the
> > comments remain somewhat more descriptive of the circumstances that
> > we are operating under. Comments like "retry later to avoid blocking
> > writeback of other inodes" is far, far better than "retry later"
> > because it has "why" component that explains the reason for the
> > logic. You may remember why, but I sure won't in a few months time....

me2 (of course).  This code is waaaay too complex to be scrimping on comments.

> Ah yes the comment is too simple. However the redirty_tail() is not to
> avoid blocking writeback of other inodes, but to avoid eating 100% CPU
> on busy retrying a dirty inode/page that cannot perform writeback for
> a while. (In theory redirty_tail() can still busy retry though, when
> there is only one single dirty inode.) So how about
> 
>         /*
>          * somehow blocked: avoid busy retrying
>          */

That's much too short.  Expand on the "somehow" - provide an example,
describe the common/expected cause.  Fully explain what the "busy"
retry _is_ and how it can come about.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()
  2010-07-12 21:52   ` Andrew Morton
@ 2010-07-13  8:58     ` Miklos Szeredi
  2010-07-15 14:50       ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Miklos Szeredi @ 2010-07-13  8:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: fengguang.wu, hch, richard, david, jack, a.p.zijlstra,
	linux-fsdevel, linux-mm, linux-kernel, miklos

On Mon, 12 Jul 2010, Andrew Morton wrote:
> On Sun, 11 Jul 2010 10:06:57 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > 
> > Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  mm/page-writeback.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > --- linux-next.orig/mm/page-writeback.c	2010-07-11 08:41:37.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
> > @@ -503,11 +503,12 @@ static void balance_dirty_pages(struct a
> >  		};
> >  
> >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > -				&bdi_thresh, bdi);
> > +				 &bdi_thresh, bdi);
> >  
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > -					global_page_state(NR_UNSTABLE_NFS);
> > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > +				 global_page_state(NR_UNSTABLE_NFS);
> > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > +			       global_page_state(NR_WRITEBACK_TEMP);
> >  
> >  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> >  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > 
> 
> hm, OK.

Hm, hm.  I'm not sure this is right.  The VM has absolutely no control
over NR_WRITEBACK_TEMP pages, they may clear quickly or may not make
any progress.  So it's usually wrong to make a decision based on
NR_WRITEBACK_TEMP for an unrelated device.

Using it in throttle_vm_writeout() would actually be deadlocky, since
the userspace filesystem will probably depend on memory allocations to
complete the writeout.

The only place where we should be taking NR_WRITEBACK_TEMP into
account is calculating the remaining memory that can be devided
between dirtyers, and that's (clip_bdi_dirty_limit) where it is
already used.

> I wonder whether we could/should have unified NR_WRITEBACK_TEMP and
> NR_UNSTABLE_NFS.  Their "meanings" aren't quite the same, but perhaps
> some "treat page as dirty because the fs is futzing with it" thing.

AFAICS NR_UNSTABLE_NFS is something akin to NR_DIRTY, only on the
server side.  So nfs can very much do something about making
NR_UNSTABLE_NFS go away, while there's nothing that can be done about
NR_WRITEBACK_TEMP.

Thanks,
Miklos

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages()
  2010-07-13  8:58     ` Miklos Szeredi
@ 2010-07-15 14:50       ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-15 14:50 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Andrew Morton, hch@infradead.org, richard@rsk.demon.co.uk,
	david@fromorbit.com, jack@suse.cz, a.p.zijlstra@chello.nl,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org

On Tue, Jul 13, 2010 at 04:58:47PM +0800, Miklos Szeredi wrote:
> On Mon, 12 Jul 2010, Andrew Morton wrote:
> > On Sun, 11 Jul 2010 10:06:57 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > 
> > > Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > ---
> > >  mm/page-writeback.c |    7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > --- linux-next.orig/mm/page-writeback.c	2010-07-11 08:41:37.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
> > > @@ -503,11 +503,12 @@ static void balance_dirty_pages(struct a
> > >  		};
> > >  
> > >  		get_dirty_limits(&background_thresh, &dirty_thresh,
> > > -				&bdi_thresh, bdi);
> > > +				 &bdi_thresh, bdi);
> > >  
> > >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > > -					global_page_state(NR_UNSTABLE_NFS);
> > > -		nr_writeback = global_page_state(NR_WRITEBACK);
> > > +				 global_page_state(NR_UNSTABLE_NFS);
> > > +		nr_writeback = global_page_state(NR_WRITEBACK) +
> > > +			       global_page_state(NR_WRITEBACK_TEMP);
> > >  
> > >  		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > >  		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > > 
> > 
> > hm, OK.
> 
> Hm, hm.  I'm not sure this is right.  The VM has absolutely no control
> over NR_WRITEBACK_TEMP pages, they may clear quickly or may not make
> any progress.  So it's usually wrong to make a decision based on
> NR_WRITEBACK_TEMP for an unrelated device.

Ah OK, let's remove this patch.

> Using it in throttle_vm_writeout() would actually be deadlocky, since
> the userspace filesystem will probably depend on memory allocations to
> complete the writeout.

Right.

> The only place where we should be taking NR_WRITEBACK_TEMP into
> account is calculating the remaining memory that can be devided
> between dirtyers, and that's (clip_bdi_dirty_limit) where it is
> already used.

clip_bdi_dirty_limit() is removed in the next patch, hopefully it's OK.

> > I wonder whether we could/should have unified NR_WRITEBACK_TEMP and
> > NR_UNSTABLE_NFS.  Their "meanings" aren't quite the same, but perhaps
> > some "treat page as dirty because the fs is futzing with it" thing.
> 
> AFAICS NR_UNSTABLE_NFS is something akin to NR_DIRTY, only on the
> server side.  So nfs can very much do something about making
> NR_UNSTABLE_NFS go away, while there's nothing that can be done about
> NR_WRITEBACK_TEMP.

Right. nfs_write_inode() normally tries to commit unstable pages.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-12 21:56   ` Andrew Morton
@ 2010-07-15 14:55     ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-15 14:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Peter Zijlstra, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, Jul 13, 2010 at 05:56:43AM +0800, Andrew Morton wrote:
> On Sun, 11 Jul 2010 10:06:59 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> >
> > ...
> >
> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			       unsigned long dirty)
> 
> It'd be nice to have some documentation for these things.  They're
> non-static, non-obvious and are stuffed to the gills with secret magic
> numbers.

Good suggestion, here is an attempt to document the functions.

Thanks,
Fengguang
---
Subject: add comment to the dirty limit functions
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 09:54:25 CST 2010

Document global_dirty_limits() and bdi_dirty_limit().

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-07-15 08:20:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-07-15 10:39:41.000000000 +0800
@@ -390,6 +390,15 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+/**
+ * global_dirty_limits - background writeback and dirty throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
+ * - vm.dirty_ratio             or  vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
@@ -424,8 +433,18 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty)
+/**
+ * bdi_dirty_limit - current task's share of dirty throttling threshold on @bdi
+ *
+ * Once the global dirty limit is _exceeded_, all dirtiers will be throttled.
+ * To avoid starving fast devices (which can sync dirty pages in short time) or
+ * throttling light dirtiers, we start throttling individual tasks on a per-bdi
+ * basis when _approaching_ the global dirty limit. Relative high limits will
+ * be allocated to fast devices and/or light dirtiers. The bdi's dirty share is
+ * evaluated adapting to its throughput and bounded if the bdi->min_ratio
+ * and/or bdi->max_ratio parameters are set.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages
  2010-07-12 22:13       ` Andrew Morton
@ 2010-07-15 15:35         ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-15 15:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML,
	David Howells

On Tue, Jul 13, 2010 at 06:13:17AM +0800, Andrew Morton wrote:
> On Mon, 12 Jul 2010 23:31:27 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > > +		} else if (inode->i_state & I_DIRTY) {
> > > > +			/*
> > > > +			 * At least XFS will redirty the inode during the
> > > > +			 * writeback (delalloc) and on io completion (isize).
> > > > +			 */
> > > > +			redirty_tail(inode);
> > > 
> > > I'd drop the mention of XFS here - any filesystem that does delayed
> > > allocation or unwritten extent conversion after Io completion will
> > > cause this. Perhaps make the comment:
> > > 
> > > 	/*
> > > 	 * Filesystems can dirty the inode during writeback
> > > 	 * operations, such as delayed allocation during submission
> > > 	 * or metadata updates after data IO completion.
> > > 	 */
> > 
> > Thanks, comments updated accordingly.
> > 
> > ---
> > writeback: don't redirty tail an inode with dirty pages
> > 
> > This avoids delaying writeback for an expired (XFS) inode with lots of
> > dirty pages, but no active dirtier at the moment. Previously we only do
> > that for the kupdate case.
> > 
> 
> You didn't actually explain the _reason_ for making this change. 
> Please always do that.

OK. It's actually extending commit b3af9468ae from the kupdate-only case to
both kupdate and !kupdate cases.

The commit documented the reason:

    Debug traces show that in per-bdi writeback, the inode under writeback
    almost always get redirtied by a busy dirtier.  We used to call
    redirty_tail() in this case, which could delay inode for up to 30s.
    
    This is unacceptable because it now happens so frequently for plain cp/dd,
    that the accumulated delays could make writeback of big files very slow.

    So let's distinguish between data redirty and metadata only redirty.
    The first one is caused by a busy dirtier, while the latter one could
    happen in XFS, NFS, etc. when they are doing delalloc or updating isize.

Commit b3af9468ae only does that for kupdate case because requeue_io() was
only called in the kupdate case. Now we are merging the kupdate and !kupdate
cases in patch 6/6 (why not?), so is this patch.

> The patch is...  surprisingly complicated, although the end result
> looks OK.  This is not aided by the partial duplication between
> mapping_tagged(PAGECACHE_TAG_DIRTY) and I_DIRTY_PAGES.  I don't think
> we can easily remove I_DIRTY_PAGES because it's used for the
> did-someone-just-dirty-a-page test here.

I double checked I_DIRTY_PAGES. The main difference to PAGECACHE_TAG_DIRTY is:
I_DIRTY_PAGES (at the line removed by this patch) means there are _new_ pages
get dirtied during writeback, while PAGECACHE_TAG_DIRTY means there are dirty
pages. In this sense, if the I_DIRTY_PAGES handling is the same as
PAGECACHE_TAG_DIRTY, the code can be merged into PAGECACHE_TAG_DIRTY, as this
patch does.

The other minor differences are

- in *_set_page_dirty*(), PAGECACHE_TAG_DIRTY is set racelessly, while
  I_DIRTY_PAGES might be set on the inode for a page just truncated.
  The difference has no real impact on this patch (it's actually
  slightly better now).

- afs_fsync() always set I_DIRTY_PAGES after calling afs_writepages().
  The call was there in the first day (introduce by David Howells).
  What was the intention, hmm..?

> This code is way too complex and fragile and I fear that anything we do
> to it will break something :(

Agreed. Let's try to simplify it :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
  2010-07-12 21:56   ` Andrew Morton
@ 2010-07-19 21:35   ` Andrew Morton
  2010-07-20  3:34     ` Wu Fengguang
  2010-08-03 15:03   ` Peter Zijlstra
  2 siblings, 1 reply; 36+ messages in thread
From: Andrew Morton @ 2010-07-19 21:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Peter Zijlstra, Dave Chinner, Jan Kara,
	linux-fsdevel, Linux Memory Management List, LKML

On Sun, 11 Jul 2010 10:06:59 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> so that the latter can be avoided when under global dirty background
> threshold (which is the normal state for most systems).
> 

mm/page-writeback.c: In function 'balance_dirty_pages_ratelimited_nr':
mm/page-writeback.c:466: warning: 'dirty_exceeded' may be used uninitialized in this function

This was a real bug.

--- a/mm/page-writeback.c~writeback-avoid-unnecessary-calculation-of-bdi-dirty-thresholds-fix
+++ a/mm/page-writeback.c
@@ -463,7 +463,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
-	int dirty_exceeded;
+	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-19 21:35   ` Andrew Morton
@ 2010-07-20  3:34     ` Wu Fengguang
  2010-07-20  4:14       ` Andrew Morton
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-20  3:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Hellwig, Peter Zijlstra, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, Jul 20, 2010 at 05:35:20AM +0800, Andrew Morton wrote:
> On Sun, 11 Jul 2010 10:06:59 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> > so that the latter can be avoided when under global dirty background
> > threshold (which is the normal state for most systems).
> > 
> 
> mm/page-writeback.c: In function 'balance_dirty_pages_ratelimited_nr':
> mm/page-writeback.c:466: warning: 'dirty_exceeded' may be used uninitialized in this function
> 
> This was a real bug.

Thanks! But how do you catch this? There are no warnings in my compile test.

I noticed that there is a gcc option "-Wuninitialized", which will be turned on
with "-Wall" and "-O2" as used in the following command:

  gcc -Wp,-MD,mm/.page-writeback.o.d  -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.4.4/include -I/cc/linux-2.6.33/arch/x86/include -Iinclude  -include include/generated/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -march=core2 -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -Wframe-larger-than=2048 -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -g -pg -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fno-dwarf2-cf
 i-asm -fconserve-stack   -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(page_writeback)"  -D"KBUILD_MODNAME=KBUILD_STR(page_writeback)"  -c -o mm/page-writeback.o mm/page-writeback.c


My gcc version is 

$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.4-6' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --ena
ble-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu                                            Thread model: posix
gcc version 4.4.4 (Debian 4.4.4-6) 

Thanks,
Fengguang

> --- a/mm/page-writeback.c~writeback-avoid-unnecessary-calculation-of-bdi-dirty-thresholds-fix
> +++ a/mm/page-writeback.c
> @@ -463,7 +463,7 @@ static void balance_dirty_pages(struct a
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
>  	unsigned long pause = 1;
> -	int dirty_exceeded;
> +	bool dirty_exceeded = false;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> _

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-20  3:34     ` Wu Fengguang
@ 2010-07-20  4:14       ` Andrew Morton
  0 siblings, 0 replies; 36+ messages in thread
From: Andrew Morton @ 2010-07-20  4:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Peter Zijlstra, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, 20 Jul 2010 11:34:37 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Tue, Jul 20, 2010 at 05:35:20AM +0800, Andrew Morton wrote:
> > On Sun, 11 Jul 2010 10:06:59 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> > > so that the latter can be avoided when under global dirty background
> > > threshold (which is the normal state for most systems).
> > > 
> > 
> > mm/page-writeback.c: In function 'balance_dirty_pages_ratelimited_nr':
> > mm/page-writeback.c:466: warning: 'dirty_exceeded' may be used uninitialized in this function
> > 
> > This was a real bug.
> 
> Thanks! But how do you catch this? There are no warnings in my compile test.

Basic `make allmodconfig'.  But I use a range of different compiler
versions.  Different versions of gcc detect different stuff.  This was 4.1.0
or 4.0.2, I forget which.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-11  2:06 ` [PATCH 2/6] writeback: reduce calls to global_page_state " Wu Fengguang
@ 2010-07-26 15:19   ` Jan Kara
  2010-07-27  3:59     ` Wu Fengguang
  2010-08-03 14:55   ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Jan Kara @ 2010-07-26 15:19 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Jan Kara, Peter Zijlstra,
	Richard Kennedy, Dave Chinner, linux-fsdevel,
	Linux Memory Management List, LKML

On Sun 11-07-10 10:06:58, Wu Fengguang wrote:
> Reducing the number of times balance_dirty_pages calls global_page_state
> reduces the cache references and so improves write performance on a
> variety of workloads.
> 
> 'perf stats' of simple fio write tests shows the reduction in cache
> access.  Where the test is fio 'write,mmap,600Mb,pre_read' on AMD
> AthlonX2 with 3Gb memory (dirty_threshold approx 600 Mb) running each
> test 10 times, dropping the fasted & slowest values then taking the
> average & standard deviation
> 
> 		average (s.d.) in millions (10^6)
> 2.6.31-rc8	648.6 (14.6)
> +patch		620.1 (16.5)
> 
> Achieving this reduction is by dropping clip_bdi_dirty_limit as it
> rereads the counters to apply the dirty_threshold and moving this check
> up into balance_dirty_pages where it has already read the counters.
> 
> Also by rearrange the for loop to only contain one copy of the limit
> tests allows the pdflush test after the loop to use the local copies of
> the counters rather than rereading them.
> 
> In the common case with no throttling it now calls global_page_state 5
> fewer times and bdi_stat 2 fewer.
> 
> Fengguang:
> 
> This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
> with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
> to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
> accurate we don't need to do routinely clip. A simple dirty limit check
> would be enough.
> 
> The check is necessary because, in principle we should throttle
> everything calling balance_dirty_pages() when we're over the total
> limit, as said by Peter.
> 
> We now set and clear dirty_exceeded not only based on bdi dirty limits,
> but also on the global dirty limits. This is a bit counterintuitive, but
> the global limits are the ultimate goal and shall be always imposed.
  Thinking about this again - what you did is rather big change for systems
with more active BDIs. For example if I have two disks sda and sdb and
write for some time to sda, then dirty limit for sdb gets scaled down.
So when we start writing to sbd we'll heavily throttle the threads until
the dirty limit for sdb ramps up regardless of how far are we to reach the
global limit...

> We may now start background writeback work based on outdated conditions.
> That's safe because the bdi flush thread will (and have to) double check
> the states. It reduces overall overheads because the test based on old
> states still have good chance to be right.

									Honza
> 
> CC: Jan Kara <jack@suse.cz>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   95 ++++++++++++++----------------------------
>  1 file changed, 33 insertions(+), 62 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2010-07-11 08:42:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2010-07-11 08:44:49.000000000 +0800
> @@ -253,32 +253,6 @@ static void bdi_writeout_fraction(struct
>  	}
>  }
>  
> -/*
> - * Clip the earned share of dirty pages to that which is actually available.
> - * This avoids exceeding the total dirty_limit when the floating averages
> - * fluctuate too quickly.
> - */
> -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> -		unsigned long dirty, unsigned long *pbdi_dirty)
> -{
> -	unsigned long avail_dirty;
> -
> -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> -		 global_page_state(NR_WRITEBACK) +
> -		 global_page_state(NR_UNSTABLE_NFS) +
> -		 global_page_state(NR_WRITEBACK_TEMP);
> -
> -	if (avail_dirty < dirty)
> -		avail_dirty = dirty - avail_dirty;
> -	else
> -		avail_dirty = 0;
> -
> -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> -		bdi_stat(bdi, BDI_WRITEBACK);
> -
> -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> -}
> -
>  static inline void task_dirties_fraction(struct task_struct *tsk,
>  		long *numerator, long *denominator)
>  {
> @@ -469,7 +443,6 @@ get_dirty_limits(unsigned long *pbackgro
>  			bdi_dirty = dirty * bdi->max_ratio / 100;
>  
>  		*pbdi_dirty = bdi_dirty;
> -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
>  		task_dirty_limit(current, pbdi_dirty);
>  	}
>  }
> @@ -491,7 +464,7 @@ static void balance_dirty_pages(struct a
>  	unsigned long bdi_thresh;
>  	unsigned long pages_written = 0;
>  	unsigned long pause = 1;
> -
> +	int dirty_exceeded;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> @@ -510,10 +483,35 @@ static void balance_dirty_pages(struct a
>  		nr_writeback = global_page_state(NR_WRITEBACK) +
>  			       global_page_state(NR_WRITEBACK_TEMP);
>  
> -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +		/*
> +		 * In order to avoid the stacked BDI deadlock we need
> +		 * to ensure we accurately count the 'dirty' pages when
> +		 * the threshold is low.
> +		 *
> +		 * Otherwise it would be possible to get thresh+n pages
> +		 * reported dirty, even though there are thresh-m pages
> +		 * actually dirty; with m+n sitting in the percpu
> +		 * deltas.
> +		 */
> +		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> +			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> +			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> +		} else {
> +			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> +			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +		}
> +
> +		/*
> +		 * The bdi thresh is somehow "soft" limit derived from the
> +		 * global "hard" limit. The former helps to prevent heavy IO
> +		 * bdi or process from holding back light ones; The latter is
> +		 * the last resort safeguard.
> +		 */
> +		dirty_exceeded =
> +			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> +			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
>  
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> +		if (!dirty_exceeded)
>  			break;
>  
>  		/*
> @@ -541,34 +539,10 @@ static void balance_dirty_pages(struct a
>  		if (bdi_nr_reclaimable > bdi_thresh) {
>  			writeback_inodes_wb(&bdi->wb, &wbc);
>  			pages_written += write_chunk - wbc.nr_to_write;
> -			get_dirty_limits(&background_thresh, &dirty_thresh,
> -				       &bdi_thresh, bdi);
>  			trace_wbc_balance_dirty_written(&wbc, bdi);
> +			if (pages_written >= write_chunk)
> +				break;		/* We've done our duty */
>  		}
> -
> -		/*
> -		 * In order to avoid the stacked BDI deadlock we need
> -		 * to ensure we accurately count the 'dirty' pages when
> -		 * the threshold is low.
> -		 *
> -		 * Otherwise it would be possible to get thresh+n pages
> -		 * reported dirty, even though there are thresh-m pages
> -		 * actually dirty; with m+n sitting in the percpu
> -		 * deltas.
> -		 */
> -		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else if (bdi_nr_reclaimable) {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> -		}
> -
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> -			break;
> -		if (pages_written >= write_chunk)
> -			break;		/* We've done our duty */
> -
>  		trace_wbc_balance_dirty_wait(&wbc, bdi);
>  		__set_current_state(TASK_INTERRUPTIBLE);
>  		io_schedule_timeout(pause);
> @@ -582,8 +556,7 @@ static void balance_dirty_pages(struct a
>  			pause = HZ / 10;
>  	}
>  
> -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> -			bdi->dirty_exceeded)
> +	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
>  	if (writeback_in_progress(bdi))
> @@ -598,9 +571,7 @@ static void balance_dirty_pages(struct a
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_background_writeback(bdi);
>  }
>  
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-26 15:19   ` Jan Kara
@ 2010-07-27  3:59     ` Wu Fengguang
  2010-07-27  9:12       ` Jan Kara
  0 siblings, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-07-27  3:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Peter Zijlstra, Richard Kennedy,
	Dave Chinner, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List, LKML

> > This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
> > with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
> > to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
> > accurate we don't need to do routinely clip. A simple dirty limit check
> > would be enough.
> > 
> > The check is necessary because, in principle we should throttle
> > everything calling balance_dirty_pages() when we're over the total
> > limit, as said by Peter.
> > 
> > We now set and clear dirty_exceeded not only based on bdi dirty limits,
> > but also on the global dirty limits. This is a bit counterintuitive, but
> > the global limits are the ultimate goal and shall be always imposed.
>   Thinking about this again - what you did is rather big change for systems
> with more active BDIs. For example if I have two disks sda and sdb and
> write for some time to sda, then dirty limit for sdb gets scaled down.
> So when we start writing to sbd we'll heavily throttle the threads until
> the dirty limit for sdb ramps up regardless of how far are we to reach the
> global limit...

The global threshold check is added in place of clip_bdi_dirty_limit()
for safety and not intended as a behavior change. If ever leading to
big behavior change and regression, that it would be indicating some
too permissive per-bdi threshold calculation.

Did you see the global dirty threshold get exceeded when writing to 2+
devices? Occasional small exceeding should be OK though. I tried the
following debug patch and see no warnings when doing two concurrent cp
over local disk and NFS.

Index: linux-next/mm/page-writeback.c
===================================================================
--- linux-next.orig/mm/page-writeback.c	2010-07-27 11:26:18.063817669 +0800
+++ linux-next/mm/page-writeback.c	2010-07-27 11:26:53.335855847 +0800
@@ -513,6 +513,11 @@
 		if (!dirty_exceeded)
 			break;
 
+		if (nr_reclaimable + nr_writeback >= dirty_thresh)
+			printk ("XXX: dirty exceeded: %lu + %lu = %lu ++ %lu\n",
+				nr_reclaimable, nr_writeback, dirty_thresh,
+				nr_reclaimable + nr_writeback - dirty_thresh);
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-27  3:59     ` Wu Fengguang
@ 2010-07-27  9:12       ` Jan Kara
  2010-07-28  2:04         ` Wu Fengguang
  0 siblings, 1 reply; 36+ messages in thread
From: Jan Kara @ 2010-07-27  9:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Peter Zijlstra,
	Richard Kennedy, Dave Chinner, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List, LKML

On Tue 27-07-10 11:59:41, Wu Fengguang wrote:
> > > This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
> > > with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
> > > to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
> > > accurate we don't need to do routinely clip. A simple dirty limit check
> > > would be enough.
> > > 
> > > The check is necessary because, in principle we should throttle
> > > everything calling balance_dirty_pages() when we're over the total
> > > limit, as said by Peter.
> > > 
> > > We now set and clear dirty_exceeded not only based on bdi dirty limits,
> > > but also on the global dirty limits. This is a bit counterintuitive, but
> > > the global limits are the ultimate goal and shall be always imposed.
> >   Thinking about this again - what you did is rather big change for systems
> > with more active BDIs. For example if I have two disks sda and sdb and
> > write for some time to sda, then dirty limit for sdb gets scaled down.
> > So when we start writing to sbd we'll heavily throttle the threads until
> > the dirty limit for sdb ramps up regardless of how far are we to reach the
> > global limit...
> 
> The global threshold check is added in place of clip_bdi_dirty_limit()
> for safety and not intended as a behavior change. If ever leading to
> big behavior change and regression, that it would be indicating some
> too permissive per-bdi threshold calculation.
> 
> Did you see the global dirty threshold get exceeded when writing to 2+
> devices? Occasional small exceeding should be OK though. I tried the
> following debug patch and see no warnings when doing two concurrent cp
> over local disk and NFS.
  Oops, sorry. I've misread the code. You're right. There shouldn't be a big
change in the behavior.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-27  9:12       ` Jan Kara
@ 2010-07-28  2:04         ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-07-28  2:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Peter Zijlstra, Richard Kennedy,
	Dave Chinner, linux-fsdevel@vger.kernel.org,
	Linux Memory Management List, LKML

> > The global threshold check is added in place of clip_bdi_dirty_limit()
> > for safety and not intended as a behavior change. If ever leading to
> > big behavior change and regression, that it would be indicating some
> > too permissive per-bdi threshold calculation.
> > 
> > Did you see the global dirty threshold get exceeded when writing to 2+
> > devices? Occasional small exceeding should be OK though. I tried the
> > following debug patch and see no warnings when doing two concurrent cp
> > over local disk and NFS.
>   Oops, sorry. I've misread the code. You're right. There shouldn't be a big
> change in the behavior.

It does indicate a missing point in the changelog. The paragraph is
updated to:

        We now set and clear dirty_exceeded not only based on bdi dirty limits,
        but also on the global dirty limit. The global limit check is added in
        place of clip_bdi_dirty_limit() for safety and not intended as a
        behavior change. The bdi limits should be tight enough to keep all dirty
        pages under the global limit at most time; occasional small exceeding
        should be OK though. The change makes the logic more obvious: the global
        limit is the ultimate goal and shall be always imposed.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/6] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2010-07-11  2:06 ` [PATCH 2/6] writeback: reduce calls to global_page_state " Wu Fengguang
  2010-07-26 15:19   ` Jan Kara
@ 2010-08-03 14:55   ` Peter Zijlstra
  1 sibling, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-08-03 14:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Jan Kara, Richard Kennedy,
	Dave Chinner, linux-fsdevel, Linux Memory Management List, LKML

On Sun, 2010-07-11 at 10:06 +0800, Wu Fengguang wrote:
> 
> CC: Jan Kara <jack@suse.cz>

I can more or less remember this patch, and the result looks good.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


> Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |   95 ++++++++++++++----------------------------
>  1 file changed, 33 insertions(+), 62 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c 2010-07-11 08:42:14.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2010-07-11 08:44:49.000000000 +0800
> @@ -253,32 +253,6 @@ static void bdi_writeout_fraction(struct
>         }
>  }
>  
>  static inline void task_dirties_fraction(struct task_struct *tsk,
>                 long *numerator, long *denominator)
>  {
> @@ -469,7 +443,6 @@ get_dirty_limits(unsigned long *pbackgro
>                         bdi_dirty = dirty * bdi->max_ratio / 100;
>  
>                 *pbdi_dirty = bdi_dirty;
>                 task_dirty_limit(current, pbdi_dirty);
>         }
>  }
> @@ -491,7 +464,7 @@ static void balance_dirty_pages(struct a
>         unsigned long bdi_thresh;
>         unsigned long pages_written = 0;
>         unsigned long pause = 1;
> +       int dirty_exceeded;
>         struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>         for (;;) {
> @@ -510,10 +483,35 @@ static void balance_dirty_pages(struct a
>                 nr_writeback = global_page_state(NR_WRITEBACK) +
>                                global_page_state(NR_WRITEBACK_TEMP);
>  
> +               /*
> +                * In order to avoid the stacked BDI deadlock we need
> +                * to ensure we accurately count the 'dirty' pages when
> +                * the threshold is low.
> +                *
> +                * Otherwise it would be possible to get thresh+n pages
> +                * reported dirty, even though there are thresh-m pages
> +                * actually dirty; with m+n sitting in the percpu
> +                * deltas.
> +                */
> +               if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> +                       bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> +                       bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> +               } else {
> +                       bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> +                       bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +               }
> +
> +               /*
> +                * The bdi thresh is somehow "soft" limit derived from the
> +                * global "hard" limit. The former helps to prevent heavy IO
> +                * bdi or process from holding back light ones; The latter is
> +                * the last resort safeguard.
> +                */
> +               dirty_exceeded =
> +                       (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> +                       || (nr_reclaimable + nr_writeback >= dirty_thresh);
>  
> +               if (!dirty_exceeded)
>                         break;
>  
>                 /*
> @@ -541,34 +539,10 @@ static void balance_dirty_pages(struct a
>                 if (bdi_nr_reclaimable > bdi_thresh) {
>                         writeback_inodes_wb(&bdi->wb, &wbc);
>                         pages_written += write_chunk - wbc.nr_to_write;
>                         trace_wbc_balance_dirty_written(&wbc, bdi);
> +                       if (pages_written >= write_chunk)
> +                               break;          /* We've done our duty */
>                 }
>                 trace_wbc_balance_dirty_wait(&wbc, bdi);
>                 __set_current_state(TASK_INTERRUPTIBLE);
>                 io_schedule_timeout(pause);
> @@ -582,8 +556,7 @@ static void balance_dirty_pages(struct a
>                         pause = HZ / 10;
>         }
>  
> +       if (!dirty_exceeded && bdi->dirty_exceeded)
>                 bdi->dirty_exceeded = 0;
>  
>         if (writeback_in_progress(bdi))
> @@ -598,9 +571,7 @@ static void balance_dirty_pages(struct a
>          * background_thresh, to keep the amount of dirty memory low.
>          */
>         if ((laptop_mode && pages_written) ||
> +           (!laptop_mode && (nr_reclaimable > background_thresh)))
>                 bdi_start_background_writeback(bdi);
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
  2010-07-12 21:56   ` Andrew Morton
  2010-07-19 21:35   ` Andrew Morton
@ 2010-08-03 15:03   ` Peter Zijlstra
  2010-08-03 15:10     ` Wu Fengguang
  2010-08-04 16:41     ` Wu Fengguang
  2 siblings, 2 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-08-03 15:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Jan Kara,
	linux-fsdevel, Linux Memory Management List, LKML

On Sun, 2010-07-11 at 10:06 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-less-bdi-calc.patch)
> Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> so that the latter can be avoided when under global dirty background
> threshold (which is the normal state for most systems).

The patch looks OK, although esp with the proposed comments in the
follow up email, bdi_dirty_limit() gets a bit confusing wrt to how and
what the limit is.

Maybe its clearer to not call task_dirty_limit() from bdi_dirty_limit(),
that way the comment can focus on the device write request completion
proportion thing.

> +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> +			       unsigned long dirty)
> +{
> +	u64 bdi_dirty;
> +	long numerator, denominator;
>  
> +	/*
> +	 * Calculate this BDI's share of the dirty ratio.
> +	 */
> +	bdi_writeout_fraction(bdi, &numerator, &denominator);
>  
> +	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
> +	bdi_dirty *= numerator;
> +	do_div(bdi_dirty, denominator);
>  
> +	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> +	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
> +		bdi_dirty = dirty * bdi->max_ratio / 100;
> +
  +       return bdi_dirty;
>  }

And then add the call to task_dirty_limit() here:

> +++ linux-next/mm/backing-dev.c	2010-07-11 08:53:44.000000000 +0800
> @@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
>  		nr_more_io++;
>  	spin_unlock(&inode_lock);
>  
> -	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> +	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
  +       bdi_thresh = task_dirty_limit(current, bdi_thresh);

And add a comment to task_dirty_limit() as well, explaining its reason
for existence (protecting light/slow dirtying tasks from heavier/fast
ones).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-08-03 15:03   ` Peter Zijlstra
@ 2010-08-03 15:10     ` Wu Fengguang
  2010-08-04 16:41     ` Wu Fengguang
  1 sibling, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-08-03 15:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, Aug 03, 2010 at 11:03:42PM +0800, Peter Zijlstra wrote:
> On Sun, 2010-07-11 at 10:06 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-less-bdi-calc.patch)
> > Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> > so that the latter can be avoided when under global dirty background
> > threshold (which is the normal state for most systems).
> 
> The patch looks OK, although esp with the proposed comments in the
> follow up email, bdi_dirty_limit() gets a bit confusing wrt to how and
> what the limit is.
> 
> Maybe its clearer to not call task_dirty_limit() from bdi_dirty_limit(),
> that way the comment can focus on the device write request completion
> proportion thing.
> 
> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			       unsigned long dirty)
> > +{
> > +	u64 bdi_dirty;
> > +	long numerator, denominator;
> >  
> > +	/*
> > +	 * Calculate this BDI's share of the dirty ratio.
> > +	 */
> > +	bdi_writeout_fraction(bdi, &numerator, &denominator);
> >  
> > +	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
> > +	bdi_dirty *= numerator;
> > +	do_div(bdi_dirty, denominator);
> >  
> > +	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> > +	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
> > +		bdi_dirty = dirty * bdi->max_ratio / 100;
> > +
>   +       return bdi_dirty;
> >  }
> 
> And then add the call to task_dirty_limit() here:
> 
> > +++ linux-next/mm/backing-dev.c	2010-07-11 08:53:44.000000000 +0800
> > @@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
> >  		nr_more_io++;
> >  	spin_unlock(&inode_lock);
> >  
> > -	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
> > +	global_dirty_limits(&background_thresh, &dirty_thresh);
> > +	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
>   +       bdi_thresh = task_dirty_limit(current, bdi_thresh);
> 
> And add a comment to task_dirty_limit() as well, explaining its reason
> for existence (protecting light/slow dirtying tasks from heavier/fast
> ones).

Good suggestions, that would be much less confusing. Will post updated
patches tomorrow.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-08-03 15:03   ` Peter Zijlstra
  2010-08-03 15:10     ` Wu Fengguang
@ 2010-08-04 16:41     ` Wu Fengguang
  2010-08-04 17:10       ` Peter Zijlstra
  1 sibling, 1 reply; 36+ messages in thread
From: Wu Fengguang @ 2010-08-04 16:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, Aug 03, 2010 at 11:03:42PM +0800, Peter Zijlstra wrote:
> On Sun, 2010-07-11 at 10:06 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-less-bdi-calc.patch)
> > Split get_dirty_limits() into global_dirty_limits()+bdi_dirty_limit(),
> > so that the latter can be avoided when under global dirty background
> > threshold (which is the normal state for most systems).
> 
> The patch looks OK, although esp with the proposed comments in the
> follow up email, bdi_dirty_limit() gets a bit confusing wrt to how and
> what the limit is.
> 
> Maybe its clearer to not call task_dirty_limit() from bdi_dirty_limit(),
> that way the comment can focus on the device write request completion
> proportion thing.

Done, thanks.

> > +unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> > +			       unsigned long dirty)
> > +{
> > +	u64 bdi_dirty;
> > +	long numerator, denominator;
> >  
> > +	/*
> > +	 * Calculate this BDI's share of the dirty ratio.
> > +	 */
> > +	bdi_writeout_fraction(bdi, &numerator, &denominator);
> >  
> > +	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
> > +	bdi_dirty *= numerator;
> > +	do_div(bdi_dirty, denominator);
> >  
> > +	bdi_dirty += (dirty * bdi->min_ratio) / 100;
> > +	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
> > +		bdi_dirty = dirty * bdi->max_ratio / 100;
> > +
>   +       return bdi_dirty;
> >  }
> 
> And then add the call to task_dirty_limit() here:

Done. I omitted adding task_dirty_limit() to the bdi_dirty_limit()
inside bdi_debug_stats_show() -- looks unnecessary there.

> > +++ linux-next/mm/backing-dev.c	2010-07-11 08:53:44.000000000 +0800
> > @@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
> >  		nr_more_io++;
> >  	spin_unlock(&inode_lock);
> >  
> > -	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
> > +	global_dirty_limits(&background_thresh, &dirty_thresh);
> > +	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
>   +       bdi_thresh = task_dirty_limit(current, bdi_thresh);
> 
> And add a comment to task_dirty_limit() as well, explaining its reason
> for existence (protecting light/slow dirtying tasks from heavier/fast
> ones).

Comments updated as below. Any suggestions/corrections?

Thanks,
Fengguang

Subject: writeback: add comment to the dirty limits functions
From: Wu Fengguang <fengguang.wu@intel.com>
Date: Thu Jul 15 09:54:25 CST 2010

Document global_dirty_limits(), bdi_dirty_limit() and task_dirty_limit().

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-08-03 23:14:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-08-05 00:37:17.000000000 +0800
@@ -261,11 +261,18 @@ static inline void task_dirties_fraction
 }
 
 /*
- * scale the dirty limit
+ * task_dirty_limit - scale down dirty throttling threshold for one task
  *
  * task specific dirty limit:
  *
  *   dirty -= (dirty/8) * p_{t}
+ *
+ * To protect light/slow dirtying tasks from heavier/fast ones, we start
+ * throttling individual tasks before reaching the bdi dirty limit.
+ * Relatively low thresholds will be allocated to heavy dirtiers. So when
+ * dirty pages grow large, heavy dirtiers will be throttled first, which will
+ * effectively curb the growth of dirty pages. Light dirtiers with high enough
+ * dirty threshold may never get throttled.
  */
 static unsigned long task_dirty_limit(struct task_struct *tsk,
 				       unsigned long bdi_dirty)
@@ -390,6 +397,15 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+/**
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ *
+ * Calculate the dirty thresholds based on sysctl parameters
+ * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
+ * - vm.dirty_ratio             or  vm.dirty_bytes
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * runtime tasks.
+ */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
@@ -424,8 +440,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty)
+/**
+ * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ *
+ * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * - starving fast devices
+ * - piling up dirty pages (that will take long time to sync) on slow devices
+ *
+ * The bdi's share of dirty limit will be adapting to its throughput and
+ * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds
  2010-08-04 16:41     ` Wu Fengguang
@ 2010-08-04 17:10       ` Peter Zijlstra
  0 siblings, 0 replies; 36+ messages in thread
From: Peter Zijlstra @ 2010-08-04 17:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Jan Kara,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Thu, 2010-08-05 at 00:41 +0800, Wu Fengguang wrote:
> 
> Comments updated as below. Any suggestions/corrections?
> 
Looks nice, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases
  2010-07-12 22:22       ` Andrew Morton
@ 2010-08-05 16:01         ` Wu Fengguang
  0 siblings, 0 replies; 36+ messages in thread
From: Wu Fengguang @ 2010-08-05 16:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Chinner, Christoph Hellwig, Martin Bligh, Michael Rubin,
	Peter Zijlstra, Jan Kara, Peter Zijlstra,
	linux-fsdevel@vger.kernel.org, Linux Memory Management List, LKML

On Tue, Jul 13, 2010 at 06:22:54AM +0800, Andrew Morton wrote:
> On Mon, 12 Jul 2010 23:52:39 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > > Also, I'd prefer that the
> > > comments remain somewhat more descriptive of the circumstances that
> > > we are operating under. Comments like "retry later to avoid blocking
> > > writeback of other inodes" is far, far better than "retry later"
> > > because it has "why" component that explains the reason for the
> > > logic. You may remember why, but I sure won't in a few months time....
> 
> me2 (of course).  This code is waaaay too complex to be scrimping on comments.
> 
> > Ah yes the comment is too simple. However the redirty_tail() is not to
> > avoid blocking writeback of other inodes, but to avoid eating 100% CPU
> > on busy retrying a dirty inode/page that cannot perform writeback for
> > a while. (In theory redirty_tail() can still busy retry though, when
> > there is only one single dirty inode.) So how about
> > 
> >         /*
> >          * somehow blocked: avoid busy retrying
> >          */
> 
> That's much too short.  Expand on the "somehow" - provide an example,
> describe the common/expected cause.  Fully explain what the "busy"
> retry _is_ and how it can come about.

It was a long story.. This redirty_tail() was introduced when more_io
is introduced. The initial patch for more_io does not have the
redirty_tail(), and when it's merged, several 100% iowait bug reports
arises:

reiserfs:
        http://lkml.org/lkml/2007/10/23/93

jfs:
        commit 29a424f28390752a4ca2349633aaacc6be494db5
        JFS: clear PAGECACHE_TAG_DIRTY for no-write pages

ext2:
        http://www.spinics.net/linux/lists/linux-ext4/msg04762.html

They are all old bugs hidden in various filesystems that become
"obvious" with the more_io patch. At the time, the ext2 bug is thought
to be "trivial", so you didn't merge that fix. Instead the following
patch with redirty_tail() is merged:

http://www.spinics.net/linux/lists/linux-ext4/msg04507.html

This will in general prevent 100% on ext2 and other possibly unknown FS bugs.

I'll take David's comments and note the above in changelog.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2010-08-05 16:01 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-11  2:06 [PATCH 0/6] writeback cleanups and trivial fixes Wu Fengguang
2010-07-11  2:06 ` [PATCH 1/6] writeback: take account of NR_WRITEBACK_TEMP in balance_dirty_pages() Wu Fengguang
2010-07-12 21:52   ` Andrew Morton
2010-07-13  8:58     ` Miklos Szeredi
2010-07-15 14:50       ` Wu Fengguang
2010-07-11  2:06 ` [PATCH 2/6] writeback: reduce calls to global_page_state " Wu Fengguang
2010-07-26 15:19   ` Jan Kara
2010-07-27  3:59     ` Wu Fengguang
2010-07-27  9:12       ` Jan Kara
2010-07-28  2:04         ` Wu Fengguang
2010-08-03 14:55   ` Peter Zijlstra
2010-07-11  2:06 ` [PATCH 3/6] writeback: avoid unnecessary calculation of bdi dirty thresholds Wu Fengguang
2010-07-12 21:56   ` Andrew Morton
2010-07-15 14:55     ` Wu Fengguang
2010-07-19 21:35   ` Andrew Morton
2010-07-20  3:34     ` Wu Fengguang
2010-07-20  4:14       ` Andrew Morton
2010-08-03 15:03   ` Peter Zijlstra
2010-08-03 15:10     ` Wu Fengguang
2010-08-04 16:41     ` Wu Fengguang
2010-08-04 17:10       ` Peter Zijlstra
2010-07-11  2:07 ` [PATCH 4/6] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2010-07-12  2:01   ` Dave Chinner
2010-07-12 15:31     ` Wu Fengguang
2010-07-12 22:13       ` Andrew Morton
2010-07-15 15:35         ` Wu Fengguang
2010-07-11  2:07 ` [PATCH 5/6] writeback: fix queue_io() ordering Wu Fengguang
2010-07-12 22:15   ` Andrew Morton
2010-07-11  2:07 ` [PATCH 6/6] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2010-07-12  2:08   ` Dave Chinner
2010-07-12 15:52     ` Wu Fengguang
2010-07-12 22:06       ` Dave Chinner
2010-07-12 22:22       ` Andrew Morton
2010-08-05 16:01         ` Wu Fengguang
2010-07-11  2:44 ` [PATCH 0/6] writeback cleanups and trivial fixes Christoph Hellwig
2010-07-11  2:50   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).