[PATCH 00/45] some writeback experiments

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/45] some writeback experiments
@ 2009-10-07  7:38 Wu Fengguang
  2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
                   ` (47 more replies)
  0 siblings, 48 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

Hi all,

Here is a collection of writeback patches on

- larger writeback chunk sizes
- single per-bdi flush thread (killing the foreground throttling writeouts)
- lumpy pageout
- sync livelock prevention
- writeback scheduling
- random fixes

Sorry for posting a too big series - there are many direct or implicit
dependencies, and one patch lead to another before I can stop..

The lumpy pageout and nr_segments support is not complete and do not
cover all filesystems for now. It may be better to first convert some of
the ->writepages to the generic routines to avoid duplicate work.

I managed to address many issues in past week, however there are still known
problems. Hints from filesystem developers are highly appreciated. Thanks!

The estimated writeback bandwidth is about 1/2 the real throughput
for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
cannot be estimated at all for XFS.  Very interesting..

NFS writeback is very bumpy. The page numbers and network throughput "freeze"
together from time to time:

# vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
     nr_writeback         nr_dirty      nr_unstable
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11045            53987             6490
            11033            53120             8145
            11195            52143            10886
            11211            52144            10913
            11211            52144            10913
            11211            52144            10913

btrfs seems to maintain a private pool of writeback pages, which can go out of
control:

     nr_writeback         nr_dirty
           261075              132
           252891              195
           244795              187
           236851              187
           228830              187
           221040              218
           212674              237
           204981              237

XFS has very interesting "bumpy writeback" behavior: it tends to wait
collect enough pages and then write the whole world.

     nr_writeback         nr_dirty
            80781                0
            37117            37703
            37117            43933
            81044                6
            81050                0
            43943            10199
            43930            36355
            43930            36355
            80293                0
            80285                0
            80285                0

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-09 15:12   ` Jan Kara
  2009-10-07  7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
                   ` (46 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Richard Kennedy, Wu Fengguang, LKML

[-- Attachment #1: writeback-less-balance-calc.patch --]
[-- Type: text/plain, Size: 6645 bytes --]

From: Richard Kennedy <richard@rsk.demon.co.uk>

Reducing the number of times balance_dirty_pages calls global_page_state
reduces the cache references and so improves write performance on a
variety of workloads.

'perf stats' of simple fio write tests shows the reduction in cache
access.
Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
3Gb memory (dirty_threshold approx 600 Mb)
running each test 10 times, dropping the fasted & slowest values then
taking 
the average & standard deviation

		average (s.d.) in millions (10^6)
2.6.31-rc8	648.6 (14.6)
+patch		620.1 (16.5)

Achieving this reduction is by dropping clip_bdi_dirty_limit as it  
rereads the counters to apply the dirty_threshold and moving this check
up into balance_dirty_pages where it has already read the counters.

Also by rearrange the for loop to only contain one copy of the limit
tests allows the pdflush test after the loop to use the local copies of
the counters rather than rereading them.

In the common case with no throttling it now calls global_page_state 5
fewer times and bdi_stat 2 fewer.

Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
----
Thanks to everybody for the feedback & suggestions.
This patch is against 2.6.31-rc8

---
 mm/page-writeback.c |   99 ++++++++++++++----------------------------
 1 file changed, 33 insertions(+), 66 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:31:42.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:31:53.000000000 +0800
@@ -252,32 +252,6 @@ static void bdi_writeout_fraction(struct
 	}
 }
 
-/*
- * Clip the earned share of dirty pages to that which is actually available.
- * This avoids exceeding the total dirty_limit when the floating averages
- * fluctuate too quickly.
- */
-static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
-		unsigned long dirty, unsigned long *pbdi_dirty)
-{
-	unsigned long avail_dirty;
-
-	avail_dirty = global_page_state(NR_FILE_DIRTY) +
-		 global_page_state(NR_WRITEBACK) +
-		 global_page_state(NR_UNSTABLE_NFS) +
-		 global_page_state(NR_WRITEBACK_TEMP);
-
-	if (avail_dirty < dirty)
-		avail_dirty = dirty - avail_dirty;
-	else
-		avail_dirty = 0;
-
-	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
-		bdi_stat(bdi, BDI_WRITEBACK);
-
-	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
-}
-
 static inline void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
@@ -468,7 +442,6 @@ get_dirty_limits(unsigned long *pbackgro
 			bdi_dirty = dirty * bdi->max_ratio / 100;
 
 		*pbdi_dirty = bdi_dirty;
-		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
 		task_dirty_limit(current, pbdi_dirty);
 	}
 }
@@ -490,7 +463,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
-
+	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
@@ -503,16 +476,36 @@ static void balance_dirty_pages(struct a
 		};
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
-				&bdi_thresh, bdi);
+				 &bdi_thresh, bdi);
 
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+				 global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK) +
+			       global_page_state(NR_WRITEBACK_TEMP);
 
-		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		/*
+		 * In order to avoid the stacked BDI deadlock we need
+		 * to ensure we accurately count the 'dirty' pages when
+		 * the threshold is low.
+		 *
+		 * Otherwise it would be possible to get thresh+n pages
+		 * reported dirty, even though there are thresh-m pages
+		 * actually dirty; with m+n sitting in the percpu
+		 * deltas.
+		 */
+		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
+			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+		} else {
+			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+		}
+
+		dirty_exceeded =
+			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
+			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
+		if (!dirty_exceeded)
 			break;
 
 		/*
@@ -521,7 +514,7 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_reclaimable + nr_writeback <
-				(background_thresh + dirty_thresh) / 2)
+		    (background_thresh + dirty_thresh) / 2)
 			break;
 
 		if (!bdi->dirty_exceeded)
@@ -539,33 +532,10 @@ static void balance_dirty_pages(struct a
 		if (bdi_nr_reclaimable > bdi_thresh) {
 			writeback_inodes_wbc(&wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
-			get_dirty_limits(&background_thresh, &dirty_thresh,
-				       &bdi_thresh, bdi);
-		}
-
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else if (bdi_nr_reclaimable) {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			/* don't wait if we've done enough */
+			if (pages_written >= write_chunk)
+				break;
 		}
-
-		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
-			break;
-		if (pages_written >= write_chunk)
-			break;		/* We've done our duty */
-
 		schedule_timeout_interruptible(pause);
 
 		/*
@@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
 			pause = HZ / 10;
 	}
 
-	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
-			bdi->dirty_exceeded)
+	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
 	if (writeback_in_progress(bdi))
@@ -593,9 +562,7 @@ static void balance_dirty_pages(struct a
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
-			       + global_page_state(NR_UNSTABLE_NFS))
-					  > background_thresh)))
+	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
  2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
                   ` (45 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-less-bdi-calc.patch --]
[-- Type: text/plain, Size: 6263 bytes --]

Split get_dirty_limits() into global_dirty_thresh() and bdi_dirty_thresh(),
so that the latter can be avoided in balance_dirty_pages() when under
global dirty threshold.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    5 +-
 mm/backing-dev.c          |    3 -
 mm/page-writeback.c       |   74 ++++++++++++++++++------------------
 4 files changed, 43 insertions(+), 41 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:31:53.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:31:54.000000000 +0800
@@ -266,10 +266,11 @@ static inline void task_dirties_fraction
  *
  *   dirty -= (dirty/8) * p_{t}
  */
-static void task_dirty_limit(struct task_struct *tsk, unsigned long *pdirty)
+static unsigned long task_dirty_thresh(struct task_struct *tsk,
+				       unsigned long bdi_dirty)
 {
 	long numerator, denominator;
-	unsigned long dirty = *pdirty;
+	unsigned long dirty = bdi_dirty;
 	u64 inv = dirty >> 3;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
@@ -277,10 +278,8 @@ static void task_dirty_limit(struct task
 	do_div(inv, denominator);
 
 	dirty -= inv;
-	if (dirty < *pdirty/2)
-		dirty = *pdirty/2;
 
-	*pdirty = dirty;
+	return max(dirty, bdi_dirty/2);
 }
 
 /*
@@ -390,9 +389,7 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
-void
-get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		 unsigned long *pbdi_dirty, struct backing_dev_info *bdi)
+void global_dirty_thresh(unsigned long *pbackground, unsigned long *pdirty)
 {
 	unsigned long background;
 	unsigned long dirty;
@@ -424,26 +421,28 @@ get_dirty_limits(unsigned long *pbackgro
 	}
 	*pbackground = background;
 	*pdirty = dirty;
+}
 
-	if (bdi) {
-		u64 bdi_dirty;
-		long numerator, denominator;
+unsigned long bdi_dirty_thresh(struct backing_dev_info *bdi,
+			       unsigned long dirty)
+{
+	u64 bdi_dirty;
+	long numerator, denominator;
 
-		/*
-		 * Calculate this BDI's share of the dirty ratio.
-		 */
-		bdi_writeout_fraction(bdi, &numerator, &denominator);
+	/*
+	 * Calculate this BDI's share of the dirty ratio.
+	 */
+	bdi_writeout_fraction(bdi, &numerator, &denominator);
 
-		bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
-		bdi_dirty *= numerator;
-		do_div(bdi_dirty, denominator);
-		bdi_dirty += (dirty * bdi->min_ratio) / 100;
-		if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
-			bdi_dirty = dirty * bdi->max_ratio / 100;
+	bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
+	bdi_dirty *= numerator;
+	do_div(bdi_dirty, denominator);
 
-		*pbdi_dirty = bdi_dirty;
-		task_dirty_limit(current, pbdi_dirty);
-	}
+	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
+		bdi_dirty = dirty * bdi->max_ratio / 100;
+
+	return task_dirty_thresh(current, bdi_dirty);
 }
 
 /*
@@ -475,14 +474,24 @@ static void balance_dirty_pages(struct a
 			.range_cyclic	= 1,
 		};
 
-		get_dirty_limits(&background_thresh, &dirty_thresh,
-				 &bdi_thresh, bdi);
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				 global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK) +
 			       global_page_state(NR_WRITEBACK_TEMP);
 
+		global_dirty_thresh(&background_thresh, &dirty_thresh);
+
+		/*
+		 * Throttle it only when the background writeback cannot
+		 * catch-up. This avoids (excessively) small writeouts
+		 * when the bdi limits are ramping up.
+		 */
+		if (nr_reclaimable + nr_writeback <
+		    (background_thresh + dirty_thresh) / 2)
+			break;
+
+		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
+
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
 		 * to ensure we accurately count the 'dirty' pages when
@@ -508,15 +517,6 @@ static void balance_dirty_pages(struct a
 		if (!dirty_exceeded)
 			break;
 
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-		    (background_thresh + dirty_thresh) / 2)
-			break;
-
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
@@ -626,7 +626,7 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	unsigned long dirty_thresh;
 
         for ( ; ; ) {
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+		global_dirty_thresh(&background_thresh, &dirty_thresh);
 
                 /*
                  * Boost the allowable dirty threshold a bit for page
--- linux.orig/fs/fs-writeback.c	2009-10-06 23:31:52.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:31:54.000000000 +0800
@@ -729,7 +729,7 @@ static inline bool over_bground_thresh(v
 {
 	unsigned long background_thresh, dirty_thresh;
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+	global_dirty_thresh(&background_thresh, &dirty_thresh);
 
 	return (global_page_state(NR_FILE_DIRTY) +
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
--- linux.orig/mm/backing-dev.c	2009-10-06 23:31:42.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:31:54.000000000 +0800
@@ -83,7 +83,8 @@ static int bdi_debug_stats_show(struct s
 	}
 	spin_unlock(&inode_lock);
 
-	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
+	global_dirty_thresh(&background_thresh, &dirty_thresh);
+	bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux.orig/include/linux/writeback.h	2009-10-06 23:31:52.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:31:54.000000000 +0800
@@ -126,8 +126,9 @@ struct ctl_table;
 int dirty_writeback_centisecs_handler(struct ctl_table *, int,
 				      void __user *, size_t *, loff_t *);
 
-void get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty,
-		      unsigned long *pbdi_dirty, struct backing_dev_info *bdi);
+void global_dirty_thresh(unsigned long *pbackground, unsigned long *pdirty);
+unsigned long bdi_dirty_thresh(struct backing_dev_info *bdi,
+			       unsigned long dirty);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage()
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
  2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
  2009-10-07  7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
                   ` (44 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-ext4-remove-wbc.patch --]
[-- Type: text/plain, Size: 920 bytes --]

CC: Theodore Ts'o <tytso@mit.edu> 
CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- linux.orig/fs/ext4/inode.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/fs/ext4/inode.c	2009-10-06 23:31:54.000000000 +0800
@@ -2599,7 +2599,6 @@ static int bput_one(handle_t *handle, st
 }
 
 static int __ext4_journalled_writepage(struct page *page,
-				       struct writeback_control *wbc,
 				       unsigned int len)
 {
 	struct address_space *mapping = page->mapping;
@@ -2757,7 +2756,7 @@ static int ext4_writepage(struct page *p
 		 * doesn't seem much point in redirtying the page here.
 		 */
 		ClearPageChecked(page);
-		return __ext4_journalled_writepage(page, wbc, len);
+		return __ext4_journalled_writepage(page, len);
 	}
 
 	if (test_opt(inode->i_sb, NOBH) && ext4_should_writeback_data(inode))



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 04/45] writeback: remove unused nonblocking and congestion checks
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (2 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-09 15:26   ` Jan Kara
  2009-10-07  7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
                   ` (43 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-remove-congested-checks.patch --]
[-- Type: text/plain, Size: 6095 bytes --]

- no one is calling wb_writeback and write_cache_pages with
  wbc.nonblocking=1 any more
- lumpy pageout will want to do nonblocking writeback without the
  congestion wait

So remove the congestion checks as suggested by Chris.

CC: Chris Mason <chris.mason@oracle.com>
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 drivers/staging/pohmelfs/inode.c |    9 ---------
 fs/afs/write.c                   |   16 +---------------
 fs/cifs/file.c                   |   10 ----------
 fs/fs-writeback.c                |    8 --------
 fs/gfs2/aops.c                   |   10 ----------
 fs/xfs/linux-2.6/xfs_aops.c      |    6 +-----
 mm/page-writeback.c              |   12 ------------
 7 files changed, 2 insertions(+), 69 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:31:54.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:31:59.000000000 +0800
@@ -660,14 +660,6 @@ static void writeback_inodes_wb(struct b
 			continue;
 		}
 
-		if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
-			wbc->encountered_congestion = 1;
-			if (!is_blkdev_sb)
-				break;		/* Skip a congested fs */
-			requeue_io(inode);
-			continue;		/* Skip a congested blockdev */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
--- linux.orig/mm/page-writeback.c	2009-10-06 23:31:54.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:31:59.000000000 +0800
@@ -787,7 +787,6 @@ int write_cache_pages(struct address_spa
 		      struct writeback_control *wbc, writepage_t writepage,
 		      void *data)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ret = 0;
 	int done = 0;
 	struct pagevec pvec;
@@ -800,11 +799,6 @@ int write_cache_pages(struct address_spa
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
 
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		return 0;
-	}
-
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		writeback_index = mapping->writeback_index; /* prev offset */
@@ -923,12 +917,6 @@ continue_unlock:
 					break;
 				}
 			}
-
-			if (wbc->nonblocking && bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
-				done = 1;
-				break;
-			}
 		}
 		pagevec_release(&pvec);
 		cond_resched();
--- linux.orig/drivers/staging/pohmelfs/inode.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/drivers/staging/pohmelfs/inode.c	2009-10-06 23:31:59.000000000 +0800
@@ -152,11 +152,6 @@ static int pohmelfs_writepages(struct ad
 	int scanned = 0;
 	int range_whole = 0;
 
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		return 0;
-	}
-
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
@@ -248,10 +243,6 @@ retry:
 
 			if (wbc->nr_to_write <= 0)
 				done = 1;
-			if (wbc->nonblocking && bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
-				done = 1;
-			}
 
 			continue;
 out_continue:
--- linux.orig/fs/afs/write.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/fs/afs/write.c	2009-10-06 23:31:59.000000000 +0800
@@ -455,8 +455,6 @@ int afs_writepage(struct page *page, str
 	}
 
 	wbc->nr_to_write -= ret;
-	if (wbc->nonblocking && bdi_write_congested(bdi))
-		wbc->encountered_congestion = 1;
 
 	_leave(" = 0");
 	return 0;
@@ -529,11 +527,6 @@ static int afs_writepages_region(struct 
 
 		wbc->nr_to_write -= ret;
 
-		if (wbc->nonblocking && bdi_write_congested(bdi)) {
-			wbc->encountered_congestion = 1;
-			break;
-		}
-
 		cond_resched();
 	} while (index < end && wbc->nr_to_write > 0);
 
@@ -554,18 +547,11 @@ int afs_writepages(struct address_space 
 
 	_enter("");
 
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		_leave(" = 0 [congest]");
-		return 0;
-	}
-
 	if (wbc->range_cyclic) {
 		start = mapping->writeback_index;
 		end = -1;
 		ret = afs_writepages_region(mapping, wbc, start, end, &next);
-		if (start > 0 && wbc->nr_to_write > 0 && ret == 0 &&
-		    !(wbc->nonblocking && wbc->encountered_congestion))
+		if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
 			ret = afs_writepages_region(mapping, wbc, 0, start,
 						    &next);
 		mapping->writeback_index = next;
--- linux.orig/fs/cifs/file.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/fs/cifs/file.c	2009-10-06 23:31:59.000000000 +0800
@@ -1379,16 +1379,6 @@ static int cifs_writepages(struct addres
 		return generic_writepages(mapping, wbc);
 
 
-	/*
-	 * BB: Is this meaningful for a non-block-device file system?
-	 * If it is, we should test it again after we do I/O
-	 */
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		kfree(iov);
-		return 0;
-	}
-
 	xid = GetXid();
 
 	pagevec_init(&pvec, 0);
--- linux.orig/fs/gfs2/aops.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/fs/gfs2/aops.c	2009-10-06 23:31:59.000000000 +0800
@@ -313,11 +313,6 @@ static int gfs2_write_jdata_pagevec(stru
 
 		if (ret || (--(wbc->nr_to_write) <= 0))
 			ret = 1;
-		if (wbc->nonblocking && bdi_write_congested(bdi)) {
-			wbc->encountered_congestion = 1;
-			ret = 1;
-		}
-
 	}
 	gfs2_trans_end(sdp);
 	return ret;
@@ -348,11 +343,6 @@ static int gfs2_write_cache_jdata(struct
 	int scanned = 0;
 	int range_whole = 0;
 
-	if (wbc->nonblocking && bdi_write_congested(bdi)) {
-		wbc->encountered_congestion = 1;
-		return 0;
-	}
-
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index; /* Start from prev offset */
--- linux.orig/fs/xfs/linux-2.6/xfs_aops.c	2009-10-06 23:31:41.000000000 +0800
+++ linux/fs/xfs/linux-2.6/xfs_aops.c	2009-10-06 23:31:59.000000000 +0800
@@ -890,12 +890,8 @@ xfs_convert_page(
 
 			bdi = inode->i_mapping->backing_dev_info;
 			wbc->nr_to_write--;
-			if (bdi_write_congested(bdi)) {
-				wbc->encountered_congestion = 1;
+			if (wbc->nr_to_write <= 0)
 				done = 1;
-			} else if (wbc->nr_to_write <= 0) {
-				done = 1;
-			}
 		}
 		xfs_start_page_writeback(page, !page_dirty, count);
 	}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (3 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
                   ` (42 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-remove-memory-bdi.patch --]
[-- Type: text/plain, Size: 1275 bytes --]

This is dead code because no bdi flush thread will be started for
!bdi_cap_writeback_dirty bdis.

CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   18 ------------------
 1 file changed, 18 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:31:59.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:37:57.000000000 +0800
@@ -617,7 +617,6 @@ static void writeback_inodes_wb(struct b
 				struct writeback_control *wbc)
 {
 	struct super_block *sb = wbc->sb, *pin_sb = NULL;
-	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
@@ -638,23 +637,6 @@ static void writeback_inodes_wb(struct b
 			continue;
 		}
 
-		if (!bdi_cap_writeback_dirty(wb->bdi)) {
-			redirty_tail(inode);
-			if (is_blkdev_sb) {
-				/*
-				 * Dirty memory-backed blockdev: the ramdisk
-				 * driver does this.  Skip just this inode
-				 */
-				continue;
-			}
-			/*
-			 * Dirty memory-backed inode against a filesystem other
-			 * than the kernel-internal bdev filesystem.  Skip the
-			 * entire superblock.
-			 */
-			break;
-		}
-
 		if (inode->i_state & (I_NEW | I_WILL_FREE)) {
 			requeue_io(inode);
 			continue;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (4 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
                   ` (41 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Richard Kennedy, Wu Fengguang, LKML

[-- Attachment #1: writeback-ratelimit-on-dirty-exceeded.patch --]
[-- Type: text/plain, Size: 2170 bytes --]

When dirty_exceeded, use ratelimit = ratelimit_pages/8, allowing it to
scale up to 512KB for memory bounty systems. This is more efficient than
the original 8 pages, and won't risk exceeding the dirty limit too much.

Given the larger ratelimit value, we can safely ignore the low bound
check in sync_writeback_pages.

dirty_exceeded is more likely to be seen when there are multiple dirty
processes. In which case the lowered ratelimit will help reduce their
overall wait time (latency) in the throttled queue.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:37:50.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:38:18.000000000 +0800
@@ -39,7 +39,8 @@
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
-static long ratelimit_pages = 32;
+#define MAX_RATELIMIT_PAGES ((4096 * 1024) / PAGE_CACHE_SIZE)
+static long ratelimit_pages = MAX_RATELIMIT_PAGES;
 
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
@@ -49,9 +50,6 @@ static long ratelimit_pages = 32;
  */
 static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
 	return dirtied + dirtied / 2;
 }
 
@@ -600,7 +598,7 @@ void balance_dirty_pages_ratelimited_nr(
 
 	ratelimit = ratelimit_pages;
 	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+		ratelimit >>= 3;
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
@@ -722,8 +720,8 @@ void writeback_set_ratelimit(void)
 	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
+	if (ratelimit_pages > MAX_RATELIMIT_PAGES)
+		ratelimit_pages = MAX_RATELIMIT_PAGES;
 }
 
 static int __cpuinit



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (5 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-09 15:45   ` Jan Kara
  2009-10-07  7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
                   ` (40 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-xfs-fast-redirty.patch --]
[-- Type: text/plain, Size: 1891 bytes --]

This avoids delaying writeback for an expired (XFS) inode with lots of
dirty pages, but no active dirtier at the moment.

CC: Dave Chinner <david@fromorbit.com> 
CC: Christoph Hellwig <hch@infradead.org> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:37:57.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:28.000000000 +0800
@@ -479,18 +479,7 @@ writeback_single_inode(struct inode *ino
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
-		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
-			/*
-			 * More pages get dirtied by a fast dirtier.
-			 */
-			goto select_queue;
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * At least XFS will redirty the inode during the
-			 * writeback (delalloc) and on io completion (isize).
-			 */
-			redirty_tail(inode);
-		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
@@ -512,7 +501,6 @@ writeback_single_inode(struct inode *ino
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-select_queue:
 				if (wbc->nr_to_write <= 0) {
 					/*
 					 * slice used up: queue for next turn
@@ -535,6 +523,12 @@ select_queue:
 				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
+		} else if (inode->i_state & I_DIRTY) {
+			/*
+			 * At least XFS will redirty the inode during the
+			 * writeback (delalloc) and on io completion (isize).
+			 */
+			redirty_tail(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (6 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
                   ` (39 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Christoph Lameter, Wu Fengguang, LKML

[-- Attachment #1: linux_mm_page-writeback.c --]
[-- Type: text/plain, Size: 3063 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort the
writeback of the current inode, which instructs writeback_single_inode()
to delay it for a while if necessary. 

This is the right behavior for
- sync writeback (is already so with range_whole)
  we have scanned the inode address space, and don't care any more newly
  dirtied pages. So shall update its i_dirtied_when and exclude it from
  the todo list.
- periodic writeback
  any more newly dirtied pages may be delayed for a while.
  This also prevents pointless IO for busy overwriters.
- background writeback
  irrelevant because it generally don't care the dirty timestamp.

That should get rid of one inefficient IO pattern of .range_cyclic when
writeback_index wraps, in which the submitted pages may be consisted of
two distant ranges: submit [10000-10100], (wrap), submit [0-100].

CC: Christoph Lameter <cl@linux-foundation.org>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <jens.axboe@oracle.com>
CC: Nick Piggin <npiggin@suse.de>
CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:38:18.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:38:30.000000000 +0800
@@ -789,31 +789,23 @@ int write_cache_pages(struct address_spa
 	int done = 0;
 	struct pagevec pvec;
 	int nr_pages;
-	pgoff_t uninitialized_var(writeback_index);
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
 	pgoff_t done_index;
-	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
 
 	pagevec_init(&pvec, 0);
 	if (wbc->range_cyclic) {
-		writeback_index = mapping->writeback_index; /* prev offset */
-		index = writeback_index;
-		if (index == 0)
-			cycled = 1;
-		else
-			cycled = 0;
+		index = mapping->writeback_index; /* prev offset */
 		end = -1;
 	} else {
 		index = wbc->range_start >> PAGE_CACHE_SHIFT;
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		cycled = 1; /* ignore range_cyclic tests */
 	}
-retry:
+
 	done_index = index;
 	while (!done && (index <= end)) {
 		int i;
@@ -821,8 +813,10 @@ retry:
 		nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 			      PAGECACHE_TAG_DIRTY,
 			      min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1);
-		if (nr_pages == 0)
+		if (nr_pages == 0) {
+			done_index = 0;
 			break;
+		}
 
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
@@ -919,17 +913,6 @@ continue_unlock:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!cycled && !done) {
-		/*
-		 * range_cyclic:
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		cycled = 1;
-		index = 0;
-		end = writeback_index - 1;
-		goto retry;
-	}
 	if (!wbc->no_nrwrite_index_update) {
 		if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
 			mapping->writeback_index = done_index;



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (7 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07 12:32   ` Evgeniy Polyakov
  2009-10-07  7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
                   ` (38 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Evgeniy Polyakov, Wu Fengguang, LKML

[-- Attachment #1: linux_drivers_staging_pohmelfs_inode.c --]
[-- Type: text/plain, Size: 2571 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for 
a while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: Evgeniy Polyakov <zbr@ioremap.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 drivers/staging/pohmelfs/inode.c |   25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

--- linux.orig/drivers/staging/pohmelfs/inode.c	2009-10-06 23:37:49.000000000 +0800
+++ linux/drivers/staging/pohmelfs/inode.c	2009-10-06 23:38:31.000000000 +0800
@@ -149,7 +149,6 @@ static int pohmelfs_writepages(struct ad
 	int nr_pages;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
 	int range_whole = 0;
 
 	if (wbc->range_cyclic) {
@@ -160,17 +159,18 @@ static int pohmelfs_writepages(struct ad
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
 	}
-retry:
+
 	while (!done && (index <= end)) {
 		unsigned int i = min(end - index, (pgoff_t)psb->trans_max_pages);
 		int path_len;
 		struct netfs_trans *trans;
 
 		err = pohmelfs_inode_has_dirty_pages(mapping, index);
-		if (!err)
+		if (!err) {
+			index = 0;
 			break;
+		}
 
 		err = pohmelfs_path_length(pi);
 		if (err < 0)
@@ -197,15 +197,16 @@ retry:
 		dprintk("%s: t: %p, nr_pages: %u, end: %lu, index: %lu, max: %u.\n",
 				__func__, trans, nr_pages, end, index, trans->page_num);
 
-		if (!nr_pages)
+		if (!nr_pages) {
+			index = 0;
 			goto err_out_reset;
+		}
 
 		err = pohmelfs_write_inode_create(inode, trans);
 		if (err)
 			goto err_out_reset;
 
 		err = 0;
-		scanned = 1;
 
 		for (i = 0; i < trans->page_num; i++) {
 			struct page *page = trans->pages[i];
@@ -215,7 +216,7 @@ retry:
 			if (unlikely(page->mapping != mapping))
 				goto out_continue;
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				done = 1;
 				goto out_continue;
 			}
@@ -263,16 +264,6 @@ err_out_reset:
 		break;
 	}
 
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = 1;
-		index = 0;
-		goto retry;
-	}
-
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (8 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
                   ` (37 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: linux_fs_btrfs_extent_io.c --]
[-- Type: text/plain, Size: 2125 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for a
while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/extent_io.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)

--- linux.orig/fs/btrfs/extent_io.c	2009-10-06 23:37:49.000000000 +0800
+++ linux/fs/btrfs/extent_io.c	2009-10-06 23:38:32.000000000 +0800
@@ -2402,10 +2402,9 @@ static int extent_write_cache_pages(stru
 	int done = 0;
 	int nr_to_write_done = 0;
 	struct pagevec pvec;
-	int nr_pages;
+	int nr_pages = 1;
 	pgoff_t index;
 	pgoff_t end;		/* Inclusive */
-	int scanned = 0;
 	int range_whole = 0;
 
 	pagevec_init(&pvec, 0);
@@ -2417,16 +2416,14 @@ static int extent_write_cache_pages(stru
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
 	}
-retry:
+
 	while (!done && !nr_to_write_done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 			      PAGECACHE_TAG_DIRTY, min(end - index,
 				  (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
 		unsigned i;
 
-		scanned = 1;
 		for (i = 0; i < nr_pages; i++) {
 			struct page *page = pvec.pages[i];
 
@@ -2447,7 +2444,7 @@ retry:
 				continue;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				done = 1;
 				unlock_page(page);
 				continue;
@@ -2484,15 +2481,9 @@ retry:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = 1;
-		index = 0;
-		goto retry;
-	}
+	if (!nr_pages)
+		mapping->writeback_index = 0;
+
 	return ret;
 }
 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (9 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
                   ` (36 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Steve French, Wu Fengguang, LKML

[-- Attachment #1: linux_fs_cifs_file.c --]
[-- Type: text/plain, Size: 2354 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for
a while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: Steve French <sfrench@samba.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/cifs/file.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

--- linux.orig/fs/cifs/file.c	2009-10-06 23:37:49.000000000 +0800
+++ linux/fs/cifs/file.c	2009-10-06 23:38:33.000000000 +0800
@@ -1337,7 +1337,6 @@ static int cifs_partialpagewrite(struct 
 static int cifs_writepages(struct address_space *mapping,
 			   struct writeback_control *wbc)
 {
-	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned int bytes_to_write;
 	unsigned int bytes_written;
 	struct cifs_sb_info *cifs_sb;
@@ -1349,14 +1348,13 @@ static int cifs_writepages(struct addres
 	int len;
 	int n_iov = 0;
 	pgoff_t next;
-	int nr_pages;
+	int nr_pages = 1;
 	__u64 offset = 0;
 	struct cifsFileInfo *open_file;
 	struct cifsInodeInfo *cifsi = CIFS_I(mapping->host);
 	struct page *page;
 	struct pagevec pvec;
 	int rc = 0;
-	int scanned = 0;
 	int xid, long_op;
 
 	cifs_sb = CIFS_SB(mapping->host->i_sb);
@@ -1390,9 +1388,8 @@ static int cifs_writepages(struct addres
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
 	}
-retry:
+
 	while (!done && (index <= end) &&
 	       (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 			PAGECACHE_TAG_DIRTY,
@@ -1425,7 +1422,7 @@ retry:
 				break;
 			}
 
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				done = 1;
 				unlock_page(page);
 				break;
@@ -1537,15 +1534,8 @@ retry:
 
 		pagevec_release(&pvec);
 	}
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = 1;
+	if (!nr_pages)
 		index = 0;
-		goto retry;
-	}
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (10 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
                   ` (35 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: linux_fs_ext4_inode.c --]
[-- Type: text/plain, Size: 2179 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for a
while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: Theodore Ts'o <tytso@mit.edu> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/ext4/inode.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

--- linux.orig/fs/ext4/inode.c	2009-10-06 23:37:48.000000000 +0800
+++ linux/fs/ext4/inode.c	2009-10-06 23:38:35.000000000 +0800
@@ -2805,7 +2805,7 @@ static int ext4_da_writepages(struct add
 	int pages_written = 0;
 	long pages_skipped;
 	unsigned int max_pages;
-	int range_cyclic, cycled = 1, io_done = 0;
+	int range_cyclic, io_done = 0;
 	int needed_blocks, ret = 0;
 	long desired_nr_to_write, nr_to_writebump = 0;
 	loff_t range_start = wbc->range_start;
@@ -2840,8 +2840,6 @@ static int ext4_da_writepages(struct add
 	range_cyclic = wbc->range_cyclic;
 	if (wbc->range_cyclic) {
 		index = mapping->writeback_index;
-		if (index)
-			cycled = 0;
 		wbc->range_start = index << PAGE_CACHE_SHIFT;
 		wbc->range_end  = LLONG_MAX;
 		wbc->range_cyclic = 0;
@@ -2889,7 +2887,6 @@ static int ext4_da_writepages(struct add
 	wbc->no_nrwrite_index_update = 1;
 	pages_skipped = wbc->pages_skipped;
 
-retry:
 	while (!ret && wbc->nr_to_write > 0) {
 
 		/*
@@ -2963,20 +2960,13 @@ retry:
 			wbc->pages_skipped = pages_skipped;
 			ret = 0;
 			io_done = 1;
-		} else if (wbc->nr_to_write)
+		} else if (wbc->nr_to_write > 0) {
 			/*
 			 * There is no more writeout needed
-			 * or we requested for a noblocking writeout
-			 * and we found the device congested
 			 */
+			index = 0;
 			break;
-	}
-	if (!io_done && !cycled) {
-		cycled = 1;
-		index = 0;
-		wbc->range_start = index << PAGE_CACHE_SHIFT;
-		wbc->range_end  = mapping->writeback_index - 1;
-		goto retry;
+		}
 	}
 	if (pages_skipped != wbc->pages_skipped)
 		ext4_msg(inode->i_sb, KERN_CRIT,

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (11 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
                   ` (34 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Steven Whitehouse, Wu Fengguang, LKML

[-- Attachment #1: linux_fs_gfs2_aops.c --]
[-- Type: text/plain, Size: 2042 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for a
while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/gfs2/aops.c |   16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

--- linux.orig/fs/gfs2/aops.c	2009-10-06 23:37:48.000000000 +0800
+++ linux/fs/gfs2/aops.c	2009-10-06 23:38:41.000000000 +0800
@@ -287,7 +287,7 @@ static int gfs2_write_jdata_pagevec(stru
 			continue;
 		}
 
-		if (!wbc->range_cyclic && page->index > end) {
+		if (page->index > end) {
 			ret = 1;
 			unlock_page(page);
 			continue;
@@ -340,7 +340,6 @@ static int gfs2_write_cache_jdata(struct
 	int nr_pages;
 	pgoff_t index;
 	pgoff_t end;
-	int scanned = 0;
 	int range_whole = 0;
 
 	pagevec_init(&pvec, 0);
@@ -352,15 +351,12 @@ static int gfs2_write_cache_jdata(struct
 		end = wbc->range_end >> PAGE_CACHE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
-		scanned = 1;
 	}
 
-retry:
 	 while (!done && (index <= end) &&
 		(nr_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 					       PAGECACHE_TAG_DIRTY,
 					       min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) {
-		scanned = 1;
 		ret = gfs2_write_jdata_pagevec(mapping, wbc, &pvec, nr_pages, end);
 		if (ret)
 			done = 1;
@@ -371,16 +367,8 @@ retry:
 		cond_resched();
 	}
 
-	if (!scanned && !done) {
-		/*
-		 * We hit the last page and there is more work to be done: wrap
-		 * back to the start of the file
-		 */
-		scanned = 1;
+	if (!nr_pages)
 		index = 0;
-		goto retry;
-	}
-
 	if (wbc->range_cyclic || (range_whole && wbc->nr_to_write > 0))
 		mapping->writeback_index = index;
 	return ret;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (12 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
                   ` (33 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	David Howells, Wu Fengguang, LKML

[-- Attachment #1: linux_fs_afs_write.c --]
[-- Type: text/plain, Size: 1447 bytes --]

Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
of the inode, which instructs writeback_single_inode() to delay it for
a while if necessary.

It removes one inefficient .range_cyclic IO pattern when writeback_index
wraps:
	submit [10000-10100], (wrap), submit [0-100]
In which the submitted pages may be consisted of two distant ranges.

It also prevents submitting pointless IO for busy overwriters.

CC: David Howells <dhowells@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/afs/write.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- linux.orig/fs/afs/write.c	2009-10-06 23:37:48.000000000 +0800
+++ linux/fs/afs/write.c	2009-10-06 23:38:42.000000000 +0800
@@ -477,8 +477,10 @@ static int afs_writepages_region(struct 
 	do {
 		n = find_get_pages_tag(mapping, &index, PAGECACHE_TAG_DIRTY,
 				       1, &page);
-		if (!n)
+		if (!n) {
+			index = 0;
 			break;
+		}
 
 		_debug("wback %lx", page->index);
 
@@ -551,9 +553,6 @@ int afs_writepages(struct address_space 
 		start = mapping->writeback_index;
 		end = -1;
 		ret = afs_writepages_region(mapping, wbc, start, end, &next);
-		if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
-			ret = afs_writepages_region(mapping, wbc, 0, start,
-						    &next);
 		mapping->writeback_index = next;
 	} else if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) {
 		end = (pgoff_t)(LLONG_MAX >> PAGE_CACHE_SHIFT);



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 15/45] writeback: fix queue_io() ordering
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (13 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
                   ` (32 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Martin Bligh, Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: queue_io-fix.patch --]
[-- Type: text/plain, Size: 1164 bytes --]

This was not a bug, since b_io is empty for kupdate writeback.
The next patch will do requeue_io() for non-kupdate writeback,
so let's fix it.

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:28.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:42.000000000 +0800
@@ -375,11 +375,14 @@ static void move_expired_inodes(struct l
 }
 
 /*
- * Queue all expired dirty inodes for io, eldest first.
+ * Queue all expired dirty inodes for io, eldest first:
+ * (newly dirtied) => b_dirty inodes
+ *                 => b_more_io inodes
+ *                 => remaining inodes in b_io => (dequeue for sync)
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	list_splice_init(&wb->b_more_io, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (14 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
                   ` (31 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Martin Bligh, Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-merge-kupdate-cases.patch --]
[-- Type: text/plain, Size: 2777 bytes --]

Unify the logic for kupdate and non-kupdate cases.
There won't be starvation because the inodes requeued into b_more_io
will later be spliced _after_ the remaining inodes in b_io, hence won't
stand in the way of other inodes in the next run.

It avoids unnecessary redirty_tail() calls, hence the update of
i_dirtied_when. The timestamp update is undesirable because it could
later delay the inode's period writeback, or exclude the inode from
the data integrity sync operation (which will check timestamp to avoid
extra work and livelock).

CC: Dave Chinner <david@fromorbit.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |   39 ++++++---------------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:42.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:42.000000000 +0800
@@ -485,45 +485,18 @@ writeback_single_inode(struct inode *ino
 		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from b_io onto b_more_io/b_dirty.
+			 * sometimes bales out without doing anything.
 			 */
-			/*
-			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of b_dirty so it gets first
-			 * consideration.  Otherwise, move it to the tail, for
-			 * the reasons described there.  I'm not really sure
-			 * how much sense this makes.  Presumably I had a good
-			 * reasons for doing it this way, and I'd rather not
-			 * muck with it at present.
-			 */
-			if (wbc->for_kupdate) {
+			inode->i_state |= I_DIRTY_PAGES;
+			if (wbc->nr_to_write <= 0) {
 				/*
-				 * For the kupdate function we move the inode
-				 * to b_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
+				 * slice used up: queue for next turn
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				if (wbc->nr_to_write <= 0) {
-					/*
-					 * slice used up: queue for next turn
-					 */
-					requeue_io(inode);
-				} else {
-					/*
-					 * somehow blocked: retry later
-					 */
-					redirty_tail(inode);
-				}
+				requeue_io(inode);
 			} else {
 				/*
-				 * Otherwise fully redirty the inode so that
-				 * other inodes on this superblock will get some
-				 * writeout.  Otherwise heavy writing to one
-				 * file would indefinitely suspend writeout of
-				 * all the other files.
+				 * somehow blocked: retry later
 				 */
-				inode->i_state |= I_DIRTY_PAGES;
 				redirty_tail(inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 17/45] writeback: only allow two background writeback works
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (15 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
                   ` (30 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-limit-background-work.patch --]
[-- Type: text/plain, Size: 2821 bytes --]

balance_dirty_pages() need a reliable way to ensure some background work
is scheduled for running. We cannot simply run bdi_start_writeback()
because that would queue up huge number of works (which takes memory and
time).

CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   14 ++------------
 include/linux/backing-dev.h |   26 +++++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 13 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:42.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
@@ -85,18 +85,6 @@ static inline void bdi_work_init(struct 
 int sysctl_dirty_debug __read_mostly;
 
 
-/**
- * writeback_in_progress - determine whether there is writeback in progress
- * @bdi: the device's backing_dev_info structure.
- *
- * Determine whether there is writeback waiting to be handled against a
- * backing device.
- */
-int writeback_in_progress(struct backing_dev_info *bdi)
-{
-	return !list_empty(&bdi->work_list);
-}
-
 static void bdi_work_clear(struct bdi_work *work)
 {
 	clear_bit(WS_USED_B, &work->state);
@@ -147,6 +135,8 @@ static void wb_clear_pending(struct bdi_
 
 		spin_lock(&bdi->wb_lock);
 		list_del_rcu(&work->list);
+		if (work->args.for_background)
+			clear_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
 		spin_unlock(&bdi->wb_lock);
 
 		wb_work_complete(work);
--- linux.orig/include/linux/backing-dev.h	2009-10-06 23:37:47.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
@@ -94,6 +94,11 @@ struct backing_dev_info {
 #endif
 };
 
+/*
+ * background work queued, set to avoid redundant background works
+ */
+#define WB_FLAG_BACKGROUND_WORK		30
+
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
@@ -248,7 +253,26 @@ int bdi_set_max_ratio(struct backing_dev
 extern struct backing_dev_info default_backing_dev_info;
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page);
 
-int writeback_in_progress(struct backing_dev_info *bdi);
+/**
+ * writeback_in_progress - determine whether there is writeback in progress
+ * @bdi: the device's backing_dev_info structure.
+ *
+ * Determine whether there is writeback waiting to be handled against a
+ * backing device.
+ */
+static inline int writeback_in_progress(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->work_list);
+}
+
+/*
+ * Helper to limit # of background writeback works in circulation to 2.
+ * (one running and another queued)
+ */
+static inline int can_submit_background_writeback(struct backing_dev_info *bdi)
+{
+       return !test_and_set_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
+}
 
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (16 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-08  1:01   ` KAMEZAWA Hiroyuki
  2009-10-07  7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
                   ` (29 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-balance-wait-queue.patch --]
[-- Type: text/plain, Size: 9191 bytes --]

As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
the per-bdi flusher to writeback enough pages for it, instead of
starting foreground writeback by itself. By doing so we harvest two
benefits:
- avoid concurrent writeback of multiple inodes (Dave Chinner)
  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.
- avoid one constraint torwards huge per-file nr_to_write
  The write_chunk used by balance_dirty_pages() should be small enough to
  prevent user noticeable one-shot latency. Ie. each sleep/wait inside
  balance_dirty_pages() shall be small enough. When it starts its own
  writeback, it must specify a small nr_to_write. The throttle wait queue
  removes this dependancy by the way.

CC: Chris Mason <chris.mason@oracle.com> 
CC: Dave Chinner <david@fromorbit.com> 
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   71 ++++++++++++++++++++++++++++++++++
 include/linux/backing-dev.h |   15 +++++++
 mm/backing-dev.c            |    4 +
 mm/page-writeback.c         |   53 ++++++-------------------
 4 files changed, 103 insertions(+), 40 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:38:30.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
@@ -218,6 +218,15 @@ static inline void __bdi_writeout_inc(st
 {
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
+
+	/*
+	 * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
+	 * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
+	 * places to reduce the race possibility.
+	 */
+	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
+	    atomic_dec_and_test(&bdi->throttle_pages))
+		bdi_writeback_wakeup(bdi);
 }
 
 void bdi_writeout_inc(struct backing_dev_info *bdi)
@@ -458,20 +467,10 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
 	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				 global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK) +
@@ -518,39 +517,13 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wbc(&wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			/* don't wait if we've done enough */
-			if (pages_written >= write_chunk)
-				break;
-		}
-		schedule_timeout_interruptible(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		bdi_writeback_wait(bdi, write_chunk);
+		break;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	if (writeback_in_progress(bdi))
-		return;
-
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold before
 	 * starting background writeout, and then write out all the way down
@@ -559,8 +532,8 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
+	    can_submit_background_writeback(bdi))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
--- linux.orig/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
@@ -86,6 +86,13 @@ struct backing_dev_info {
 
 	struct list_head work_list;
 
+	/*
+	 * dirtier process throttling
+	 */
+	spinlock_t		throttle_lock;
+	struct list_head	throttle_list;	/* nr to sync for each task */
+	atomic_t		throttle_pages; /* nr to sync for head task */
+
 	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
@@ -99,6 +106,12 @@ struct backing_dev_info {
  */
 #define WB_FLAG_BACKGROUND_WORK		30
 
+/*
+ * when no task is throttled, set throttle_pages to larger than this,
+ * to avoid unnecessary atomic decreases.
+ */
+#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
+
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
@@ -110,6 +123,8 @@ void bdi_start_writeback(struct backing_
 				long nr_pages);
 int bdi_writeback_task(struct bdi_writeback *wb);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
+int bdi_writeback_wakeup(struct backing_dev_info *bdi);
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
@@ -25,6 +25,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/completion.h>
 #include "internal.h"
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
@@ -265,6 +266,72 @@ void bdi_start_writeback(struct backing_
 	bdi_alloc_queue_work(bdi, &args);
 }
 
+struct dirty_throttle_task {
+	long			nr_pages;
+	struct list_head	list;
+	struct completion	complete;
+};
+
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
+{
+	struct dirty_throttle_task tt = {
+		.nr_pages = nr_pages,
+		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
+	};
+	unsigned long flags;
+
+	/*
+	 * register throttle pages
+	 */
+	spin_lock_irqsave(&bdi->throttle_lock, flags);
+	if (list_empty(&bdi->throttle_list))
+		atomic_set(&bdi->throttle_pages, nr_pages);
+	list_add(&tt.list, &bdi->throttle_list);
+	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+	/*
+	 * make sure we will be woke up by someone
+	 */
+	if (can_submit_background_writeback(bdi))
+		bdi_start_writeback(bdi, NULL, 0);
+
+	wait_for_completion(&tt.complete);
+}
+
+/*
+ * return 1 if there are more waiting tasks.
+ */
+int bdi_writeback_wakeup(struct backing_dev_info *bdi)
+{
+	struct dirty_throttle_task *tt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bdi->throttle_lock, flags);
+	/*
+	 * remove and wakeup head task
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		list_del(&tt->list);
+		complete(&tt->complete);
+	}
+	/*
+	 * update throttle pages
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		atomic_set(&bdi->throttle_pages, tt->nr_pages);
+	} else {
+		tt = NULL;
+		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+	}
+	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+	return tt != NULL;
+}
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -760,6 +827,10 @@ static long wb_writeback(struct bdi_writ
 		spin_unlock(&inode_lock);
 	}
 
+	if (args->for_background)
+		while (bdi_writeback_wakeup(wb->bdi))
+			;  /* unthrottle all tasks */
+
 	return wrote;
 }
 
--- linux.orig/mm/backing-dev.c	2009-10-06 23:37:47.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:43.000000000 +0800
@@ -646,6 +646,10 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->wb_mask = 1;
 	bdi->wb_cnt = 1;
 
+	spin_lock_init(&bdi->throttle_lock);
+	INIT_LIST_HEAD(&bdi->throttle_list);
+	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
 		if (err)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 19/45] writeback: remove the loop in balance_dirty_pages()
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (17 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
                   ` (28 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-balance-cleanup.patch --]
[-- Type: text/plain, Size: 3677 bytes --]

The loop is no longer necessary. Remove it without behavior change.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   81 +++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 43 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
@@ -470,60 +470,55 @@ static void balance_dirty_pages(struct a
 	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
-	for (;;) {
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-				 global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK) +
-			       global_page_state(NR_WRITEBACK_TEMP);
+	nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+			 global_page_state(NR_UNSTABLE_NFS);
+	nr_writeback = global_page_state(NR_WRITEBACK) +
+		       global_page_state(NR_WRITEBACK_TEMP);
 
-		global_dirty_thresh(&background_thresh, &dirty_thresh);
-
-		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <
-		    (background_thresh + dirty_thresh) / 2)
-			break;
+	global_dirty_thresh(&background_thresh, &dirty_thresh);
 
-		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
+	/*
+	 * Throttle it only when the background writeback cannot
+	 * catch-up. This skips the ramp up phase of bdi limits.
+	 */
+	if (nr_reclaimable + nr_writeback <
+			(background_thresh + dirty_thresh) / 2)
+		goto out;
 
-		/*
-		 * In order to avoid the stacked BDI deadlock we need
-		 * to ensure we accurately count the 'dirty' pages when
-		 * the threshold is low.
-		 *
-		 * Otherwise it would be possible to get thresh+n pages
-		 * reported dirty, even though there are thresh-m pages
-		 * actually dirty; with m+n sitting in the percpu
-		 * deltas.
-		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
-		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
-		}
+	bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
+	/*
+	 * In order to avoid the stacked BDI deadlock we need
+	 * to ensure we accurately count the 'dirty' pages when
+	 * the threshold is low.
+	 *
+	 * Otherwise it would be possible to get thresh+n pages
+	 * reported dirty, even though there are thresh-m pages
+	 * actually dirty; with m+n sitting in the percpu
+	 * deltas.
+	 */
+	if (bdi_thresh >= 2 * bdi_stat_error(bdi)) {
+		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+	} else {
+		bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
+		bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+	}
 
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
+	dirty_exceeded = (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
 			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
 
-		if (!dirty_exceeded)
-			break;
+	if (!dirty_exceeded)
+		goto out;
 
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
+	if (!bdi->dirty_exceeded)
+		bdi->dirty_exceeded = 1;
 
-		bdi_writeback_wait(bdi, write_chunk);
-		break;
-	}
+	bdi_writeback_wait(bdi, write_chunk);
 
-	if (!dirty_exceeded && bdi->dirty_exceeded)
+	if (bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+out:
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold before
 	 * starting background writeout, and then write out all the way down

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (18 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
                   ` (27 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 10531 bytes --]

The generic writeback routines are departing from congestion_wait()
in preferance of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages may grow out of control.

In perticular, balance_dirty_pages() will exit after it pushes
write_chunk pages into the PG_writeback page pool, _OR_ when the
background writeback work quits. The latter is new behavior, and could
not only quit (normally) after below background threshold, but also
quit when it finds _zero_ dirty pages to write. The latter case gives
rise to the number of PG_writeback pages if it is not explicitly limited.

CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

The wait time and network throughput varies a lot! this is a major problem.
This means nfs_end_page_writeback() is not called smoothly over time,
even when there are plenty of PG_writeback pages on the client side.

[  397.828509] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  397.850976] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.065244] write_bandwidth: comm=nfsiod pages=192 time=5212ms
[  403.549134] write_bandwidth: comm=nfsiod pages=1536 time=144ms
[  403.570717] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.595749] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.622171] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  403.651779] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.680543] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  403.712572] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.751552] write_bandwidth: comm=nfsiod pages=192 time=36ms
[  403.785979] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.823995] write_bandwidth: comm=nfsiod pages=192 time=36ms
[  403.858970] write_bandwidth: comm=nfsiod pages=192 time=32ms
[  403.880786] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  403.902732] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.925925] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.952044] write_bandwidth: comm=nfsiod pages=258 time=24ms
[  403.974006] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  403.995989] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.031049] write_bandwidth: comm=nfsiod pages=192 time=1032ms
[  405.257635] write_bandwidth: comm=nfsiod pages=1536 time=192ms
[  405.279069] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.300843] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  405.326031] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.350843] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  405.375160] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  409.331015] write_bandwidth: comm=nfsiod pages=192 time=3952ms
[  409.587928] write_bandwidth: comm=nfsiod pages=1536 time=152ms
[  409.610068] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  409.635736] write_bandwidth: comm=nfsiod pages=192 time=24ms

# vmmon -d 1 nr_writeback nr_dirty nr_unstable

     nr_writeback         nr_dirty      nr_unstable
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11045            53987             6490
            11033            53120             8145
            11195            52143            10886
            11211            52144            10913
            11211            52144            10913
            11211            52144            10913
            11056            56887             3876
            11062            55298             8155
            11214            54485             9838
            11225            54461             9852
            11225            54461             9852
            11225            54461             4582
            22342            35535             7823

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   0   9  92   0   0|   0     0 |  66B  306B|   0     0 |1003   377 
  0   1  39  60   0   1|   0     0 |  90k 1361k|   0     0 |1765  1599 
  0  15  12  43   0  31|   0     0 |2292k   34M|   0     0 |  12k   16k
  0   0  16  84   0   0|   0     0 | 132B  306B|   0     0 |1003   376 
  0   0  43  57   0   0|   0     0 |  66B  306B|   0     0 |1004   376 
  0   7  25  55   0  13|   0     0 |1202k   18M|   0     0 |7331  8921 
  0   8  21  55   0  15|   0     0 |1195k   18M|   0     0 |5382  6579 
  0   0  38  62   0   0|   0     0 |  66B  306B|   0     0 |1002   371 
  0   0  33  67   0   0|   0     0 |  66B  306B|   0     0 |1003   376 
  0  14  20  41   0  24|   0     0 |1621k   24M|   0     0 |8549    10k
  0   5  31  55   0   9|   0     0 | 769k   11M|   0     0 |4444  5180 
  0   0  18  82   0   0|   0     0 |  66B  568B|   0     0 |1004   377 
  0   1  41  54   0   3|   0     0 | 184k 2777k|   0     0 |2609  2619 
  1  13  22  43   0  22|   0     0 |1572k   23M|   0     0 |8138    10k
  0  11   9  59   0  20|   0     0 |1861k   27M|   0     0 |9576    13k
  0   5  23  66   0   5|   0     0 | 540k 8122k|   0     0 |2816  2885 


 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   81 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 73 insertions(+), 11 deletions(-)

--- linux.orig/fs/nfs/write.c	2009-10-06 23:37:46.000000000 +0800
+++ linux/fs/nfs/write.c	2009-10-07 10:44:42.000000000 +0800
@@ -187,11 +187,54 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_CACHE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, long limit,
+			      struct backing_dev_info *bdi)
+{
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+
+	wait_event(wqh[is_sync], !test_bit(waitbit, &bdi->state));
+}
+
+static void nfs_wakeup_congested(long nr, long limit,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	if (nr < 2*limit - min(limit/8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+		if (waitqueue_active(&wqh[BLK_RW_SYNC])) {
+			smp_mb__after_clear_bit();
+			wake_up(&wqh[BLK_RW_SYNC]);
+		}
+	}
+	if (nr < limit - min(limit/8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC])) {
+			smp_mb__after_clear_bit();
+			wake_up(&wqh[BLK_RW_ASYNC]);
+		}
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -201,11 +244,9 @@ static int nfs_set_page_writeback(struct
 		struct inode *inode = page->mapping->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  nfs_congestion_kb >> (PAGE_SHIFT-10),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +257,11 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     nfs_congestion_kb >> (PAGE_SHIFT-10),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page)
@@ -309,19 +353,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux.orig/include/linux/nfs_fs_sb.h	2009-10-06 23:37:46.000000000 +0800
+++ linux/include/linux/nfs_fs_sb.h	2009-10-06 23:38:44.000000000 +0800
@@ -108,6 +108,7 @@ struct nfs_server {
 	struct nfs_iostats *	io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux.orig/fs/nfs/client.c	2009-10-06 23:37:46.000000000 +0800
+++ linux/fs/nfs/client.c	2009-10-06 23:38:44.000000000 +0800
@@ -991,6 +991,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 21/45] writeback: estimate bdi write bandwidth
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (19 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
                   ` (26 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-throughput-estimation.patch --]
[-- Type: text/plain, Size: 5410 bytes --]

Estimate bdi write bandwidth in bdi_writeback_wakeup(), at which time
the queue is not starved due to the dirtying process and the associated
background writeback.

The estimation should be able to reflect the max device capability,
unless there are busy reads, in which case we need lower nr_to_write
anyway.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

TODO: the estimated write bandwidth (~30MB/s) is mysteriously only half the real
throughput (~60MB/s). Here are some debug printk from bdi_calc_write_bandwidth():

        printk("write_bandwidth: comm=%s pages=%lu time=%lums\n",
	               current->comm, nr_pages, time * 1000 / HZ);

[ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
[ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
[ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms
[ 1093.986128] write_bandwidth: comm=swapper pages=1536 time=192ms
[ 1094.179983] write_bandwidth: comm=swapper pages=1536 time=192ms
[ 1094.374021] write_bandwidth: comm=swapper pages=1536 time=196ms
[ 1094.570611] write_bandwidth: comm=swapper pages=1536 time=196ms
[ 1094.771847] write_bandwidth: comm=swapper pages=1536 time=200ms
[ 1094.961981] write_bandwidth: comm=swapper pages=1536 time=192ms

Workload is several concurrent copies.

 fs/fs-writeback.c           |   30 +++++++++++++++++++++---------
 include/linux/backing-dev.h |    2 ++
 include/linux/writeback.h   |   10 ++++++++++
 mm/backing-dev.c            |    2 ++
 4 files changed, 35 insertions(+), 9 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:44.000000000 +0800
@@ -266,7 +266,23 @@ void bdi_start_writeback(struct backing_
 	bdi_alloc_queue_work(bdi, &args);
 }
 
+static int bdi_writeback_chunk(struct backing_dev_info *bdi)
+{
+	return max(MIN_WRITEBACK_PAGES, bdi->write_bandwidth);
+}
+
+static void bdi_calc_write_bandwidth(struct backing_dev_info *bdi,
+				     unsigned long nr_pages,
+				     unsigned long time)
+{
+	unsigned long bw;
+
+	bw = HZ * nr_pages / (time | 1);
+	bdi->write_bandwidth = (bdi->write_bandwidth * 63 + bw) / 64;
+}
+
 struct dirty_throttle_task {
+	unsigned long		start_time;
 	long			nr_pages;
 	struct list_head	list;
 	struct completion	complete;
@@ -275,6 +291,7 @@ struct dirty_throttle_task {
 void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
 {
 	struct dirty_throttle_task tt = {
+		.start_time = jiffies,
 		.nr_pages = nr_pages,
 		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
 	};
@@ -314,6 +331,9 @@ int bdi_writeback_wakeup(struct backing_
 		tt = list_entry(bdi->throttle_list.prev,
 				struct dirty_throttle_task, list);
 		list_del(&tt->list);
+		if (atomic_read(&bdi->throttle_pages) == 0)
+			bdi_calc_write_bandwidth(bdi, tt->nr_pages,
+						 jiffies - tt->start_time);
 		complete(&tt->complete);
 	}
 	/*
@@ -323,6 +343,7 @@ int bdi_writeback_wakeup(struct backing_
 		tt = list_entry(bdi->throttle_list.prev,
 				struct dirty_throttle_task, list);
 		atomic_set(&bdi->throttle_pages, tt->nr_pages);
+		tt->start_time = jiffies;
 	} else {
 		tt = NULL;
 		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
@@ -717,15 +738,6 @@ void writeback_inodes_wbc(struct writeba
 	writeback_inodes_wb(&bdi->wb, wbc);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
--- linux.orig/include/linux/writeback.h	2009-10-06 23:37:46.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:38:44.000000000 +0800
@@ -14,6 +14,16 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
+ * The max number of pages to writeout for each inode.
+ *
+ * We honor each inode a nr_to_write that will take about 1 second
+ * to finish, based on dynamic estimation of the bdi's write bandwidth.
+ * MAX_ serves as initial bandwidth value; MIN_ serves as low boundary.
+ */
+#define MAX_WRITEBACK_PAGES	(128 << (20 - PAGE_CACHE_SHIFT))
+#define MIN_WRITEBACK_PAGES	( 16 << (20 - PAGE_CACHE_SHIFT))
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux.orig/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-06 23:38:44.000000000 +0800
@@ -86,6 +86,8 @@ struct backing_dev_info {
 
 	struct list_head work_list;
 
+	int write_bandwidth;	/* pages per second */
+
 	/*
 	 * dirtier process throttling
 	 */
--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:43.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:44.000000000 +0800
@@ -646,6 +646,8 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->wb_mask = 1;
 	bdi->wb_cnt = 1;
 
+	bdi->write_bandwidth = MAX_WRITEBACK_PAGES;
+
 	spin_lock_init(&bdi->throttle_lock);
 	INIT_LIST_HEAD(&bdi->throttle_list);
 	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 22/45] writeback: show bdi write bandwidth in debugfs
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (20 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
                   ` (25 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 1258 bytes --]

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:44.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:44.000000000 +0800
@@ -93,6 +93,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "WriteBandwidth:   %8lu kBps\n"
 		   "WriteBack threads:%8lu\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
@@ -104,8 +105,9 @@ static int bdi_debug_stats_show(struct s
 		   "wb_cnt:           %8u\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi->write_bandwidth),
+		   nr_wb, nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
 		   !list_empty(&bdi->wb_list), bdi->wb_cnt);
 #undef K

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 23/45] writeback: kill space in debugfs item name
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (21 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
                   ` (24 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-fix-writeback-threads.patch --]
[-- Type: text/plain, Size: 682 bytes --]

The space is not script friendly, kill it.

CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:44.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:45.000000000 +0800
@@ -94,7 +94,7 @@ static int bdi_debug_stats_show(struct s
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
 		   "WriteBandwidth:   %8lu kBps\n"
-		   "WriteBack threads:%8lu\n"
+		   "WriteBackThreads: %8lu\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (22 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
                   ` (23 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-global-timeout.patch --]
[-- Type: text/plain, Size: 2204 bytes --]

The global wbc.timeout could be useful when we decide to do two sync
scans on dirty pages and dirty metadata. XFS could say: please return to
sync dirty metadata after 10s. Would need another b_io_metadata queue,
but that's possible.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com> 
CC: Christoph Hellwig <hch@infradead.org>
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    6 ++++++
 include/linux/writeback.h |    1 +
 mm/backing-dev.c          |    1 +
 3 files changed, 8 insertions(+)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:44.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:45.000000000 +0800
@@ -666,6 +666,10 @@ static void writeback_inodes_wb(struct b
 {
 	struct super_block *sb = wbc->sb, *pin_sb = NULL;
 	const unsigned long start = jiffies;	/* livelock avoidance */
+	unsigned long stop_time = 0;
+
+	if (wbc->timeout)
+		stop_time = (start + wbc->timeout) | 1;
 
 	spin_lock(&inode_lock);
 
@@ -723,6 +727,8 @@ static void writeback_inodes_wb(struct b
 		}
 		if (!list_empty(&wb->b_more_io))
 			wbc->more_io = 1;
+		if (stop_time && time_after(jiffies, stop_time))
+			break;
 	}
 
 	unpin_sb_for_writeback(&pin_sb);
--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:45.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:45.000000000 +0800
@@ -339,6 +339,7 @@ static void bdi_flush_io(struct backing_
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 		.nr_to_write		= 1024,
+		.timeout		= HZ,
 	};
 
 	writeback_inodes_wbc(&wbc);
--- linux.orig/include/linux/writeback.h	2009-10-06 23:38:44.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:38:45.000000000 +0800
@@ -42,6 +42,7 @@ struct writeback_control {
 	struct super_block *sb;		/* if !NULL, only write inodes from
 					   this super_block */
 	enum writeback_sync_modes sync_mode;
+	int timeout;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (23 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
                   ` (22 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-per-file-nr_to_write.patch --]
[-- Type: text/plain, Size: 3651 bytes --]

CC: Theodore Ts'o <tytso@mit.edu>
CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com> 
CC: Christoph Hellwig <hch@infradead.org>
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   24 ++++++++++++------------
 include/linux/writeback.h |    5 +++--
 mm/backing-dev.c          |    1 -
 3 files changed, 15 insertions(+), 15 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:45.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:38:56.000000000 +0800
@@ -661,12 +661,13 @@ pinned:
 	return 0;
 }
 
-static void writeback_inodes_wb(struct bdi_writeback *wb,
+static long writeback_inodes_wb(struct bdi_writeback *wb,
 				struct writeback_control *wbc)
 {
 	struct super_block *sb = wbc->sb, *pin_sb = NULL;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 	unsigned long stop_time = 0;
+	unsigned long wrote = 0;
 
 	if (wbc->timeout)
 		stop_time = (start + wbc->timeout) | 1;
@@ -709,7 +710,10 @@ static void writeback_inodes_wb(struct b
 		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
+		wbc->nr_to_write = bdi_writeback_chunk(wb->bdi);
+		wrote += wbc->nr_to_write;
 		writeback_single_inode(inode, wbc);
+		wrote -= wbc->nr_to_write;
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -735,6 +739,7 @@ static void writeback_inodes_wb(struct b
 
 	spin_unlock(&inode_lock);
 	/* Leave any unwritten inodes on b_io */
+	return wrote;
 }
 
 void writeback_inodes_wbc(struct writeback_control *wbc)
@@ -782,6 +787,7 @@ static long wb_writeback(struct bdi_writ
 	};
 	unsigned long oldest_jif;
 	long wrote = 0;
+	long nr;
 	struct inode *inode;
 
 	if (wbc.for_kupdate) {
@@ -810,26 +816,20 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
-		writeback_inodes_wb(wb, &wbc);
-		args->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		nr = writeback_inodes_wb(wb, &wbc);
+		args->nr_pages -= nr;
+		wrote += nr;
 
 		/*
-		 * If we consumed everything, see if we have more
-		 */
-		if (wbc.nr_to_write <= 0)
-			continue;
-		/*
-		 * Didn't write everything and we don't have more IO, bail
+		 * Bail if no more IO
 		 */
 		if (!wbc.more_io)
 			break;
 		/*
 		 * Did we write something? Try for more
 		 */
-		if (wbc.nr_to_write < MAX_WRITEBACK_PAGES)
+		if (nr)
 			continue;
 		/*
 		 * Nothing written. Wait for some inode to
--- linux.orig/include/linux/writeback.h	2009-10-06 23:38:45.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:38:52.000000000 +0800
@@ -45,8 +45,9 @@ struct writeback_control {
 	int timeout;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
-	long nr_to_write;		/* Write this many pages, and decrement
-					   this for each page written */
+	long nr_to_write;		/* Max pages to write per file, and
+					   decrement this for each page written
+					 */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*
--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:45.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:38:52.000000000 +0800
@@ -338,7 +338,6 @@ static void bdi_flush_io(struct backing_
 		.sync_mode		= WB_SYNC_NONE,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
-		.nr_to_write		= 1024,
 		.timeout		= HZ,
 	};
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (24 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
                   ` (21 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: block-bdi-nonrot.patch --]
[-- Type: text/plain, Size: 5313 bytes --]

For now only pass the non-rotational bit to the queue's bdi.

To support hierarchical dm/md configurations, blk_set_rotational()
shall be expanded to make several iterations of the bdi list and call
its bdi->rotational_fn to pass the bit up.  But that would imply we
don's support mixed SSD/HD arrays.

CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-core.c             |   14 ++++++++++++++
 block/blk-sysfs.c            |    7 +------
 drivers/block/nbd.c          |    2 +-
 drivers/block/xen-blkfront.c |    1 +
 drivers/mmc/card/queue.c     |    2 +-
 drivers/scsi/sd.c            |    2 +-
 include/linux/backing-dev.h  |    8 ++++++++
 include/linux/blkdev.h       |    1 +
 8 files changed, 28 insertions(+), 9 deletions(-)

--- linux.orig/block/blk-sysfs.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/block/blk-sysfs.c	2009-10-06 23:39:26.000000000 +0800
@@ -162,12 +162,7 @@ static ssize_t queue_nonrot_store(struct
 	unsigned long nm;
 	ssize_t ret = queue_var_store(&nm, page, count);
 
-	spin_lock_irq(q->queue_lock);
-	if (nm)
-		queue_flag_clear(QUEUE_FLAG_NONROT, q);
-	else
-		queue_flag_set(QUEUE_FLAG_NONROT, q);
-	spin_unlock_irq(q->queue_lock);
+	blk_set_rotational(q, nm);
 
 	return ret;
 }
--- linux.orig/include/linux/backing-dev.h	2009-10-06 23:38:44.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-06 23:39:26.000000000 +0800
@@ -29,6 +29,7 @@ enum bdi_state {
 	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
+	BDI_non_rotational,	/* Underlying device is SSD or virtual */
 	BDI_registered,		/* bdi_register() was done */
 	BDI_unused,		/* Available bits start here */
 };
@@ -254,6 +255,8 @@ int bdi_set_max_ratio(struct backing_dev
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
 
+#define BDI_CAP_NONROT		0x00000200
+
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
 
@@ -291,6 +294,11 @@ static inline int can_submit_background_
        return !test_and_set_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
 }
 
+static inline bool bdi_nonrot(struct backing_dev_info *bdi)
+{
+	return bdi->state & (1 << BDI_non_rotational);
+}
+
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
 	if (bdi->congested_fn)
--- linux.orig/block/blk-core.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/block/blk-core.c	2009-10-06 23:39:26.000000000 +0800
@@ -2486,6 +2486,20 @@ free_and_out:
 }
 EXPORT_SYMBOL_GPL(blk_rq_prep_clone);
 
+void blk_set_rotational(struct request_queue *q, int rotational)
+{
+	spin_lock_irq(q->queue_lock);
+	if (rotational) {
+		queue_flag_clear(QUEUE_FLAG_NONROT, q);
+		clear_bit(BDI_non_rotational, &q->backing_dev_info.state);
+	} else {
+		queue_flag_set(QUEUE_FLAG_NONROT, q);
+		set_bit(BDI_non_rotational, &q->backing_dev_info.state);
+	}
+	spin_unlock_irq(q->queue_lock);
+}
+EXPORT_SYMBOL(blk_set_rotational);
+
 int kblockd_schedule_work(struct request_queue *q, struct work_struct *work)
 {
 	return queue_work(kblockd_workqueue, work);
--- linux.orig/include/linux/blkdev.h	2009-10-06 23:37:44.000000000 +0800
+++ linux/include/linux/blkdev.h	2009-10-06 23:39:26.000000000 +0800
@@ -666,6 +666,7 @@ static inline void blk_clear_queue_full(
 		queue_flag_clear(QUEUE_FLAG_ASYNCFULL, q);
 }
 
+void blk_set_rotational(struct request_queue *q, int rotational);
 
 /*
  * mergeable request must not have _NOMERGE or _BARRIER bit set, nor may
--- linux.orig/drivers/block/nbd.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/drivers/block/nbd.c	2009-10-06 23:39:26.000000000 +0800
@@ -772,7 +772,7 @@ static int __init nbd_init(void)
 		/*
 		 * Tell the block layer that we are not a rotational device
 		 */
-		queue_flag_set_unlocked(QUEUE_FLAG_NONROT, disk->queue);
+		blk_set_rotational(disk->queue, 0);
 	}
 
 	if (register_blkdev(NBD_MAJOR, "nbd")) {
--- linux.orig/drivers/block/xen-blkfront.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/drivers/block/xen-blkfront.c	2009-10-06 23:39:26.000000000 +0800
@@ -342,6 +342,7 @@ static int xlvbd_init_blk_queue(struct g
 		return -1;
 
 	queue_flag_set_unlocked(QUEUE_FLAG_VIRT, rq);
+	blk_set_rotational(rq, 0);
 
 	/* Hard sector size and max sectors impersonate the equiv. hardware. */
 	blk_queue_logical_block_size(rq, sector_size);
--- linux.orig/drivers/mmc/card/queue.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/drivers/mmc/card/queue.c	2009-10-06 23:39:26.000000000 +0800
@@ -127,7 +127,7 @@ int mmc_init_queue(struct mmc_queue *mq,
 
 	blk_queue_prep_rq(mq->queue, mmc_prep_request);
 	blk_queue_ordered(mq->queue, QUEUE_ORDERED_DRAIN, NULL);
-	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, mq->queue);
+	blk_set_rotational(mq->queue, 0);
 
 #ifdef CONFIG_MMC_BLOCK_BOUNCE
 	if (host->max_hw_segs == 1) {
--- linux.orig/drivers/scsi/sd.c	2009-10-06 23:37:44.000000000 +0800
+++ linux/drivers/scsi/sd.c	2009-10-06 23:39:26.000000000 +0800
@@ -1898,7 +1898,7 @@ static void sd_read_block_characteristic
 	rot = get_unaligned_be16(&buffer[4]);
 
 	if (rot == 1)
-		queue_flag_set_unlocked(QUEUE_FLAG_NONROT, sdkp->disk->queue);
+		blk_set_rotational(sdkp->disk->queue, 0);
 
 	kfree(buffer);
 }

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 27/45] writeback: introduce wbc.for_background
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (25 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
                   ` (20 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-add-wbc-for_background.patch --]
[-- Type: text/plain, Size: 1716 bytes --]

It will be tested for setting wbc.nr_segments, lower the flush priority
for NFS, and maybe more in future.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    1 +
 fs/nfs/write.c            |    2 +-
 include/linux/writeback.h |    1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:56.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:27.000000000 +0800
@@ -783,6 +783,7 @@ static long wb_writeback(struct bdi_writ
 		.sync_mode		= args->sync_mode,
 		.older_than_this	= NULL,
 		.for_kupdate		= args->for_kupdate,
+		.for_background		= args->for_background,
 		.range_cyclic		= args->range_cyclic,
 	};
 	unsigned long oldest_jif;
--- linux.orig/include/linux/writeback.h	2009-10-06 23:38:52.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:39:27.000000000 +0800
@@ -61,6 +61,7 @@ struct writeback_control {
 	unsigned nonblocking:1;		/* Don't get stuck on request queues */
 	unsigned encountered_congestion:1; /* An output: a queue is full */
 	unsigned for_kupdate:1;		/* A kupdate writeback */
+	unsigned for_background:1;	/* A background writeback */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
 	unsigned more_io:1;		/* more io to be dispatched */
--- linux.orig/fs/nfs/write.c	2009-10-06 23:38:44.000000000 +0800
+++ linux/fs/nfs/write.c	2009-10-06 23:39:27.000000000 +0800
@@ -178,7 +178,7 @@ static int wb_priority(struct writeback_
 {
 	if (wbc->for_reclaim)
 		return FLUSH_HIGHPRI | FLUSH_STABLE;
-	if (wbc->for_kupdate)
+	if (wbc->for_kupdate || wbc->for_background)
 		return FLUSH_LOWPRI;
 	return 0;
 }

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 28/45] writeback: introduce wbc.nr_segments
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (26 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
                   ` (19 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-nr_segments.patch --]
[-- Type: text/plain, Size: 4698 bytes --]

wbc.nr_segments serves two major purposes:

- fairness between two large files, one is continuously dirtied,
  another is sparsely dirtied. Given the same amount of dirty pages,
  it could take vastly different time to sync them to the _same_
  device. The nr_segments check helps to favor continuous data.
- avoid seeks/fragmentations. To give each file fair chance of
  writeback, we have to abort a file when some nr_to_write or timeout
  is reached. However they are both not good abort conditions.
  The best is for filesystem to abort earlier in seek boundaries,
  and treat nr_to_write/timeout as large enough bottom lines.

However a low nr_segments would be inefficient if all files are sparsely
dirtied. For example, it may be inefficient for the block device inodes,
which has lots of sparsely distributed metadata pages.

The wbc.nr_segments here is determined purely by logical page index
distance: if two pages are 1MB apart, it makes a new segment.

Filesystems could do this better with real extent knowledges.
One possible scheme is to record the previous page index in
wbc.writeback_index, and let ->writepage compare if the current and
previous pages lie in the same extent, and decrease wbc.nr_segments
accordingly. Care should taken to avoid double decreases in writepage
and write_cache_pages.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com> 
CC: Christoph Hellwig <hch@infradead.org>
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    8 +++++++-
 fs/jbd2/commit.c          |    1 +
 include/linux/writeback.h |   10 +++++++++-
 mm/filemap.c              |    1 +
 mm/page-writeback.c       |    7 +++++++
 5 files changed, 25 insertions(+), 2 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:39:27.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:28.000000000 +0800
@@ -542,6 +542,11 @@ writeback_single_inode(struct inode *ino
 
 	spin_unlock(&inode_lock);
 
+	if (wbc->for_kupdate || wbc->for_background)
+		wbc->nr_segments = bdi_nonrot(wbc->bdi) ? 100 : 10;
+	else
+		wbc->nr_segments = LONG_MAX;
+
 	ret = do_writepages(mapping, wbc);
 
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
@@ -566,7 +571,8 @@ writeback_single_inode(struct inode *ino
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
+			if (wbc->nr_to_write <= 0 ||
+			    wbc->nr_segments <= 0) {
 				/*
 				 * slice used up: queue for next turn
 				 */
--- linux.orig/include/linux/writeback.h	2009-10-06 23:39:27.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-06 23:39:28.000000000 +0800
@@ -48,6 +48,9 @@ struct writeback_control {
 	long nr_to_write;		/* Max pages to write per file, and
 					   decrement this for each page written
 					 */
+	long nr_segments;		/* Max page segments to write per file,
+					   this is a count down value, too
+					 */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*
@@ -77,8 +80,13 @@ struct writeback_control {
 };
 
 /*
+ * if two page ranges are more than 1MB apart, they are taken as two segments.
+ */
+#define WB_SEGMENT_DIST		(1024 >> (PAGE_CACHE_SHIFT - 10))
+
+/*
  * fs/fs-writeback.c
- */	
+ */
 struct bdi_writeback;
 int inode_wait(void *);
 void writeback_inodes_sb(struct super_block *);
--- linux.orig/mm/filemap.c	2009-10-06 23:37:43.000000000 +0800
+++ linux/mm/filemap.c	2009-10-06 23:39:28.000000000 +0800
@@ -216,6 +216,7 @@ int __filemap_fdatawrite_range(struct ad
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
 		.nr_to_write = LONG_MAX,
+		.nr_segments = LONG_MAX,
 		.range_start = start,
 		.range_end = end,
 	};
--- linux.orig/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:39:28.000000000 +0800
@@ -805,6 +805,13 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (nr_to_write != wbc->nr_to_write &&
+			    done_index + WB_SEGMENT_DIST < page->index &&
+			    --wbc->nr_segments <= 0) {
+				done = 1;
+				break;
+			}
+
 			done_index = page->index + 1;
 
 			lock_page(page);
--- linux.orig/fs/jbd2/commit.c	2009-10-06 23:37:42.000000000 +0800
+++ linux/fs/jbd2/commit.c	2009-10-06 23:39:28.000000000 +0800
@@ -219,6 +219,7 @@ static int journal_submit_inode_data_buf
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
 		.nr_to_write = mapping->nrpages * 2,
+		.nr_segments = LONG_MAX,
 		.range_start = 0,
 		.range_end = i_size_read(mapping->host),
 	};



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (27 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07 11:57   ` Hugh Dickins
  2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
                   ` (18 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Hugh Dickins, Wu Fengguang, LKML

[-- Attachment #1: writeback-for-reclaim-pre-fix.patch --]
[-- Type: text/plain, Size: 1752 bytes --]

When shmem returns AOP_WRITEPAGE_ACTIVATE, the inode pages cannot be
synced in the near future. So write_cache_pages shall stop writting this
inode, and shmem shall increase pages_skipped to instruct VFS not to
busy retry.

CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

---
 mm/page-writeback.c |   23 +++++++++++------------
 mm/shmem.c          |    1 +
 2 files changed, 12 insertions(+), 12 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-06 23:39:28.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:39:29.000000000 +0800
@@ -851,19 +851,18 @@ continue_unlock:
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
 					unlock_page(page);
 					ret = 0;
-				} else {
-					/*
-					 * done_index is set past this page,
-					 * so media errors will not choke
-					 * background writeout for the entire
-					 * file. This has consequences for
-					 * range_cyclic semantics (ie. it may
-					 * not be suitable for data integrity
-					 * writeout).
-					 */
-					done = 1;
-					break;
 				}
+				/*
+				 * done_index is set past this page,
+				 * so media errors will not choke
+				 * background writeout for the entire
+				 * file. This has consequences for
+				 * range_cyclic semantics (ie. it may
+				 * not be suitable for data integrity
+				 * writeout).
+				 */
+				done = 1;
+				break;
  			}
 
 			if (nr_to_write > 0) {
--- linux.orig/mm/shmem.c	2009-10-06 23:37:40.000000000 +0800
+++ linux/mm/shmem.c	2009-10-06 23:39:29.000000000 +0800
@@ -1103,6 +1103,7 @@ unlock:
 	 */
 	swapcache_free(swap, NULL);
 redirty:
+	wbc->pages_skipped++;
 	set_page_dirty(page);
 	if (wbc->for_reclaim)
 		return AOP_WRITEPAGE_ACTIVATE;	/* Return with page locked */



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 30/45] vmscan: lumpy pageout
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (28 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
                   ` (17 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-lumpy-pageout.patch --]
[-- Type: text/plain, Size: 2082 bytes --]

When pageout a dirty page, try to piggy back more consecutive dirty
pages (up to 512KB) to improve IO efficiency.

Only ext3/reiserfs which don't have its own aops->writepages are
supported in this initial version.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   12 ++++++++++++
 mm/vmscan.c         |   11 +++++++++++
 2 files changed, 23 insertions(+)

--- linux.orig/mm/vmscan.c	2009-10-06 23:37:39.000000000 +0800
+++ linux/mm/vmscan.c	2009-10-06 23:39:30.000000000 +0800
@@ -344,6 +344,8 @@ typedef enum {
 	PAGE_CLEAN,
 } pageout_t;
 
+#define LUMPY_PAGEOUT_PAGES	(512 * 1024 / PAGE_CACHE_SIZE)
+
 /*
  * pageout is called by shrink_page_list() for each dirty page.
  * Calls ->writepage().
@@ -409,6 +411,15 @@ static pageout_t pageout(struct page *pa
 		}
 
 		/*
+		 * only write_cache_pages() supports for_reclaim for now
+		 */
+		if (!mapping->a_ops->writepages) {
+			wbc.range_start = (page->index + 1) << PAGE_CACHE_SHIFT;
+			wbc.nr_to_write = LUMPY_PAGEOUT_PAGES - 1;
+			generic_writepages(mapping, &wbc);
+		}
+
+		/*
 		 * Wait on writeback if requested to. This happens when
 		 * direct reclaiming a large contiguous area and the
 		 * first attempt to free a range of pages fails.
--- linux.orig/mm/page-writeback.c	2009-10-06 23:39:29.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-06 23:39:30.000000000 +0800
@@ -805,6 +805,11 @@ int write_cache_pages(struct address_spa
 				break;
 			}
 
+			if (wbc->for_reclaim && done_index != page->index) {
+				done = 1;
+				break;
+			}
+
 			if (nr_to_write != wbc->nr_to_write &&
 			    done_index + WB_SEGMENT_DIST < page->index &&
 			    --wbc->nr_segments <= 0) {
@@ -846,6 +851,13 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			/*
+			 * active and unevictable pages will be checked at
+			 * rotate time
+			 */
+			if (wbc->for_reclaim)
+				SetPageReclaim(page);
+
 			ret = (*writepage)(page, wbc, data);
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 31/45] writeback: sync old inodes first in background writeback
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (29 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2010-07-12  3:01   ` Christoph Hellwig
  2009-10-07  7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
                   ` (16 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-expired-for-background.patch --]
[-- Type: text/plain, Size: 1709 bytes --]

A background flush work may run for ever. So it's reasonable for it to
mimic the kupdate behavior of syncing old/expired inodes first.

CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:39:28.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:31.000000000 +0800
@@ -680,7 +680,7 @@ static long writeback_inodes_wb(struct b
 
 	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+	if (list_empty(&wb->b_io))
 		queue_io(wb, wbc->older_than_this);
 
 	while (!list_empty(&wb->b_io)) {
@@ -793,14 +793,15 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= args->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	int expire_interval = msecs_to_jiffies(dirty_expire_interval * 10);
+	int fg_rounds = 0;
 	long wrote = 0;
 	long nr;
 	struct inode *inode;
 
-	if (wbc.for_kupdate) {
+	if (wbc.for_kupdate || wbc.for_background) {
 		wbc.older_than_this = &oldest_jif;
-		oldest_jif = jiffies -
-				msecs_to_jiffies(dirty_expire_interval * 10);
+		oldest_jif = jiffies - expire_interval;
 	}
 	if (!wbc.range_cyclic) {
 		wbc.range_start = 0;
@@ -828,6 +829,18 @@ static long wb_writeback(struct bdi_writ
 		args->nr_pages -= nr;
 		wrote += nr;
 
+		if (args->for_background && expire_interval &&
+		    ++fg_rounds && list_empty(&wb->b_io)) {
+			if (fg_rounds < 10)
+				expire_interval >>= 1;
+			if (expire_interval)
+				oldest_jif = jiffies - expire_interval;
+			else
+				wbc.older_than_this = 0;
+			fg_rounds = 0;
+			continue;
+		}
+
 		/*
 		 * Bail if no more IO
 		 */



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (30 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
                   ` (15 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-moving-expired.patch --]
[-- Type: text/plain, Size: 716 bytes --]

This prevents it to stuck with some very old but busy inodes,
and to give newer expired inodes a fair chance.

CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:39:31.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:32.000000000 +0800
@@ -829,6 +829,9 @@ static long wb_writeback(struct bdi_writ
 		args->nr_pages -= nr;
 		wrote += nr;
 
+		if (args->for_kupdate && list_empty(&wb->b_io))
+			oldest_jif = jiffies - expire_interval;
+
 		if (args->for_background && expire_interval &&
 		    ++fg_rounds && list_empty(&wb->b_io)) {
 			if (fg_rounds < 10)

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 34/45] writeback: sync livelock - kick background writeback
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (31 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
                   ` (14 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-sync-pending.patch --]
[-- Type: text/plain, Size: 2364 bytes --]

The periodic/background writeback can run forever. So when any
sync work is enqueued, increase bdi->sync_works to notify the
active non-sync works to exit. Non-sync works queued after all sync
works won't be affected.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   11 +++++++++++
 include/linux/backing-dev.h |    5 +++++
 mm/backing-dev.c            |    1 +
 3 files changed, 17 insertions(+)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:39:33.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:33.000000000 +0800
@@ -139,6 +139,8 @@ static void wb_clear_pending(struct bdi_
 		list_del_rcu(&work->list);
 		if (work->args.for_background)
 			clear_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
+		if (work->args.for_sync)
+			bdi->sync_works--;
 		spin_unlock(&bdi->wb_lock);
 
 		wb_work_complete(work);
@@ -159,6 +161,8 @@ static void bdi_queue_work(struct backin
 	 */
 	spin_lock(&bdi->wb_lock);
 	list_add_tail_rcu(&work->list, &bdi->work_list);
+	if (work->args.for_sync)
+		bdi->sync_works++;
 	spin_unlock(&bdi->wb_lock);
 
 	/*
@@ -811,6 +815,13 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		/*
+		 * background/periodic works can run forever, need to abort
+		 * on seeing any pending sync work, to prevent livelock it.
+		 */
+		if (!args->for_sync && wb->bdi->sync_works > 0)
+			break;
+
+		/*
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
--- linux.orig/include/linux/backing-dev.h	2009-10-06 23:39:33.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-06 23:39:33.000000000 +0800
@@ -84,6 +84,11 @@ struct backing_dev_info {
 	struct list_head wb_list; /* the flusher threads hanging off this bdi */
 	unsigned long wb_mask;	  /* bitmask of registered tasks */
 	unsigned int wb_cnt;	  /* number of registered tasks */
+	/*
+	 * sync works queued, background works shall abort on seeing this,
+	 * to prevent livelocking the sync works
+	 */
+	unsigned int sync_works;
 
 	struct list_head work_list;
 
--- linux.orig/mm/backing-dev.c	2009-10-06 23:38:52.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-06 23:39:33.000000000 +0800
@@ -647,6 +647,7 @@ int bdi_init(struct backing_dev_info *bd
 	 */
 	bdi->wb_mask = 1;
 	bdi->wb_cnt = 1;
+	bdi->sync_works = 0;
 
 	bdi->write_bandwidth = MAX_WRITEBACK_PAGES;
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (32 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
                   ` (13 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-sync-pending-start_time.patch --]
[-- Type: text/plain, Size: 1747 bytes --]

The start time in writeback_inodes_wb() is not very useful because it
slips at each invocation time. We shall use one _constant_ time at the
beginning to cover this whole sync() work.

The timestamp is now grabbed at work start time. It could be better set
at the sync work submission time.

CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-06 23:39:33.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-06 23:39:39.000000000 +0800
@@ -669,12 +669,11 @@ static long writeback_inodes_wb(struct b
 				struct writeback_control *wbc)
 {
 	struct super_block *sb = wbc->sb, *pin_sb = NULL;
-	const unsigned long start = jiffies;	/* livelock avoidance */
 	unsigned long stop_time = 0;
 	unsigned long wrote = 0;
 
 	if (wbc->timeout)
-		stop_time = (start + wbc->timeout) | 1;
+		stop_time = (jiffies + wbc->timeout) | 1;
 
 	spin_lock(&inode_lock);
 
@@ -699,13 +698,6 @@ static long writeback_inodes_wb(struct b
 			continue;
 		}
 
-		/*
-		 * Was this inode dirtied after sync_sb_inodes was called?
-		 * This keeps sync from extra jobs and livelock.
-		 */
-		if (inode_dirtied_after(inode, start))
-			break;
-
 		if (pin_sb_for_writeback(wbc, inode, &pin_sb)) {
 			requeue_io(inode);
 			continue;
@@ -798,6 +790,13 @@ static long wb_writeback(struct bdi_writ
 	long nr;
 	struct inode *inode;
 
+	/*
+	 * keep sync from extra jobs and livelock
+	 */
+	if (wbc.for_sync) {
+		wbc.older_than_this = &oldest_jif;
+		oldest_jif = jiffies;
+	}
 	if (wbc.for_kupdate || wbc.for_background) {
 		wbc.older_than_this = &oldest_jif;
 		oldest_jif = jiffies - expire_interval;



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (33 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
                   ` (12 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-sync-pending-slower-writes.patch --]
[-- Type: text/plain, Size: 2286 bytes --]

The inodes to be synced will be dirty throttled regardless of the dirty
threshold.  This stops sync() livelock by some fast dirtier.

CC: Jan Kara <jack@suse.cz> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    4 ++++
 include/linux/backing-dev.h |    1 +
 mm/page-writeback.c         |   13 +++++++++++++
 3 files changed, 18 insertions(+)

--- linux.orig/mm/page-writeback.c	2009-10-07 10:47:09.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-07 14:32:43.000000000 +0800
@@ -478,6 +478,18 @@ static void balance_dirty_pages(struct a
 	global_dirty_thresh(&background_thresh, &dirty_thresh);
 
 	/*
+	 * If sync() is in progress, curb the to-be-synced inodes regardless
+	 * of dirty limits, so that a fast dirtier won't livelock the sync.
+	 */
+	if (unlikely(bdi->sync_time &&
+		     S_ISREG(mapping->host->i_mode) &&
+		     time_after_eq(bdi->sync_time,
+				   mapping->host->dirtied_when))) {
+		write_chunk *= 2;
+		goto throttle;
+	}
+
+	/*
 	 * Throttle it only when the background writeback cannot
 	 * catch-up. This skips the ramp up phase of bdi limits.
 	 */
@@ -510,6 +522,7 @@ static void balance_dirty_pages(struct a
 	if (!dirty_exceeded)
 		goto out;
 
+throttle:
 	if (!bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 1;
 
--- linux.orig/fs/fs-writeback.c	2009-10-07 10:47:09.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:31:47.000000000 +0800
@@ -796,6 +796,7 @@ static long wb_writeback(struct bdi_writ
 	if (wbc.for_sync) {
 		wbc.older_than_this = &oldest_jif;
 		oldest_jif = jiffies;
+		wbc.bdi->sync_time = oldest_jif | 1;
 	}
 	if (wbc.for_kupdate || wbc.for_background) {
 		wbc.older_than_this = &oldest_jif;
@@ -873,6 +874,9 @@ static long wb_writeback(struct bdi_writ
 		spin_unlock(&inode_lock);
 	}
 
+	if (args->for_sync)
+		wb->bdi->sync_time = 0;
+
 	if (args->for_background)
 		while (bdi_writeback_wakeup(wb->bdi))
 			;  /* unthrottle all tasks */
--- linux.orig/include/linux/backing-dev.h	2009-10-07 10:47:09.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-07 14:31:53.000000000 +0800
@@ -89,6 +89,7 @@ struct backing_dev_info {
 	 * to prevent livelocking the sync works
 	 */
 	unsigned int sync_works;
+	unsigned long sync_time;
 
 	struct list_head work_list;
 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (34 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
                   ` (11 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-dirty-exceed-time.patch --]
[-- Type: text/plain, Size: 2680 bytes --]

When there are only one (or several) dirtiers, dirty_exceeded is always
(or mostly) off. Converting to timestamp avoids this problem. It helps
to use smaller write_chunk for smoother throttling.

We'll lower ratelimit if saw dirty exceeded in the last 1 second.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

Before patch, the wait time in balance_dirty_pages() are ~200ms:

[ 1093.397700] write_bandwidth: comm=swapper pages=1536 time=204ms
[ 1093.594319] write_bandwidth: comm=swapper pages=1536 time=196ms
[ 1093.796642] write_bandwidth: comm=swapper pages=1536 time=200ms

After patch, ~25ms:

[   90.261339] write_bandwidth: comm=swapper pages=192 time=20ms
[   90.293168] write_bandwidth: comm=swapper pages=192 time=24ms
[   90.323853] write_bandwidth: comm=swapper pages=192 time=24ms
[   90.354510] write_bandwidth: comm=swapper pages=192 time=28ms
[   90.389890] write_bandwidth: comm=swapper pages=192 time=28ms
[   90.421787] write_bandwidth: comm=swapper pages=192 time=24ms

 include/linux/backing-dev.h |    2 +-
 mm/backing-dev.c            |    1 -
 mm/page-writeback.c         |    9 +++------
 3 files changed, 4 insertions(+), 8 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-07 14:32:43.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-07 14:32:46.000000000 +0800
@@ -523,14 +523,10 @@ static void balance_dirty_pages(struct a
 		goto out;
 
 throttle:
-	if (!bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 1;
+	bdi->dirty_exceed_time = jiffies;
 
 	bdi_writeback_wait(bdi, write_chunk);
 
-	if (bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 0;
-
 out:
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold before
@@ -578,7 +574,8 @@ void balance_dirty_pages_ratelimited_nr(
 	unsigned long *p;
 
 	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
+	if (jiffies - mapping->backing_dev_info->dirty_exceed_time <
+							(unsigned long) HZ)
 		ratelimit >>= 3;
 
 	/*
--- linux.orig/include/linux/backing-dev.h	2009-10-07 14:31:53.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-07 14:32:46.000000000 +0800
@@ -74,7 +74,7 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
-	int dirty_exceeded;
+	unsigned long dirty_exceed_time;
 
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
--- linux.orig/mm/backing-dev.c	2009-10-07 14:31:53.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-07 14:32:46.000000000 +0800
@@ -661,7 +661,6 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
-	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 38/45] writeback: introduce queue b_more_io_wait
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (35 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
                   ` (10 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	David Chinner, Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-more_io_wait.patch --]
[-- Type: text/plain, Size: 6145 bytes --]

Introduce the b_more_io_wait queue to park inodes that for some reason
cannot be synced immediately. They will be revisited either in the next
b_io scan time, or after 0.1s sleep for sync, or retried after 5s in the
next periodic writeback.

The new data flow after this patchset:

b_dirty --> b_io --> b_more_io/b_more_io_wait --+
             ^                                  |
	     |                                  |
	     +----------------------------------+

The rational is to address two issues:
- the 30s delay of redirty_tail() may be too long
- redirty_tail() may update i_dirtied_when, however we now rely on it
  remain unchanged for all candidate inodes of sync(). (to avoid extra
  work and livelock, we now exclude any inodes from being synced if its
  dirty time is after the sync time)

Cc: Jan Kara <jack@suse.cz> 
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c           |   27 ++++++++++++++++-----------
 include/linux/backing-dev.h |    8 +++++---
 mm/backing-dev.c            |   14 +++++++++++---
 3 files changed, 32 insertions(+), 17 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:31:47.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:50.000000000 +0800
@@ -384,6 +384,16 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode)
+{
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+
+	list_move(&inode->i_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -453,12 +463,14 @@ static void move_expired_inodes(struct l
 /*
  * Queue all expired dirty inodes for io, eldest first:
  * (newly dirtied) => b_dirty inodes
+ *                 => b_more_io_wait inodes
  *                 => b_more_io inodes
  *                 => remaining inodes in b_io => (dequeue for sync)
  */
 static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 
@@ -860,18 +872,11 @@ static long wb_writeback(struct bdi_writ
 		 */
 		if (nr)
 			continue;
-		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
-		 */
-		spin_lock(&inode_lock);
-		if (!list_empty(&wb->b_more_io))  {
-			inode = list_entry(wb->b_more_io.prev,
-						struct inode, i_list);
-			inode_wait_for_writeback(inode);
+		if (wbc.for_sync && !list_empty(&wb->b_more_io_wait)) {
+			schedule_timeout_interruptible(HZ/10);
+			continue;
 		}
-		spin_unlock(&inode_lock);
+		break;
 	}
 
 	if (args->for_sync)
--- linux.orig/include/linux/backing-dev.h	2009-10-07 14:32:46.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-07 14:32:50.000000000 +0800
@@ -56,6 +56,7 @@ struct bdi_writeback {
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
+	struct list_head	b_more_io_wait;	/* opportunistic retry io */
 };
 
 struct backing_dev_info {
@@ -140,9 +141,10 @@ extern struct list_head bdi_list;
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux.orig/mm/backing-dev.c	2009-10-07 14:32:46.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-07 14:32:50.000000000 +0800
@@ -63,14 +63,17 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
+	unsigned long nr_dirty		= 0;
+	unsigned long nr_io		= 0;
+	unsigned long nr_more_io	= 0;
+	unsigned long nr_more_io_wait	= 0;
+	unsigned long nr_wb		= 0;
 	struct inode *inode;
 
 	/*
 	 * inode lock is enough here, the bdi->wb_list is protected by
 	 * RCU on the reader side
 	 */
-	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
 	spin_lock(&inode_lock);
 	list_for_each_entry(wb, &bdi->wb_list, list) {
 		nr_wb++;
@@ -80,6 +83,8 @@ static int bdi_debug_stats_show(struct s
 			nr_io++;
 		list_for_each_entry(inode, &wb->b_more_io, i_list)
 			nr_more_io++;
+		list_for_each_entry(inode, &wb->b_more_io_wait, i_list)
+			nr_more_io_wait++;
 	}
 	spin_unlock(&inode_lock);
 
@@ -98,6 +103,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
+		   "b_more_io_wait:   %8lu\n"
 		   "bdi_list:         %8u\n"
 		   "state:            %8lx\n"
 		   "wb_mask:          %8lx\n"
@@ -107,7 +113,7 @@ static int bdi_debug_stats_show(struct s
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi->write_bandwidth),
-		   nr_wb, nr_dirty, nr_io, nr_more_io,
+		   nr_wb, nr_dirty, nr_io, nr_more_io, nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
 		   !list_empty(&bdi->wb_list), bdi->wb_cnt);
 #undef K
@@ -264,6 +270,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 }
 
 static void bdi_task_init(struct backing_dev_info *bdi,
@@ -688,6 +695,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&inode_lock);
 	}
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 39/45] writeback: remove wbc.more_io
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (36 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
                   ` (9 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-remove-more-io.patch --]
[-- Type: text/plain, Size: 3621 bytes --]

It seems no longer required. It was introduced mainly to deal with the
complexity from _multiple_ superblock queues. Now there are only one
queue, so infomation can be directly queried if necessary.

CC: Theodore Ts'o <tytso@mit.edu> 
CC: Dave Chinner <david@fromorbit.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   13 +------------
 include/linux/writeback.h   |    1 -
 include/trace/events/ext4.h |    6 ++----
 3 files changed, 3 insertions(+), 17 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:32:50.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:51.000000000 +0800
@@ -733,12 +733,8 @@ static long writeback_inodes_wb(struct b
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0) {
-			wbc->more_io = 1;
+		if (wbc->nr_to_write <= 0)
 			break;
-		}
-		if (!list_empty(&wb->b_more_io))
-			wbc->more_io = 1;
 		if (stop_time && time_after(jiffies, stop_time))
 			break;
 	}
@@ -800,7 +796,6 @@ static long wb_writeback(struct bdi_writ
 	int fg_rounds = 0;
 	long wrote = 0;
 	long nr;
-	struct inode *inode;
 
 	/*
 	 * keep sync from extra jobs and livelock
@@ -840,7 +835,6 @@ static long wb_writeback(struct bdi_writ
 		if (args->for_background && !over_bground_thresh())
 			break;
 
-		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.pages_skipped = 0;
 		nr = writeback_inodes_wb(wb, &wbc);
@@ -863,11 +857,6 @@ static long wb_writeback(struct bdi_writ
 		}
 
 		/*
-		 * Bail if no more IO
-		 */
-		if (!wbc.more_io)
-			break;
-		/*
 		 * Did we write something? Try for more
 		 */
 		if (nr)
--- linux.orig/include/linux/writeback.h	2009-10-07 14:31:46.000000000 +0800
+++ linux/include/linux/writeback.h	2009-10-07 14:32:51.000000000 +0800
@@ -78,7 +78,6 @@ struct writeback_control {
 	unsigned for_sync:1;		/* A writeback for sync */
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 	/*
 	 * write_cache_pages() won't update wbc->nr_to_write and
 	 * mapping->writeback_index if no_nrwrite_index_update
--- linux.orig/include/trace/events/ext4.h	2009-10-07 14:31:46.000000000 +0800
+++ linux/include/trace/events/ext4.h	2009-10-07 14:32:51.000000000 +0800
@@ -311,7 +311,6 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__field(	int,	pages_written		)
 		__field(	long,	pages_skipped		)
 		__field(	char,	encountered_congestion	)
-		__field(	char,	more_io			)	
 		__field(	char,	no_nrwrite_index_update )
 		__field(       pgoff_t,	writeback_index		)
 	),
@@ -323,16 +322,15 @@ TRACE_EVENT(ext4_da_writepages_result,
 		__entry->pages_written	= pages_written;
 		__entry->pages_skipped	= wbc->pages_skipped;
 		__entry->encountered_congestion	= wbc->encountered_congestion;
-		__entry->more_io	= wbc->more_io;
 		__entry->no_nrwrite_index_update = wbc->no_nrwrite_index_update;
 		__entry->writeback_index = inode->i_mapping->writeback_index;
 	),
 
-	TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld congestion %d more_io %d no_nrwrite_index_update %d writeback_index %lu",
+	TP_printk("dev %s ino %lu ret %d pages_written %d pages_skipped %ld congestion %d no_nrwrite_index_update %d writeback_index %lu",
 		  jbd2_dev_to_name(__entry->dev),
 		  (unsigned long) __entry->ino, __entry->ret,
 		  __entry->pages_written, __entry->pages_skipped,
-		  __entry->encountered_congestion, __entry->more_io,
+		  __entry->encountered_congestion,
 		  __entry->no_nrwrite_index_update,
 		  (unsigned long) __entry->writeback_index)
 );

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (37 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
                   ` (8 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-more_io_wait-b.patch --]
[-- Type: text/plain, Size: 1116 bytes --]

Use requeue_io_wait() if inode is being synced by others.
This queue won't be busy retried, so avoids busy loops.

Cc: Jan Kara <jack@suse.cz> 
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:32:51.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:52.000000000 +0800
@@ -526,14 +526,14 @@ writeback_single_inode(struct inode *ino
 	if (inode->i_state & I_SYNC) {
 		/*
 		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
+		 * writeback-for-data-integrity, move it to b_more_io_wait so
+		 * that writeback can proceed with the other inodes on b_io.
 		 *
 		 * We'll have another go at writing back this inode when we
 		 * completed a full scan of b_io.
 		 */
 		if (!wait) {
-			requeue_io(inode);
+			requeue_io_wait(inode);
 			return 0;
 		}
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (38 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
@ 2009-10-07  7:38 ` Wu Fengguang
  2009-10-07  7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
                   ` (7 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-more_io_wait-d.patch --]
[-- Type: text/plain, Size: 726 bytes --]

Use requeue_io_wait() if some pages were skipped due to locked buffers.

Cc: Dave Chinner <david@fromorbit.com> 
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:32:52.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:52.000000000 +0800
@@ -727,7 +727,7 @@ static long writeback_inodes_wb(struct b
 			 * writeback is not making progress due to locked
 			 * buffers.  Skip this inode for now.
 			 */
-			redirty_tail(inode);
+			requeue_io_wait(inode);
 		}
 		spin_unlock(&inode_lock);
 		iput(inode);

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 42/45] writeback: requeue_io_wait() on blocked inode
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (39 preceding siblings ...)
  2009-10-07  7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
@ 2009-10-07  7:39 ` Wu Fengguang
  2009-10-07  7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
                   ` (6 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: writeback-remove-redirty-blocked.patch --]
[-- Type: text/plain, Size: 667 bytes --]

Use requeue_io_wait() if inode is somehow blocked. This includes the
wrapped around range_cyclic case.

CC: Jan Kara <jack@suse.cz> 
CC: Dave Chinner <david@fromorbit.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:32:52.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:53.000000000 +0800
@@ -591,7 +591,7 @@ writeback_single_inode(struct inode *ino
 				/*
 				 * somehow blocked: retry later
 				 */
-				redirty_tail(inode);
+				requeue_io_wait(inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (40 preceding siblings ...)
  2009-10-07  7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
@ 2009-10-07  7:39 ` Wu Fengguang
  2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
                   ` (5 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Michael Rubin, Peter Zijlstra, Wu Fengguang, LKML

[-- Attachment #1: writeback-remove-redirty-b.patch --]
[-- Type: text/plain, Size: 819 bytes --]

When an inodes is redirtied by the filesystem, its
dirty time shall not be updated.

Cc: Jan Kara <jack@suse.cz> 
Cc: Dave Chinner <david@fromorbit.com> 
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-10-07 14:32:53.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-07 14:32:54.000000000 +0800
@@ -598,7 +598,7 @@ writeback_single_inode(struct inode *ino
 			 * At least XFS will redirty the inode during the
 			 * writeback (delalloc) and on io completion (isize).
 			 */
-			redirty_tail(inode);
+			requeue_io_wait(inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (41 preceding siblings ...)
  2009-10-07  7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
@ 2009-10-07  7:39 ` Wu Fengguang
  2009-10-07 13:11   ` Peter Staubach
  2009-10-07  7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
                   ` (4 subsequent siblings)
  47 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Peter Staubach, Wu Fengguang, LKML

[-- Attachment #1: nfs-remove-flush-lock.patch --]
[-- Type: text/plain, Size: 2797 bytes --]

It was introduced in 72cb77f4a5ac, and the several issues have been
addressed in generic writeback:
- out of order writeback (or interleaved concurrent writeback)
  addressed by the per-bdi writeback and wait queue in balance_dirty_pages()
- sync livelocked by a fast dirtier
  addressed by throttling all to-be-synced dirty inodes

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Peter Staubach <staubach@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux.orig/fs/nfs/file.c	2009-10-07 14:31:45.000000000 +0800
+++ linux/fs/nfs/file.c	2009-10-07 14:32:54.000000000 +0800
@@ -386,15 +386,6 @@ static int nfs_write_begin(struct file *
 		mapping->host->i_ino, len, (long long) pos);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux.orig/fs/nfs/write.c	2009-10-07 14:31:45.000000000 +0800
+++ linux/fs/nfs/write.c	2009-10-07 14:32:54.000000000 +0800
@@ -387,26 +387,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux.orig/include/linux/nfs_fs.h	2009-10-07 14:31:45.000000000 +0800
+++ linux/include/linux/nfs_fs.h	2009-10-07 14:32:54.000000000 +0800
@@ -208,7 +208,6 @@ struct nfs_inode {
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 45/45] btrfs: fix race on syncing the btree inode
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (42 preceding siblings ...)
  2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
@ 2009-10-07  7:39 ` Wu Fengguang
  2009-10-07  8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
                   ` (3 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Wu Fengguang, LKML

[-- Attachment #1: btrfs-remove-nr_to_write-tricks.patch --]
[-- Type: text/plain, Size: 2734 bytes --]

When doing sync, the btree dirty pages refuse to go away for tens of seconds:

# vmmon -d 1 nr_writeback nr_dirty nr_unstable

     nr_writeback         nr_dirty      nr_unstable
            46641            23315                0
            46641            23380                0
            46641            23381                0
            26674            43206                0
            18963            51006                0
            11252            58721                0
             3528            66419                0
                0            70024                0
                0            70024                0
                0            70024                0
                0            70024                0
                0            70024                0
                0            70024                0
                0            70024                0
                0            70024                0

Note that the 70024 pages are under the btree inode's 32MB
no-write-metadata threshold. This is racy because the sync
work has to sleep and retry it forever for data integrity.

The 32MB threshold may also become a problem for background
writeback given a memory tight box. So it may be better to
replace the threshold with some informed writeback tricks.

CC: Chris Mason <chris.mason@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |   29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

--- linux.orig/fs/btrfs/disk-io.c	2009-10-07 14:31:45.000000000 +0800
+++ linux/fs/btrfs/disk-io.c	2009-10-07 14:32:55.000000000 +0800
@@ -707,22 +707,19 @@ static int btree_writepage(struct page *
 static int btree_writepages(struct address_space *mapping,
 			    struct writeback_control *wbc)
 {
-	struct extent_io_tree *tree;
-	tree = &BTRFS_I(mapping->host)->io_tree;
-	if (wbc->sync_mode == WB_SYNC_NONE) {
-		struct btrfs_root *root = BTRFS_I(mapping->host)->root;
-		u64 num_dirty;
-		unsigned long thresh = 32 * 1024 * 1024;
-
-		if (wbc->for_kupdate)
-			return 0;
-
-		/* this is a bit racy, but that's ok */
-		num_dirty = root->fs_info->dirty_metadata_bytes;
-		if (num_dirty < thresh)
-			return 0;
-	}
-	return extent_writepages(tree, mapping, btree_get_extent, wbc);
+	struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree;
+	int ret;
+
+	if (!wbc->for_sync)
+		wbc->nr_segments = 1;
+	ret = extent_writepages(tree, mapping, btree_get_extent, wbc);
+	/*
+	 * Fake some some skipped pages, so that VFS won't
+	 * try hard on writing this inode.
+	 */
+	if (!wbc->for_sync)
+		wbc->pages_skipped++;
+	return ret;
 }
 
 static int btree_readpage(struct file *file, struct page *page)



^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 21/45] writeback: estimate bdi write bandwidth
  2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
@ 2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  9:39     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-07  8:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel

On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> +static void bdi_calc_write_bandwidth(struct backing_dev_info *bdi,
> +                                    unsigned long nr_pages,
> +                                    unsigned long time)
> +{
> +       unsigned long bw;
> +
> +       bw = HZ * nr_pages / (time | 1);
> +       bdi->write_bandwidth = (bdi->write_bandwidth * 63 + bw) / 64;
> +}

If you have block times < 1 jiffy this all falls apart quite quickly.
You could perhaps try to use cpu_clock() for ns resolution timestamps if
this is a real issue.

(I could imagine fast arrays with huge throughput causing small sleeps,
resulting in underestimates of their bandwidth)

Also, 63/64 seems rather slow progress.. maybe that's good.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded
  2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
@ 2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  9:17     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-07  8:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel, Richard Kennedy

On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> plain text document attachment
> (writeback-ratelimit-on-dirty-exceeded.patch)
> When dirty_exceeded, use ratelimit = ratelimit_pages/8, allowing it to
> scale up to 512KB for memory bounty systems. This is more efficient than
> the original 8 pages, and won't risk exceeding the dirty limit too much.
> 
> Given the larger ratelimit value, we can safely ignore the low bound
> check in sync_writeback_pages.
> 
> dirty_exceeded is more likely to be seen when there are multiple dirty
> processes. In which case the lowered ratelimit will help reduce their
> overall wait time (latency) in the throttled queue.

Don't forget that ratelimit_pages is a per-cpu limit. So the total error
on the dirty limit scales with the number of cpus.

Other than that, I guess this patch needs numbers ;-)


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
@ 2009-10-07  8:53   ` Peter Zijlstra
  2009-10-07  9:07     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-07  8:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel

On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> plain text document attachment (writeback-nfs-request-queue.patch)
> The generic writeback routines are departing from congestion_wait()
> in preferance of get_request_wait(), aka. waiting on the block queues.
> 
> Introduce the missing writeback wait queue for NFS, otherwise its
> writeback pages may grow out of control.
> 
> In perticular, balance_dirty_pages() will exit after it pushes
> write_chunk pages into the PG_writeback page pool, _OR_ when the
> background writeback work quits. The latter is new behavior, and could
> not only quit (normally) after below background threshold, but also
> quit when it finds _zero_ dirty pages to write. The latter case gives
> rise to the number of PG_writeback pages if it is not explicitly limited.
> 
> CC: Jens Axboe <jens.axboe@oracle.com>
> CC: Chris Mason <chris.mason@oracle.com>
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
> 
> The wait time and network throughput varies a lot! this is a major problem.
> This means nfs_end_page_writeback() is not called smoothly over time,
> even when there are plenty of PG_writeback pages on the client side.

Could that be some ack batching from the nfs protocol/server?

> +static void nfs_wakeup_congested(long nr, long limit,
> +				 struct backing_dev_info *bdi,
> +				 wait_queue_head_t *wqh)
> +{
> +	if (nr < 2*limit - min(limit/8, NFS_WAIT_PAGES)) {
> +		if (test_bit(BDI_sync_congested, &bdi->state))
> +			clear_bdi_congested(bdi, BLK_RW_SYNC);
> +		if (waitqueue_active(&wqh[BLK_RW_SYNC])) {
> +			smp_mb__after_clear_bit();
> +			wake_up(&wqh[BLK_RW_SYNC]);
> +		}
> +	}
> +	if (nr < limit - min(limit/8, NFS_WAIT_PAGES)) {
> +		if (test_bit(BDI_async_congested, &bdi->state))
> +			clear_bdi_congested(bdi, BLK_RW_ASYNC);
> +		if (waitqueue_active(&wqh[BLK_RW_ASYNC])) {
> +			smp_mb__after_clear_bit();
> +			wake_up(&wqh[BLK_RW_ASYNC]);
> +		}
> +	}
> +}
>  

wakeup implies a full memory barrier.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (43 preceding siblings ...)
  2009-10-07  7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
@ 2009-10-07  8:53 ` Peter Zijlstra
  2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
                   ` (2 subsequent siblings)
  47 siblings, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-07  8:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel

On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> Hi all,
> 
> Here is a collection of writeback patches on
> 
> - larger writeback chunk sizes
> - single per-bdi flush thread (killing the foreground throttling writeouts)
> - lumpy pageout
> - sync livelock prevention
> - writeback scheduling
> - random fixes
> 
> Sorry for posting a too big series - there are many direct or implicit
> dependencies, and one patch lead to another before I can stop..

Awesome bits..




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  8:53   ` Peter Zijlstra
@ 2009-10-07  9:07     ` Wu Fengguang
  2009-10-07  9:15       ` Peter Zijlstra
  2009-10-07  9:17       ` Nick Piggin
  0 siblings, 2 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  9:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 04:53:20PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> > plain text document attachment (writeback-nfs-request-queue.patch)
> > The generic writeback routines are departing from congestion_wait()
> > in preferance of get_request_wait(), aka. waiting on the block queues.
> > 
> > Introduce the missing writeback wait queue for NFS, otherwise its
> > writeback pages may grow out of control.
> > 
> > In perticular, balance_dirty_pages() will exit after it pushes
> > write_chunk pages into the PG_writeback page pool, _OR_ when the
> > background writeback work quits. The latter is new behavior, and could
> > not only quit (normally) after below background threshold, but also
> > quit when it finds _zero_ dirty pages to write. The latter case gives
> > rise to the number of PG_writeback pages if it is not explicitly limited.
> > 
> > CC: Jens Axboe <jens.axboe@oracle.com>
> > CC: Chris Mason <chris.mason@oracle.com>
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> > 
> > The wait time and network throughput varies a lot! this is a major problem.
> > This means nfs_end_page_writeback() is not called smoothly over time,
> > even when there are plenty of PG_writeback pages on the client side.
> 
> Could that be some ack batching from the nfs protocol/server?

Yes possibly. Another possibility is some works in the nfsiod
workqueue takes up a long time.

> > +static void nfs_wakeup_congested(long nr, long limit,
> > +				 struct backing_dev_info *bdi,
> > +				 wait_queue_head_t *wqh)
> > +{
> > +	if (nr < 2*limit - min(limit/8, NFS_WAIT_PAGES)) {
> > +		if (test_bit(BDI_sync_congested, &bdi->state))
> > +			clear_bdi_congested(bdi, BLK_RW_SYNC);
> > +		if (waitqueue_active(&wqh[BLK_RW_SYNC])) {
> > +			smp_mb__after_clear_bit();
> > +			wake_up(&wqh[BLK_RW_SYNC]);
> > +		}
> > +	}
> > +	if (nr < limit - min(limit/8, NFS_WAIT_PAGES)) {
> > +		if (test_bit(BDI_async_congested, &bdi->state))
> > +			clear_bdi_congested(bdi, BLK_RW_ASYNC);
> > +		if (waitqueue_active(&wqh[BLK_RW_ASYNC])) {
> > +			smp_mb__after_clear_bit();
> > +			wake_up(&wqh[BLK_RW_ASYNC]);
> > +		}
> > +	}
> > +}
> >  
> 
> wakeup implies a full memory barrier.

If so, this smp_mb__after_clear_bit() line is also not necessary?

void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
//...
        clear_bit(bit, &bdi->state);
        smp_mb__after_clear_bit();
        if (waitqueue_active(wqh))
                wake_up(wqh);


Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  9:07     ` Wu Fengguang
@ 2009-10-07  9:15       ` Peter Zijlstra
  2009-10-07  9:19         ` Wu Fengguang
  2009-10-07  9:17       ` Nick Piggin
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-07  9:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org

On Wed, 2009-10-07 at 17:07 +0800, Wu Fengguang wrote:
> 
> > wakeup implies a full memory barrier.
> 
> If so, this smp_mb__after_clear_bit() line is also not necessary?
> 
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> //...
>         clear_bit(bit, &bdi->state);
>         smp_mb__after_clear_bit();
>         if (waitqueue_active(wqh))
>                 wake_up(wqh);

Depends on if the barrier is needed even when the wakeup doesn't happen
I guess ;-)


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  9:07     ` Wu Fengguang
  2009-10-07  9:15       ` Peter Zijlstra
@ 2009-10-07  9:17       ` Nick Piggin
  2009-10-07  9:52         ` Wu Fengguang
  1 sibling, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2009-10-07  9:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 05:07:22PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 04:53:20PM +0800, Peter Zijlstra wrote:
> > wakeup implies a full memory barrier.
> 
> If so, this smp_mb__after_clear_bit() line is also not necessary?
> 
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> //...
>         clear_bit(bit, &bdi->state);
>         smp_mb__after_clear_bit();
>         if (waitqueue_active(wqh))
>                 wake_up(wqh);

Typically in these patterns you do need a barrier. You need to
ensure the load to check the waitqueue is performed after the
clear_bit.

The other side does:
add_to_waitqueue
set_current_state /* has mb() in it */ 
if (test_bit())
   schedule()


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded
  2009-10-07  8:53   ` Peter Zijlstra
@ 2009-10-07  9:17     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  9:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy

On Wed, Oct 07, 2009 at 04:53:19PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> > plain text document attachment
> > (writeback-ratelimit-on-dirty-exceeded.patch)
> > When dirty_exceeded, use ratelimit = ratelimit_pages/8, allowing it to
> > scale up to 512KB for memory bounty systems. This is more efficient than
> > the original 8 pages, and won't risk exceeding the dirty limit too much.
> > 
> > Given the larger ratelimit value, we can safely ignore the low bound
> > check in sync_writeback_pages.
> > 
> > dirty_exceeded is more likely to be seen when there are multiple dirty
> > processes. In which case the lowered ratelimit will help reduce their
> > overall wait time (latency) in the throttled queue.
> 
> Don't forget that ratelimit_pages is a per-cpu limit. So the total error
> on the dirty limit scales with the number of cpus.

Ah yes! Given that a typical NUMA configuration would be to equip each CPU
with >1GB memory, 512KB is not a big error :)

> Other than that, I guess this patch needs numbers ;-)

Lots of numbers, hehe. Basically I see ~25ms wait time for ext2/3/4
and btrfs (with this patch and the dirty exceed timestamp patch).

One cp process:

[ 3404.554478] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.570407] write_bandwidth: comm=btrfs-endio-wri pages=192 time=12ms
[ 3404.601166] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.632872] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.663633] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3404.694198] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3404.724879] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.756655] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.787350] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.817950] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.848640] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3404.880556] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 3404.911372] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3404.942006] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3404.972712] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.002672] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.034480] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.065114] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.095870] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.126453] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.158043] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.188801] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.219327] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.249922] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.281869] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.312533] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.343194] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.373761] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.410828] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 3405.441993] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.473266] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.505746] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.537103] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.568327] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 3405.600730] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.632039] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.663326] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.695674] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.727031] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.758146] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.790743] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 3405.822129] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.853427] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 3405.885889] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.917116] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 3405.948411] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms

Two:


[ 9393.138529] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9393.171042] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9393.344709] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9393.377045] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9393.555611] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9393.586298] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9393.747218] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9393.777825] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9393.808456] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9393.964086] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.120223] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.273854] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9394.304842] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.334474] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.480031] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.527769] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9394.557919] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.590690] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9394.621626] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.654424] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9394.678289] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9394.709837] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.742912] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.774576] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.806149] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.822272] write_bandwidth: comm=btrfs-endio-wri pages=192 time=12ms
[ 9394.852609] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.886215] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.915804] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9394.947044] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9394.979504] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9395.010944] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9395.042221] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9395.074640] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms
[ 9395.105938] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9395.137157] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9395.169668] write_bandwidth: comm=btrfs-endio-wri pages=192 time=28ms

Four:

[ 9435.829736] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9436.702953] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9436.719957] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9436.753598] write_bandwidth: comm=btrfs-endio-wri pages=192 time=36ms
[ 9436.770935] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9436.804367] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9436.818913] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9436.854302] write_bandwidth: comm=btrfs-endio-wri pages=192 time=36ms
[ 9436.870489] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9436.903449] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9443.140282] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9443.155674] write_bandwidth: comm=btrfs-endio-wri pages=192 time=12ms
[ 9443.180418] write_bandwidth: comm=btrfs-endio-wri pages=192 time=24ms
[ 9443.697286] write_bandwidth: comm=btrfs-endio-wri pages=192 time=20ms
[ 9443.714321] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9443.748455] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9443.763764] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9443.796788] write_bandwidth: comm=btrfs-endio-wri pages=192 time=36ms
[ 9443.814309] write_bandwidth: comm=btrfs-endio-wri pages=192 time=16ms
[ 9446.493680] write_bandwidth: comm=btrfs-endio-wri pages=193 time=16ms
[ 9446.528486] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms
[ 9446.888774] write_bandwidth: comm=btrfs-endio-wri pages=192 time=40ms
[ 9446.903518] write_bandwidth: comm=btrfs-endio-wri pages=192 time=12ms
[ 9446.933610] write_bandwidth: comm=btrfs-endio-wri pages=192 time=32ms

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  9:15       ` Peter Zijlstra
@ 2009-10-07  9:19         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  9:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 05:15:36PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-07 at 17:07 +0800, Wu Fengguang wrote:
> > 
> > > wakeup implies a full memory barrier.
> > 
> > If so, this smp_mb__after_clear_bit() line is also not necessary?
> > 
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > //...
> >         clear_bit(bit, &bdi->state);
> >         smp_mb__after_clear_bit();
> >         if (waitqueue_active(wqh))
> >                 wake_up(wqh);
> 
> Depends on if the barrier is needed even when the wakeup doesn't happen
> I guess ;-)

I see. Thanks!

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 21/45] writeback: estimate bdi write bandwidth
  2009-10-07  8:53   ` Peter Zijlstra
@ 2009-10-07  9:39     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  9:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Li, Shaohua, Myklebust Trond, jens.axboe@oracle.com,
	Jan Kara, Nick Piggin, linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 04:53:19PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-07 at 15:38 +0800, Wu Fengguang wrote:
> > +static void bdi_calc_write_bandwidth(struct backing_dev_info *bdi,
> > +                                    unsigned long nr_pages,
> > +                                    unsigned long time)
> > +{
> > +       unsigned long bw;
> > +
> > +       bw = HZ * nr_pages / (time | 1);
> > +       bdi->write_bandwidth = (bdi->write_bandwidth * 63 + bw) / 64;
> > +}
> 
> If you have block times < 1 jiffy this all falls apart quite quickly.
> You could perhaps try to use cpu_clock() for ns resolution timestamps if
> this is a real issue.

Good point. I'd like to view it in the opposite way: if a sleep takes
<= 1 jiffy to wakeup, that would mean high overheads. We could as well
lift the ratelimit values :)

> (I could imagine fast arrays with huge throughput causing small sleeps,
> resulting in underestimates of their bandwidth)
> 
> Also, 63/64 seems rather slow progress.. maybe that's good.

Whatever will be fast enough for servers that keep running for years. So the
main concern would be how fast it can adapt to slow usb devices.

Here is a quick calculation. Each iteration writes ~800KB:

irb > a=128; 1.upto(100) { |i| a = (a * 63 + 5) / 64; printf "%d\t%d\n", i, a }

[...]
86      12
87      11
88      10
89      9
90      8
91      7
92      6
93      5
94      5

Looks slow. 8 or 16 seems better, which stabilize after writing about
16MB or 32MB data:

a=128; 1.upto(100) { |i| a = (a * 7 + 5) / 8; printf "%d\t%d\n", i, a }
1       112
2       98
3       86
4       75
5       66
6       58
7       51
8       45
9       40
10      35
11      31
12      27
13      24
14      21
15      19
16      17
17      15
18      13
19      12
20      11
21      10
22      9
23      8
24      7
25      6
26      5
27      5
28      5
29      5

16:

34      13
35      12
36      11
37      10
38      9
39      8
40      7
41      6
42      5
43      5

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 20/45] NFS: introduce writeback wait queue
  2009-10-07  9:17       ` Nick Piggin
@ 2009-10-07  9:52         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07  9:52 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Peter Zijlstra, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, linux-fsdevel@vger.kernel.org

On Wed, Oct 07, 2009 at 05:17:05PM +0800, Nick Piggin wrote:
> On Wed, Oct 07, 2009 at 05:07:22PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 04:53:20PM +0800, Peter Zijlstra wrote:
> > > wakeup implies a full memory barrier.
> > 
> > If so, this smp_mb__after_clear_bit() line is also not necessary?
> > 
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > //...
> >         clear_bit(bit, &bdi->state);
> >         smp_mb__after_clear_bit();
> >         if (waitqueue_active(wqh))
> >                 wake_up(wqh);
> 
> Typically in these patterns you do need a barrier. You need to
> ensure the load to check the waitqueue is performed after the
> clear_bit.
> 
> The other side does:
> add_to_waitqueue
> set_current_state /* has mb() in it */ 
> if (test_bit())
>    schedule()

Thanks, just moved the mb() up to follow clear_bit():

                if (test_bit(BDI_sync_congested, &bdi->state)) {
                        clear_bdi_congested(bdi, BLK_RW_SYNC);
                        smp_mb__after_clear_bit();
                }
                if (waitqueue_active(&wqh[BLK_RW_SYNC]))
                        wake_up(&wqh[BLK_RW_SYNC]);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (44 preceding siblings ...)
  2009-10-07  8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
@ 2009-10-07 10:17 ` David Howells
  2009-10-07 10:21   ` Nick Piggin
  2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
  2009-10-07 14:26 ` Theodore Tso
  47 siblings, 1 reply; 116+ messages in thread
From: David Howells @ 2009-10-07 10:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: dhowells, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel, LKML

Wu Fengguang <fengguang.wu@intel.com> wrote:

> Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> of the inode, which instructs writeback_single_inode() to delay it for
> a while if necessary.
> 
> It removes one inefficient .range_cyclic IO pattern when writeback_index
> wraps:
> 	submit [10000-10100], (wrap), submit [0-100]
> In which the submitted pages may be consisted of two distant ranges.
> 
> It also prevents submitting pointless IO for busy overwriters.
> 
> CC: David Howells <dhowells@redhat.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Acked-by: David Howells <dhowells@redhat.com>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
@ 2009-10-07 10:21   ` Nick Piggin
  2009-10-07 10:47     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2009-10-07 10:21 UTC (permalink / raw)
  To: David Howells
  Cc: Wu Fengguang, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara, linux-fsdevel,
	LKML

On Wed, Oct 07, 2009 at 11:17:06AM +0100, David Howells wrote:
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> > of the inode, which instructs writeback_single_inode() to delay it for
> > a while if necessary.
> > 
> > It removes one inefficient .range_cyclic IO pattern when writeback_index
> > wraps:
> > 	submit [10000-10100], (wrap), submit [0-100]
> > In which the submitted pages may be consisted of two distant ranges.
> > 
> > It also prevents submitting pointless IO for busy overwriters.
> > 
> > CC: David Howells <dhowells@redhat.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Acked-by: David Howells <dhowells@redhat.com>

I don't see why. Then the inode is given less write bandwidth than
those which don't wrap (or wrap on "nice" boundaries).

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07 10:21   ` Nick Piggin
@ 2009-10-07 10:47     ` Wu Fengguang
  2009-10-07 11:23       ` Nick Piggin
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 10:47 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Howells, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 06:21:30PM +0800, Nick Piggin wrote:
> On Wed, Oct 07, 2009 at 11:17:06AM +0100, David Howells wrote:
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> > > of the inode, which instructs writeback_single_inode() to delay it for
> > > a while if necessary.
> > > 
> > > It removes one inefficient .range_cyclic IO pattern when writeback_index
> > > wraps:
> > > 	submit [10000-10100], (wrap), submit [0-100]
> > > In which the submitted pages may be consisted of two distant ranges.
> > > 
> > > It also prevents submitting pointless IO for busy overwriters.
> > > 
> > > CC: David Howells <dhowells@redhat.com>
> > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > 
> > Acked-by: David Howells <dhowells@redhat.com>
> 
> I don't see why. Then the inode is given less write bandwidth than
> those which don't wrap (or wrap on "nice" boundaries).

The "return on wrapped" behavior itself only offers a natural seek
boundary to the upper layer.  It's mainly the "whether to delay"
policy that will affect (overall) bandwidth.

If we choose to not sleep, and to go on with other inodes and then
back to this inode, no bandwidth will be lost.

If we have done work with other inodes (if any), and choose to sleep
for a while before restarting this inode, then we could lose bandwidth.
The plus side is, we possibly avoid submitting extra IO if this inode
is being busy overwritten. So it's a tradeoff.

The behavior after this patchset is, to keep busy as long as we can
write any pages (in patch 38/45). So we still opt for bandwidth :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07 10:47     ` Wu Fengguang
@ 2009-10-07 11:23       ` Nick Piggin
  2009-10-07 12:21         ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Nick Piggin @ 2009-10-07 11:23 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: David Howells, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 06:47:11PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 06:21:30PM +0800, Nick Piggin wrote:
> > On Wed, Oct 07, 2009 at 11:17:06AM +0100, David Howells wrote:
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> > > > of the inode, which instructs writeback_single_inode() to delay it for
> > > > a while if necessary.
> > > > 
> > > > It removes one inefficient .range_cyclic IO pattern when writeback_index
> > > > wraps:
> > > > 	submit [10000-10100], (wrap), submit [0-100]
> > > > In which the submitted pages may be consisted of two distant ranges.
> > > > 
> > > > It also prevents submitting pointless IO for busy overwriters.
> > > > 
> > > > CC: David Howells <dhowells@redhat.com>
> > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > 
> > > Acked-by: David Howells <dhowells@redhat.com>
> > 
> > I don't see why. Then the inode is given less write bandwidth than
> > those which don't wrap (or wrap on "nice" boundaries).
> 
> The "return on wrapped" behavior itself only offers a natural seek
> boundary to the upper layer.  It's mainly the "whether to delay"
> policy that will affect (overall) bandwidth.
> 
> If we choose to not sleep, and to go on with other inodes and then
> back to this inode, no bandwidth will be lost.
> 
> If we have done work with other inodes (if any), and choose to sleep
> for a while before restarting this inode, then we could lose bandwidth.
> The plus side is, we possibly avoid submitting extra IO if this inode
> is being busy overwritten. So it's a tradeoff.
> 
> The behavior after this patchset is, to keep busy as long as we can
> write any pages (in patch 38/45). So we still opt for bandwidth :)

No I mean bandwidth fairness between inodes.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case
  2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
@ 2009-10-07 11:57   ` Hugh Dickins
  2009-10-07 14:00     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Hugh Dickins @ 2009-10-07 11:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed, 7 Oct 2009, Wu Fengguang wrote:

> When shmem returns AOP_WRITEPAGE_ACTIVATE, the inode pages cannot be
> synced in the near future. So write_cache_pages shall stop writting this
> inode, and shmem shall increase pages_skipped to instruct VFS not to
> busy retry.
> 
> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Okay, it embarrasses me to see AOP_WRITEPAGE_ACTIVATE (and its horrid
"in this one case the page is still locked" semantic) still around -
my patch to remove it vanished from mmotm (probably caused a temporary
conflict) and I've never chased it up (partly out of guilt that I'd not
yet kept my promise to contact the openAFS people about their use of it).

But that's orthogonal to your concern here: for so long as there has
been a wbc->pages_skipped, I guess shmem_writepage() should have been
incrementing it there - thanks.  But I don't believe the VFS will ever
have any interest in pages_skipped from shmem_writepage(): do you have
evidence that it does?  If so, I need to investigate.

And the accompanying change to write_cache_pages() seems irrelevant
and misguided.  Irrelevant because write_cache_pages() should never be
dealing with shmem_writepage() (its bdi should keep it well away), and
should never be dealing with reclaim, which is the only case in which
shmem_writepage() returns AOP_WRITEPAGE_ACTIVATE - or have your other
changes, or the bdi work, changed that?

And misguided because in your change to write_cache_pages() you're
taking AOP_WRITEPAGE_ACTIVATE to say that it should now give up, not
process more pages.  We just don't know that.  All it means is that
this one page couldn't be written and should be reactivated (if it
were under reclaim): it might be the case that every other page tried
after would get treated in the same way, or it might be the case that
the next page would get written successfully.  That info is just not
provided.

Hugh

> ---
>  mm/page-writeback.c |   23 +++++++++++------------
>  mm/shmem.c          |    1 +
>  2 files changed, 12 insertions(+), 12 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-10-06 23:39:28.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-10-06 23:39:29.000000000 +0800
> @@ -851,19 +851,18 @@ continue_unlock:
>  				if (ret == AOP_WRITEPAGE_ACTIVATE) {
>  					unlock_page(page);
>  					ret = 0;
> -				} else {
> -					/*
> -					 * done_index is set past this page,
> -					 * so media errors will not choke
> -					 * background writeout for the entire
> -					 * file. This has consequences for
> -					 * range_cyclic semantics (ie. it may
> -					 * not be suitable for data integrity
> -					 * writeout).
> -					 */
> -					done = 1;
> -					break;
>  				}
> +				/*
> +				 * done_index is set past this page,
> +				 * so media errors will not choke
> +				 * background writeout for the entire
> +				 * file. This has consequences for
> +				 * range_cyclic semantics (ie. it may
> +				 * not be suitable for data integrity
> +				 * writeout).
> +				 */
> +				done = 1;
> +				break;
>   			}
>  
>  			if (nr_to_write > 0) {
> --- linux.orig/mm/shmem.c	2009-10-06 23:37:40.000000000 +0800
> +++ linux/mm/shmem.c	2009-10-06 23:39:29.000000000 +0800
> @@ -1103,6 +1103,7 @@ unlock:
>  	 */
>  	swapcache_free(swap, NULL);
>  redirty:
> +	wbc->pages_skipped++;
>  	set_page_dirty(page);
>  	if (wbc->for_reclaim)
>  		return AOP_WRITEPAGE_ACTIVATE;	/* Return with page locked */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs)
  2009-10-07 11:23       ` Nick Piggin
@ 2009-10-07 12:21         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 12:21 UTC (permalink / raw)
  To: Nick Piggin
  Cc: David Howells, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Jan Kara,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 07:23:02PM +0800, Nick Piggin wrote:
> On Wed, Oct 07, 2009 at 06:47:11PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 06:21:30PM +0800, Nick Piggin wrote:
> > > On Wed, Oct 07, 2009 at 11:17:06AM +0100, David Howells wrote:
> > > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > > 
> > > > > Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> > > > > of the inode, which instructs writeback_single_inode() to delay it for
> > > > > a while if necessary.
> > > > > 
> > > > > It removes one inefficient .range_cyclic IO pattern when writeback_index
> > > > > wraps:
> > > > > 	submit [10000-10100], (wrap), submit [0-100]
> > > > > In which the submitted pages may be consisted of two distant ranges.
> > > > > 
> > > > > It also prevents submitting pointless IO for busy overwriters.
> > > > > 
> > > > > CC: David Howells <dhowells@redhat.com>
> > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > > > 
> > > > Acked-by: David Howells <dhowells@redhat.com>
> > > 
> > > I don't see why. Then the inode is given less write bandwidth than
> > > those which don't wrap (or wrap on "nice" boundaries).
> > 
> > The "return on wrapped" behavior itself only offers a natural seek
> > boundary to the upper layer.  It's mainly the "whether to delay"
> > policy that will affect (overall) bandwidth.
> > 
> > If we choose to not sleep, and to go on with other inodes and then
> > back to this inode, no bandwidth will be lost.
> > 
> > If we have done work with other inodes (if any), and choose to sleep
> > for a while before restarting this inode, then we could lose bandwidth.
> > The plus side is, we possibly avoid submitting extra IO if this inode
> > is being busy overwritten. So it's a tradeoff.
> > 
> > The behavior after this patchset is, to keep busy as long as we can
> > write any pages (in patch 38/45). So we still opt for bandwidth :)
> 
> No I mean bandwidth fairness between inodes.

I guess it's the old semantics that has bandwidth fairness problem :)

Imagine write chunk size is 4MB, and inode A/B with size 6MB/8MB.

The old semantics will have write sequence

        4MB for A; 4MB for B; other inodes;
        4MB for A; 4MB for B; other inodes;
        4MB for A; 4MB for B; other inodes;

while the new sequence would be

        4MB for A; 4MB for B; other inodes;
        2MB for A; 4MB for B; other inodes;
        4MB for A; 4MB for B; other inodes;
        2MB for A; 4MB for B; other inodes;

On average, each page in A used to get more write chance than B's.
Now with no-wrap, A and B's pages have the same chance to be writeback.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs)
  2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
@ 2009-10-07 12:32   ` Evgeniy Polyakov
  2009-10-07 14:23     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Evgeniy Polyakov @ 2009-10-07 12:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

Hi.

On Wed, Oct 07, 2009 at 03:38:27PM +0800, Wu Fengguang (fengguang.wu@intel.com) wrote:
> Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> of the inode, which instructs writeback_single_inode() to delay it for 
> a while if necessary.
> 
> It removes one inefficient .range_cyclic IO pattern when writeback_index
> wraps:
> 	submit [10000-10100], (wrap), submit [0-100]
> In which the submitted pages may be consisted of two distant ranges.
> 
> It also prevents submitting pointless IO for busy overwriters.

I have no objections against this patchset, since I followed the
upstream writeback behaviour and did not personally observe such wraps
which would be otherwise handled in a single run.

> CC: Evgeniy Polyakov <zbr@ioremap.net>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock
  2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
@ 2009-10-07 13:11   ` Peter Staubach
  2009-10-07 13:32     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Staubach @ 2009-10-07 13:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

Wu Fengguang wrote:
> It was introduced in 72cb77f4a5ac, and the several issues have been
> addressed in generic writeback:
> - out of order writeback (or interleaved concurrent writeback)
>   addressed by the per-bdi writeback and wait queue in balance_dirty_pages()
> - sync livelocked by a fast dirtier
>   addressed by throttling all to-be-synced dirty inodes
> 

I don't think that we can just remove this support.  It is
designed to reduce the effects from doing a stat(2) on a
file which is being actively written to.

If we do remove it, then we will need to replace this patch
with another.  Trond and I hadn't quite finished discussing
some aspects of that other patch...  :-)

	Thanx...

		ps

> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> CC: Peter Staubach <staubach@redhat.com>
> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/nfs/file.c          |    9 ---------
>  fs/nfs/write.c         |   11 -----------
>  include/linux/nfs_fs.h |    1 -
>  3 files changed, 21 deletions(-)
> 
> --- linux.orig/fs/nfs/file.c	2009-10-07 14:31:45.000000000 +0800
> +++ linux/fs/nfs/file.c	2009-10-07 14:32:54.000000000 +0800
> @@ -386,15 +386,6 @@ static int nfs_write_begin(struct file *
>  		mapping->host->i_ino, len, (long long) pos);
>  
>  start:
> -	/*
> -	 * Prevent starvation issues if someone is doing a consistency
> -	 * sync-to-disk
> -	 */
> -	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
> -			nfs_wait_bit_killable, TASK_KILLABLE);
> -	if (ret)
> -		return ret;
> -
>  	page = grab_cache_page_write_begin(mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> --- linux.orig/fs/nfs/write.c	2009-10-07 14:31:45.000000000 +0800
> +++ linux/fs/nfs/write.c	2009-10-07 14:32:54.000000000 +0800
> @@ -387,26 +387,15 @@ static int nfs_writepages_callback(struc
>  int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
>  {
>  	struct inode *inode = mapping->host;
> -	unsigned long *bitlock = &NFS_I(inode)->flags;
>  	struct nfs_pageio_descriptor pgio;
>  	int err;
>  
> -	/* Stop dirtying of new pages while we sync */
> -	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
> -			nfs_wait_bit_killable, TASK_KILLABLE);
> -	if (err)
> -		goto out_err;
> -
>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>  
>  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
>  	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
>  	nfs_pageio_complete(&pgio);
>  
> -	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
> -	smp_mb__after_clear_bit();
> -	wake_up_bit(bitlock, NFS_INO_FLUSHING);
> -
>  	if (err < 0)
>  		goto out_err;
>  	err = pgio.pg_error;
> --- linux.orig/include/linux/nfs_fs.h	2009-10-07 14:31:45.000000000 +0800
> +++ linux/include/linux/nfs_fs.h	2009-10-07 14:32:54.000000000 +0800
> @@ -208,7 +208,6 @@ struct nfs_inode {
>  #define NFS_INO_STALE		(1)		/* possible stale inode */
>  #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
>  #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
> -#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
>  #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
>  #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock
  2009-10-07 13:11   ` Peter Staubach
@ 2009-10-07 13:32     ` Wu Fengguang
  2009-10-07 13:59       ` Peter Staubach
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 13:32 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 09:11:15PM +0800, Peter Staubach wrote:
> Wu Fengguang wrote:
> > It was introduced in 72cb77f4a5ac, and the several issues have been
> > addressed in generic writeback:
> > - out of order writeback (or interleaved concurrent writeback)
> >   addressed by the per-bdi writeback and wait queue in balance_dirty_pages()
> > - sync livelocked by a fast dirtier
> >   addressed by throttling all to-be-synced dirty inodes
> > 
> 
> I don't think that we can just remove this support.  It is
> designed to reduce the effects from doing a stat(2) on a
> file which is being actively written to.

Ah OK.

> If we do remove it, then we will need to replace this patch
> with another.  Trond and I hadn't quite finished discussing
> some aspects of that other patch...  :-)

I noticed the i_mutex lock in nfs_getattr(). Do you mean that?

Thanks,
Fengguang

> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> > CC: Peter Staubach <staubach@redhat.com>
> > CC: Trond Myklebust <Trond.Myklebust@netapp.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/nfs/file.c          |    9 ---------
> >  fs/nfs/write.c         |   11 -----------
> >  include/linux/nfs_fs.h |    1 -
> >  3 files changed, 21 deletions(-)
> > 
> > --- linux.orig/fs/nfs/file.c	2009-10-07 14:31:45.000000000 +0800
> > +++ linux/fs/nfs/file.c	2009-10-07 14:32:54.000000000 +0800
> > @@ -386,15 +386,6 @@ static int nfs_write_begin(struct file *
> >  		mapping->host->i_ino, len, (long long) pos);
> >  
> >  start:
> > -	/*
> > -	 * Prevent starvation issues if someone is doing a consistency
> > -	 * sync-to-disk
> > -	 */
> > -	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
> > -			nfs_wait_bit_killable, TASK_KILLABLE);
> > -	if (ret)
> > -		return ret;
> > -
> >  	page = grab_cache_page_write_begin(mapping, index, flags);
> >  	if (!page)
> >  		return -ENOMEM;
> > --- linux.orig/fs/nfs/write.c	2009-10-07 14:31:45.000000000 +0800
> > +++ linux/fs/nfs/write.c	2009-10-07 14:32:54.000000000 +0800
> > @@ -387,26 +387,15 @@ static int nfs_writepages_callback(struc
> >  int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
> >  {
> >  	struct inode *inode = mapping->host;
> > -	unsigned long *bitlock = &NFS_I(inode)->flags;
> >  	struct nfs_pageio_descriptor pgio;
> >  	int err;
> >  
> > -	/* Stop dirtying of new pages while we sync */
> > -	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
> > -			nfs_wait_bit_killable, TASK_KILLABLE);
> > -	if (err)
> > -		goto out_err;
> > -
> >  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
> >  
> >  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
> >  	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
> >  	nfs_pageio_complete(&pgio);
> >  
> > -	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
> > -	smp_mb__after_clear_bit();
> > -	wake_up_bit(bitlock, NFS_INO_FLUSHING);
> > -
> >  	if (err < 0)
> >  		goto out_err;
> >  	err = pgio.pg_error;
> > --- linux.orig/include/linux/nfs_fs.h	2009-10-07 14:31:45.000000000 +0800
> > +++ linux/include/linux/nfs_fs.h	2009-10-07 14:32:54.000000000 +0800
> > @@ -208,7 +208,6 @@ struct nfs_inode {
> >  #define NFS_INO_STALE		(1)		/* possible stale inode */
> >  #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
> >  #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
> > -#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
> >  #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
> >  #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (45 preceding siblings ...)
  2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
@ 2009-10-07 13:47 ` Peter Staubach
  2009-10-07 15:18   ` Wu Fengguang
  2009-10-07 14:26 ` Theodore Tso
  47 siblings, 1 reply; 116+ messages in thread
From: Peter Staubach @ 2009-10-07 13:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

Wu Fengguang wrote:
> Hi all,
> 
> Here is a collection of writeback patches on
> 
> - larger writeback chunk sizes
> - single per-bdi flush thread (killing the foreground throttling writeouts)
> - lumpy pageout
> - sync livelock prevention
> - writeback scheduling
> - random fixes
> 
> Sorry for posting a too big series - there are many direct or implicit
> dependencies, and one patch lead to another before I can stop..
> 
> The lumpy pageout and nr_segments support is not complete and do not
> cover all filesystems for now. It may be better to first convert some of
> the ->writepages to the generic routines to avoid duplicate work.
> 
> I managed to address many issues in past week, however there are still known
> problems. Hints from filesystem developers are highly appreciated. Thanks!
> 
> The estimated writeback bandwidth is about 1/2 the real throughput
> for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
> cannot be estimated at all for XFS.  Very interesting..
> 
> NFS writeback is very bumpy. The page numbers and network throughput "freeze"
> together from time to time:
> 

Yes.  It appears that the problem is that too many pages get dirtied
and the network/server get overwhelmed by the NFS client attempting
to write out all of the pages as quickly as it possibly can.

I think that it would be better if we could better match the
number of pages which can be dirty at any given point with the
bandwidth available through the network and the server file
system and storage.

One approach that I have pondered is immediately queuing an
asynchronous request when enough pages are dirtied to be able
to completely fill an over the wire transfer.  This sort of
seems like a per-file bdi, which doesn't seem quite like the
right approach to me.  What would y'all think about that?

		ps


> # vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
>      nr_writeback         nr_dirty      nr_unstable
>             11227            41463            38044
>             11227            41463            38044
>             11227            41463            38044
>             11227            41463            38044
>             11045            53987             6490
>             11033            53120             8145
>             11195            52143            10886
>             11211            52144            10913
>             11211            52144            10913
>             11211            52144            10913
> 
> btrfs seems to maintain a private pool of writeback pages, which can go out of
> control:
> 
>      nr_writeback         nr_dirty
>            261075              132
>            252891              195
>            244795              187
>            236851              187
>            228830              187
>            221040              218
>            212674              237
>            204981              237
> 
> XFS has very interesting "bumpy writeback" behavior: it tends to wait
> collect enough pages and then write the whole world.
> 
>      nr_writeback         nr_dirty
>             80781                0
>             37117            37703
>             37117            43933
>             81044                6
>             81050                0
>             43943            10199
>             43930            36355
>             43930            36355
>             80293                0
>             80285                0
>             80285                0
> 
> Thanks,
> Fengguang
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock
  2009-10-07 13:32     ` Wu Fengguang
@ 2009-10-07 13:59       ` Peter Staubach
  2009-10-08  1:44         ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Staubach @ 2009-10-07 13:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 09:11:15PM +0800, Peter Staubach wrote:
>> Wu Fengguang wrote:
>>> It was introduced in 72cb77f4a5ac, and the several issues have been
>>> addressed in generic writeback:
>>> - out of order writeback (or interleaved concurrent writeback)
>>>   addressed by the per-bdi writeback and wait queue in balance_dirty_pages()
>>> - sync livelocked by a fast dirtier
>>>   addressed by throttling all to-be-synced dirty inodes
>>>
>> I don't think that we can just remove this support.  It is
>> designed to reduce the effects from doing a stat(2) on a
>> file which is being actively written to.
> 
> Ah OK.
> 
>> If we do remove it, then we will need to replace this patch
>> with another.  Trond and I hadn't quite finished discussing
>> some aspects of that other patch...  :-)
> 
> I noticed the i_mutex lock in nfs_getattr(). Do you mean that?
> 

Well, that's part of that support as well.  That keeps a writing
application from dirtying more pages while the application doing
the stat is attempting to clean them.

Another approach that I suggested was to keep track of the
number of pages which are dirty on a per-inode basis.  When
enough pages are dirty to fill an over the wire transfer,
then schedule an asynchronous write to transmit that data to
the server.  This ties in with support to ensure that the
server/network is not completely overwhelmed by the client
by flow controlling the writing application to better match
the bandwidth and latencies of the network and server.
With this support, the NFS client tends not to fill memory
with dirty pages and thus, does not depend upon the other
parts of the system to flush these pages.

All of these recent pages make this current flushing happen
in a much more orderly fashion, which is great.  However,
this can still lead to the client attempting to flush
potentially gigabytes all at once, which is more than most
networks and servers can handle reasonably.

		ps


> Thanks,
> Fengguang
> 
>>> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
>>> CC: Peter Staubach <staubach@redhat.com>
>>> CC: Trond Myklebust <Trond.Myklebust@netapp.com>
>>> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
>>> ---
>>>  fs/nfs/file.c          |    9 ---------
>>>  fs/nfs/write.c         |   11 -----------
>>>  include/linux/nfs_fs.h |    1 -
>>>  3 files changed, 21 deletions(-)
>>>
>>> --- linux.orig/fs/nfs/file.c	2009-10-07 14:31:45.000000000 +0800
>>> +++ linux/fs/nfs/file.c	2009-10-07 14:32:54.000000000 +0800
>>> @@ -386,15 +386,6 @@ static int nfs_write_begin(struct file *
>>>  		mapping->host->i_ino, len, (long long) pos);
>>>  
>>>  start:
>>> -	/*
>>> -	 * Prevent starvation issues if someone is doing a consistency
>>> -	 * sync-to-disk
>>> -	 */
>>> -	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
>>> -			nfs_wait_bit_killable, TASK_KILLABLE);
>>> -	if (ret)
>>> -		return ret;
>>> -
>>>  	page = grab_cache_page_write_begin(mapping, index, flags);
>>>  	if (!page)
>>>  		return -ENOMEM;
>>> --- linux.orig/fs/nfs/write.c	2009-10-07 14:31:45.000000000 +0800
>>> +++ linux/fs/nfs/write.c	2009-10-07 14:32:54.000000000 +0800
>>> @@ -387,26 +387,15 @@ static int nfs_writepages_callback(struc
>>>  int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
>>>  {
>>>  	struct inode *inode = mapping->host;
>>> -	unsigned long *bitlock = &NFS_I(inode)->flags;
>>>  	struct nfs_pageio_descriptor pgio;
>>>  	int err;
>>>  
>>> -	/* Stop dirtying of new pages while we sync */
>>> -	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
>>> -			nfs_wait_bit_killable, TASK_KILLABLE);
>>> -	if (err)
>>> -		goto out_err;
>>> -
>>>  	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
>>>  
>>>  	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
>>>  	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
>>>  	nfs_pageio_complete(&pgio);
>>>  
>>> -	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
>>> -	smp_mb__after_clear_bit();
>>> -	wake_up_bit(bitlock, NFS_INO_FLUSHING);
>>> -
>>>  	if (err < 0)
>>>  		goto out_err;
>>>  	err = pgio.pg_error;
>>> --- linux.orig/include/linux/nfs_fs.h	2009-10-07 14:31:45.000000000 +0800
>>> +++ linux/include/linux/nfs_fs.h	2009-10-07 14:32:54.000000000 +0800
>>> @@ -208,7 +208,6 @@ struct nfs_inode {
>>>  #define NFS_INO_STALE		(1)		/* possible stale inode */
>>>  #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
>>>  #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
>>> -#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
>>>  #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
>>>  #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case
  2009-10-07 11:57   ` Hugh Dickins
@ 2009-10-07 14:00     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 14:00 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 07:57:00PM +0800, Hugh Dickins wrote:
> On Wed, 7 Oct 2009, Wu Fengguang wrote:
> 
> > When shmem returns AOP_WRITEPAGE_ACTIVATE, the inode pages cannot be
> > synced in the near future. So write_cache_pages shall stop writting this
> > inode, and shmem shall increase pages_skipped to instruct VFS not to
> > busy retry.
> > 
> > CC: Hugh Dickins <hugh.dickins@tiscali.co.uk>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> Okay, it embarrasses me to see AOP_WRITEPAGE_ACTIVATE (and its horrid
> "in this one case the page is still locked" semantic) still around -
> my patch to remove it vanished from mmotm (probably caused a temporary
> conflict) and I've never chased it up (partly out of guilt that I'd not
> yet kept my promise to contact the openAFS people about their use of it).

Googled this one :) 

        http://markmail.org/thread/fivi4bgylwsy26ws

In fact we could just return from shmem_writepage() with PG_dirty and
!PG_writeback. Then the page will be put back to LRU with PG_reclaim
cleared. It could well happen in other filesystems who has trouble to
writeback the page for the time (those already do pages_skipped++).

> But that's orthogonal to your concern here: for so long as there has
> been a wbc->pages_skipped, I guess shmem_writepage() should have been
> incrementing it there - thanks.  But I don't believe the VFS will ever
> have any interest in pages_skipped from shmem_writepage(): do you have
> evidence that it does?  If so, I need to investigate.

Yes :) Except in this new code.

> And the accompanying change to write_cache_pages() seems irrelevant
> and misguided.  Irrelevant because write_cache_pages() should never be
> dealing with shmem_writepage() (its bdi should keep it well away), and
> should never be dealing with reclaim, which is the only case in which
> shmem_writepage() returns AOP_WRITEPAGE_ACTIVATE - or have your other
> changes, or the bdi work, changed that?

That becomes possible with my another patch (30/45). I've added check
to avoid doing lumpy reclaim for shmem. So now everything returns to
normal :)

> And misguided because in your change to write_cache_pages() you're
> taking AOP_WRITEPAGE_ACTIVATE to say that it should now give up, not
> process more pages.  We just don't know that.  All it means is that
> this one page couldn't be written and should be reactivated (if it
> were under reclaim): it might be the case that every other page tried
> after would get treated in the same way, or it might be the case that
> the next page would get written successfully.  That info is just not
> provided.

Yes, it was over-smart indeed. I'll revert that chunk (or
AOP_WRITEPAGE_ACTIVATE) totally.

Thanks,
Fengguang

> > ---
> >  mm/page-writeback.c |   23 +++++++++++------------
> >  mm/shmem.c          |    1 +
> >  2 files changed, 12 insertions(+), 12 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-10-06 23:39:28.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-10-06 23:39:29.000000000 +0800
> > @@ -851,19 +851,18 @@ continue_unlock:
> >  				if (ret == AOP_WRITEPAGE_ACTIVATE) {
> >  					unlock_page(page);
> >  					ret = 0;
> > -				} else {
> > -					/*
> > -					 * done_index is set past this page,
> > -					 * so media errors will not choke
> > -					 * background writeout for the entire
> > -					 * file. This has consequences for
> > -					 * range_cyclic semantics (ie. it may
> > -					 * not be suitable for data integrity
> > -					 * writeout).
> > -					 */
> > -					done = 1;
> > -					break;
> >  				}
> > +				/*
> > +				 * done_index is set past this page,
> > +				 * so media errors will not choke
> > +				 * background writeout for the entire
> > +				 * file. This has consequences for
> > +				 * range_cyclic semantics (ie. it may
> > +				 * not be suitable for data integrity
> > +				 * writeout).
> > +				 */
> > +				done = 1;
> > +				break;
> >   			}
> >  
> >  			if (nr_to_write > 0) {
> > --- linux.orig/mm/shmem.c	2009-10-06 23:37:40.000000000 +0800
> > +++ linux/mm/shmem.c	2009-10-06 23:39:29.000000000 +0800
> > @@ -1103,6 +1103,7 @@ unlock:
> >  	 */
> >  	swapcache_free(swap, NULL);
> >  redirty:
> > +	wbc->pages_skipped++;
> >  	set_page_dirty(page);
> >  	if (wbc->for_reclaim)
> >  		return AOP_WRITEPAGE_ACTIVATE;	/* Return with page locked */

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs)
  2009-10-07 12:32   ` Evgeniy Polyakov
@ 2009-10-07 14:23     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 14:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 08:32:11PM +0800, Evgeniy Polyakov wrote:
> Hi.
> 
> On Wed, Oct 07, 2009 at 03:38:27PM +0800, Wu Fengguang (fengguang.wu@intel.com) wrote:
> > Convert wbc.range_cyclic to new behavior: when past EOF, abort writeback
> > of the inode, which instructs writeback_single_inode() to delay it for 
> > a while if necessary.
> > 
> > It removes one inefficient .range_cyclic IO pattern when writeback_index
> > wraps:
> > 	submit [10000-10100], (wrap), submit [0-100]
> > In which the submitted pages may be consisted of two distant ranges.
> > 
> > It also prevents submitting pointless IO for busy overwriters.
> 
> I have no objections against this patchset, since I followed the
> upstream writeback behaviour and did not personally observe such wraps
> which would be otherwise handled in a single run.

OK, thanks!

Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
                   ` (46 preceding siblings ...)
  2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
@ 2009-10-07 14:26 ` Theodore Tso
  2009-10-07 14:45   ` Wu Fengguang
  47 siblings, 1 reply; 116+ messages in thread
From: Theodore Tso @ 2009-10-07 14:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed, Oct 07, 2009 at 03:38:18PM +0800, Wu Fengguang wrote:
> 
> The estimated writeback bandwidth is about 1/2 the real throughput
> for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
> cannot be estimated at all for XFS.  Very interesting..

Can you expand on what you mean here?  Estimated write bandwidth of
what?  And what are you comparing it against?

I'm having trouble understanding your note (which I'm guessing you
write fairly late at night?  :-)

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07 14:26 ` Theodore Tso
@ 2009-10-07 14:45   ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 14:45 UTC (permalink / raw)
  To: Theodore Tso, Andrew Morton, Christoph Hellwig, Dave Chinner,
	Chris Mason

On Wed, Oct 07, 2009 at 10:26:32PM +0800, Theodore Ts'o wrote:
> On Wed, Oct 07, 2009 at 03:38:18PM +0800, Wu Fengguang wrote:
> > 
> > The estimated writeback bandwidth is about 1/2 the real throughput
> > for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
> > cannot be estimated at all for XFS.  Very interesting..
> 
> Can you expand on what you mean here?  Estimated write bandwidth of
> what?  And what are you comparing it against?
 
Please refer to [PATCH 21/45] writeback: estimate bdi write bandwidth
and patch 22, I have some numbers there :)

> I'm having trouble understanding your note (which I'm guessing you
> write fairly late at night?  :-)

Sorry - I wrote that when I got tired on debugging ;)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
@ 2009-10-07 15:18   ` Wu Fengguang
  2009-10-08  5:33     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-07 15:18 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 09:47:14PM +0800, Peter Staubach wrote:
> Wu Fengguang wrote:
> > Hi all,
> > 
> > Here is a collection of writeback patches on
> > 
> > - larger writeback chunk sizes
> > - single per-bdi flush thread (killing the foreground throttling writeouts)
> > - lumpy pageout
> > - sync livelock prevention
> > - writeback scheduling
> > - random fixes
> > 
> > Sorry for posting a too big series - there are many direct or implicit
> > dependencies, and one patch lead to another before I can stop..
> > 
> > The lumpy pageout and nr_segments support is not complete and do not
> > cover all filesystems for now. It may be better to first convert some of
> > the ->writepages to the generic routines to avoid duplicate work.
> > 
> > I managed to address many issues in past week, however there are still known
> > problems. Hints from filesystem developers are highly appreciated. Thanks!
> > 
> > The estimated writeback bandwidth is about 1/2 the real throughput
> > for ext2/3/4 and btrfs; noticeable bigger than real throughput for NFS; and
> > cannot be estimated at all for XFS.  Very interesting..
> > 
> > NFS writeback is very bumpy. The page numbers and network throughput "freeze"
> > together from time to time:
> > 
> 
> Yes.  It appears that the problem is that too many pages get dirtied
> and the network/server get overwhelmed by the NFS client attempting
> to write out all of the pages as quickly as it possibly can.

In theory it should push pages as quickly as possible at first,
to fill up the server side queue.

> I think that it would be better if we could better match the
> number of pages which can be dirty at any given point with the
> bandwidth available through the network and the server file
> system and storage.

And then go into this steady state of matched network/disk bandwidth.

> One approach that I have pondered is immediately queuing an
> asynchronous request when enough pages are dirtied to be able
> to completely fill an over the wire transfer.  This sort of
> seems like a per-file bdi, which doesn't seem quite like the
> right approach to me.  What would y'all think about that?

Hmm, it sounds like unnecessary complexity. Because it is not going to
help the busy-dirtier case anyway. And if we can do good on heavy IO,
the pre-flushing policy becomes less interesting.

> 
> > # vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
> >      nr_writeback         nr_dirty      nr_unstable
> >             11227            41463            38044
> >             11227            41463            38044
> >             11227            41463            38044
> >             11227            41463            38044

I guess in the above 4 seconds, either client or (more likely) server
is blocked. A blocked server cannot send ACKs to knock down both
nr_writeback/nr_unstable. And the stuck nr_writeback will freeze
nr_dirty as well, because the dirtying process is throttled until
it receives enough "PG_writeback cleared" event, however the bdi-flush
thread is also blocked when trying to clear more PG_writeback, because
the client side nr_writeback limit has been reached. In summary,

server blocked => nr_writeback stuck => nr_writeback limit reached
=> bdi-flush blocked => no end_page_writeback() => dirtier blocked
=> nr_dirty stuck

Thanks,
Fengguang

> >             11045            53987             6490
> >             11033            53120             8145
> >             11195            52143            10886
> >             11211            52144            10913
> >             11211            52144            10913
> >             11211            52144            10913
> > 
> > btrfs seems to maintain a private pool of writeback pages, which can go out of
> > control:
> > 
> >      nr_writeback         nr_dirty
> >            261075              132
> >            252891              195
> >            244795              187
> >            236851              187
> >            228830              187
> >            221040              218
> >            212674              237
> >            204981              237
> > 
> > XFS has very interesting "bumpy writeback" behavior: it tends to wait
> > collect enough pages and then write the whole world.
> > 
> >      nr_writeback         nr_dirty
> >             80781                0
> >             37117            37703
> >             37117            43933
> >             81044                6
> >             81050                0
> >             43943            10199
> >             43930            36355
> >             43930            36355
> >             80293                0
> >             80285                0
> >             80285                0
> > 
> > Thanks,
> > Fengguang
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
@ 2009-10-08  1:01   ` KAMEZAWA Hiroyuki
  2009-10-08  1:58     ` Wu Fengguang
  2009-10-08  8:05     ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
  0 siblings, 2 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08  1:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed, 07 Oct 2009 15:38:36 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
> the per-bdi flusher to writeback enough pages for it, instead of
> starting foreground writeback by itself. By doing so we harvest two
> benefits:
> - avoid concurrent writeback of multiple inodes (Dave Chinner)
>   If every thread doing writes and being throttled start foreground
>   writeback, it leads to N IO submitters from at least N different
>   inodes at the same time, end up with N different sets of IO being
>   issued with potentially zero locality to each other, resulting in
>   much lower elevator sort/merge efficiency and hence we seek the disk
>   all over the place to service the different sets of IO.
>   OTOH, if there is only one submission thread, it doesn't jump between
>   inodes in the same way when congestion clears - it keeps writing to
>   the same inode, resulting in large related chunks of sequential IOs
>   being issued to the disk. This is more efficient than the above
>   foreground writeback because the elevator works better and the disk
>   seeks less.
> - avoid one constraint torwards huge per-file nr_to_write
>   The write_chunk used by balance_dirty_pages() should be small enough to
>   prevent user noticeable one-shot latency. Ie. each sleep/wait inside
>   balance_dirty_pages() shall be small enough. When it starts its own
>   writeback, it must specify a small nr_to_write. The throttle wait queue
>   removes this dependancy by the way.
> 

May I ask a question ? (maybe not directly related to this patch itself, sorry)

Recent works as "writeback: switch to per-bdi threads for flushing data"
removed congestion_wait() from balance_dirty_pages() and added
schedule_timeout_interruptible(). 

And this one replaces it with wake_up+wait_queue.

IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
== kernel/schec.c
void account_idle_time(cputime_t cputime)
{
        struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
        cputime64_t cputime64 = cputime_to_cputime64(cputime);
        struct rq *rq = this_rq();

        if (atomic_read(&rq->nr_iowait) > 0)
                cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
        else
                cpustat->idle = cputime64_add(cpustat->idle, cputime64);
}
==
Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
at some places. In old kernel, congestion_wait() at el did that by calling
io_schedule_timeout().

How this runqueue->nr_iowait is handled now ?

Thanks,
-Kame




> CC: Chris Mason <chris.mason@oracle.com> 
> CC: Dave Chinner <david@fromorbit.com> 
> CC: Jan Kara <jack@suse.cz> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> CC: Jens Axboe <jens.axboe@oracle.com> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   71 ++++++++++++++++++++++++++++++++++
>  include/linux/backing-dev.h |   15 +++++++
>  mm/backing-dev.c            |    4 +
>  mm/page-writeback.c         |   53 ++++++-------------------
>  4 files changed, 103 insertions(+), 40 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-10-06 23:38:30.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-10-06 23:38:43.000000000 +0800
> @@ -218,6 +218,15 @@ static inline void __bdi_writeout_inc(st
>  {
>  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
>  			      bdi->max_prop_frac);
> +
> +	/*
> +	 * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
> +	 * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
> +	 * places to reduce the race possibility.
> +	 */
> +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
> +	    atomic_dec_and_test(&bdi->throttle_pages))
> +		bdi_writeback_wakeup(bdi);
>  }
>  
>  void bdi_writeout_inc(struct backing_dev_info *bdi)
> @@ -458,20 +467,10 @@ static void balance_dirty_pages(struct a
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
>  	int dirty_exceeded;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> -		struct writeback_control wbc = {
> -			.bdi		= bdi,
> -			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
> -			.nr_to_write	= write_chunk,
> -			.range_cyclic	= 1,
> -		};
> -
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  				 global_page_state(NR_UNSTABLE_NFS);
>  		nr_writeback = global_page_state(NR_WRITEBACK) +
> @@ -518,39 +517,13 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		if (bdi_nr_reclaimable > bdi_thresh) {
> -			writeback_inodes_wbc(&wbc);
> -			pages_written += write_chunk - wbc.nr_to_write;
> -			/* don't wait if we've done enough */
> -			if (pages_written >= write_chunk)
> -				break;
> -		}
> -		schedule_timeout_interruptible(pause);
> -
> -		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> -		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
> +		bdi_writeback_wait(bdi, write_chunk);
> +		break;
>  	}
>  
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
>  
> -	if (writeback_in_progress(bdi))
> -		return;
> -
>  	/*
>  	 * In laptop mode, we wait until hitting the higher threshold before
>  	 * starting background writeout, and then write out all the way down
> @@ -559,8 +532,8 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> +	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> +	    can_submit_background_writeback(bdi))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> --- linux.orig/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
> +++ linux/include/linux/backing-dev.h	2009-10-06 23:38:43.000000000 +0800
> @@ -86,6 +86,13 @@ struct backing_dev_info {
>  
>  	struct list_head work_list;
>  
> +	/*
> +	 * dirtier process throttling
> +	 */
> +	spinlock_t		throttle_lock;
> +	struct list_head	throttle_list;	/* nr to sync for each task */
> +	atomic_t		throttle_pages; /* nr to sync for head task */
> +
>  	struct device *dev;
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -99,6 +106,12 @@ struct backing_dev_info {
>   */
>  #define WB_FLAG_BACKGROUND_WORK		30
>  
> +/*
> + * when no task is throttled, set throttle_pages to larger than this,
> + * to avoid unnecessary atomic decreases.
> + */
> +#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
> +
>  int bdi_init(struct backing_dev_info *bdi);
>  void bdi_destroy(struct backing_dev_info *bdi);
>  
> @@ -110,6 +123,8 @@ void bdi_start_writeback(struct backing_
>  				long nr_pages);
>  int bdi_writeback_task(struct bdi_writeback *wb);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi);
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> --- linux.orig/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-06 23:38:43.000000000 +0800
> @@ -25,6 +25,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/completion.h>
>  #include "internal.h"
>  
>  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> @@ -265,6 +266,72 @@ void bdi_start_writeback(struct backing_
>  	bdi_alloc_queue_work(bdi, &args);
>  }
>  
> +struct dirty_throttle_task {
> +	long			nr_pages;
> +	struct list_head	list;
> +	struct completion	complete;
> +};
> +
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> +{
> +	struct dirty_throttle_task tt = {
> +		.nr_pages = nr_pages,
> +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> +	};
> +	unsigned long flags;
> +
> +	/*
> +	 * register throttle pages
> +	 */
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	if (list_empty(&bdi->throttle_list))
> +		atomic_set(&bdi->throttle_pages, nr_pages);
> +	list_add(&tt.list, &bdi->throttle_list);
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	/*
> +	 * make sure we will be woke up by someone
> +	 */
> +	if (can_submit_background_writeback(bdi))
> +		bdi_start_writeback(bdi, NULL, 0);
> +
> +	wait_for_completion(&tt.complete);
> +}
> +
> +/*
> + * return 1 if there are more waiting tasks.
> + */
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> +{
> +	struct dirty_throttle_task *tt;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	/*
> +	 * remove and wakeup head task
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		list_del(&tt->list);
> +		complete(&tt->complete);
> +	}
> +	/*
> +	 * update throttle pages
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> +	} else {
> +		tt = NULL;
> +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> +	}
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	return tt != NULL;
> +}
> +
>  /*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
> @@ -760,6 +827,10 @@ static long wb_writeback(struct bdi_writ
>  		spin_unlock(&inode_lock);
>  	}
>  
> +	if (args->for_background)
> +		while (bdi_writeback_wakeup(wb->bdi))
> +			;  /* unthrottle all tasks */
> +
>  	return wrote;
>  }
>  
> --- linux.orig/mm/backing-dev.c	2009-10-06 23:37:47.000000000 +0800
> +++ linux/mm/backing-dev.c	2009-10-06 23:38:43.000000000 +0800
> @@ -646,6 +646,10 @@ int bdi_init(struct backing_dev_info *bd
>  	bdi->wb_mask = 1;
>  	bdi->wb_cnt = 1;
>  
> +	spin_lock_init(&bdi->throttle_lock);
> +	INIT_LIST_HEAD(&bdi->throttle_list);
> +	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> +
>  	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
>  		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
>  		if (err)
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock
  2009-10-07 13:59       ` Peter Staubach
@ 2009-10-08  1:44         ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  1:44 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

[-- Attachment #1: Type: text/plain, Size: 2950 bytes --]

On Wed, Oct 07, 2009 at 09:59:10PM +0800, Peter Staubach wrote:
> Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 09:11:15PM +0800, Peter Staubach wrote:
> >> Wu Fengguang wrote:
> >>> It was introduced in 72cb77f4a5ac, and the several issues have been
> >>> addressed in generic writeback:
> >>> - out of order writeback (or interleaved concurrent writeback)
> >>>   addressed by the per-bdi writeback and wait queue in balance_dirty_pages()
> >>> - sync livelocked by a fast dirtier
> >>>   addressed by throttling all to-be-synced dirty inodes
> >>>
> >> I don't think that we can just remove this support.  It is
> >> designed to reduce the effects from doing a stat(2) on a
> >> file which is being actively written to.
> > 
> > Ah OK.
> > 
> >> If we do remove it, then we will need to replace this patch
> >> with another.  Trond and I hadn't quite finished discussing
> >> some aspects of that other patch...  :-)
> > 
> > I noticed the i_mutex lock in nfs_getattr(). Do you mean that?
> > 
> 
> Well, that's part of that support as well.  That keeps a writing
> application from dirtying more pages while the application doing
> the stat is attempting to clean them.

Instead of blocking totally, we could throttle application writes.
The following two patches are the more gentle way of doing this,
however it does not guarantee to kill the livelock, since a busy
bdi-flush thread could writeback many pages to unfreeze the
application prematurely. Anyway I attach them as a demo of idea,
whether it be good or bad.

> Another approach that I suggested was to keep track of the
> number of pages which are dirty on a per-inode basis.  When

Yes a per-inode dirty count should be trivial and may be good for others.

> enough pages are dirty to fill an over the wire transfer,
> then schedule an asynchronous write to transmit that data to

This should also be trivial to support if the location ordered
writeback infrastructure is ready.

> the server.  This ties in with support to ensure that the
> server/network is not completely overwhelmed by the client
> by flow controlling the writing application to better match
> the bandwidth and latencies of the network and server.

I like that feature :)

> With this support, the NFS client tends not to fill memory
> with dirty pages and thus, does not depend upon the other
> parts of the system to flush these pages.
> 
> All of these recent pages make this current flushing happen
> in a much more orderly fashion, which is great.  However,

Thanks.

> this can still lead to the client attempting to flush
> potentially gigabytes all at once, which is more than most
> networks and servers can handle reasonably.

OK, I now see the need to keep mapping->nr_dirty under control: it
could make many NFS operations response in a more bounded fashion.
The good thing is, it can share infrastructure with the location
based writeback (http://lkml.org/lkml/2007/8/27/45 :)

Thanks,
Fengguang

[-- Attachment #2: writeback-throttle-sync-mapping.patch --]
[-- Type: text/x-diff, Size: 2488 bytes --]

writeback: sync livelock - throttle mapping dirties if it's being synced

The AS_SYNC_WAITER flag will be set to indicate an active sync waiter.

It's not perfect with respect to file ranges and concurrent syncs. And
do not guarantee that the application writes on this mapping can be
throttled enough to avoid livelock, especially when there are many
background writebacks.

Just my 2 cents.

CC: Jan Kara <jack@suse.cz>
CC: Peter Staubach <staubach@redhat.com>
CC: Myklebust Trond <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/pagemap.h |    3 +++
 mm/filemap.c            |    2 ++
 mm/page-writeback.c     |    5 +++++
 3 files changed, 10 insertions(+)

--- linux.orig/mm/page-writeback.c	2009-10-08 08:11:19.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-08 08:48:13.000000000 +0800
@@ -480,7 +480,12 @@ static void balance_dirty_pages(struct a
 	/*
 	 * If sync() is in progress, curb the to-be-synced inodes regardless
 	 * of dirty limits, so that a fast dirtier won't livelock the sync.
+	 * In perticular, NFS syncs the mapping before many file operations.
 	 */
+	if (unlikely(test_bit(AS_SYNC_WAITER, &mapping->flags))) {
+		write_chunk *= 8;
+		goto throttle;
+	}
 	if (unlikely(bdi->sync_time &&
 		     S_ISREG(mapping->host->i_mode) &&
 		     time_after_eq(bdi->sync_time,
--- linux.orig/include/linux/pagemap.h	2009-10-08 08:11:19.000000000 +0800
+++ linux/include/linux/pagemap.h	2009-10-08 08:11:20.000000000 +0800
@@ -23,6 +23,9 @@ enum mapping_flags {
 	AS_ENOSPC	= __GFP_BITS_SHIFT + 1,	/* ENOSPC on async write */
 	AS_MM_ALL_LOCKS	= __GFP_BITS_SHIFT + 2,	/* under mm_take_all_locks() */
 	AS_UNEVICTABLE	= __GFP_BITS_SHIFT + 3,	/* e.g., ramdisk, SHM_LOCK */
+	AS_SYNC_WAITER	= __GFP_BITS_SHIFT + 4,	/* sync&wait under way, page
+						 * dirtying will be throttled
+						 */
 };
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
--- linux.orig/mm/filemap.c	2009-10-08 08:11:19.000000000 +0800
+++ linux/mm/filemap.c	2009-10-08 08:11:20.000000000 +0800
@@ -356,6 +356,7 @@ int filemap_write_and_wait(struct addres
 	int err = 0;
 
 	if (mapping->nrpages) {
+		set_bit(AS_SYNC_WAITER, &mapping->flags);
 		err = filemap_fdatawrite(mapping);
 		/*
 		 * Even if the above returned error, the pages may be
@@ -368,6 +369,7 @@ int filemap_write_and_wait(struct addres
 			if (!err)
 				err = err2;
 		}
+		clear_bit(AS_SYNC_WAITER, &mapping->flags);
 	}
 	return err;
 }

[-- Attachment #3: nfs-no-i_mutex-for-livelock-prevention.patch --]
[-- Type: text/x-diff, Size: 1496 bytes --]

nfs: don't take i_mutex on nfs_wb_nocommit()

Set the AS_SYNC_WAITER flag and let VFS throttle application writes.

CC: Peter Staubach <staubach@redhat.com>
CC: Myklebust Trond <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/inode.c |    9 +--------
 fs/nfs/write.c |    6 +++++-
 2 files changed, 6 insertions(+), 9 deletions(-)

--- linux.orig/fs/nfs/inode.c	2009-10-07 10:03:26.000000000 +0800
+++ linux/fs/nfs/inode.c	2009-10-08 08:18:37.000000000 +0800
@@ -513,16 +513,9 @@ int nfs_getattr(struct vfsmount *mnt, st
 
 	/*
 	 * Flush out writes to the server in order to update c/mtime.
-	 *
-	 * Hold the i_mutex to suspend application writes temporarily;
-	 * this prevents long-running writing applications from blocking
-	 * nfs_wb_nocommit.
 	 */
-	if (S_ISREG(inode->i_mode)) {
-		mutex_lock(&inode->i_mutex);
+	if (S_ISREG(inode->i_mode))
 		nfs_wb_nocommit(inode);
-		mutex_unlock(&inode->i_mutex);
-	}
 
 	/*
 	 * We may force a getattr if the user cares about atime.
--- linux.orig/fs/nfs/write.c	2009-10-08 08:21:02.000000000 +0800
+++ linux/fs/nfs/write.c	2009-10-08 08:31:01.000000000 +0800
@@ -1539,8 +1539,12 @@ static int nfs_write_mapping(struct addr
 		.range_start = 0,
 		.range_end = LLONG_MAX,
 	};
+	int ret;
 
-	return __nfs_write_mapping(mapping, &wbc, how);
+	set_bit(AS_SYNC_WAITER, &mapping->flags);
+	ret = __nfs_write_mapping(mapping, &wbc, how);
+	clear_bit(AS_SYNC_WAITER, &mapping->flags);
+	return ret;
 }
 
 /*

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  1:01   ` KAMEZAWA Hiroyuki
@ 2009-10-08  1:58     ` Wu Fengguang
  2009-10-08  2:40       ` KAMEZAWA Hiroyuki
  2009-10-08  8:08       ` Peter Zijlstra
  2009-10-08  8:05     ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
  1 sibling, 2 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  1:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 09:01:59AM +0800, KAMEZAWA Hiroyuki wrote:
> tatus: RO
> Content-Length: 12481
> Lines: 332
> 
> On Wed, 07 Oct 2009 15:38:36 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > As proposed by Chris, Dave and Jan, let balance_dirty_pages() wait for
> > the per-bdi flusher to writeback enough pages for it, instead of
> > starting foreground writeback by itself. By doing so we harvest two
> > benefits:
> > - avoid concurrent writeback of multiple inodes (Dave Chinner)
> >   If every thread doing writes and being throttled start foreground
> >   writeback, it leads to N IO submitters from at least N different
> >   inodes at the same time, end up with N different sets of IO being
> >   issued with potentially zero locality to each other, resulting in
> >   much lower elevator sort/merge efficiency and hence we seek the disk
> >   all over the place to service the different sets of IO.
> >   OTOH, if there is only one submission thread, it doesn't jump between
> >   inodes in the same way when congestion clears - it keeps writing to
> >   the same inode, resulting in large related chunks of sequential IOs
> >   being issued to the disk. This is more efficient than the above
> >   foreground writeback because the elevator works better and the disk
> >   seeks less.
> > - avoid one constraint torwards huge per-file nr_to_write
> >   The write_chunk used by balance_dirty_pages() should be small enough to
> >   prevent user noticeable one-shot latency. Ie. each sleep/wait inside
> >   balance_dirty_pages() shall be small enough. When it starts its own
> >   writeback, it must specify a small nr_to_write. The throttle wait queue
> >   removes this dependancy by the way.
> >
> 
> May I ask a question ? (maybe not directly related to this patch itself, sorry)

Sure :)

> Recent works as "writeback: switch to per-bdi threads for flushing data"
> removed congestion_wait() from balance_dirty_pages() and added
> schedule_timeout_interruptible().
> 
> And this one replaces it with wake_up+wait_queue.

Right. 

> IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
> == kernel/schec.c
> void account_idle_time(cputime_t cputime)
> {
>         struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>         cputime64_t cputime64 = cputime_to_cputime64(cputime);
>         struct rq *rq = this_rq();
> 
>         if (atomic_read(&rq->nr_iowait) > 0)
>                 cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
>         else
>                 cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> }
> ==
> Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
> at some places. In old kernel, congestion_wait() at el did that by calling
> io_schedule_timeout().
> 
> How this runqueue->nr_iowait is handled now ?

Good question. io_schedule() has an old comment for throttling IO wait:

         * But don't do that if it is a deliberate, throttling IO wait (this task
         * has set its backing_dev_info: the queue against which it should throttle)
         */
        void __sched io_schedule(void)

So it looks both Jens' and this patch behaves right in ignoring the
iowait accounting for balance_dirty_pages() :)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  1:58     ` Wu Fengguang
@ 2009-10-08  2:40       ` KAMEZAWA Hiroyuki
  2009-10-08  4:01         ` Wu Fengguang
  2009-10-08  8:08       ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08  2:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, 8 Oct 2009 09:58:22 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Thu, Oct 08, 2009 at 09:01:59AM +0800, KAMEZAWA Hiroyuki wrote:
> > IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
> > == kernel/schec.c
> > void account_idle_time(cputime_t cputime)
> > {
> >         struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> >         cputime64_t cputime64 = cputime_to_cputime64(cputime);
> >         struct rq *rq = this_rq();
> > 
> >         if (atomic_read(&rq->nr_iowait) > 0)
> >                 cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
> >         else
> >                 cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> > }
> > ==
> > Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
> > at some places. In old kernel, congestion_wait() at el did that by calling
> > io_schedule_timeout().
> > 
> > How this runqueue->nr_iowait is handled now ?
> 
> Good question. io_schedule() has an old comment for throttling IO wait:
> 
>          * But don't do that if it is a deliberate, throttling IO wait (this task
>          * has set its backing_dev_info: the queue against which it should throttle)
>          */
>         void __sched io_schedule(void)
> 
> So it looks both Jens' and this patch behaves right in ignoring the
> iowait accounting for balance_dirty_pages() :)
> 
Thank you for clarification.
Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.

BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.

Regards,
-Kame

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  2:40       ` KAMEZAWA Hiroyuki
@ 2009-10-08  4:01         ` Wu Fengguang
  2009-10-08  5:59           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  4:01 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 10:40:37AM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 8 Oct 2009 09:58:22 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> > On Thu, Oct 08, 2009 at 09:01:59AM +0800, KAMEZAWA Hiroyuki wrote:
> > > IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
> > > == kernel/schec.c
> > > void account_idle_time(cputime_t cputime)
> > > {
> > >         struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
> > >         cputime64_t cputime64 = cputime_to_cputime64(cputime);
> > >         struct rq *rq = this_rq();
> > > 
> > >         if (atomic_read(&rq->nr_iowait) > 0)
> > >                 cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
> > >         else
> > >                 cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> > > }
> > > ==
> > > Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
> > > at some places. In old kernel, congestion_wait() at el did that by calling
> > > io_schedule_timeout().
> > > 
> > > How this runqueue->nr_iowait is handled now ?
> > 
> > Good question. io_schedule() has an old comment for throttling IO wait:
> > 
> >          * But don't do that if it is a deliberate, throttling IO wait (this task
> >          * has set its backing_dev_info: the queue against which it should throttle)
> >          */
> >         void __sched io_schedule(void)
> > 
> > So it looks both Jens' and this patch behaves right in ignoring the
> > iowait accounting for balance_dirty_pages() :)
> > 
> Thank you for clarification.
> Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
> to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.

Thanks, you also reminds me to do io_schedule() in the nfs writeback
wait queue :)

> BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.

Good suggestion. How about this patch?

---
writeback: show per-bdi throttled tasks

All currently throttled tasks will be listed, showing the pages to
writeback for them, and total wait time since blocked.

	# cat /debug/bdi/8:0/throttle_list
	goal=6144kb	waited=32ms
	goal=6144kb     waited=48ms

CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    1 +
 include/linux/backing-dev.h |    2 ++
 mm/backing-dev.c            |   33 +++++++++++++++++++++++++++++++++
 3 files changed, 36 insertions(+)

--- linux.orig/include/linux/backing-dev.h	2009-10-08 11:46:28.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-08 11:47:39.000000000 +0800
@@ -101,6 +101,7 @@ struct backing_dev_info {
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
+	struct dentry *debug_throttle;
 #endif
 };
 
@@ -116,6 +117,7 @@ struct backing_dev_info {
 #define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
 
 struct dirty_throttle_task {
+	unsigned long		start_time;
 	long			nr_pages;
 	struct list_head	list;
 	struct completion	complete;
--- linux.orig/mm/backing-dev.c	2009-10-08 11:47:37.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-08 11:59:06.000000000 +0800
@@ -115,6 +115,23 @@ static int bdi_debug_stats_show(struct s
 	return 0;
 }
 
+static int bdi_debug_throttle_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi = m->private;
+	struct dirty_throttle_task *tt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bdi->throttle_lock, flags);
+	list_for_each_entry(tt, &bdi->throttle_list, list) {
+		seq_printf(m, "goal=%lukb\twaited=%lums\n",
+			   tt->nr_pages << (PAGE_SHIFT - 10),
+			   (jiffies - tt->start_time) * 1000 / HZ);
+	}
+	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+	return 0;
+}
+
 static int bdi_debug_stats_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, bdi_debug_stats_show, inode->i_private);
@@ -127,15 +144,31 @@ static const struct file_operations bdi_
 	.release	= single_release,
 };
 
+static int bdi_debug_throttle_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, bdi_debug_throttle_show, inode->i_private);
+}
+
+static const struct file_operations bdi_debug_throttle_fops = {
+	.open		= bdi_debug_throttle_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
 {
 	bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
 	bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
 					       bdi, &bdi_debug_stats_fops);
+	bdi->debug_throttle = debugfs_create_file("throttle_list", 0444,
+						  bdi->debug_dir, bdi,
+						  &bdi_debug_throttle_fops);
 }
 
 static void bdi_debug_unregister(struct backing_dev_info *bdi)
 {
+	debugfs_remove(bdi->debug_throttle);
 	debugfs_remove(bdi->debug_stats);
 	debugfs_remove(bdi->debug_dir);
 }
--- linux.orig/fs/fs-writeback.c	2009-10-08 11:44:27.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-08 11:47:39.000000000 +0800
@@ -284,6 +284,7 @@ static void bdi_calc_write_bandwidth(str
 void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
 {
 	struct dirty_throttle_task tt = {
+		.start_time = jiffies,
 		.nr_pages = nr_pages,
 		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
 	};

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-07 15:18   ` Wu Fengguang
@ 2009-10-08  5:33     ` Wu Fengguang
  2009-10-08  5:44       ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  5:33 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Wed, Oct 07, 2009 at 11:18:22PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 09:47:14PM +0800, Peter Staubach wrote:
> > 
> > > # vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
> > >      nr_writeback         nr_dirty      nr_unstable
> > >             11227            41463            38044
> > >             11227            41463            38044
> > >             11227            41463            38044
> > >             11227            41463            38044
> 
> I guess in the above 4 seconds, either client or (more likely) server
> is blocked. A blocked server cannot send ACKs to knock down both

Yeah the server side is blocked.  The nfsd are mostly blocked in
generic_file_aio_write(), in particular, the i_mutex lock! I'm copying
one or two big files over NFS, so the i_mutex lock is heavily contented.

I'm using the default wsize=4096 for NFS-root..

wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4691  4691 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
 4692  4692 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
 4693  4693 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
 4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4695  4695 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
 4696  4696 TS       -  -5  24   1  0.0 D<   log_wait_commit          nfsd
 4697  4697 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4691  4691 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
 4692  4692 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
 4693  4693 TS       -  -5  24   0  0.0 D<   sync_buffer              nfsd
 4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4695  4695 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
 4696  4696 TS       -  -5  24   1  0.0 D<   generic_file_aio_write   nfsd
 4697  4697 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd

wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4691  4691 TS       -  -5  24   0  0.1 D<   get_request_wait         nfsd
 4692  4692 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4693  4693 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4695  4695 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4696  4696 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4697  4697 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd

wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   1  0.1 D<   get_write_access         nfsd
 4691  4691 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4692  4692 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4693  4693 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
 4694  4694 TS       -  -5  24   1  0.1 D<   get_write_access         nfsd
 4695  4695 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4696  4696 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4697  4697 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd

Thanks,
Fengguang

> nr_writeback/nr_unstable. And the stuck nr_writeback will freeze
> nr_dirty as well, because the dirtying process is throttled until
> it receives enough "PG_writeback cleared" event, however the bdi-flush
> thread is also blocked when trying to clear more PG_writeback, because
> the client side nr_writeback limit has been reached. In summary,
> 
> server blocked => nr_writeback stuck => nr_writeback limit reached
> => bdi-flush blocked => no end_page_writeback() => dirtier blocked
> => nr_dirty stuck
> 
> Thanks,
> Fengguang
> 
> > >             11045            53987             6490
> > >             11033            53120             8145
> > >             11195            52143            10886
> > >             11211            52144            10913
> > >             11211            52144            10913
> > >             11211            52144            10913
> > > 
> > > btrfs seems to maintain a private pool of writeback pages, which can go out of
> > > control:
> > > 
> > >      nr_writeback         nr_dirty
> > >            261075              132
> > >            252891              195
> > >            244795              187
> > >            236851              187
> > >            228830              187
> > >            221040              218
> > >            212674              237
> > >            204981              237
> > > 
> > > XFS has very interesting "bumpy writeback" behavior: it tends to wait
> > > collect enough pages and then write the whole world.
> > > 
> > >      nr_writeback         nr_dirty
> > >             80781                0
> > >             37117            37703
> > >             37117            43933
> > >             81044                6
> > >             81050                0
> > >             43943            10199
> > >             43930            36355
> > >             43930            36355
> > >             80293                0
> > >             80285                0
> > >             80285                0
> > > 
> > > Thanks,
> > > Fengguang
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 00/45] some writeback experiments
  2009-10-08  5:33     ` Wu Fengguang
@ 2009-10-08  5:44       ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  5:44 UTC (permalink / raw)
  To: Peter Staubach
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 01:33:35PM +0800, Wu Fengguang wrote:
> On Wed, Oct 07, 2009 at 11:18:22PM +0800, Wu Fengguang wrote:
> > On Wed, Oct 07, 2009 at 09:47:14PM +0800, Peter Staubach wrote:
> > > 
> > > > # vmmon -d 1 nr_writeback nr_dirty nr_unstable      # (per 1-second samples)
> > > >      nr_writeback         nr_dirty      nr_unstable
> > > >             11227            41463            38044
> > > >             11227            41463            38044
> > > >             11227            41463            38044
> > > >             11227            41463            38044
> > 
> > I guess in the above 4 seconds, either client or (more likely) server
> > is blocked. A blocked server cannot send ACKs to knock down both
> 
> Yeah the server side is blocked.  The nfsd are mostly blocked in
> generic_file_aio_write(), in particular, the i_mutex lock! I'm copying
> one or two big files over NFS, so the i_mutex lock is heavily contented.
> 
> I'm using the default wsize=4096 for NFS-root..

Just switched to 512k wsize, and things improved: in most time the 8
nfsd are not all blocked. However, the bumpiness still remains:

     nr_writeback         nr_dirty      nr_unstable
            11105            58080            15042
            11105            58080            15042
            11233            54583            18626
            11101            51964            22036
            11105            51978            22065
            11233            52362            22577
            10985            58538            13500
            11233            53748            19721
            11047            51999            21778
            11105            50262            23572
            11105            50262            20441
            10985            52772            20721
            10977            52109            21516
            11105            48296            26629
            11105            48296            26629
            10985            52191            21042
            11166            51456            22296
            10980            50681            24466
            11233            45352            30488
            11233            45352            30488
            11105            45475            30616
            11131            45313            20355
            11233            51126            22637
            11233            51126            22637


wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   1  0.1 S<   svc_recv                 nfsd
 4691  4691 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4692  4692 TS       -  -5  24   0  0.1 R<   ?                        nfsd
 4693  4693 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4694  4694 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4695  4695 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4696  4696 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4697  4697 TS       -  -5  24   0  0.1 R<   ?                        nfsd
wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4691  4691 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4692  4692 TS       -  -5  24   1  0.1 D<   log_wait_commit          nfsd
 4693  4693 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4694  4694 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4695  4695 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4696  4696 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
 4697  4697 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   1  0.1 S<   svc_recv                 nfsd
 4691  4691 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4692  4692 TS       -  -5  24   1  0.1 R<   ?                        nfsd
 4693  4693 TS       -  -5  24   1  0.1 R<   ?                        nfsd
 4694  4694 TS       -  -5  24   1  0.1 R<   ?                        nfsd
 4695  4695 TS       -  -5  24   1  0.1 S<   svc_recv                 nfsd
 4696  4696 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4697  4697 TS       -  -5  24   1  0.1 S<   svc_recv                 nfsd
wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
  329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
 4690  4690 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
 4691  4691 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4692  4692 TS       -  -5  24   1  0.1 D<   nfsd_sync                nfsd
 4693  4693 TS       -  -5  24   1  0.1 D<   sync_buffer              nfsd
 4694  4694 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
 4695  4695 TS       -  -5  24   1  0.1 S<   svc_recv                 nfsd
 4696  4696 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
 4697  4697 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd

Thanks,
Fengguang

> wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
>   329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
>  4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4691  4691 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
>  4692  4692 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
>  4693  4693 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
>  4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4695  4695 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
>  4696  4696 TS       -  -5  24   1  0.0 D<   log_wait_commit          nfsd
>  4697  4697 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
> wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
>   329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
>  4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4691  4691 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
>  4692  4692 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
>  4693  4693 TS       -  -5  24   0  0.0 D<   sync_buffer              nfsd
>  4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4695  4695 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
>  4696  4696 TS       -  -5  24   1  0.0 D<   generic_file_aio_write   nfsd
>  4697  4697 TS       -  -5  24   0  0.0 D<   generic_file_aio_write   nfsd
> 
> wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
>   329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
>  4690  4690 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4691  4691 TS       -  -5  24   0  0.1 D<   get_request_wait         nfsd
>  4692  4692 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4693  4693 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
>  4694  4694 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4695  4695 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4696  4696 TS       -  -5  24   0  0.1 S<   svc_recv                 nfsd
>  4697  4697 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
> 
> wfg ~% ps -o pid,tid,class,rtprio,ni,pri,psr,pcpu,stat,wchan:24,comm ax|g nfs
>   329   329 TS       -  -5  24   1  0.0 S<   worker_thread            nfsiod
>  4690  4690 TS       -  -5  24   1  0.1 D<   get_write_access         nfsd
>  4691  4691 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4692  4692 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4693  4693 TS       -  -5  24   1  0.1 D<   generic_file_aio_write   nfsd
>  4694  4694 TS       -  -5  24   1  0.1 D<   get_write_access         nfsd
>  4695  4695 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4696  4696 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
>  4697  4697 TS       -  -5  24   0  0.1 D<   generic_file_aio_write   nfsd
> 
> Thanks,
> Fengguang
> 
> > nr_writeback/nr_unstable. And the stuck nr_writeback will freeze
> > nr_dirty as well, because the dirtying process is throttled until
> > it receives enough "PG_writeback cleared" event, however the bdi-flush
> > thread is also blocked when trying to clear more PG_writeback, because
> > the client side nr_writeback limit has been reached. In summary,
> > 
> > server blocked => nr_writeback stuck => nr_writeback limit reached
> > => bdi-flush blocked => no end_page_writeback() => dirtier blocked
> > => nr_dirty stuck
> > 
> > Thanks,
> > Fengguang
> > 
> > > >             11045            53987             6490
> > > >             11033            53120             8145
> > > >             11195            52143            10886
> > > >             11211            52144            10913
> > > >             11211            52144            10913
> > > >             11211            52144            10913
> > > > 
> > > > btrfs seems to maintain a private pool of writeback pages, which can go out of
> > > > control:
> > > > 
> > > >      nr_writeback         nr_dirty
> > > >            261075              132
> > > >            252891              195
> > > >            244795              187
> > > >            236851              187
> > > >            228830              187
> > > >            221040              218
> > > >            212674              237
> > > >            204981              237
> > > > 
> > > > XFS has very interesting "bumpy writeback" behavior: it tends to wait
> > > > collect enough pages and then write the whole world.
> > > > 
> > > >      nr_writeback         nr_dirty
> > > >             80781                0
> > > >             37117            37703
> > > >             37117            43933
> > > >             81044                6
> > > >             81050                0
> > > >             43943            10199
> > > >             43930            36355
> > > >             43930            36355
> > > >             80293                0
> > > >             80285                0
> > > >             80285                0
> > > > 
> > > > Thanks,
> > > > Fengguang
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  4:01         ` Wu Fengguang
@ 2009-10-08  5:59           ` KAMEZAWA Hiroyuki
  2009-10-08  6:07             ` Wu Fengguang
  2009-10-08  6:28             ` Wu Fengguang
  0 siblings, 2 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08  5:59 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, 8 Oct 2009 12:01:36 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Thu, Oct 08, 2009 at 10:40:37AM +0800, KAMEZAWA Hiroyuki wrote:
> > Thank you for clarification.
> > Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
> > to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.
> 
> Thanks, you also reminds me to do io_schedule() in the nfs writeback
> wait queue :)
> 
good side effect ;)

> > BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.
> 
> Good suggestion. How about this patch?
> 

Seems attractive. 
Hmm..
==
struct dirty_throttle_task {
+	pid_t	owner_pid;
==
and show it ? (too verbose ?


Regards,
-Kame

> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |    1 +
>  include/linux/backing-dev.h |    2 ++
>  mm/backing-dev.c            |   33 +++++++++++++++++++++++++++++++++
>  3 files changed, 36 insertions(+)
> 
> --- linux.orig/include/linux/backing-dev.h	2009-10-08 11:46:28.000000000 +0800
> +++ linux/include/linux/backing-dev.h	2009-10-08 11:47:39.000000000 +0800
> @@ -101,6 +101,7 @@ struct backing_dev_info {
>  #ifdef CONFIG_DEBUG_FS
>  	struct dentry *debug_dir;
>  	struct dentry *debug_stats;
> +	struct dentry *debug_throttle;
>  #endif
>  };
>  
> @@ -116,6 +117,7 @@ struct backing_dev_info {
>  #define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
>  
>  struct dirty_throttle_task {
> +	unsigned long		start_time;
>  	long			nr_pages;
>  	struct list_head	list;
>  	struct completion	complete;
> --- linux.orig/mm/backing-dev.c	2009-10-08 11:47:37.000000000 +0800
> +++ linux/mm/backing-dev.c	2009-10-08 11:59:06.000000000 +0800
> @@ -115,6 +115,23 @@ static int bdi_debug_stats_show(struct s
>  	return 0;
>  }
>  
> +static int bdi_debug_throttle_show(struct seq_file *m, void *v)
> +{
> +	struct backing_dev_info *bdi = m->private;
> +	struct dirty_throttle_task *tt;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	list_for_each_entry(tt, &bdi->throttle_list, list) {
> +		seq_printf(m, "goal=%lukb\twaited=%lums\n",
> +			   tt->nr_pages << (PAGE_SHIFT - 10),
> +			   (jiffies - tt->start_time) * 1000 / HZ);
> +	}
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	return 0;
> +}
> +
>  static int bdi_debug_stats_open(struct inode *inode, struct file *file)
>  {
>  	return single_open(file, bdi_debug_stats_show, inode->i_private);
> @@ -127,15 +144,31 @@ static const struct file_operations bdi_
>  	.release	= single_release,
>  };
>  
> +static int bdi_debug_throttle_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, bdi_debug_throttle_show, inode->i_private);
> +}
> +
> +static const struct file_operations bdi_debug_throttle_fops = {
> +	.open		= bdi_debug_throttle_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
>  static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>  {
>  	bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
>  	bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
>  					       bdi, &bdi_debug_stats_fops);
> +	bdi->debug_throttle = debugfs_create_file("throttle_list", 0444,
> +						  bdi->debug_dir, bdi,
> +						  &bdi_debug_throttle_fops);
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
>  {
> +	debugfs_remove(bdi->debug_throttle);
>  	debugfs_remove(bdi->debug_stats);
>  	debugfs_remove(bdi->debug_dir);
>  }
> --- linux.orig/fs/fs-writeback.c	2009-10-08 11:44:27.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-08 11:47:39.000000000 +0800
> @@ -284,6 +284,7 @@ static void bdi_calc_write_bandwidth(str
>  void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
>  {
>  	struct dirty_throttle_task tt = {
> +		.start_time = jiffies,
>  		.nr_pages = nr_pages,
>  		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
>  	};
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  5:59           ` KAMEZAWA Hiroyuki
@ 2009-10-08  6:07             ` Wu Fengguang
  2009-10-08  6:28             ` Wu Fengguang
  1 sibling, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  6:07 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 01:59:07PM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 8 Oct 2009 12:01:36 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Thu, Oct 08, 2009 at 10:40:37AM +0800, KAMEZAWA Hiroyuki wrote:
> > > Thank you for clarification.
> > > Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
> > > to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.
> > 
> > Thanks, you also reminds me to do io_schedule() in the nfs writeback
> > wait queue :)
> > 
> good side effect ;)

Done, hehe.

> > > BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.
> > 
> > Good suggestion. How about this patch?
> > 
> 
> Seems attractive. 
> Hmm..
> ==
> struct dirty_throttle_task {
> +	pid_t	owner_pid;
> ==
> and show it ? (too verbose ?

I even thought of adding the task pointer - it won't go away anyway :)

Thanks,
Fengguang

> 
> Regards,
> -Kame
> 
> > CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c           |    1 +
> >  include/linux/backing-dev.h |    2 ++
> >  mm/backing-dev.c            |   33 +++++++++++++++++++++++++++++++++
> >  3 files changed, 36 insertions(+)
> > 
> > --- linux.orig/include/linux/backing-dev.h	2009-10-08 11:46:28.000000000 +0800
> > +++ linux/include/linux/backing-dev.h	2009-10-08 11:47:39.000000000 +0800
> > @@ -101,6 +101,7 @@ struct backing_dev_info {
> >  #ifdef CONFIG_DEBUG_FS
> >  	struct dentry *debug_dir;
> >  	struct dentry *debug_stats;
> > +	struct dentry *debug_throttle;
> >  #endif
> >  };
> >  
> > @@ -116,6 +117,7 @@ struct backing_dev_info {
> >  #define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
> >  
> >  struct dirty_throttle_task {
> > +	unsigned long		start_time;
> >  	long			nr_pages;
> >  	struct list_head	list;
> >  	struct completion	complete;
> > --- linux.orig/mm/backing-dev.c	2009-10-08 11:47:37.000000000 +0800
> > +++ linux/mm/backing-dev.c	2009-10-08 11:59:06.000000000 +0800
> > @@ -115,6 +115,23 @@ static int bdi_debug_stats_show(struct s
> >  	return 0;
> >  }
> >  
> > +static int bdi_debug_throttle_show(struct seq_file *m, void *v)
> > +{
> > +	struct backing_dev_info *bdi = m->private;
> > +	struct dirty_throttle_task *tt;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> > +	list_for_each_entry(tt, &bdi->throttle_list, list) {
> > +		seq_printf(m, "goal=%lukb\twaited=%lums\n",
> > +			   tt->nr_pages << (PAGE_SHIFT - 10),
> > +			   (jiffies - tt->start_time) * 1000 / HZ);
> > +	}
> > +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> > +
> > +	return 0;
> > +}
> > +
> >  static int bdi_debug_stats_open(struct inode *inode, struct file *file)
> >  {
> >  	return single_open(file, bdi_debug_stats_show, inode->i_private);
> > @@ -127,15 +144,31 @@ static const struct file_operations bdi_
> >  	.release	= single_release,
> >  };
> >  
> > +static int bdi_debug_throttle_open(struct inode *inode, struct file *file)
> > +{
> > +	return single_open(file, bdi_debug_throttle_show, inode->i_private);
> > +}
> > +
> > +static const struct file_operations bdi_debug_throttle_fops = {
> > +	.open		= bdi_debug_throttle_open,
> > +	.read		= seq_read,
> > +	.llseek		= seq_lseek,
> > +	.release	= single_release,
> > +};
> > +
> >  static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
> >  {
> >  	bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
> >  	bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
> >  					       bdi, &bdi_debug_stats_fops);
> > +	bdi->debug_throttle = debugfs_create_file("throttle_list", 0444,
> > +						  bdi->debug_dir, bdi,
> > +						  &bdi_debug_throttle_fops);
> >  }
> >  
> >  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> >  {
> > +	debugfs_remove(bdi->debug_throttle);
> >  	debugfs_remove(bdi->debug_stats);
> >  	debugfs_remove(bdi->debug_dir);
> >  }
> > --- linux.orig/fs/fs-writeback.c	2009-10-08 11:44:27.000000000 +0800
> > +++ linux/fs/fs-writeback.c	2009-10-08 11:47:39.000000000 +0800
> > @@ -284,6 +284,7 @@ static void bdi_calc_write_bandwidth(str
> >  void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> >  {
> >  	struct dirty_throttle_task tt = {
> > +		.start_time = jiffies,
> >  		.nr_pages = nr_pages,
> >  		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> >  	};
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  5:59           ` KAMEZAWA Hiroyuki
  2009-10-08  6:07             ` Wu Fengguang
@ 2009-10-08  6:28             ` Wu Fengguang
  2009-10-08  6:39               ` KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-08  6:28 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 01:59:07PM +0800, KAMEZAWA Hiroyuki wrote:
> On Thu, 8 Oct 2009 12:01:36 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > On Thu, Oct 08, 2009 at 10:40:37AM +0800, KAMEZAWA Hiroyuki wrote:
> > > Thank you for clarification.
> > > Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
> > > to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.
> > 
> > Thanks, you also reminds me to do io_schedule() in the nfs writeback
> > wait queue :)
> > 
> good side effect ;)
> 
> > > BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.
> > 
> > Good suggestion. How about this patch?
> > 
> 
> Seems attractive. 
> Hmm..
> ==
> struct dirty_throttle_task {
> +	pid_t	owner_pid;
> ==
> and show it ? (too verbose ?

Not all all, added comm too :)

---
writeback: show per-bdi throttled tasks

All currently throttled tasks will be listed, showing the pages to
writeback for them, and total wait time since blocked.

	# cat /debug/bdi/0:16/throttle_list
	goal=768kb      waited=36ms     pid=3551        comm=cp


CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    6 +++--
 include/linux/backing-dev.h |    3 ++
 mm/backing-dev.c            |   35 ++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 2 deletions(-)

--- linux.orig/include/linux/backing-dev.h	2009-10-08 12:37:13.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-08 14:10:44.000000000 +0800
@@ -101,6 +101,7 @@ struct backing_dev_info {
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
+	struct dentry *debug_throttle;
 #endif
 };
 
@@ -116,6 +117,8 @@ struct backing_dev_info {
 #define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
 
 struct dirty_throttle_task {
+	struct task_struct	*sleeper;
+	unsigned long		start_time;
 	long			nr_pages;
 	struct list_head	list;
 	struct completion	complete;
--- linux.orig/mm/backing-dev.c	2009-10-08 12:37:42.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-08 14:12:35.000000000 +0800
@@ -115,6 +115,25 @@ static int bdi_debug_stats_show(struct s
 	return 0;
 }
 
+static int bdi_debug_throttle_show(struct seq_file *m, void *v)
+{
+	struct backing_dev_info *bdi = m->private;
+	struct dirty_throttle_task *tt;
+	unsigned long flags;
+
+	spin_lock_irqsave(&bdi->throttle_lock, flags);
+	list_for_each_entry(tt, &bdi->throttle_list, list) {
+		seq_printf(m, "goal=%lukb\twaited=%lums\tpid=%d\tcomm=%s\n",
+			   tt->nr_pages << (PAGE_SHIFT - 10),
+			   (jiffies - tt->start_time) * 1000 / HZ,
+			   tt->sleeper->pid,
+			   tt->sleeper->comm);
+	}
+	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
+
+	return 0;
+}
+
 static int bdi_debug_stats_open(struct inode *inode, struct file *file)
 {
 	return single_open(file, bdi_debug_stats_show, inode->i_private);
@@ -127,15 +146,31 @@ static const struct file_operations bdi_
 	.release	= single_release,
 };
 
+static int bdi_debug_throttle_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, bdi_debug_throttle_show, inode->i_private);
+}
+
+static const struct file_operations bdi_debug_throttle_fops = {
+	.open		= bdi_debug_throttle_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
 {
 	bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
 	bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
 					       bdi, &bdi_debug_stats_fops);
+	bdi->debug_throttle = debugfs_create_file("throttle_list", 0444,
+						  bdi->debug_dir, bdi,
+						  &bdi_debug_throttle_fops);
 }
 
 static void bdi_debug_unregister(struct backing_dev_info *bdi)
 {
+	debugfs_remove(bdi->debug_throttle);
 	debugfs_remove(bdi->debug_stats);
 	debugfs_remove(bdi->debug_dir);
 }
--- linux.orig/fs/fs-writeback.c	2009-10-08 12:37:24.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-08 14:10:51.000000000 +0800
@@ -284,8 +284,10 @@ static void bdi_calc_write_bandwidth(str
 void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
 {
 	struct dirty_throttle_task tt = {
-		.nr_pages = nr_pages,
-		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
+		.sleeper	= current,
+		.start_time	= jiffies,
+		.nr_pages	= nr_pages,
+		.complete	= COMPLETION_INITIALIZER_ONSTACK(tt.complete),
 	};
 	unsigned long flags;
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  6:28             ` Wu Fengguang
@ 2009-10-08  6:39               ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08  6:39 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, 8 Oct 2009 14:28:09 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Thu, Oct 08, 2009 at 01:59:07PM +0800, KAMEZAWA Hiroyuki wrote:
> > On Thu, 8 Oct 2009 12:01:36 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > On Thu, Oct 08, 2009 at 10:40:37AM +0800, KAMEZAWA Hiroyuki wrote:
> > > > Thank you for clarification.
> > > > Then, hmm, %iotwait (which 'top' shows) didn't work as desgined and we need
> > > > to update throttle_vm_writeout() and some in vmscan.c. Thanks for input.
> > > 
> > > Thanks, you also reminds me to do io_schedule() in the nfs writeback
> > > wait queue :)
> > > 
> > good side effect ;)
> > 
> > > > BTW, I'm glad if I can know "how many threads/ios are throttoled now" per bdi.
> > > 
> > > Good suggestion. How about this patch?
> > > 
> > 
> > Seems attractive. 
> > Hmm..
> > ==
> > struct dirty_throttle_task {
> > +	pid_t	owner_pid;
> > ==
> > and show it ? (too verbose ?
> 
> Not all all, added comm too :)
> 
Wow, nice.

Thank you,
-Kame

> ---
> writeback: show per-bdi throttled tasks
> 
> All currently throttled tasks will be listed, showing the pages to
> writeback for them, and total wait time since blocked.
> 
> 	# cat /debug/bdi/0:16/throttle_list
> 	goal=768kb      waited=36ms     pid=3551        comm=cp
> 
> 
> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |    6 +++--
>  include/linux/backing-dev.h |    3 ++
>  mm/backing-dev.c            |   35 ++++++++++++++++++++++++++++++++++
>  3 files changed, 42 insertions(+), 2 deletions(-)
> 
> --- linux.orig/include/linux/backing-dev.h	2009-10-08 12:37:13.000000000 +0800
> +++ linux/include/linux/backing-dev.h	2009-10-08 14:10:44.000000000 +0800
> @@ -101,6 +101,7 @@ struct backing_dev_info {
>  #ifdef CONFIG_DEBUG_FS
>  	struct dentry *debug_dir;
>  	struct dentry *debug_stats;
> +	struct dentry *debug_throttle;
>  #endif
>  };
>  
> @@ -116,6 +117,8 @@ struct backing_dev_info {
>  #define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
>  
>  struct dirty_throttle_task {
> +	struct task_struct	*sleeper;
> +	unsigned long		start_time;
>  	long			nr_pages;
>  	struct list_head	list;
>  	struct completion	complete;
> --- linux.orig/mm/backing-dev.c	2009-10-08 12:37:42.000000000 +0800
> +++ linux/mm/backing-dev.c	2009-10-08 14:12:35.000000000 +0800
> @@ -115,6 +115,25 @@ static int bdi_debug_stats_show(struct s
>  	return 0;
>  }
>  
> +static int bdi_debug_throttle_show(struct seq_file *m, void *v)
> +{
> +	struct backing_dev_info *bdi = m->private;
> +	struct dirty_throttle_task *tt;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&bdi->throttle_lock, flags);
> +	list_for_each_entry(tt, &bdi->throttle_list, list) {
> +		seq_printf(m, "goal=%lukb\twaited=%lums\tpid=%d\tcomm=%s\n",
> +			   tt->nr_pages << (PAGE_SHIFT - 10),
> +			   (jiffies - tt->start_time) * 1000 / HZ,
> +			   tt->sleeper->pid,
> +			   tt->sleeper->comm);
> +	}
> +	spin_unlock_irqrestore(&bdi->throttle_lock, flags);
> +
> +	return 0;
> +}
> +
>  static int bdi_debug_stats_open(struct inode *inode, struct file *file)
>  {
>  	return single_open(file, bdi_debug_stats_show, inode->i_private);
> @@ -127,15 +146,31 @@ static const struct file_operations bdi_
>  	.release	= single_release,
>  };
>  
> +static int bdi_debug_throttle_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, bdi_debug_throttle_show, inode->i_private);
> +}
> +
> +static const struct file_operations bdi_debug_throttle_fops = {
> +	.open		= bdi_debug_throttle_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +
>  static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
>  {
>  	bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
>  	bdi->debug_stats = debugfs_create_file("stats", 0444, bdi->debug_dir,
>  					       bdi, &bdi_debug_stats_fops);
> +	bdi->debug_throttle = debugfs_create_file("throttle_list", 0444,
> +						  bdi->debug_dir, bdi,
> +						  &bdi_debug_throttle_fops);
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
>  {
> +	debugfs_remove(bdi->debug_throttle);
>  	debugfs_remove(bdi->debug_stats);
>  	debugfs_remove(bdi->debug_dir);
>  }
> --- linux.orig/fs/fs-writeback.c	2009-10-08 12:37:24.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-08 14:10:51.000000000 +0800
> @@ -284,8 +284,10 @@ static void bdi_calc_write_bandwidth(str
>  void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
>  {
>  	struct dirty_throttle_task tt = {
> -		.nr_pages = nr_pages,
> -		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> +		.sleeper	= current,
> +		.start_time	= jiffies,
> +		.nr_pages	= nr_pages,
> +		.complete	= COMPLETION_INITIALIZER_ONSTACK(tt.complete),
>  	};
>  	unsigned long flags;
>  
> 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  1:01   ` KAMEZAWA Hiroyuki
  2009-10-08  1:58     ` Wu Fengguang
@ 2009-10-08  8:05     ` Peter Zijlstra
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-08  8:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Wu Fengguang, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Thu, 2009-10-08 at 10:01 +0900, KAMEZAWA Hiroyuki wrote:

> May I ask a question ? (maybe not directly related to this patch itself, sorry)
> 
> Recent works as "writeback: switch to per-bdi threads for flushing data"
> removed congestion_wait() from balance_dirty_pages() and added
> schedule_timeout_interruptible(). 
> 
> And this one replaces it with wake_up+wait_queue.
> 
> IIUC, "iowait" cpustat data was calculated by runqueue->nr_iowait as
> == kernel/schec.c
> void account_idle_time(cputime_t cputime)
> {
>         struct cpu_usage_stat *cpustat = &kstat_this_cpu.cpustat;
>         cputime64_t cputime64 = cputime_to_cputime64(cputime);
>         struct rq *rq = this_rq();
> 
>         if (atomic_read(&rq->nr_iowait) > 0)
>                 cpustat->iowait = cputime64_add(cpustat->iowait, cputime64);
>         else
>                 cpustat->idle = cputime64_add(cpustat->idle, cputime64);
> }
> ==
> Then, for showing "cpu is in iowait", runqueue->nr_iowait should be modified
> at some places. In old kernel, congestion_wait() at el did that by calling
> io_schedule_timeout().
> 
> How this runqueue->nr_iowait is handled now ?

Ah, I think you've got a good point, we need a io_schedule() there.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  1:58     ` Wu Fengguang
  2009-10-08  2:40       ` KAMEZAWA Hiroyuki
@ 2009-10-08  8:08       ` Peter Zijlstra
  2009-10-08  8:11         ` KAMEZAWA Hiroyuki
  2009-10-08  8:36         ` Jens Axboe
  1 sibling, 2 replies; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-08  8:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, 2009-10-08 at 09:58 +0800, Wu Fengguang wrote:
> 
> > How this runqueue->nr_iowait is handled now ?
> 
> Good question. io_schedule() has an old comment for throttling IO wait:
> 
>          * But don't do that if it is a deliberate, throttling IO wait (this task
>          * has set its backing_dev_info: the queue against which it should throttle)
>          */
>         void __sched io_schedule(void)
> 
> So it looks both Jens' and this patch behaves right in ignoring the
> iowait accounting for balance_dirty_pages() :)

Well it is a change in behaviour, and I think IOWAIT makes sense when
we're blocked due to io throttle..

Hmm?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  8:08       ` Peter Zijlstra
@ 2009-10-08  8:11         ` KAMEZAWA Hiroyuki
  2009-10-08  8:36         ` Jens Axboe
  1 sibling, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-10-08  8:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, 08 Oct 2009 10:08:36 +0200
Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Thu, 2009-10-08 at 09:58 +0800, Wu Fengguang wrote:
> > 
> > > How this runqueue->nr_iowait is handled now ?
> > 
> > Good question. io_schedule() has an old comment for throttling IO wait:
> > 
> >          * But don't do that if it is a deliberate, throttling IO wait (this task
> >          * has set its backing_dev_info: the queue against which it should throttle)
> >          */
> >         void __sched io_schedule(void)
> > 
> > So it looks both Jens' and this patch behaves right in ignoring the
> > iowait accounting for balance_dirty_pages() :)
> 
> Well it is a change in behaviour, and I think IOWAIT makes sense when
> we're blocked due to io throttle..
> 
> Hmm?
> 
Above comment "don't do that if it is a deliberate, throttling IO wait" is
really old but ignored.
I pesonally don't like to change the meanig of iowait in /proc/stat. 
But I'm not sure which is better to change the definitiion (which was ignored) or
fix behavior (not correct very long time)...

Hmm?, too ;)

Regards,
-Kame


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages()
  2009-10-08  8:08       ` Peter Zijlstra
  2009-10-08  8:11         ` KAMEZAWA Hiroyuki
@ 2009-10-08  8:36         ` Jens Axboe
  2009-10-09  2:52           ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
  1 sibling, 1 reply; 116+ messages in thread
From: Jens Axboe @ 2009-10-08  8:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li, Shaohua,
	Myklebust Trond, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08 2009, Peter Zijlstra wrote:
> On Thu, 2009-10-08 at 09:58 +0800, Wu Fengguang wrote:
> > 
> > > How this runqueue->nr_iowait is handled now ?
> > 
> > Good question. io_schedule() has an old comment for throttling IO wait:
> > 
> >          * But don't do that if it is a deliberate, throttling IO wait (this task
> >          * has set its backing_dev_info: the queue against which it should throttle)
> >          */
> >         void __sched io_schedule(void)
> > 
> > So it looks both Jens' and this patch behaves right in ignoring the
> > iowait accounting for balance_dirty_pages() :)
> 
> Well it is a change in behaviour, and I think IOWAIT makes sense when
> we're blocked due to io throttle..
> 
> Hmm?

Yep agree, if we're deliberately waiting on IO, it should count as
iowait time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH] writeback: account IO throttling wait as iowait
  2009-10-08  8:36         ` Jens Axboe
@ 2009-10-09  2:52           ` Wu Fengguang
  2009-10-09 10:41             ` Jens Axboe
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-09  2:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li, Shaohua,
	Myklebust Trond, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Oct 08, 2009 at 04:36:09PM +0800, Jens Axboe wrote:
> On Thu, Oct 08 2009, Peter Zijlstra wrote:
> > On Thu, 2009-10-08 at 09:58 +0800, Wu Fengguang wrote:
> > > 
> > > > How this runqueue->nr_iowait is handled now ?
> > > 
> > > Good question. io_schedule() has an old comment for throttling IO wait:
> > > 
> > >          * But don't do that if it is a deliberate, throttling IO wait (this task
> > >          * has set its backing_dev_info: the queue against which it should throttle)
> > >          */
> > >         void __sched io_schedule(void)
> > > 
> > > So it looks both Jens' and this patch behaves right in ignoring the
> > > iowait accounting for balance_dirty_pages() :)
> > 
> > Well it is a change in behaviour, and I think IOWAIT makes sense when
> > we're blocked due to io throttle..
> > 
> > Hmm?
> 
> Yep agree, if we're deliberately waiting on IO, it should count as
> iowait time.

Then let's revert to the old behavior :)

For one single cp, it increases iowait from 29% to 56%.

Before patch:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   4  64  28   0   3|   0     0 | 272k   10M|   0     0 |1854   863
  0   6  69  23   0   3|   0     0 | 249k   11M|   0     0 |1709   865
  0   6  64  27   0   4|   0     0 | 235k   10M|   0     0 |1807   788
  0   4  61  30   0   4|   0     0 | 271k   12M|   0     0 |1910   898
  0   4  72  21   0   4|   0     0 | 289k   13M|   0     0 |1832   905
  0   6  58  35   0   2|   0     0 | 252k   11M|   0     0 |1713   900
  0   4  54  38   0   4|   0     0 | 257k   11M|   0     0 |1777   841
  0   5  59  30   0   7|   0     0 | 270k   12M|   0     0 |1758   836

After patch:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   5  35  57   0   4|   0     0 | 255k   11M|   0     0 |1705   879
  0   4  38  53   0   4|   0     0 | 326k   14M|   0     0 |1940   980
  0   3  36  59   0   2|   0     0 | 291k   13M|   0     0 |1970   970
  0   4  28  66   0   2|   0     0 | 290k   13M|   0     0 |1805   928
  0   6  38  54   0   3|   0     0 | 230k   10M|   0     0 |1866   842
  0   5  44  49   0   4|   0     0 | 278k   12M|   0     0 |1808   868

Thanks,
Fengguang
---
writeback: account IO throttling wait as iowait

It makes sense to do IOWAIT when someone is blocked
due to IO throttle, as suggested by Kame and Peter.

There is an old comment for not doing IOWAIT on throttle,
however it has been mismatching the code for a long time.

If we stop accounting IOWAIT for 2.6.32, it could be an
undesirable behavior change. So restore the io_schedule.

CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 kernel/sched.c      |    3 ---
 mm/page-writeback.c |    3 ++-
 2 files changed, 2 insertions(+), 4 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-09 10:40:19.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-09 10:40:20.000000000 +0800
@@ -566,7 +566,8 @@ static void balance_dirty_pages(struct a
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
 
-		schedule_timeout_interruptible(pause);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
--- linux.orig/kernel/sched.c	2009-10-09 10:40:30.000000000 +0800
+++ linux/kernel/sched.c	2009-10-09 10:40:51.000000000 +0800
@@ -6720,9 +6720,6 @@ EXPORT_SYMBOL(yield);
 /*
  * This task is about to go to sleep on IO. Increment rq->nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
- *
- * But don't do that if it is a deliberate, throttling IO wait (this task
- * has set its backing_dev_info: the queue against which it should throttle)
  */
 void __sched io_schedule(void)
 {

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] writeback: account IO throttling wait as iowait
  2009-10-09  2:52           ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
@ 2009-10-09 10:41             ` Jens Axboe
  2009-10-09 10:58               ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Jens Axboe @ 2009-10-09 10:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li, Shaohua,
	Myklebust Trond, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Oct 09 2009, Wu Fengguang wrote:
> On Thu, Oct 08, 2009 at 04:36:09PM +0800, Jens Axboe wrote:
> > On Thu, Oct 08 2009, Peter Zijlstra wrote:
> > > On Thu, 2009-10-08 at 09:58 +0800, Wu Fengguang wrote:
> > > > 
> > > > > How this runqueue->nr_iowait is handled now ?
> > > > 
> > > > Good question. io_schedule() has an old comment for throttling IO wait:
> > > > 
> > > >          * But don't do that if it is a deliberate, throttling IO wait (this task
> > > >          * has set its backing_dev_info: the queue against which it should throttle)
> > > >          */
> > > >         void __sched io_schedule(void)
> > > > 
> > > > So it looks both Jens' and this patch behaves right in ignoring the
> > > > iowait accounting for balance_dirty_pages() :)
> > > 
> > > Well it is a change in behaviour, and I think IOWAIT makes sense when
> > > we're blocked due to io throttle..
> > > 
> > > Hmm?
> > 
> > Yep agree, if we're deliberately waiting on IO, it should count as
> > iowait time.
> 
> Then let's revert to the old behavior :)
> 
> For one single cp, it increases iowait from 29% to 56%.
> 
> Before patch:
> 
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>   0   4  64  28   0   3|   0     0 | 272k   10M|   0     0 |1854   863
>   0   6  69  23   0   3|   0     0 | 249k   11M|   0     0 |1709   865
>   0   6  64  27   0   4|   0     0 | 235k   10M|   0     0 |1807   788
>   0   4  61  30   0   4|   0     0 | 271k   12M|   0     0 |1910   898
>   0   4  72  21   0   4|   0     0 | 289k   13M|   0     0 |1832   905
>   0   6  58  35   0   2|   0     0 | 252k   11M|   0     0 |1713   900
>   0   4  54  38   0   4|   0     0 | 257k   11M|   0     0 |1777   841
>   0   5  59  30   0   7|   0     0 | 270k   12M|   0     0 |1758   836
> 
> After patch:
> 
> ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
> usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
>   0   5  35  57   0   4|   0     0 | 255k   11M|   0     0 |1705   879
>   0   4  38  53   0   4|   0     0 | 326k   14M|   0     0 |1940   980
>   0   3  36  59   0   2|   0     0 | 291k   13M|   0     0 |1970   970
>   0   4  28  66   0   2|   0     0 | 290k   13M|   0     0 |1805   928
>   0   6  38  54   0   3|   0     0 | 230k   10M|   0     0 |1866   842
>   0   5  44  49   0   4|   0     0 | 278k   12M|   0     0 |1808   868
> 
> Thanks,
> Fengguang
> ---
> writeback: account IO throttling wait as iowait
> 
> It makes sense to do IOWAIT when someone is blocked
> due to IO throttle, as suggested by Kame and Peter.
> 
> There is an old comment for not doing IOWAIT on throttle,
> however it has been mismatching the code for a long time.
> 
> If we stop accounting IOWAIT for 2.6.32, it could be an
> undesirable behavior change. So restore the io_schedule.

Thanks, queued up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] writeback: account IO throttling wait as iowait
  2009-10-09 10:41             ` Jens Axboe
@ 2009-10-09 10:58               ` Wu Fengguang
  2009-10-09 11:01                 ` Jens Axboe
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-09 10:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li, Shaohua,
	Myklebust Trond, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Oct 09, 2009 at 06:41:05PM +0800, Jens Axboe wrote:
> > ---
> > writeback: account IO throttling wait as iowait
> > 
> > It makes sense to do IOWAIT when someone is blocked
> > due to IO throttle, as suggested by Kame and Peter.
> > 
> > There is an old comment for not doing IOWAIT on throttle,
> > however it has been mismatching the code for a long time.
> > 
> > If we stop accounting IOWAIT for 2.6.32, it could be an
> > undesirable behavior change. So restore the io_schedule.
> 
> Thanks, queued up.

Thank you. Would you also pick up this one if it's OK to you?

Thanks,
Fengguang

---
writeback: kill space in debugfs item name

The space is not script friendly, kill it.

CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux.orig/mm/backing-dev.c	2009-10-09 10:05:27.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-09 18:55:27.000000000 +0800
@@ -92,7 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
-		   "WriteBack threads:%8lu\n"
+		   "WritebackThreads: %8lu\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH] writeback: account IO throttling wait as iowait
  2009-10-09 10:58               ` Wu Fengguang
@ 2009-10-09 11:01                 ` Jens Axboe
  0 siblings, 0 replies; 116+ messages in thread
From: Jens Axboe @ 2009-10-09 11:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, KAMEZAWA Hiroyuki, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li, Shaohua,
	Myklebust Trond, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Oct 09 2009, Wu Fengguang wrote:
> On Fri, Oct 09, 2009 at 06:41:05PM +0800, Jens Axboe wrote:
> > > ---
> > > writeback: account IO throttling wait as iowait
> > > 
> > > It makes sense to do IOWAIT when someone is blocked
> > > due to IO throttle, as suggested by Kame and Peter.
> > > 
> > > There is an old comment for not doing IOWAIT on throttle,
> > > however it has been mismatching the code for a long time.
> > > 
> > > If we stop accounting IOWAIT for 2.6.32, it could be an
> > > undesirable behavior change. So restore the io_schedule.
> > 
> > Thanks, queued up.
> 
> Thank you. Would you also pick up this one if it's OK to you?

Sure, looks appropriate.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
@ 2009-10-09 15:12   ` Jan Kara
  2009-10-09 15:18     ` Peter Zijlstra
  2009-10-10 21:33     ` Wu Fengguang
  0 siblings, 2 replies; 116+ messages in thread
From: Jan Kara @ 2009-10-09 15:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel,
	Richard Kennedy, LKML

  Hi,

On Wed 07-10-09 15:38:19, Wu Fengguang wrote:
> From: Richard Kennedy <richard@rsk.demon.co.uk>
> 
> Reducing the number of times balance_dirty_pages calls global_page_state
> reduces the cache references and so improves write performance on a
> variety of workloads.
> 
> 'perf stats' of simple fio write tests shows the reduction in cache
> access.
> Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
> 3Gb memory (dirty_threshold approx 600 Mb)
> running each test 10 times, dropping the fasted & slowest values then
> taking 
> the average & standard deviation
> 
> 		average (s.d.) in millions (10^6)
> 2.6.31-rc8	648.6 (14.6)
> +patch		620.1 (16.5)
> 
> Achieving this reduction is by dropping clip_bdi_dirty_limit as it  
> rereads the counters to apply the dirty_threshold and moving this check
> up into balance_dirty_pages where it has already read the counters.
> 
> Also by rearrange the for loop to only contain one copy of the limit
> tests allows the pdflush test after the loop to use the local copies of
> the counters rather than rereading them.
> 
> In the common case with no throttling it now calls global_page_state 5
> fewer times and bdi_stat 2 fewer.
  Hmm, but the patch changes the behavior of balance_dirty_pages() in
several ways:

> -/*
> - * Clip the earned share of dirty pages to that which is actually available.
> - * This avoids exceeding the total dirty_limit when the floating averages
> - * fluctuate too quickly.
> - */
> -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> -		unsigned long dirty, unsigned long *pbdi_dirty)
> -{
> -	unsigned long avail_dirty;
> -
> -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> -		 global_page_state(NR_WRITEBACK) +
> -		 global_page_state(NR_UNSTABLE_NFS) +
> -		 global_page_state(NR_WRITEBACK_TEMP);
> -
> -	if (avail_dirty < dirty)
> -		avail_dirty = dirty - avail_dirty;
> -	else
> -		avail_dirty = 0;
> -
> -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> -		bdi_stat(bdi, BDI_WRITEBACK);
> -
> -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> -}
> -
>  static inline void task_dirties_fraction(struct task_struct *tsk,
>  		long *numerator, long *denominator)
>  {
> @@ -468,7 +442,6 @@ get_dirty_limits(unsigned long *pbackgro
>  			bdi_dirty = dirty * bdi->max_ratio / 100;
>  
>  		*pbdi_dirty = bdi_dirty;
> -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
  I don't see, what test in balance_dirty_limits() should replace this
clipping... OTOH clipping does not seem to have too much effect on the
behavior of balance_dirty_pages - the limit we clip to (at least
BDI_WRITEBACK + BDI_RECLAIMABLE) is large enough so that we break from the
loop immediately. So just getting rid of the function is fine but
I'd update the changelog accordingly.

> @@ -503,16 +476,36 @@ static void balance_dirty_pages(struct a
>  		};
>  
>  		get_dirty_limits(&background_thresh, &dirty_thresh,
> -				&bdi_thresh, bdi);
> +				 &bdi_thresh, bdi);
>  
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> -					global_page_state(NR_UNSTABLE_NFS);
> -		nr_writeback = global_page_state(NR_WRITEBACK);
> +				 global_page_state(NR_UNSTABLE_NFS);
> +		nr_writeback = global_page_state(NR_WRITEBACK) +
> +			       global_page_state(NR_WRITEBACK_TEMP);
>  
> -		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +		/*
> +		 * In order to avoid the stacked BDI deadlock we need
> +		 * to ensure we accurately count the 'dirty' pages when
> +		 * the threshold is low.
> +		 *
> +		 * Otherwise it would be possible to get thresh+n pages
> +		 * reported dirty, even though there are thresh-m pages
> +		 * actually dirty; with m+n sitting in the percpu
> +		 * deltas.
> +		 */
> +		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> +			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> +			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> +		} else {
> +			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> +			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +		}
> +
> +		dirty_exceeded =
> +			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> +			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
>  
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> +		if (!dirty_exceeded)
>  			break;
  Ugh, but this is not equivalent! We would block the writer on some BDI
without any dirty data if we are over global dirty limit. That didn't
happen before.

>  		/*
> @@ -521,7 +514,7 @@ static void balance_dirty_pages(struct a
>  		 * when the bdi limits are ramping up.
>  		 */
>  		if (nr_reclaimable + nr_writeback <
> -				(background_thresh + dirty_thresh) / 2)
> +		    (background_thresh + dirty_thresh) / 2)
>  			break;
>  
>  		if (!bdi->dirty_exceeded)
> @@ -539,33 +532,10 @@ static void balance_dirty_pages(struct a
>  		if (bdi_nr_reclaimable > bdi_thresh) {
>  			writeback_inodes_wbc(&wbc);
>  			pages_written += write_chunk - wbc.nr_to_write;
> -			get_dirty_limits(&background_thresh, &dirty_thresh,
> -				       &bdi_thresh, bdi);
> -		}
> -
> -		/*
> -		 * In order to avoid the stacked BDI deadlock we need
> -		 * to ensure we accurately count the 'dirty' pages when
> -		 * the threshold is low.
> -		 *
> -		 * Otherwise it would be possible to get thresh+n pages
> -		 * reported dirty, even though there are thresh-m pages
> -		 * actually dirty; with m+n sitting in the percpu
> -		 * deltas.
> -		 */
> -		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> -			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> -		} else if (bdi_nr_reclaimable) {
> -			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> -			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> +			/* don't wait if we've done enough */
> +			if (pages_written >= write_chunk)
> +				break;
>  		}
> -
> -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> -			break;
> -		if (pages_written >= write_chunk)
> -			break;		/* We've done our duty */
> -
  Here, we had an opportunity to break from the loop even if we didn't
manage to write everything (for example because per-bdi thread managed to
write enough or because enough IO has completed while we were trying to
write). After the patch, we will sleep. IMHO that's not good...
  I'd think that if we did all that work in writeback_inodes_wbc we could
spend the effort on regetting and rechecking the stats...

>  		schedule_timeout_interruptible(pause);
>  
>  		/*
> @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
>  			pause = HZ / 10;
>  	}
>  
> -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> -			bdi->dirty_exceeded)
> +	if (!dirty_exceeded && bdi->dirty_exceeded)
>  		bdi->dirty_exceeded = 0;
  Here we fail to clear dirty_exceeded if we are over global dirty limit
but not over per-bdi dirty limit...

> @@ -593,9 +562,7 @@ static void balance_dirty_pages(struct a
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
>  	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> -			       + global_page_state(NR_UNSTABLE_NFS))
> -					  > background_thresh)))
> +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
  This might be based on rather old values in case we break from the loop
after calling writeback_inodes_wbc.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-09 15:12   ` Jan Kara
@ 2009-10-09 15:18     ` Peter Zijlstra
  2009-10-09 15:47       ` Jan Kara
  2009-10-10 21:33     ` Wu Fengguang
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-09 15:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel,
	Richard Kennedy, LKML

On Fri, 2009-10-09 at 17:12 +0200, Jan Kara wrote:
>   Ugh, but this is not equivalent! We would block the writer on some BDI
> without any dirty data if we are over global dirty limit. That didn't
> happen before.

It should have, we should throttle everything calling
balance_dirty_pages() when we're over the total limit.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/45] writeback: remove unused nonblocking and congestion checks
  2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
@ 2009-10-09 15:26   ` Jan Kara
  2009-10-10 13:47     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Jan Kara @ 2009-10-09 15:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed 07-10-09 15:38:22, Wu Fengguang wrote:
> - no one is calling wb_writeback and write_cache_pages with
>   wbc.nonblocking=1 any more
> - lumpy pageout will want to do nonblocking writeback without the
>   congestion wait
> 
> So remove the congestion checks as suggested by Chris.
  Looks good. Since encountered_congestion isn't used, you can delete it as
well... BTW, you might need to split this patch to per-fs chunks for the
sake of merging.

									Honza

> 
> CC: Chris Mason <chris.mason@oracle.com>
> CC: Jens Axboe <jens.axboe@oracle.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  drivers/staging/pohmelfs/inode.c |    9 ---------
>  fs/afs/write.c                   |   16 +---------------
>  fs/cifs/file.c                   |   10 ----------
>  fs/fs-writeback.c                |    8 --------
>  fs/gfs2/aops.c                   |   10 ----------
>  fs/xfs/linux-2.6/xfs_aops.c      |    6 +-----
>  mm/page-writeback.c              |   12 ------------
>  7 files changed, 2 insertions(+), 69 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-10-06 23:31:54.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-06 23:31:59.000000000 +0800
> @@ -660,14 +660,6 @@ static void writeback_inodes_wb(struct b
>  			continue;
>  		}
>  
> -		if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
> -			wbc->encountered_congestion = 1;
> -			if (!is_blkdev_sb)
> -				break;		/* Skip a congested fs */
> -			requeue_io(inode);
> -			continue;		/* Skip a congested blockdev */
> -		}
> -
>  		/*
>  		 * Was this inode dirtied after sync_sb_inodes was called?
>  		 * This keeps sync from extra jobs and livelock.
> --- linux.orig/mm/page-writeback.c	2009-10-06 23:31:54.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-10-06 23:31:59.000000000 +0800
> @@ -787,7 +787,6 @@ int write_cache_pages(struct address_spa
>  		      struct writeback_control *wbc, writepage_t writepage,
>  		      void *data)
>  {
> -	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	int ret = 0;
>  	int done = 0;
>  	struct pagevec pvec;
> @@ -800,11 +799,6 @@ int write_cache_pages(struct address_spa
>  	int range_whole = 0;
>  	long nr_to_write = wbc->nr_to_write;
>  
> -	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -		wbc->encountered_congestion = 1;
> -		return 0;
> -	}
> -
>  	pagevec_init(&pvec, 0);
>  	if (wbc->range_cyclic) {
>  		writeback_index = mapping->writeback_index; /* prev offset */
> @@ -923,12 +917,6 @@ continue_unlock:
>  					break;
>  				}
>  			}
> -
> -			if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -				wbc->encountered_congestion = 1;
> -				done = 1;
> -				break;
> -			}
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
> --- linux.orig/drivers/staging/pohmelfs/inode.c	2009-10-06 23:31:41.000000000 +0800
> +++ linux/drivers/staging/pohmelfs/inode.c	2009-10-06 23:31:59.000000000 +0800
> @@ -152,11 +152,6 @@ static int pohmelfs_writepages(struct ad
>  	int scanned = 0;
>  	int range_whole = 0;
>  
> -	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -		wbc->encountered_congestion = 1;
> -		return 0;
> -	}
> -
>  	if (wbc->range_cyclic) {
>  		index = mapping->writeback_index; /* Start from prev offset */
>  		end = -1;
> @@ -248,10 +243,6 @@ retry:
>  
>  			if (wbc->nr_to_write <= 0)
>  				done = 1;
> -			if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -				wbc->encountered_congestion = 1;
> -				done = 1;
> -			}
>  
>  			continue;
>  out_continue:
> --- linux.orig/fs/afs/write.c	2009-10-06 23:31:41.000000000 +0800
> +++ linux/fs/afs/write.c	2009-10-06 23:31:59.000000000 +0800
> @@ -455,8 +455,6 @@ int afs_writepage(struct page *page, str
>  	}
>  
>  	wbc->nr_to_write -= ret;
> -	if (wbc->nonblocking && bdi_write_congested(bdi))
> -		wbc->encountered_congestion = 1;
>  
>  	_leave(" = 0");
>  	return 0;
> @@ -529,11 +527,6 @@ static int afs_writepages_region(struct 
>  
>  		wbc->nr_to_write -= ret;
>  
> -		if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -			wbc->encountered_congestion = 1;
> -			break;
> -		}
> -
>  		cond_resched();
>  	} while (index < end && wbc->nr_to_write > 0);
>  
> @@ -554,18 +547,11 @@ int afs_writepages(struct address_space 
>  
>  	_enter("");
>  
> -	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -		wbc->encountered_congestion = 1;
> -		_leave(" = 0 [congest]");
> -		return 0;
> -	}
> -
>  	if (wbc->range_cyclic) {
>  		start = mapping->writeback_index;
>  		end = -1;
>  		ret = afs_writepages_region(mapping, wbc, start, end, &next);
> -		if (start > 0 && wbc->nr_to_write > 0 && ret == 0 &&
> -		    !(wbc->nonblocking && wbc->encountered_congestion))
> +		if (start > 0 && wbc->nr_to_write > 0 && ret == 0)
>  			ret = afs_writepages_region(mapping, wbc, 0, start,
>  						    &next);
>  		mapping->writeback_index = next;
> --- linux.orig/fs/cifs/file.c	2009-10-06 23:31:41.000000000 +0800
> +++ linux/fs/cifs/file.c	2009-10-06 23:31:59.000000000 +0800
> @@ -1379,16 +1379,6 @@ static int cifs_writepages(struct addres
>  		return generic_writepages(mapping, wbc);
>  
>  
> -	/*
> -	 * BB: Is this meaningful for a non-block-device file system?
> -	 * If it is, we should test it again after we do I/O
> -	 */
> -	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -		wbc->encountered_congestion = 1;
> -		kfree(iov);
> -		return 0;
> -	}
> -
>  	xid = GetXid();
>  
>  	pagevec_init(&pvec, 0);
> --- linux.orig/fs/gfs2/aops.c	2009-10-06 23:31:41.000000000 +0800
> +++ linux/fs/gfs2/aops.c	2009-10-06 23:31:59.000000000 +0800
> @@ -313,11 +313,6 @@ static int gfs2_write_jdata_pagevec(stru
>  
>  		if (ret || (--(wbc->nr_to_write) <= 0))
>  			ret = 1;
> -		if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -			wbc->encountered_congestion = 1;
> -			ret = 1;
> -		}
> -
>  	}
>  	gfs2_trans_end(sdp);
>  	return ret;
> @@ -348,11 +343,6 @@ static int gfs2_write_cache_jdata(struct
>  	int scanned = 0;
>  	int range_whole = 0;
>  
> -	if (wbc->nonblocking && bdi_write_congested(bdi)) {
> -		wbc->encountered_congestion = 1;
> -		return 0;
> -	}
> -
>  	pagevec_init(&pvec, 0);
>  	if (wbc->range_cyclic) {
>  		index = mapping->writeback_index; /* Start from prev offset */
> --- linux.orig/fs/xfs/linux-2.6/xfs_aops.c	2009-10-06 23:31:41.000000000 +0800
> +++ linux/fs/xfs/linux-2.6/xfs_aops.c	2009-10-06 23:31:59.000000000 +0800
> @@ -890,12 +890,8 @@ xfs_convert_page(
>  
>  			bdi = inode->i_mapping->backing_dev_info;
>  			wbc->nr_to_write--;
> -			if (bdi_write_congested(bdi)) {
> -				wbc->encountered_congestion = 1;
> +			if (wbc->nr_to_write <= 0)
>  				done = 1;
> -			} else if (wbc->nr_to_write <= 0) {
> -				done = 1;
> -			}
>  		}
>  		xfs_start_page_writeback(page, !page_dirty, count);
>  	}
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages
  2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
@ 2009-10-09 15:45   ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2009-10-09 15:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed 07-10-09 15:38:25, Wu Fengguang wrote:
> This avoids delaying writeback for an expired (XFS) inode with lots of
> dirty pages, but no active dirtier at the moment.
  OK, looks good.
Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> CC: Dave Chinner <david@fromorbit.com> 
> CC: Christoph Hellwig <hch@infradead.org> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c |   20 +++++++-------------
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2009-10-06 23:37:57.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-10-06 23:38:28.000000000 +0800
> @@ -479,18 +479,7 @@ writeback_single_inode(struct inode *ino
>  	spin_lock(&inode_lock);
>  	inode->i_state &= ~I_SYNC;
>  	if (!(inode->i_state & (I_FREEING | I_CLEAR))) {
> -		if ((inode->i_state & I_DIRTY_PAGES) && wbc->for_kupdate) {
> -			/*
> -			 * More pages get dirtied by a fast dirtier.
> -			 */
> -			goto select_queue;
> -		} else if (inode->i_state & I_DIRTY) {
> -			/*
> -			 * At least XFS will redirty the inode during the
> -			 * writeback (delalloc) and on io completion (isize).
> -			 */
> -			redirty_tail(inode);
> -		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
>  			 * We didn't write back all the pages.  nfs_writepages()
>  			 * sometimes bales out without doing anything. Redirty
> @@ -512,7 +501,6 @@ writeback_single_inode(struct inode *ino
>  				 * soon as the queue becomes uncongested.
>  				 */
>  				inode->i_state |= I_DIRTY_PAGES;
> -select_queue:
>  				if (wbc->nr_to_write <= 0) {
>  					/*
>  					 * slice used up: queue for next turn
> @@ -535,6 +523,12 @@ select_queue:
>  				inode->i_state |= I_DIRTY_PAGES;
>  				redirty_tail(inode);
>  			}
> +		} else if (inode->i_state & I_DIRTY) {
> +			/*
> +			 * At least XFS will redirty the inode during the
> +			 * writeback (delalloc) and on io completion (isize).
> +			 */
> +			redirty_tail(inode);
>  		} else if (atomic_read(&inode->i_count)) {
>  			/*
>  			 * The inode is clean, inuse
> 
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-09 15:18     ` Peter Zijlstra
@ 2009-10-09 15:47       ` Jan Kara
  2009-10-11  2:28         ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Jan Kara @ 2009-10-09 15:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Wu Fengguang, Andrew Morton, Theodore Tso,
	Christoph Hellwig, Dave Chinner, Chris Mason, Li Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Nick Piggin,
	linux-fsdevel, Richard Kennedy, LKML

On Fri 09-10-09 17:18:32, Peter Zijlstra wrote:
> On Fri, 2009-10-09 at 17:12 +0200, Jan Kara wrote:
> >   Ugh, but this is not equivalent! We would block the writer on some BDI
> > without any dirty data if we are over global dirty limit. That didn't
> > happen before.
> 
> It should have, we should throttle everything calling
> balance_dirty_pages() when we're over the total limit.
  OK :) I agree it's reasonable. But Wu, please note this in the
changelog because it might be a substantial change for some loads.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 04/45] writeback: remove unused nonblocking and congestion checks
  2009-10-09 15:26   ` Jan Kara
@ 2009-10-10 13:47     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-10 13:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	LKML

On Fri, Oct 09, 2009 at 11:26:29PM +0800, Jan Kara wrote:
> On Wed 07-10-09 15:38:22, Wu Fengguang wrote:
> > - no one is calling wb_writeback and write_cache_pages with
> >   wbc.nonblocking=1 any more
> > - lumpy pageout will want to do nonblocking writeback without the
> >   congestion wait
> > 
> > So remove the congestion checks as suggested by Chris.
>   Looks good. Since encountered_congestion isn't used, you can delete it as
> well... BTW, you might need to split this patch to per-fs chunks for the
> sake of merging.

OK, good suggestions to follow :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-09 15:12   ` Jan Kara
  2009-10-09 15:18     ` Peter Zijlstra
@ 2009-10-10 21:33     ` Wu Fengguang
  2009-10-12 21:18       ` Jan Kara
  1 sibling, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-10 21:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Fri, Oct 09, 2009 at 11:12:31PM +0800, Jan Kara wrote:
>   Hi,
>
> On Wed 07-10-09 15:38:19, Wu Fengguang wrote:
> > From: Richard Kennedy <richard@rsk.demon.co.uk>
> >
> > Reducing the number of times balance_dirty_pages calls global_page_state
> > reduces the cache references and so improves write performance on a
> > variety of workloads.
> >
> > 'perf stats' of simple fio write tests shows the reduction in cache
> > access.
> > Where the test is fio 'write,mmap,600Mb,pre_read' on AMD AthlonX2 with
> > 3Gb memory (dirty_threshold approx 600 Mb)
> > running each test 10 times, dropping the fasted & slowest values then
> > taking
> > the average & standard deviation
> >
> > 		average (s.d.) in millions (10^6)
> > 2.6.31-rc8	648.6 (14.6)
> > +patch		620.1 (16.5)
> >
> > Achieving this reduction is by dropping clip_bdi_dirty_limit as it
> > rereads the counters to apply the dirty_threshold and moving this check
> > up into balance_dirty_pages where it has already read the counters.
> >
> > Also by rearrange the for loop to only contain one copy of the limit
> > tests allows the pdflush test after the loop to use the local copies of
> > the counters rather than rereading them.
> >
> > In the common case with no throttling it now calls global_page_state 5
> > fewer times and bdi_stat 2 fewer.
>   Hmm, but the patch changes the behavior of balance_dirty_pages() in
> several ways:

Yes, unfortunately the changelog failed to make that clear ..

> > -/*
> > - * Clip the earned share of dirty pages to that which is actually available.
> > - * This avoids exceeding the total dirty_limit when the floating averages
> > - * fluctuate too quickly.
> > - */
> > -static void clip_bdi_dirty_limit(struct backing_dev_info *bdi,
> > -		unsigned long dirty, unsigned long *pbdi_dirty)
> > -{
> > -	unsigned long avail_dirty;
> > -
> > -	avail_dirty = global_page_state(NR_FILE_DIRTY) +
> > -		 global_page_state(NR_WRITEBACK) +
> > -		 global_page_state(NR_UNSTABLE_NFS) +
> > -		 global_page_state(NR_WRITEBACK_TEMP);
> > -
> > -	if (avail_dirty < dirty)
> > -		avail_dirty = dirty - avail_dirty;
> > -	else
> > -		avail_dirty = 0;
> > -
> > -	avail_dirty += bdi_stat(bdi, BDI_RECLAIMABLE) +
> > -		bdi_stat(bdi, BDI_WRITEBACK);
> > -
> > -	*pbdi_dirty = min(*pbdi_dirty, avail_dirty);
> > -}
> > -
> >  static inline void task_dirties_fraction(struct task_struct *tsk,
> >  		long *numerator, long *denominator)
> >  {
> > @@ -468,7 +442,6 @@ get_dirty_limits(unsigned long *pbackgro
> >  			bdi_dirty = dirty * bdi->max_ratio / 100;
> >
> >  		*pbdi_dirty = bdi_dirty;
> > -		clip_bdi_dirty_limit(bdi, dirty, pbdi_dirty);
>   I don't see, what test in balance_dirty_limits() should replace this
> clipping... OTOH clipping does not seem to have too much effect on the
> behavior of balance_dirty_pages - the limit we clip to (at least
> BDI_WRITEBACK + BDI_RECLAIMABLE) is large enough so that we break from the
> loop immediately. So just getting rid of the function is fine but
> I'd update the changelog accordingly.
>

It essentially replace clip_bdi_dirty_limit() with the explicit check
(nr_reclaimable + nr_writeback >= dirty_thresh) to avoid exceeding the
dirty limit. Since the bdi dirty limit is mostly accurate we don't need
to do routinely clip. A simple dirty limit check would be enough.

I added the above text to changelog :)

> > +		dirty_exceeded =
> > +			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> > +			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> >
> > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > +		if (!dirty_exceeded)
> >  			break;
>   Ugh, but this is not equivalent! We would block the writer on some BDI
> without any dirty data if we are over global dirty limit. That didn't
> happen before.

This restores the (right) behavior in 2.6.18. And peter have the same goal
in mind with clip_bdi_dirty_limit() ;)

> > +			/* don't wait if we've done enough */
> > +			if (pages_written >= write_chunk)
> > +				break;
> >  		}
> > -
> > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > -			break;
> > -		if (pages_written >= write_chunk)
> > -			break;		/* We've done our duty */
> > -
>   Here, we had an opportunity to break from the loop even if we didn't
> manage to write everything (for example because per-bdi thread managed to
> write enough or because enough IO has completed while we were trying to
> write). After the patch, we will sleep. IMHO that's not good...

Note that the pages_written check is moved several lines up in the patch :)

>   I'd think that if we did all that work in writeback_inodes_wbc we could
> spend the effort on regetting and rechecking the stats...

Yes maybe. I didn't care it because the later throttle queue patch totally
removed the loop and hence to need to reget the stats :)

> >  		schedule_timeout_interruptible(pause);
> >
> >  		/*
> > @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
> >  			pause = HZ / 10;
> >  	}
> >
> > -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > -			bdi->dirty_exceeded)
> > +	if (!dirty_exceeded && bdi->dirty_exceeded)
> >  		bdi->dirty_exceeded = 0;
>   Here we fail to clear dirty_exceeded if we are over global dirty limit
> but not over per-bdi dirty limit...

You must be mistaken: dirty_exceeded = (over bdi limit || over global limit),
so !dirty_exceeded = (!over bdi limit && !over global limit).

> > @@ -593,9 +562,7 @@ static void balance_dirty_pages(struct a
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> >  	if ((laptop_mode && pages_written) ||
> > -	    (!laptop_mode && ((global_page_state(NR_FILE_DIRTY)
> > -			       + global_page_state(NR_UNSTABLE_NFS))
> > -					  > background_thresh)))
> > +	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> >  		bdi_start_writeback(bdi, NULL, 0);
> >  }
>   This might be based on rather old values in case we break from the loop
> after calling writeback_inodes_wbc.

Yes that's possible. It's safe because the bdi worker will double check
background_thresh. We can call bdi_start_writeback() as long as there are
good possibility: the nr_reclaimable is not likely to drop suddenly from
during our writeout.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-09 15:47       ` Jan Kara
@ 2009-10-11  2:28         ` Wu Fengguang
  2009-10-11  7:44           ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-11  2:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Fri, Oct 09, 2009 at 11:47:59PM +0800, Jan Kara wrote:
> On Fri 09-10-09 17:18:32, Peter Zijlstra wrote:
> > On Fri, 2009-10-09 at 17:12 +0200, Jan Kara wrote:
> > >   Ugh, but this is not equivalent! We would block the writer on some BDI
> > > without any dirty data if we are over global dirty limit. That didn't
> > > happen before.
> > 
> > It should have, we should throttle everything calling
> > balance_dirty_pages() when we're over the total limit.
>   OK :) I agree it's reasonable. But Wu, please note this in the
> changelog because it might be a substantial change for some loads.

Thanks, I added the note by Peter :)

Note that the total limit check itself may not be sufficient. For
example, there are no nr_writeback limit for NFS (and maybe btrfs)
after removing the congestion waits.  Therefore it is very possible

        nr_writeback => dirty_thresh
        nr_dirty     => 0

which is obviously undesirable: everything newly dirtied are soon put
to writeback. It violates the 30s expire time and the background
threshold rules, and will hurt write-and-truncate operations (ie. temp
files).

So the better solution would be to impose a nr_writeback limit for
every filesystem that didn't already have one (the block io queue).
NFS used to have that limit with congestion_wait, but now we need
to do a wait queue for it.

With the nr_writeback wait queue, it can be guaranteed that once
balance_dirty_pages() asks for writing 1500 pages, it will be done
with necessary sleeping in the bdi flush thread. So we can safely
remove the loop and double checking of global dirty limit in
balance_dirty_pages().

However, there is still one problem - there are no general
coordinations between max nr_writeback and the background/dirty
limits.

It is possible (and very likely for some small memory systems) that

	nr_writeback > dirty_thresh - background_thresh
	10,000         20,000         15,000

In this case, it is possible that an application to be throttled because
of
	nr_reclaimable + nr_writeback > dirty_thresh
	12,000           10,000         20,000

starts a background writeback work to do job for it, however that work
quits immediately because

	nr_reclaimable < background_thresh
	12,000           15,000

In the end, the application did not get throttled at all at dirty_thresh.
Instead, it will be throttled at (background_thresh + max_nr_writeback).

One solution (aka. the old behavior) is to respect the dirty_thresh, by
not quiting background writeback when there are throttled tasks (this
patch). It has the drawback of background writeback not doing its job
_actively_. Instead, it will frequently be started and quit at times
when applications enter and leave balanced_dirty_pages().

In the above scheme, the background_thresh is disregarded. The other
ways would be to disregard dirty_thresh (may be undesirable) or to
limit max_nr_writeback (not as easy).

It is still very possible to hit nr_dirty all the way down to 0 if
max_nr_writeback > background_thresh.

This is a bit twisting. Any ideas?

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux.orig/fs/fs-writeback.c	2009-10-11 09:19:49.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-11 09:21:50.000000000 +0800
@@ -781,7 +781,8 @@ static long wb_writeback(struct bdi_writ
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (args->for_background && !over_bground_thresh())
+		if (args->for_background && !over_bground_thresh() &&
+		    !list_empty(&wb->bdi->throttle_list))
 			break;

 		wbc.more_io = 0;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-11  2:28         ` Wu Fengguang
@ 2009-10-11  7:44           ` Peter Zijlstra
  2009-10-11 10:50             ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-11  7:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Sun, 2009-10-11 at 10:28 +0800, Wu Fengguang wrote:
> 
> Note that the total limit check itself may not be sufficient. For
> example, there are no nr_writeback limit for NFS (and maybe btrfs)
> after removing the congestion waits.  Therefore it is very possible
> 
>         nr_writeback => dirty_thresh
>         nr_dirty     => 0
> 
> which is obviously undesirable: everything newly dirtied are soon put
> to writeback. It violates the 30s expire time and the background
> threshold rules, and will hurt write-and-truncate operations (ie. temp
> files).
> 
> So the better solution would be to impose a nr_writeback limit for
> every filesystem that didn't already have one (the block io queue).
> NFS used to have that limit with congestion_wait, but now we need
> to do a wait queue for it.
> 
> With the nr_writeback wait queue, it can be guaranteed that once
> balance_dirty_pages() asks for writing 1500 pages, it will be done
> with necessary sleeping in the bdi flush thread. So we can safely
> remove the loop and double checking of global dirty limit in
> balance_dirty_pages().

nr_reclaim = nr_dirty + nr_writeback + nr_unstable, so anything calling
into balance_dirty_pages() would still block on seeing such large
amounts of nr_writeback.

Having the constraint nr_dirty + nr_writeback + nr_unstable <
dirty_thresh should ensure we never have nr_writeback > dirty_thresh,
simply because you cannot dirty more, which then cannot be converted to
more writeback.

Or am I missing something?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-11  7:44           ` Peter Zijlstra
@ 2009-10-11 10:50             ` Wu Fengguang
  2009-10-11 10:58               ` Peter Zijlstra
  2009-10-11 11:25               ` Peter Zijlstra
  0 siblings, 2 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-11 10:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Sun, Oct 11, 2009 at 03:44:40PM +0800, Peter Zijlstra wrote:
> On Sun, 2009-10-11 at 10:28 +0800, Wu Fengguang wrote:
> > 
> > Note that the total limit check itself may not be sufficient. For
> > example, there are no nr_writeback limit for NFS (and maybe btrfs)
> > after removing the congestion waits.  Therefore it is very possible
> > 
> >         nr_writeback => dirty_thresh
> >         nr_dirty     => 0
> > 
> > which is obviously undesirable: everything newly dirtied are soon put
> > to writeback. It violates the 30s expire time and the background
> > threshold rules, and will hurt write-and-truncate operations (ie. temp
> > files).
> > 
> > So the better solution would be to impose a nr_writeback limit for
> > every filesystem that didn't already have one (the block io queue).
> > NFS used to have that limit with congestion_wait, but now we need
> > to do a wait queue for it.
> > 
> > With the nr_writeback wait queue, it can be guaranteed that once
> > balance_dirty_pages() asks for writing 1500 pages, it will be done
> > with necessary sleeping in the bdi flush thread. So we can safely
> > remove the loop and double checking of global dirty limit in
> > balance_dirty_pages().
> 
> nr_reclaim = nr_dirty + nr_writeback + nr_unstable, so anything calling
> into balance_dirty_pages() would still block on seeing such large
> amounts of nr_writeback.

Our terms are a bit different. In my previous mail,
        nr_reclaim = nr_dirty + nr_unstable
nr_writeback is added separated when comparing with dirty_thresh, just
as the code in balance_dirty_pages().

But that's fine. You are right that the application will be blocked
and dirty limit be guaranteed, if we do
        while (over dirty limit) {
                bdi_writeback_wait(pages to write);
        }

But it has a problem: as long as the bdi-flush thread for NFS don't
limit nr_writeback, its nr_writeback will grow to near
(dirty_thresh-nr_unstable), and its nr_dirty will approach 0.
That's not desirable.

So I did this:
-       while (over dirty limit) {
+       if (over dirty limit) {
                bdi_writeback_wait(pages to write);
        }
_after_ adding the NFS nr_writeback wait queue ([PATCH 20/45] NFS:
introduce writeback wait queue). With that it's safe to remove the
loop.

> Having the constraint nr_dirty + nr_writeback + nr_unstable <
> dirty_thresh should ensure we never have nr_writeback > dirty_thresh,
> simply because you cannot dirty more, which then cannot be converted to
> more writeback.
> 
> Or am I missing something?

You are right with the assumption that the loop is still there.

Sorry for the confusion, but I mean, filesystems have to limit
nr_writeback (directly or indirectly via the block io queue),
otherwise it either hit nr_dirty to 0 (with the loop), or let
nr_writeback grow out of control (without the loop).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-11 10:50             ` Wu Fengguang
@ 2009-10-11 10:58               ` Peter Zijlstra
  2009-10-11 11:25               ` Peter Zijlstra
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-11 10:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Sun, 2009-10-11 at 18:50 +0800, Wu Fengguang wrote:
> 
> Sorry for the confusion, but I mean, filesystems have to limit
> nr_writeback (directly or indirectly via the block io queue),
> otherwise it either hit nr_dirty to 0 (with the loop), or let
> nr_writeback grow out of control (without the loop).

OK, it seems we are indeed in agreement.

Just making sure ;-)


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-11 10:50             ` Wu Fengguang
  2009-10-11 10:58               ` Peter Zijlstra
@ 2009-10-11 11:25               ` Peter Zijlstra
  2009-10-12  1:26                 ` Wu Fengguang
  1 sibling, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-11 11:25 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Sun, 2009-10-11 at 18:50 +0800, Wu Fengguang wrote:
> 
> Sorry for the confusion, but I mean, filesystems have to limit
> nr_writeback (directly or indirectly via the block io queue),
> otherwise it either hit nr_dirty to 0 (with the loop), or let
> nr_writeback grow out of control (without the loop).

Doesn't this require the writeback queue to have a limit < dirty_thresh?

Or more specifically, for the bdi case:

 bdi_dirty + bdi_writeback + bdi_unstable <= bdi_thresh

we require that the writeback queue be smaller than bdi_thresh, which
could be quite difficult, since bdi_thresh can easily be 0.

Without observing the bdi_thresh constraint we can have:

  \Sum_(over bdis) writeback_queue_size

dirty pages outstanding, which could be significantly higher than
dirty_thresh.

Or am I confused again?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-11 11:25               ` Peter Zijlstra
@ 2009-10-12  1:26                 ` Wu Fengguang
  2009-10-12  9:07                   ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-12  1:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Sun, Oct 11, 2009 at 07:25:17PM +0800, Peter Zijlstra wrote:
> On Sun, 2009-10-11 at 18:50 +0800, Wu Fengguang wrote:
> > 
> > Sorry for the confusion, but I mean, filesystems have to limit
> > nr_writeback (directly or indirectly via the block io queue),
> > otherwise it either hit nr_dirty to 0 (with the loop), or let
> > nr_writeback grow out of control (without the loop).
> 
> Doesn't this require the writeback queue to have a limit < dirty_thresh?

Yes, this is the key (open) issue. For now we have nothing to limit

        nr_writeback < dirty_thresh

> Or more specifically, for the bdi case:
> 
>  bdi_dirty + bdi_writeback + bdi_unstable <= bdi_thresh
> 
> we require that the writeback queue be smaller than bdi_thresh, which
> could be quite difficult, since bdi_thresh can easily be 0.

We could apply a MIN_BDI_DIRTY_THRESH. Because the bdi threshold is
estimated from writeback events, so bdi_thresh must be non-zero to
allow some writeback pages in flight :)

> Without observing the bdi_thresh constraint we can have:
> 
>   \Sum_(over bdis) writeback_queue_size
> 
> dirty pages outstanding, which could be significantly higher than
> dirty_thresh.

Yes.  Maybe we could do some per-bdi and/or global writeback wait
queue (ie. some generalized version of the patch 20: NFS: introduce
writeback wait queue).

The per-bdi writeback queue size should ideally be proportional to its
available writeback bandwidth. MIN_BDI_DIRTY_THRESH could be defined
to (2*bdi_writeback_bandwidth) or something close. And if the resulted
bdi limits turn out to be too large for a small memory system, we just
let the global limit kick in. For such small memory systems, it is
very likely there are only one bdi. So it is not likely to lose
fairness to base its limits on available memory instead of device
capability.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-12  1:26                 ` Wu Fengguang
@ 2009-10-12  9:07                   ` Peter Zijlstra
  2009-10-12  9:24                     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-12  9:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Mon, 2009-10-12 at 09:26 +0800, Wu Fengguang wrote:
> On Sun, Oct 11, 2009 at 07:25:17PM +0800, Peter Zijlstra wrote:
> > On Sun, 2009-10-11 at 18:50 +0800, Wu Fengguang wrote:
> > > 
> > > Sorry for the confusion, but I mean, filesystems have to limit
> > > nr_writeback (directly or indirectly via the block io queue),
> > > otherwise it either hit nr_dirty to 0 (with the loop), or let
> > > nr_writeback grow out of control (without the loop).
> > 
> > Doesn't this require the writeback queue to have a limit < dirty_thresh?
> 
> Yes, this is the key (open) issue. For now we have nothing to limit
> 
>         nr_writeback < dirty_thresh
> 
> > Or more specifically, for the bdi case:
> > 
> >  bdi_dirty + bdi_writeback + bdi_unstable <= bdi_thresh
> > 
> > we require that the writeback queue be smaller than bdi_thresh, which
> > could be quite difficult, since bdi_thresh can easily be 0.
> 
> We could apply a MIN_BDI_DIRTY_THRESH. Because the bdi threshold is
> estimated from writeback events, so bdi_thresh must be non-zero to
> allow some writeback pages in flight :)

Not really, suppose you have 1000 NFS clients, of which you only use a
hand full at a time.

Then the bdi_thresh will be 0 for most of them, and only when you switch
to one it'll start growing. But it's perfectly reasonable to expect
bdi_thresh=0 to work. It just reverts to sync behaviour, we write out
everything and block until they're all gone from writeback state.

MIN_BDI_DIRTY_THRESH != 0, will have a side effect of imposing a max
number of BDIs on the system, I'm not sure you want to go there.

> > Without observing the bdi_thresh constraint we can have:
> > 
> >   \Sum_(over bdis) writeback_queue_size
> > 
> > dirty pages outstanding, which could be significantly higher than
> > dirty_thresh.
>  
> Yes.  Maybe we could do some per-bdi and/or global writeback wait
> queue (ie. some generalized version of the patch 20: NFS: introduce
> writeback wait queue).
> 
> The per-bdi writeback queue size should ideally be proportional to its
> available writeback bandwidth. MIN_BDI_DIRTY_THRESH could be defined
> to (2*bdi_writeback_bandwidth) or something close. And if the resulted
> bdi limits turn out to be too large for a small memory system, we just
> let the global limit kick in. For such small memory systems, it is
> very likely there are only one bdi. So it is not likely to lose
> fairness to base its limits on available memory instead of device
> capability.

I'm not seeing why. By simply keeping that loop we're good again, and
can have a writeback queue that works well in the saturated case.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-12  9:07                   ` Peter Zijlstra
@ 2009-10-12  9:24                     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-12  9:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Mon, Oct 12, 2009 at 05:07:10PM +0800, Peter Zijlstra wrote:
> On Mon, 2009-10-12 at 09:26 +0800, Wu Fengguang wrote:
> > On Sun, Oct 11, 2009 at 07:25:17PM +0800, Peter Zijlstra wrote:
> > > On Sun, 2009-10-11 at 18:50 +0800, Wu Fengguang wrote:
> > > > 
> > > > Sorry for the confusion, but I mean, filesystems have to limit
> > > > nr_writeback (directly or indirectly via the block io queue),
> > > > otherwise it either hit nr_dirty to 0 (with the loop), or let
> > > > nr_writeback grow out of control (without the loop).
> > > 
> > > Doesn't this require the writeback queue to have a limit < dirty_thresh?
> > 
> > Yes, this is the key (open) issue. For now we have nothing to limit
> > 
> >         nr_writeback < dirty_thresh
> > 
> > > Or more specifically, for the bdi case:
> > > 
> > >  bdi_dirty + bdi_writeback + bdi_unstable <= bdi_thresh
> > > 
> > > we require that the writeback queue be smaller than bdi_thresh, which
> > > could be quite difficult, since bdi_thresh can easily be 0.
> > 
> > We could apply a MIN_BDI_DIRTY_THRESH. Because the bdi threshold is
> > estimated from writeback events, so bdi_thresh must be non-zero to
> > allow some writeback pages in flight :)
> 
> Not really, suppose you have 1000 NFS clients, of which you only use a
> hand full at a time.
> 
> Then the bdi_thresh will be 0 for most of them, and only when you switch
> to one it'll start growing. But it's perfectly reasonable to expect
> bdi_thresh=0 to work. It just reverts to sync behaviour, we write out
> everything and block until they're all gone from writeback state.

Ah I see. We still do writeback when bdi_thresh=0, with any
application blocked in balance_dirty_pages().

> MIN_BDI_DIRTY_THRESH != 0, will have a side effect of imposing a max
> number of BDIs on the system, I'm not sure you want to go there.

OK that's not a good idea.

> > > Without observing the bdi_thresh constraint we can have:
> > > 
> > >   \Sum_(over bdis) writeback_queue_size
> > > 
> > > dirty pages outstanding, which could be significantly higher than
> > > dirty_thresh.
> >  
> > Yes.  Maybe we could do some per-bdi and/or global writeback wait
> > queue (ie. some generalized version of the patch 20: NFS: introduce
> > writeback wait queue).
> > 
> > The per-bdi writeback queue size should ideally be proportional to its
> > available writeback bandwidth. MIN_BDI_DIRTY_THRESH could be defined
> > to (2*bdi_writeback_bandwidth) or something close. And if the resulted
> > bdi limits turn out to be too large for a small memory system, we just
> > let the global limit kick in. For such small memory systems, it is
> > very likely there are only one bdi. So it is not likely to lose
> > fairness to base its limits on available memory instead of device
> > capability.
> 
> I'm not seeing why. By simply keeping that loop we're good again, and
> can have a writeback queue that works well in the saturated case.

OK it looks better to keep the loop. The memory tight systems may go
into the nr_dirty=0 situation, but it may not be an urgent problem
(its nr_dirty will be small anyway).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-10 21:33     ` Wu Fengguang
@ 2009-10-12 21:18       ` Jan Kara
  2009-10-13  3:24         ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Jan Kara @ 2009-10-12 21:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Nick Piggin,
	linux-fsdevel@vger.kernel.org, Richard Kennedy, LKML

On Sun 11-10-09 05:33:39, Wu Fengguang wrote:
> On Fri, Oct 09, 2009 at 11:12:31PM +0800, Jan Kara wrote:
> > > +			/* don't wait if we've done enough */
> > > +			if (pages_written >= write_chunk)
> > > +				break;
> > >  		}
> > > -
> > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > -			break;
> > > -		if (pages_written >= write_chunk)
> > > -			break;		/* We've done our duty */
> > > -
> >   Here, we had an opportunity to break from the loop even if we didn't
> > manage to write everything (for example because per-bdi thread managed to
> > write enough or because enough IO has completed while we were trying to
> > write). After the patch, we will sleep. IMHO that's not good...
> 
> Note that the pages_written check is moved several lines up in the patch :)
> 
> >   I'd think that if we did all that work in writeback_inodes_wbc we could
> > spend the effort on regetting and rechecking the stats...
> 
> Yes maybe. I didn't care it because the later throttle queue patch totally
> removed the loop and hence to need to reget the stats :)
  Yes, since the loop gets removed in the end, this does not matter. You
are right.

> > >  		schedule_timeout_interruptible(pause);
> > >
> > >  		/*
> > > @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
> > >  			pause = HZ / 10;
> > >  	}
> > >
> > > -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > -			bdi->dirty_exceeded)
> > > +	if (!dirty_exceeded && bdi->dirty_exceeded)
> > >  		bdi->dirty_exceeded = 0;
> >   Here we fail to clear dirty_exceeded if we are over global dirty limit
> > but not over per-bdi dirty limit...
> 
> You must be mistaken: dirty_exceeded = (over bdi limit || over global limit),
> so !dirty_exceeded = (!over bdi limit && !over global limit).
  Exactly. Previously, the check was:
if (!over bdi limit)
  bdi->dirty_exceeded = 0;

  Now it is
if (!over bdi limit && !over global limit)
  bdi->dirty_exceeded = 0;

  That's clearly not equivalent which is what I was trying to point out.
But looking at where dirty_exceeded is used, your new way is probably more
useful. It's just a bit counterintuitive that bdi->dirty_exceeded is set
even if the per-bdi limit is not exceeded...

									Honza

--
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-12 21:18       ` Jan Kara
@ 2009-10-13  3:24         ` Wu Fengguang
  2009-10-13  8:41           ` Peter Zijlstra
  2009-10-13 18:12           ` Jan Kara
  0 siblings, 2 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-13  3:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Tue, Oct 13, 2009 at 05:18:38AM +0800, Jan Kara wrote:
> On Sun 11-10-09 05:33:39, Wu Fengguang wrote:
> > On Fri, Oct 09, 2009 at 11:12:31PM +0800, Jan Kara wrote:
> > > > +			/* don't wait if we've done enough */
> > > > +			if (pages_written >= write_chunk)
> > > > +				break;
> > > >  		}
> > > > -
> > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > -			break;
> > > > -		if (pages_written >= write_chunk)
> > > > -			break;		/* We've done our duty */
> > > > -
> > >   Here, we had an opportunity to break from the loop even if we didn't
> > > manage to write everything (for example because per-bdi thread managed to
> > > write enough or because enough IO has completed while we were trying to
> > > write). After the patch, we will sleep. IMHO that's not good...
> > 
> > Note that the pages_written check is moved several lines up in the patch :)
> > 
> > >   I'd think that if we did all that work in writeback_inodes_wbc we could
> > > spend the effort on regetting and rechecking the stats...
> > 
> > Yes maybe. I didn't care it because the later throttle queue patch totally
> > removed the loop and hence to need to reget the stats :)
>   Yes, since the loop gets removed in the end, this does not matter. You
> are right.

You are right too :) I followed you and Peter's advice to do the loop
and the recheck of stats as follows:

static void balance_dirty_pages(struct address_space *mapping,
				unsigned long write_chunk)
{
	long nr_reclaimable, bdi_nr_reclaimable;
	long nr_writeback, bdi_nr_writeback;
	unsigned long background_thresh;
	unsigned long dirty_thresh;
	unsigned long bdi_thresh;
	int dirty_exceeded;
	struct backing_dev_info *bdi = mapping->backing_dev_info;

	/*
	 * If sync() is in progress, curb the to-be-synced inodes regardless
	 * of dirty limits, so that a fast dirtier won't livelock the sync.
	 */
	if (unlikely(bdi->sync_time &&
		     S_ISREG(mapping->host->i_mode) &&
		     time_after_eq(bdi->sync_time,
				   mapping->host->dirtied_when))) {
		write_chunk *= 2;
		bdi_writeback_wait(bdi, write_chunk);
	}

	for (;;) {
		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
				 global_page_state(NR_UNSTABLE_NFS);
		nr_writeback = global_page_state(NR_WRITEBACK) +
			       global_page_state(NR_WRITEBACK_TEMP);

		global_dirty_thresh(&background_thresh, &dirty_thresh);

		/*
		 * Throttle it only when the background writeback cannot
		 * catch-up. This avoids (excessively) small writeouts
		 * when the bdi limits are ramping up.
		 */
		if (nr_reclaimable + nr_writeback <
		    (background_thresh + dirty_thresh) / 2)
			break;

		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);

		/*
		 * In order to avoid the stacked BDI deadlock we need
		 * to ensure we accurately count the 'dirty' pages when
		 * the threshold is low.
		 *
		 * Otherwise it would be possible to get thresh+n pages
		 * reported dirty, even though there are thresh-m pages
		 * actually dirty; with m+n sitting in the percpu
		 * deltas.
		 */
		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
		} else {
			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
		}

		/*
		 * The bdi thresh is somehow "soft" limit derived from the
		 * global "hard" limit. The former helps to prevent heavy IO
		 * bdi or process from holding back light ones; The latter is
		 * the last resort safeguard.
		 */
		dirty_exceeded =
			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
			|| (nr_reclaimable + nr_writeback >= dirty_thresh);

		if (!dirty_exceeded)
			break;

		bdi->dirty_exceed_time = jiffies;

		bdi_writeback_wait(bdi, write_chunk);
	}

	/*
	 * In laptop mode, we wait until hitting the higher threshold before
	 * starting background writeout, and then write out all the way down
	 * to the lower threshold.  So slow writers cause minimal disk activity.
	 *
	 * In normal mode, we start background writeout at the lower
	 * background_thresh, to keep the amount of dirty memory low.
	 */
	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
	    can_submit_background_writeback(bdi))
		bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND);
}

> > > >  		schedule_timeout_interruptible(pause);
> > > >
> > > >  		/*
> > > > @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
> > > >  			pause = HZ / 10;
> > > >  	}
> > > >
> > > > -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > -			bdi->dirty_exceeded)
> > > > +	if (!dirty_exceeded && bdi->dirty_exceeded)
> > > >  		bdi->dirty_exceeded = 0;
> > >   Here we fail to clear dirty_exceeded if we are over global dirty limit
> > > but not over per-bdi dirty limit...
> > 
> > You must be mistaken: dirty_exceeded = (over bdi limit || over global limit),
> > so !dirty_exceeded = (!over bdi limit && !over global limit).
>   Exactly. Previously, the check was:
> if (!over bdi limit)
>   bdi->dirty_exceeded = 0;
> 
>   Now it is
> if (!over bdi limit && !over global limit)
>   bdi->dirty_exceeded = 0;
> 
>   That's clearly not equivalent which is what I was trying to point out.
> But looking at where dirty_exceeded is used, your new way is probably more
> useful. It's just a bit counterintuitive that bdi->dirty_exceeded is set
> even if the per-bdi limit is not exceeded...

Yeah good point. Since the per-bdi limits are more about "soft" limits
which are derived from the global "hard" limit, the code makes sense
with some comments and updated changelog :)

        This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
        with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
        to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
        accurate we don't need to do routinely clip. A simple dirty limit check
        would be enough.

        The check is necessary because, in principle we should throttle
        everything calling balance_dirty_pages() when we're over the total
        limit, as said by Peter.

        We now set and clear dirty_exceeded not only based on bdi dirty limits,
        but also on the global dirty limits. This is a bit counterintuitive, but
        the global limits are the ultimate goal and shall be always imposed.

        We may now start background writeback work based on outdated conditions.
        That's safe because the bdi flush thread will (and have to) double check
        the states. It reduces overall overheads because the test based on old
        states still have good chance to be right.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-13  3:24         ` Wu Fengguang
@ 2009-10-13  8:41           ` Peter Zijlstra
  2009-10-13 18:12           ` Jan Kara
  1 sibling, 0 replies; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-13  8:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Tue, 2009-10-13 at 11:24 +0800, Wu Fengguang wrote:

> You are right too :) I followed you and Peter's advice to do the loop
> and the recheck of stats as follows:

> 	This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
>         with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
>         to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
>         accurate we don't need to do routinely clip. A simple dirty limit check
>         would be enough.
> 
>         The check is necessary because, in principle we should throttle
>         everything calling balance_dirty_pages() when we're over the total
>         limit, as said by Peter.
> 
>         We now set and clear dirty_exceeded not only based on bdi dirty limits,
>         but also on the global dirty limits. This is a bit counterintuitive, but
>         the global limits are the ultimate goal and shall be always imposed.
> 
>         We may now start background writeback work based on outdated conditions.
>         That's safe because the bdi flush thread will (and have to) double check
>         the states. It reduces overall overheads because the test based on old
>         states still have good chance to be right.

> static void balance_dirty_pages(struct address_space *mapping,
> 				unsigned long write_chunk)
> {
> 	long nr_reclaimable, bdi_nr_reclaimable;
> 	long nr_writeback, bdi_nr_writeback;
> 	unsigned long background_thresh;
> 	unsigned long dirty_thresh;
> 	unsigned long bdi_thresh;
> 	int dirty_exceeded;
> 	struct backing_dev_info *bdi = mapping->backing_dev_info;
> 
> 	/*
> 	 * If sync() is in progress, curb the to-be-synced inodes regardless
> 	 * of dirty limits, so that a fast dirtier won't livelock the sync.
> 	 */
> 	if (unlikely(bdi->sync_time &&
> 		     S_ISREG(mapping->host->i_mode) &&
> 		     time_after_eq(bdi->sync_time,
> 				   mapping->host->dirtied_when))) {
> 		write_chunk *= 2;
> 		bdi_writeback_wait(bdi, write_chunk);
> 	}
> 
> 	for (;;) {
> 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				 global_page_state(NR_UNSTABLE_NFS);
> 		nr_writeback = global_page_state(NR_WRITEBACK) +
> 			       global_page_state(NR_WRITEBACK_TEMP);
> 
> 		global_dirty_thresh(&background_thresh, &dirty_thresh);
> 
> 		/*
> 		 * Throttle it only when the background writeback cannot
> 		 * catch-up. This avoids (excessively) small writeouts
> 		 * when the bdi limits are ramping up.
> 		 */
> 		if (nr_reclaimable + nr_writeback <
> 		    (background_thresh + dirty_thresh) / 2)
> 			break;
> 
> 		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> 
> 		/*
> 		 * In order to avoid the stacked BDI deadlock we need
> 		 * to ensure we accurately count the 'dirty' pages when
> 		 * the threshold is low.
> 		 *
> 		 * Otherwise it would be possible to get thresh+n pages
> 		 * reported dirty, even though there are thresh-m pages
> 		 * actually dirty; with m+n sitting in the percpu
> 		 * deltas.
> 		 */
> 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> 		} else {
> 			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> 		}
> 
> 		/*
> 		 * The bdi thresh is somehow "soft" limit derived from the
> 		 * global "hard" limit. The former helps to prevent heavy IO
> 		 * bdi or process from holding back light ones; The latter is
> 		 * the last resort safeguard.
> 		 */
> 		dirty_exceeded =
> 			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> 			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> 
> 		if (!dirty_exceeded)
> 			break;
> 
> 		bdi->dirty_exceed_time = jiffies;
> 
> 		bdi_writeback_wait(bdi, write_chunk);
> 	}
> 
> 	/*
> 	 * In laptop mode, we wait until hitting the higher threshold before
> 	 * starting background writeout, and then write out all the way down
> 	 * to the lower threshold.  So slow writers cause minimal disk activity.
> 	 *
> 	 * In normal mode, we start background writeout at the lower
> 	 * background_thresh, to keep the amount of dirty memory low.
> 	 */
> 	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> 	    can_submit_background_writeback(bdi))
> 		bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND);
> }

Looks good, Thanks Wu!


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-13  3:24         ` Wu Fengguang
  2009-10-13  8:41           ` Peter Zijlstra
@ 2009-10-13 18:12           ` Jan Kara
  2009-10-13 18:28             ` Peter Zijlstra
  1 sibling, 1 reply; 116+ messages in thread
From: Jan Kara @ 2009-10-13 18:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Peter Zijlstra, Li, Shaohua,
	Myklebust Trond, jens.axboe@oracle.com, Nick Piggin,
	linux-fsdevel@vger.kernel.org, Richard Kennedy, LKML

On Tue 13-10-09 11:24:05, Wu Fengguang wrote:
> On Tue, Oct 13, 2009 at 05:18:38AM +0800, Jan Kara wrote:
> > On Sun 11-10-09 05:33:39, Wu Fengguang wrote:
> > > On Fri, Oct 09, 2009 at 11:12:31PM +0800, Jan Kara wrote:
> > > > > +			/* don't wait if we've done enough */
> > > > > +			if (pages_written >= write_chunk)
> > > > > +				break;
> > > > >  		}
> > > > > -
> > > > > -		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
> > > > > -			break;
> > > > > -		if (pages_written >= write_chunk)
> > > > > -			break;		/* We've done our duty */
> > > > > -
> > > >   Here, we had an opportunity to break from the loop even if we didn't
> > > > manage to write everything (for example because per-bdi thread managed to
> > > > write enough or because enough IO has completed while we were trying to
> > > > write). After the patch, we will sleep. IMHO that's not good...
> > > 
> > > Note that the pages_written check is moved several lines up in the patch :)
> > > 
> > > >   I'd think that if we did all that work in writeback_inodes_wbc we could
> > > > spend the effort on regetting and rechecking the stats...
> > > 
> > > Yes maybe. I didn't care it because the later throttle queue patch totally
> > > removed the loop and hence to need to reget the stats :)
> >   Yes, since the loop gets removed in the end, this does not matter. You
> > are right.
> 
> You are right too :) I followed you and Peter's advice to do the loop
> and the recheck of stats as follows:
> 
> static void balance_dirty_pages(struct address_space *mapping,
> 				unsigned long write_chunk)
> {
> 	long nr_reclaimable, bdi_nr_reclaimable;
> 	long nr_writeback, bdi_nr_writeback;
> 	unsigned long background_thresh;
> 	unsigned long dirty_thresh;
> 	unsigned long bdi_thresh;
> 	int dirty_exceeded;
> 	struct backing_dev_info *bdi = mapping->backing_dev_info;
> 
> 	/*
> 	 * If sync() is in progress, curb the to-be-synced inodes regardless
> 	 * of dirty limits, so that a fast dirtier won't livelock the sync.
> 	 */
> 	if (unlikely(bdi->sync_time &&
> 		     S_ISREG(mapping->host->i_mode) &&
> 		     time_after_eq(bdi->sync_time,
> 				   mapping->host->dirtied_when))) {
> 		write_chunk *= 2;
> 		bdi_writeback_wait(bdi, write_chunk);
> 	}
> 
> 	for (;;) {
> 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> 				 global_page_state(NR_UNSTABLE_NFS);
> 		nr_writeback = global_page_state(NR_WRITEBACK) +
> 			       global_page_state(NR_WRITEBACK_TEMP);
> 
> 		global_dirty_thresh(&background_thresh, &dirty_thresh);
> 
> 		/*
> 		 * Throttle it only when the background writeback cannot
> 		 * catch-up. This avoids (excessively) small writeouts
> 		 * when the bdi limits are ramping up.
> 		 */
> 		if (nr_reclaimable + nr_writeback <
> 		    (background_thresh + dirty_thresh) / 2)
> 			break;
> 
> 		bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> 
> 		/*
> 		 * In order to avoid the stacked BDI deadlock we need
> 		 * to ensure we accurately count the 'dirty' pages when
> 		 * the threshold is low.
> 		 *
> 		 * Otherwise it would be possible to get thresh+n pages
> 		 * reported dirty, even though there are thresh-m pages
> 		 * actually dirty; with m+n sitting in the percpu
> 		 * deltas.
> 		 */
> 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> 			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> 		} else {
> 			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> 		}
> 
> 		/*
> 		 * The bdi thresh is somehow "soft" limit derived from the
> 		 * global "hard" limit. The former helps to prevent heavy IO
> 		 * bdi or process from holding back light ones; The latter is
> 		 * the last resort safeguard.
> 		 */
> 		dirty_exceeded =
> 			(bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> 			|| (nr_reclaimable + nr_writeback >= dirty_thresh);
> 
> 		if (!dirty_exceeded)
> 			break;
> 
> 		bdi->dirty_exceed_time = jiffies;
> 
> 		bdi_writeback_wait(bdi, write_chunk);
  Hmm, probably you've discussed this in some other email but why do we
cycle in this loop until we get below dirty limit? We used to leave the
loop after writing write_chunk... So the time we spend in
balance_dirty_pages() is no longer limited, right?

> 	}
> 
> 	/*
> 	 * In laptop mode, we wait until hitting the higher threshold before
> 	 * starting background writeout, and then write out all the way down
> 	 * to the lower threshold.  So slow writers cause minimal disk activity.
> 	 *
> 	 * In normal mode, we start background writeout at the lower
> 	 * background_thresh, to keep the amount of dirty memory low.
> 	 */
> 	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
> 	    can_submit_background_writeback(bdi))
> 		bdi_start_writeback(bdi, NULL, WB_FOR_BACKGROUND);
> }
> 
> > > > >  		schedule_timeout_interruptible(pause);
> > > > >
> > > > >  		/*
> > > > > @@ -577,8 +547,7 @@ static void balance_dirty_pages(struct a
> > > > >  			pause = HZ / 10;
> > > > >  	}
> > > > >
> > > > > -	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > > > -			bdi->dirty_exceeded)
> > > > > +	if (!dirty_exceeded && bdi->dirty_exceeded)
> > > > >  		bdi->dirty_exceeded = 0;
> > > >   Here we fail to clear dirty_exceeded if we are over global dirty limit
> > > > but not over per-bdi dirty limit...
> > > 
> > > You must be mistaken: dirty_exceeded = (over bdi limit || over global limit),
> > > so !dirty_exceeded = (!over bdi limit && !over global limit).
> >   Exactly. Previously, the check was:
> > if (!over bdi limit)
> >   bdi->dirty_exceeded = 0;
> > 
> >   Now it is
> > if (!over bdi limit && !over global limit)
> >   bdi->dirty_exceeded = 0;
> > 
> >   That's clearly not equivalent which is what I was trying to point out.
> > But looking at where dirty_exceeded is used, your new way is probably more
> > useful. It's just a bit counterintuitive that bdi->dirty_exceeded is set
> > even if the per-bdi limit is not exceeded...
> 
> Yeah good point. Since the per-bdi limits are more about "soft" limits
> which are derived from the global "hard" limit, the code makes sense
> with some comments and updated changelog :)
> 
>         This patch slightly changes behavior by replacing clip_bdi_dirty_limit()
>         with the explicit check (nr_reclaimable + nr_writeback >= dirty_thresh)
>         to avoid exceeding the dirty limit. Since the bdi dirty limit is mostly
>         accurate we don't need to do routinely clip. A simple dirty limit check
>         would be enough.
> 
>         The check is necessary because, in principle we should throttle
>         everything calling balance_dirty_pages() when we're over the total
>         limit, as said by Peter.
> 
>         We now set and clear dirty_exceeded not only based on bdi dirty limits,
>         but also on the global dirty limits. This is a bit counterintuitive, but
>         the global limits are the ultimate goal and shall be always imposed.
> 
>         We may now start background writeback work based on outdated conditions.
>         That's safe because the bdi flush thread will (and have to) double check
>         the states. It reduces overall overheads because the test based on old
>         states still have good chance to be right.
  The new description is good, thanks!

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-13 18:12           ` Jan Kara
@ 2009-10-13 18:28             ` Peter Zijlstra
  2009-10-14  1:38               ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-13 18:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Nick Piggin, linux-fsdevel@vger.kernel.org,
	Richard Kennedy, LKML

On Tue, 2009-10-13 at 20:12 +0200, Jan Kara wrote:
> >       for (;;) {
> >               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> >                                global_page_state(NR_UNSTABLE_NFS);
> >               nr_writeback = global_page_state(NR_WRITEBACK) +
> >                              global_page_state(NR_WRITEBACK_TEMP);
> > 
> >               global_dirty_thresh(&background_thresh, &dirty_thresh);
> > 
> >               /*
> >                * Throttle it only when the background writeback cannot
> >                * catch-up. This avoids (excessively) small writeouts
> >                * when the bdi limits are ramping up.
> >                */
> >               if (nr_reclaimable + nr_writeback <
> >                   (background_thresh + dirty_thresh) / 2)
> >                       break;
> > 
> >               bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> > 
> >               /*
> >                * In order to avoid the stacked BDI deadlock we need
> >                * to ensure we accurately count the 'dirty' pages when
> >                * the threshold is low.
> >                *
> >                * Otherwise it would be possible to get thresh+n pages
> >                * reported dirty, even though there are thresh-m pages
> >                * actually dirty; with m+n sitting in the percpu
> >                * deltas.
> >                */
> >               if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> >                       bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> >                       bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> >               } else {
> >                       bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> >                       bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> >               }
> > 
> >               /*
> >                * The bdi thresh is somehow "soft" limit derived from the
> >                * global "hard" limit. The former helps to prevent heavy IO
> >                * bdi or process from holding back light ones; The latter is
> >                * the last resort safeguard.
> >                */
> >               dirty_exceeded =
> >                       (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> >                       || (nr_reclaimable + nr_writeback >= dirty_thresh);
> > 
> >               if (!dirty_exceeded)
> >                       break;
> > 
> >               bdi->dirty_exceed_time = jiffies;
> > 
> >               bdi_writeback_wait(bdi, write_chunk);
>   Hmm, probably you've discussed this in some other email but why do we
> cycle in this loop until we get below dirty limit? We used to leave the
> loop after writing write_chunk... So the time we spend in
> balance_dirty_pages() is no longer limited, right?

Wu was saying that without the loop nr_writeback wasn't limited, but
since bdi_writeback_wakeup() is driven from writeout completion, I'm not
sure how again that was so.

We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
permits, but it cannot grow beyond the total limit, since we're actually
waiting for writeout completion.

Possibly unstable is peculiar.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-13 18:28             ` Peter Zijlstra
@ 2009-10-14  1:38               ` Wu Fengguang
  2009-10-14 11:22                 ` Peter Zijlstra
  0 siblings, 1 reply; 116+ messages in thread
From: Wu Fengguang @ 2009-10-14  1:38 UTC (permalink / raw)
  To: Peter Zijlstra, Peter Staubach, Myklebust Trond
  Cc: Jan Kara, Andrew Morton, Theodore Tso, Christoph Hellwig,
	Dave Chinner, Chris Mason, Li, Shaohua, jens.axboe@oracle.com,
	Nick Piggin, linux-fsdevel@vger.kernel.org, Richard Kennedy, LKML

On Wed, Oct 14, 2009 at 02:28:19AM +0800, Peter Zijlstra wrote:
> On Tue, 2009-10-13 at 20:12 +0200, Jan Kara wrote:
> > >       for (;;) {
> > >               nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> > >                                global_page_state(NR_UNSTABLE_NFS);
> > >               nr_writeback = global_page_state(NR_WRITEBACK) +
> > >                              global_page_state(NR_WRITEBACK_TEMP);
> > > 
> > >               global_dirty_thresh(&background_thresh, &dirty_thresh);
> > > 
> > >               /*
> > >                * Throttle it only when the background writeback cannot
> > >                * catch-up. This avoids (excessively) small writeouts
> > >                * when the bdi limits are ramping up.
> > >                */
> > >               if (nr_reclaimable + nr_writeback <
> > >                   (background_thresh + dirty_thresh) / 2)
> > >                       break;
> > > 
> > >               bdi_thresh = bdi_dirty_thresh(bdi, dirty_thresh);
> > > 
> > >               /*
> > >                * In order to avoid the stacked BDI deadlock we need
> > >                * to ensure we accurately count the 'dirty' pages when
> > >                * the threshold is low.
> > >                *
> > >                * Otherwise it would be possible to get thresh+n pages
> > >                * reported dirty, even though there are thresh-m pages
> > >                * actually dirty; with m+n sitting in the percpu
> > >                * deltas.
> > >                */
> > >               if (bdi_thresh < 2*bdi_stat_error(bdi)) {
> > >                       bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> > >                       bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
> > >               } else {
> > >                       bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> > >                       bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
> > >               }
> > > 
> > >               /*
> > >                * The bdi thresh is somehow "soft" limit derived from the
> > >                * global "hard" limit. The former helps to prevent heavy IO
> > >                * bdi or process from holding back light ones; The latter is
> > >                * the last resort safeguard.
> > >                */
> > >               dirty_exceeded =
> > >                       (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
> > >                       || (nr_reclaimable + nr_writeback >= dirty_thresh);
> > > 
> > >               if (!dirty_exceeded)
> > >                       break;
> > > 
> > >               bdi->dirty_exceed_time = jiffies;
> > > 
> > >               bdi_writeback_wait(bdi, write_chunk);
> >   Hmm, probably you've discussed this in some other email but why do we
> > cycle in this loop until we get below dirty limit? We used to leave the
> > loop after writing write_chunk... So the time we spend in
> > balance_dirty_pages() is no longer limited, right?

Right, this is a legitimate concern.

> Wu was saying that without the loop nr_writeback wasn't limited, but
> since bdi_writeback_wakeup() is driven from writeout completion, I'm not
> sure how again that was so.

Let me summarize the ideas :)

There are two cases:

- there are no bdi or block io queue to limit nr_writeback
  This must be fixed. It either let nr_writeback grow to dirty_thresh
  (with loop) and thus squeeze nr_dirty, or grow out of control
  totally (without loop). Current state is, the nr_writeback wait
  queue for NFS is there; the one for btrfs is still missing.

- there is a nr_writeback limit, but is larger than dirty_thresh
  In this case nr_dirty will be close to 0 regardless of the loop.
  The loop will help to keep
          nr_dirty + nr_writeback + nr_unstable < dirty_thresh
  Without the loop, the "real" dirty threshold would be larger
  (determined by the nr_writeback limit).

> We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
> permits, but it cannot grow beyond the total limit, since we're actually
> waiting for writeout completion.

Yes, this explains the second case. It's some trade-off like: the
nr_writeback limit can not be trusted in small memory systems, so do
the loop to impose the dirty_thresh, which unfortunately can hurt
responsiveness on all systems with prolonged wait time..

We could possibly test (nr_dirty < nr_writeback). If so, the
nr_writeback limit could be too large to deserve the loop.

It still don't address the nr_dirty=0 problem for small memory system,
that should be acceptable since its nr_dirty will be small anyway.

> Possibly unstable is peculiar.

unstable can also go wild. I saw (in current linux-next with the
following patch) balance_dirty_pages() sleeping for >30s waiting for
the NFS nr_unstable to drop. That is, waiting for the dirty inode to
be _expired_ and written to disk on the server.

It's a general uncoordinated double caching problem for NFS (and maybe more).

Thanks,
Fengguang
---

[   45.614799] balance_dirty_pages sleeped 228ms
[   45.954821] balance_dirty_pages sleeped 324ms
[   46.294874] balance_dirty_pages sleeped 324ms
[   46.638810] balance_dirty_pages sleeped 328ms
[   46.670769] balance_dirty_pages sleeped 28ms
[   46.802779] balance_dirty_pages sleeped 128ms
[   46.934788] balance_dirty_pages sleeped 124ms
[   47.066778] balance_dirty_pages sleeped 124ms
[   47.198774] balance_dirty_pages sleeped 128ms
[   47.330763] balance_dirty_pages sleeped 124ms
[   47.462768] balance_dirty_pages sleeped 128ms
[   47.594768] balance_dirty_pages sleeped 124ms
[   47.662763] balance_dirty_pages sleeped 60ms
[   47.798781] balance_dirty_pages sleeped 132ms
[   47.871435] balance_dirty_pages sleeped 64ms
[   48.002749] balance_dirty_pages sleeped 124ms
[   48.138787] balance_dirty_pages sleeped 132ms
[   48.270824] balance_dirty_pages sleeped 124ms
[   48.410762] balance_dirty_pages sleeped 128ms
[   48.542758] balance_dirty_pages sleeped 128ms
[   48.678786] balance_dirty_pages sleeped 132ms
[   48.810781] balance_dirty_pages sleeped 124ms
[   48.946755] balance_dirty_pages sleeped 124ms
[   49.182753] balance_dirty_pages sleeped 228ms
[   49.318773] balance_dirty_pages sleeped 128ms
[   49.666784] balance_dirty_pages sleeped 324ms
[   49.914774] balance_dirty_pages sleeped 228ms
[   79.998354] balance_dirty_pages sleeped 30068ms
[   80.062346] balance_dirty_pages sleeped 60ms
[   80.290414] balance_dirty_pages sleeped 224ms
[   80.542413] balance_dirty_pages sleeped 228ms
[   80.782384] balance_dirty_pages sleeped 228ms
[   81.142379] balance_dirty_pages sleeped 336ms
[  116.005926] balance_dirty_pages sleeped 34852ms
[  141.049584] balance_dirty_pages sleeped 25040ms


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

--- linux.orig/mm/page-writeback.c	2009-10-09 10:22:58.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-09 10:31:53.000000000 +0800
@@ -490,6 +490,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
 	unsigned long pause = 1;
+	unsigned long start = jiffies;
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -566,7 +567,8 @@ static void balance_dirty_pages(struct a
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
 
-		schedule_timeout_interruptible(pause);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
 
 		/*
 		 * Increase the delay for each loop, up to our previous
@@ -577,6 +579,9 @@ static void balance_dirty_pages(struct a
 			pause = HZ / 10;
 	}
 
+	if (pause > 1)
+		printk("balance_dirty_pages sleeped %lums\n", (jiffies - start) * 1000/HZ);
+
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
 			bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-14  1:38               ` Wu Fengguang
@ 2009-10-14 11:22                 ` Peter Zijlstra
  2009-10-17  5:30                   ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Peter Zijlstra @ 2009-10-14 11:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Staubach, Myklebust Trond, Jan Kara, Andrew Morton,
	Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Li, Shaohua, jens.axboe@oracle.com, Nick Piggin,
	linux-fsdevel@vger.kernel.org, Richard Kennedy, LKML

On Wed, 2009-10-14 at 09:38 +0800, Wu Fengguang wrote:
> > >   Hmm, probably you've discussed this in some other email but why do we
> > > cycle in this loop until we get below dirty limit? We used to leave the
> > > loop after writing write_chunk... So the time we spend in
> > > balance_dirty_pages() is no longer limited, right?
> 
> Right, this is a legitimate concern.

Quite.

> > Wu was saying that without the loop nr_writeback wasn't limited, but
> > since bdi_writeback_wakeup() is driven from writeout completion, I'm not
> > sure how again that was so.
> 
> Let me summarize the ideas :)
> 
> There are two cases:
> 
> - there are no bdi or block io queue to limit nr_writeback
>   This must be fixed. It either let nr_writeback grow to dirty_thresh
>   (with loop) and thus squeeze nr_dirty, or grow out of control
>   totally (without loop). Current state is, the nr_writeback wait
>   queue for NFS is there; the one for btrfs is still missing.
> 
> - there is a nr_writeback limit, but is larger than dirty_thresh
>   In this case nr_dirty will be close to 0 regardless of the loop.
>   The loop will help to keep
>           nr_dirty + nr_writeback + nr_unstable < dirty_thresh
>   Without the loop, the "real" dirty threshold would be larger
>   (determined by the nr_writeback limit).
> 
> > We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
> > permits, but it cannot grow beyond the total limit, since we're actually
> > waiting for writeout completion.
> 
> Yes, this explains the second case. It's some trade-off like: the
> nr_writeback limit can not be trusted in small memory systems, so do
> the loop to impose the dirty_thresh, which unfortunately can hurt
> responsiveness on all systems with prolonged wait time..

Ok, so I'm still puzzled.

  set_page_dirty()
  balance_dirty_pages_ratelimited()
    balance_dirty_pages_ratelimited_nr(1)
      balance_dirty_pages(nr);

So we call balance_dirty_pages() with an appropriate count for each
set_page_dirty() successful invocation, right?

balance_dirty_pages() guarantees that:

  nr_dirty + nr_writeback + nr_unstable < dirty_thresh &&
  (nr_dirty + nr_writeback + nr_unstable < 
	(dirty_thresh + background_thresh)/2 ||
   bdi_dirty + bdi_writeback + bdi_unstable < bdi_thresh)

Now without loop, without writeback limit, I still see no way to
actually generate more 'dirty' pages than dirty_thresh.

As soon as we hit dirty_thresh a process will wait for exactly the same
amount of pages to get cleaned (writeback completed) as were dirtied
(+/- the ratelimit fuzz which should even out over processes).

That should bound things to dirty_thresh -- the wait is on writeback
complete, so nr_writeback is bounded too.

[ I forgot the exact semantics of unstable, if we clear writeback before
unstable, we need to fix something ]

Now, a nr_writeback queue that limits writeback will still be useful,
esp for high speed devices. Once they ramp up and bdi_thresh exceeds the
queue size, it'll take effect. So you reap the benefits when needed.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages()
  2009-10-14 11:22                 ` Peter Zijlstra
@ 2009-10-17  5:30                   ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2009-10-17  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Peter Staubach, Myklebust Trond, Jan Kara, Andrew Morton,
	Theodore Tso, Christoph Hellwig, Dave Chinner, Chris Mason,
	Li, Shaohua, jens.axboe@oracle.com, Nick Piggin,
	linux-fsdevel@vger.kernel.org, Richard Kennedy, LKML

On Wed, Oct 14, 2009 at 07:22:28PM +0800, Peter Zijlstra wrote:
> On Wed, 2009-10-14 at 09:38 +0800, Wu Fengguang wrote:
> > > >   Hmm, probably you've discussed this in some other email but why do we
> > > > cycle in this loop until we get below dirty limit? We used to leave the
> > > > loop after writing write_chunk... So the time we spend in
> > > > balance_dirty_pages() is no longer limited, right?
> > 
> > Right, this is a legitimate concern.
> 
> Quite.
> 
> > > Wu was saying that without the loop nr_writeback wasn't limited, but
> > > since bdi_writeback_wakeup() is driven from writeout completion, I'm not
> > > sure how again that was so.
> > 
> > Let me summarize the ideas :)
> > 
> > There are two cases:
> > 
> > - there are no bdi or block io queue to limit nr_writeback
> >   This must be fixed. It either let nr_writeback grow to dirty_thresh
> >   (with loop) and thus squeeze nr_dirty, or grow out of control
> >   totally (without loop). Current state is, the nr_writeback wait
> >   queue for NFS is there; the one for btrfs is still missing.
> > 
> > - there is a nr_writeback limit, but is larger than dirty_thresh
> >   In this case nr_dirty will be close to 0 regardless of the loop.
> >   The loop will help to keep
> >           nr_dirty + nr_writeback + nr_unstable < dirty_thresh
> >   Without the loop, the "real" dirty threshold would be larger
> >   (determined by the nr_writeback limit).
> > 
> > > We can move all of bdi_dirty to bdi_writeout, if the bdi writeout queue
> > > permits, but it cannot grow beyond the total limit, since we're actually
> > > waiting for writeout completion.
> > 
> > Yes, this explains the second case. It's some trade-off like: the
> > nr_writeback limit can not be trusted in small memory systems, so do
> > the loop to impose the dirty_thresh, which unfortunately can hurt
> > responsiveness on all systems with prolonged wait time..
> 
> Ok, so I'm still puzzled.

Big sorry - it's me that was confused (by some buggy tests).

>   set_page_dirty()
>   balance_dirty_pages_ratelimited()
>     balance_dirty_pages_ratelimited_nr(1)
>       balance_dirty_pages(nr);
> 
> So we call balance_dirty_pages() with an appropriate count for each
> set_page_dirty() successful invocation, right?

Right.

> balance_dirty_pages() guarantees that:
> 
>   nr_dirty + nr_writeback + nr_unstable < dirty_thresh &&
>   (nr_dirty + nr_writeback + nr_unstable < 
> 	(dirty_thresh + background_thresh)/2 ||
>    bdi_dirty + bdi_writeback + bdi_unstable < bdi_thresh)
> 
> Now without loop, without writeback limit, I still see no way to
> actually generate more 'dirty' pages than dirty_thresh.
>
> As soon as we hit dirty_thresh a process will wait for exactly the same
> amount of pages to get cleaned (writeback completed) as were dirtied
> (+/- the ratelimit fuzz which should even out over processes).

Ah yes, we now wait for writeback _completed_ in bdi_writeback_wait(),
instead of _start_ writeback in the old fashioned writeback_inodes().

> That should bound things to dirty_thresh -- the wait is on writeback
> complete, so nr_writeback is bounded too.

Right. It was not bounded in the tests because bdi_writeback_wait()
quits _prematurely_, because the background writeback finds it was
already under background threshold, and so wakeup the throttled tasks
and then quit. Fixed by simply removing the wakeup-all in background
writeback and this change:

                if (args->for_background && !over_bground_thresh() &&
+                   list_empty(&wb->bdi->throttle_list))
                        break;

So now
- the throttled tasks are guaranteed to be wakeup
- it will only be wakeup in __bdi_writeout_inc()
- once wakeup, at least write_chunk pages have been written on behalf of it

> [ I forgot the exact semantics of unstable, if we clear writeback before
> unstable, we need to fix something ]

New tests show that NFS is working fine without loop and without NFS
nr_writeback limit:

      $ dd if=/dev/zero of=/mnt/test/zero3 bs=1M count=200 &
      $ vmmon -d 1 nr_writeback nr_dirty nr_unstable

     nr_writeback         nr_dirty      nr_unstable
                0                2                0
                0                2                0
                0            22477               65
                2            20849             1697
                2            19153             3393
                2            17420             5126
            27825                7             5979
            27816                0               41
            26925                0              907
            31064              286              159
            32531                0              213
            32548                0               89
            32405                0              155
            32464                0               98
            32517                0               45
            32560                0              194
            32534                0              220
            32601                0              222
            32490                0               72
            32447                0              115
            32511                0               48
            32535                0              216
            32535                0              216
            32535                0              216
            32535                0              216
            31555                0             1180
            29732                0             3003
            29277                0             3458
            27721                0             5014
            25955                0             6780
            24356                0             8379
            22763                0             9972
            21083                0            11652
            19371                0            13364
            17564                0            15171
            15781                0            16954
            14005                0            18730
            12230                0            20505
            12177                0            20558
            11383                0            21352
             9489                0            23246
             7621                0            25115
             5866                0            26870
             4790                0            27947
             2962                0            29773
             1089                0            31646
                0                0            32735
                0                0            32735
                0                0                0
                0                0                0
 
> Now, a nr_writeback queue that limits writeback will still be useful,
> esp for high speed devices. Once they ramp up and bdi_thresh exceeds the
> queue size, it'll take effect. So you reap the benefits when needed.

Right, the nr_writeback limit avoids

        nr_writeback => dirty_thresh
+
        nr_dirty + nr_writeback < dirty_thresh
=>
        nr_dirty => 0

Thanks for the clarification, it looks less obscure now :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 31/45] writeback: sync old inodes first in background writeback
  2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
@ 2010-07-12  3:01   ` Christoph Hellwig
  2010-07-12 15:24     ` Wu Fengguang
  0 siblings, 1 reply; 116+ messages in thread
From: Christoph Hellwig @ 2010-07-12  3:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Theodore Tso, Christoph Hellwig, Dave Chinner,
	Chris Mason, Peter Zijlstra, Li Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin, linux-fsdevel, LKML

On Wed, Oct 07, 2009 at 03:38:49PM +0800, Wu Fengguang wrote:
> A background flush work may run for ever. So it's reasonable for it to
> mimic the kupdate behavior of syncing old/expired inodes first.

I've looked at this a bit again after you pointed to this thread in
the direct reclaim thread, and I think we should be even more aggressive
in pushing out old inodes.

We basically have two types of I/O done from wb_do_writeback:

 - either we want to write all inodes for a given bdi/superblock.  That
   includes all WB_SYNC_ALL callers, but also things like
   writeback_inodes_sb and the wakeup_flusher_threads call from
   sys_sync.
 - or we have a specific goal, like for the background writeback or
   the wakeup_flusher_threads from free_more_memory.

For the first case there's obviously no point in doing any
older_than_this processing as we write out all inodes anyway.

For the second case we should always do a older_than_this pass _first_.
Rationale: we really should get the old inodes out ASAP, so that we
keep the amount of changes lost on a crash in the boundaries.
Furthermore the callers only need N pages cleaned, and they don't care
where from.  So if we can reach our goal with the older_than_this
writeback we're fine.  If the writeback loop is long enough we can
keep doing more of these later on as well.

Doing this should also help cleaning the code up a bit by moving the
wb_check_old_data_flush logic into wb_writeback and getting rid of the
for_kupdate paramter in struct wb_writeback_work.  I'm not even sure
it's worth keeping it in struct writeback_control.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 31/45] writeback: sync old inodes first in background writeback
  2010-07-12  3:01   ` Christoph Hellwig
@ 2010-07-12 15:24     ` Wu Fengguang
  0 siblings, 0 replies; 116+ messages in thread
From: Wu Fengguang @ 2010-07-12 15:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Theodore Tso, Dave Chinner, Chris Mason,
	Peter Zijlstra, Li, Shaohua, Myklebust Trond,
	jens.axboe@oracle.com, Jan Kara, Nick Piggin,
	linux-fsdevel@vger.kernel.org, LKML

On Mon, Jul 12, 2010 at 11:01:29AM +0800, Christoph Hellwig wrote:
> On Wed, Oct 07, 2009 at 03:38:49PM +0800, Wu Fengguang wrote:
> > A background flush work may run for ever. So it's reasonable for it to
> > mimic the kupdate behavior of syncing old/expired inodes first.
> 
> I've looked at this a bit again after you pointed to this thread in
> the direct reclaim thread, and I think we should be even more aggressive
> in pushing out old inodes.

Agreed.

> We basically have two types of I/O done from wb_do_writeback:
> 
>  - either we want to write all inodes for a given bdi/superblock.  That
>    includes all WB_SYNC_ALL callers, but also things like
>    writeback_inodes_sb and the wakeup_flusher_threads call from
>    sys_sync.
>  - or we have a specific goal, like for the background writeback or
>    the wakeup_flusher_threads from free_more_memory.
> 
> For the first case there's obviously no point in doing any
> older_than_this processing as we write out all inodes anyway.

We may also do older_than_this even for the sync-the-whole-world case,
as long as this simplifies wb_writeback() and/or other code. This may
make a difference for slow devices.

> For the second case we should always do a older_than_this pass _first_.

Agree in general.

> Rationale: we really should get the old inodes out ASAP, so that we
> keep the amount of changes lost on a crash in the boundaries.
> Furthermore the callers only need N pages cleaned, and they don't care
> where from.  So if we can reach our goal with the older_than_this
> writeback we're fine.  If the writeback loop is long enough we can
> keep doing more of these later on as well.

Right.

> Doing this should also help cleaning the code up a bit by moving the
> wb_check_old_data_flush logic into wb_writeback and getting rid of the
> for_kupdate paramter in struct wb_writeback_work.  I'm not even sure
> it's worth keeping it in struct writeback_control.

I'd also like to see less for_kupdate tests. Whether or not we can
totally get rid of the explicit for_kupdate case, there are always
four main writeback goals/semantics:
- periodic      stop when all 30s-old inodes are written
- background    stop when background threshold is reached
- nr_pages      stop when nr_pages written (or when all clean)
- sync          stop when all older-than-sync-time inodes are written

Note that
- the "sync" goal is obviously a superset of the "periodic" goal
- the "background" goal may be expanded to include the "periodic" goal
- the latter three goals may all do some "periodic" goal loops, with
  a moving "old" criterion.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2010-07-12 15:24 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-07  7:38 [PATCH 00/45] some writeback experiments Wu Fengguang
2009-10-07  7:38 ` [PATCH 01/45] writeback: reduce calls to global_page_state in balance_dirty_pages() Wu Fengguang
2009-10-09 15:12   ` Jan Kara
2009-10-09 15:18     ` Peter Zijlstra
2009-10-09 15:47       ` Jan Kara
2009-10-11  2:28         ` Wu Fengguang
2009-10-11  7:44           ` Peter Zijlstra
2009-10-11 10:50             ` Wu Fengguang
2009-10-11 10:58               ` Peter Zijlstra
2009-10-11 11:25               ` Peter Zijlstra
2009-10-12  1:26                 ` Wu Fengguang
2009-10-12  9:07                   ` Peter Zijlstra
2009-10-12  9:24                     ` Wu Fengguang
2009-10-10 21:33     ` Wu Fengguang
2009-10-12 21:18       ` Jan Kara
2009-10-13  3:24         ` Wu Fengguang
2009-10-13  8:41           ` Peter Zijlstra
2009-10-13 18:12           ` Jan Kara
2009-10-13 18:28             ` Peter Zijlstra
2009-10-14  1:38               ` Wu Fengguang
2009-10-14 11:22                 ` Peter Zijlstra
2009-10-17  5:30                   ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 02/45] writeback: reduce calculation of bdi dirty thresholds Wu Fengguang
2009-10-07  7:38 ` [PATCH 03/45] ext4: remove unused parameter wbc from __ext4_journalled_writepage() Wu Fengguang
2009-10-07  7:38 ` [PATCH 04/45] writeback: remove unused nonblocking and congestion checks Wu Fengguang
2009-10-09 15:26   ` Jan Kara
2009-10-10 13:47     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 05/45] writeback: remove the always false bdi_cap_writeback_dirty() test Wu Fengguang
2009-10-07  7:38 ` [PATCH 06/45] writeback: use larger ratelimit when dirty_exceeded Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:17     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 07/45] writeback: dont redirty tail an inode with dirty pages Wu Fengguang
2009-10-09 15:45   ` Jan Kara
2009-10-07  7:38 ` [PATCH 08/45] writeback: quit on wrap for .range_cyclic (write_cache_pages) Wu Fengguang
2009-10-07  7:38 ` [PATCH 09/45] writeback: quit on wrap for .range_cyclic (pohmelfs) Wu Fengguang
2009-10-07 12:32   ` Evgeniy Polyakov
2009-10-07 14:23     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 10/45] writeback: quit on wrap for .range_cyclic (btrfs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 11/45] writeback: quit on wrap for .range_cyclic (cifs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 12/45] writeback: quit on wrap for .range_cyclic (ext4) Wu Fengguang
2009-10-07  7:38 ` [PATCH 13/45] writeback: quit on wrap for .range_cyclic (gfs2) Wu Fengguang
2009-10-07  7:38 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) Wu Fengguang
2009-10-07  7:38 ` [PATCH 15/45] writeback: fix queue_io() ordering Wu Fengguang
2009-10-07  7:38 ` [PATCH 16/45] writeback: merge for_kupdate and !for_kupdate cases Wu Fengguang
2009-10-07  7:38 ` [PATCH 17/45] writeback: only allow two background writeback works Wu Fengguang
2009-10-07  7:38 ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Wu Fengguang
2009-10-08  1:01   ` KAMEZAWA Hiroyuki
2009-10-08  1:58     ` Wu Fengguang
2009-10-08  2:40       ` KAMEZAWA Hiroyuki
2009-10-08  4:01         ` Wu Fengguang
2009-10-08  5:59           ` KAMEZAWA Hiroyuki
2009-10-08  6:07             ` Wu Fengguang
2009-10-08  6:28             ` Wu Fengguang
2009-10-08  6:39               ` KAMEZAWA Hiroyuki
2009-10-08  8:08       ` Peter Zijlstra
2009-10-08  8:11         ` KAMEZAWA Hiroyuki
2009-10-08  8:36         ` Jens Axboe
2009-10-09  2:52           ` [PATCH] writeback: account IO throttling wait as iowait Wu Fengguang
2009-10-09 10:41             ` Jens Axboe
2009-10-09 10:58               ` Wu Fengguang
2009-10-09 11:01                 ` Jens Axboe
2009-10-08  8:05     ` [PATCH 18/45] writeback: introduce wait queue for balance_dirty_pages() Peter Zijlstra
2009-10-07  7:38 ` [PATCH 19/45] writeback: remove the loop in balance_dirty_pages() Wu Fengguang
2009-10-07  7:38 ` [PATCH 20/45] NFS: introduce writeback wait queue Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:07     ` Wu Fengguang
2009-10-07  9:15       ` Peter Zijlstra
2009-10-07  9:19         ` Wu Fengguang
2009-10-07  9:17       ` Nick Piggin
2009-10-07  9:52         ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 21/45] writeback: estimate bdi write bandwidth Wu Fengguang
2009-10-07  8:53   ` Peter Zijlstra
2009-10-07  9:39     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 22/45] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2009-10-07  7:38 ` [PATCH 23/45] writeback: kill space in debugfs item name Wu Fengguang
2009-10-07  7:38 ` [PATCH 24/45] writeback: remove global nr_to_write and use timeout instead Wu Fengguang
2009-10-07  7:38 ` [PATCH 25/45] writeback: convert wbc.nr_to_write to per-file parameter Wu Fengguang
2009-10-07  7:38 ` [PATCH 26/45] block: pass the non-rotational queue flag to backing_dev_info Wu Fengguang
2009-10-07  7:38 ` [PATCH 27/45] writeback: introduce wbc.for_background Wu Fengguang
2009-10-07  7:38 ` [PATCH 28/45] writeback: introduce wbc.nr_segments Wu Fengguang
2009-10-07  7:38 ` [PATCH 29/45] writeback: fix the shmem AOP_WRITEPAGE_ACTIVATE case Wu Fengguang
2009-10-07 11:57   ` Hugh Dickins
2009-10-07 14:00     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 30/45] vmscan: lumpy pageout Wu Fengguang
2009-10-07  7:38 ` [PATCH 31/45] writeback: sync old inodes first in background writeback Wu Fengguang
2010-07-12  3:01   ` Christoph Hellwig
2010-07-12 15:24     ` Wu Fengguang
2009-10-07  7:38 ` [PATCH 32/45] writeback: update kupdate expire timestamp on each scan of b_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 34/45] writeback: sync livelock - kick background writeback Wu Fengguang
2009-10-07  7:38 ` [PATCH 35/45] writeback: sync livelock - use single timestamp for whole sync work Wu Fengguang
2009-10-07  7:38 ` [PATCH 36/45] writeback: sync livelock - curb dirty speed for inodes to be synced Wu Fengguang
2009-10-07  7:38 ` [PATCH 37/45] writeback: use timestamp to indicate dirty exceeded Wu Fengguang
2009-10-07  7:38 ` [PATCH 38/45] writeback: introduce queue b_more_io_wait Wu Fengguang
2009-10-07  7:38 ` [PATCH 39/45] writeback: remove wbc.more_io Wu Fengguang
2009-10-07  7:38 ` [PATCH 40/45] writeback: requeue_io_wait() on I_SYNC locked inode Wu Fengguang
2009-10-07  7:38 ` [PATCH 41/45] writeback: requeue_io_wait() on pages_skipped inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 42/45] writeback: requeue_io_wait() on blocked inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 43/45] writeback: requeue_io_wait() on fs redirtied inode Wu Fengguang
2009-10-07  7:39 ` [PATCH 44/45] NFS: remove NFS_INO_FLUSHING lock Wu Fengguang
2009-10-07 13:11   ` Peter Staubach
2009-10-07 13:32     ` Wu Fengguang
2009-10-07 13:59       ` Peter Staubach
2009-10-08  1:44         ` Wu Fengguang
2009-10-07  7:39 ` [PATCH 45/45] btrfs: fix race on syncing the btree inode Wu Fengguang
2009-10-07  8:53 ` [PATCH 00/45] some writeback experiments Peter Zijlstra
2009-10-07 10:17 ` [PATCH 14/45] writeback: quit on wrap for .range_cyclic (afs) David Howells
2009-10-07 10:21   ` Nick Piggin
2009-10-07 10:47     ` Wu Fengguang
2009-10-07 11:23       ` Nick Piggin
2009-10-07 12:21         ` Wu Fengguang
2009-10-07 13:47 ` [PATCH 00/45] some writeback experiments Peter Staubach
2009-10-07 15:18   ` Wu Fengguang
2009-10-08  5:33     ` Wu Fengguang
2009-10-08  5:44       ` Wu Fengguang
2009-10-07 14:26 ` Theodore Tso
2009-10-07 14:45   ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).