[PATCH 00/27] IO-less dirty throttling v6

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/27] IO-less dirty throttling v6
@ 2011-03-03  6:45 Wu Fengguang
  2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
                   ` (27 more replies)
  0 siblings, 28 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, Wu Fengguang, linux-mm,
	linux-fsdevel, LKML

Andrew,

The v6 patchset is a major rework of the unreleased v5 and tested to run
OK for all the test cases, including

- ext2, ext3, ext4, xfs, btrfs, nfs
- 256M, 512M, 3G, 16G, 64G memory sizes and different dirty ratios
- single HDD, SSD, hybrid UKey+disk and 10-disk JBOD/RAID0 arrays
- 1, 2, 10, 100 and 1000 concurrent dd's

The test results (near 8000 graphs) can be explored at

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/

The tests disclosed some problems, but they are normally FS specific
imperfections presented in v4 and non of them are blocking issue for
this patchset.

It contains some "algorithms" that may sound distrusting, however the worst
case will be bounded by the upper/lower threshold of the control scope.

It selects the critical "dirty pages" and "dirty rates" as key parameters
to control. The control policies should be easy to understand, and it
can by nature support more advanced features like

- when memory pressure increases and page reclaim encounters dirty pages,
  it could instantly scale down the dirty goal to eliminate pageout(). The
  lowered dirty goal will be executed by halving (or more) the throttle
  bandwith and won't brute forcely block the dirtier tasks. The progress
  will look very much like the "bdi dirty" line in the below graph,
  where the USB key is doing the same task of bringing down the initial
  high number of dirty pages to its dirty goal:

  http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext3-1dd-1M-8p-970M-20%25-2.6.38-rc6-dt6+-2011-02-28-16-37/balance_dirty_pages-pages.png

- serve as simple IO controllers: if provide an interface for the user
  to set task_bw directly (by returning the user specified value
  directly at the beginning of dirty_throttle_bandwidth(), plus always
  throttle such tasks even under the background dirty threshold), we get
  a bandwidth based per-task async write IO controller; let the user
  scale up/down the @priority parameter in dirty_throttle_bandwidth(),
  we get a priority based IO controller. It's possible to extend the
  capabilities to the scope of cgroup, too.


v4 patchset:		https://lkml.org/lkml/2010/12/13/320
v6 introduction: 	http://comments.gmane.org/gmane.linux.file-systems/51237

Minor fixes

	[PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc
	[PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls
	[PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs
	[PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time

btrfs/nfs improvements
There are no direct inter-dependencies between the FS and VFS patches;
the patches simply make btrfs/nfs work better with the new balance_dirty_pages().

	[PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls
	[PATCH 06/27] btrfs: lower the dirty balance poll interval
	[PATCH 07/27] btrfs: wait on too many nr_async_bios
	[PATCH 08/27] nfs: dirty livelock prevention is now done in VFS
	[PATCH 09/27] nfs: writeback pages wait queue
	[PATCH 10/27] nfs: limit the commit size to reduce fluctuations
	[PATCH 11/27] nfs: limit the commit range
	[PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold

supporting functionalities

	[PATCH 13/27] writeback: account per-bdi accumulated written pages
	[PATCH 14/27] writeback: account per-bdi accumulated dirtied pages
	[PATCH 15/27] writeback: bdi write bandwidth estimation
	[PATCH 16/27] writeback: smoothed global/bdi dirty pages
	[PATCH 17/27] writeback: smoothed dirty threshold and limit
	[PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds

core changes

	[PATCH 19/27] writeback: dirty throttle bandwidth control
	[PATCH 20/27] writeback: IO-less balance_dirty_pages()

tracing

	[PATCH 21/27] writeback: show bdi write bandwidth in debugfs
	[PATCH 22/27] writeback: trace dirty_throttle_bandwidth
	[PATCH 23/27] writeback: trace balance_dirty_pages
	[PATCH 24/27] writeback: trace global_dirty_state

larger IO size

	[PATCH 25/27] writeback: make nr_to_write a per-file limit
	[PATCH 26/27] writeback: scale IO chunk size up to device bandwidth
	[PATCH 27/27] writeback: trace writeback_single_inode


 fs/btrfs/disk-io.c               |    7 
 fs/btrfs/file.c                  |   16 
 fs/btrfs/ioctl.c                 |    6 
 fs/btrfs/relocation.c            |    6 
 fs/fs-writeback.c                |   79 +-
 fs/nfs/client.c                  |    2 
 fs/nfs/file.c                    |    9 
 fs/nfs/write.c                   |  142 ++-
 include/linux/backing-dev.h      |   21 
 include/linux/nfs_fs.h           |    1 
 include/linux/nfs_fs_sb.h        |    1 
 include/linux/sched.h            |    8 
 include/linux/writeback.h        |   58 +
 include/trace/events/writeback.h |  245 ++++++
 mm/backing-dev.c                 |   51 +
 mm/filemap.c                     |    5 
 mm/memory_hotplug.c              |    3 
 mm/page-writeback.c              | 1083 +++++++++++++++++++++++------
 18 files changed, 1445 insertions(+), 298 deletions(-)

git tree for easy access

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v6

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-task_dirty_limit-comment.patch --]
[-- Type: text/plain, Size: 1412 bytes --]

Clarify the bdi_dirty_limit() comment.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:38:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:40:52.000000000 +0800
@@ -437,10 +437,17 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ *
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * And the "limit" in the name is not seriously taken as hard limit in
+ * balance_dirty_pages().
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
  2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1433 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() could be called 8
times for the same page, but obviously the page is only dirtied once.

Fix it with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2011-03-02 14:18:52.000000000 +0800
+++ linux-next/mm/filemap.c	2011-03-02 14:20:07.000000000 +0800
@@ -2253,6 +2253,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2301,6 +2302,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2327,7 +2329,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
  2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
  2011-03-03  6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Hugh Dickins, Peter Zijlstra, Rik van Riel,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh,
	linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states-fix.patch --]
[-- Type: text/plain, Size: 3263 bytes --]

This avoids unnecessary checks and dirty throttling on tmpfs/ramfs.

It can also prevent

[  388.126563] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050

in the balance_dirty_pages tracepoint, which will call

	dev_name(mapping->backing_dev_info->dev)

but shmem_backing_dev_info.dev is NULL.

Summary notes about the tmpfs/ramfs behavior changes:

As for 2.6.36 and older kernels, the tmpfs writes will sleep inside
balance_dirty_pages() as long as we are over the (dirty+background)/2
global throttle threshold.  This is because both the dirty pages and
threshold will be 0 for tmpfs/ramfs. Hence this test will always
evaluate to TRUE:

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh)
                        || (nr_reclaimable + nr_writeback >= dirty_thresh);

For 2.6.37, someone complained that the current logic does not allow the
users to set vm.dirty_ratio=0.  So commit 4cbec4c8b9 changed the test to

                dirty_exceeded =
                        (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
                        || (nr_reclaimable + nr_writeback > dirty_thresh);

So 2.6.37 will behave differently for tmpfs/ramfs: it will never get
throttled unless the global dirty threshold is exceeded (which is very
unlikely to happen; once happen, will block many tasks).

I'd say that the 2.6.36 behavior is very bad for tmpfs/ramfs. It means
for a busy writing server, tmpfs write()s may get livelocked! The
"inadvertent" throttling can hardly bring help to any workload because
of its "either no throttling, or get throttled to death" property.

So based on 2.6.37, this patch won't bring more noticeable changes.

CC: Hugh Dickins <hughd@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:43:37.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:43:51.000000000 +0800
@@ -244,13 +244,8 @@ void task_dirty_inc(struct task_struct *
 static void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
-	if (bdi_cap_writeback_dirty(bdi)) {
-		prop_fraction_percpu(&vm_completions, &bdi->completions,
+	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
-	} else {
-		*numerator = 0;
-		*denominator = 1;
-	}
 }
 
 static inline void task_dirties_fraction(struct task_struct *tsk,
@@ -495,6 +490,9 @@ static void balance_dirty_pages(struct a
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
 	for (;;) {
 		struct writeback_control wbc = {
 			.sync_mode	= WB_SYNC_NONE,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1097 bytes --]

Reduce the dampening for the control system, yielding faster
convergence. The change is a bit conservative, as smaller values may
lead to noticeable bdi threshold fluctuates in low memory JBOD setup.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-03-02 14:52:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-02 15:00:17.000000000 +0800
@@ -145,7 +145,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1);
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: btrfs-fix-balance-size.patch --]
[-- Type: text/plain, Size: 4153 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited() should be called once instead of 4
times. Failing to do so will make all tasks throttled much too heavy.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c       |   11 +++++++----
 fs/btrfs/ioctl.c      |    6 ++++--
 fs/btrfs/relocation.c |    6 ++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2011-02-21 14:24:56.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-02-21 14:37:34.000000000 +0800
@@ -770,7 +770,8 @@ out:
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
-			 unsigned long last_index, size_t write_bytes)
+			 unsigned long last_index, size_t write_bytes,
+			 int *nr_dirtied)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
@@ -837,7 +838,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (!clear_page_dirty_for_io(pages[i]))
+			(*nr_dirtied)++;
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
@@ -989,6 +991,7 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	while (iov_iter_count(&i) > 0) {
+		int nr_dirtied = 0;
 		size_t offset = pos & (PAGE_CACHE_SIZE - 1);
 		size_t write_bytes = min(iov_iter_count(&i),
 					 nrptrs * (size_t)PAGE_CACHE_SIZE -
@@ -1015,7 +1018,7 @@ static ssize_t btrfs_file_aio_write(stru
 
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
-				    write_bytes);
+				    write_bytes, &nr_dirtied);
 		if (ret) {
 			btrfs_delalloc_release_space(inode,
 					num_pages << PAGE_CACHE_SHIFT);
@@ -1050,7 +1053,7 @@ static ssize_t btrfs_file_aio_write(stru
 			} else {
 				balance_dirty_pages_ratelimited_nr(
 							inode->i_mapping,
-							dirty_pages);
+							nr_dirtied);
 				if (dirty_pages <
 				(root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 					btrfs_btree_balance_dirty(root, 1);
--- linux-next.orig/fs/btrfs/ioctl.c	2011-02-21 14:24:56.000000000 +0800
+++ linux-next/fs/btrfs/ioctl.c	2011-02-21 14:26:21.000000000 +0800
@@ -654,6 +654,7 @@ static int btrfs_defrag_file(struct file
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	unsigned long i;
+	int dirtied;
 	int ret;
 	int compress_type = BTRFS_COMPRESS_ZLIB;
 
@@ -766,7 +767,7 @@ again:
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		ClearPageChecked(page);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
 
 loop_unlock:
@@ -774,7 +775,8 @@ loop_unlock:
 		page_cache_release(page);
 		mutex_unlock(&inode->i_mutex);
 
-		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		if (dirtied)
+			balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
 		i++;
 	}
 
--- linux-next.orig/fs/btrfs/relocation.c	2011-02-21 14:24:56.000000000 +0800
+++ linux-next/fs/btrfs/relocation.c	2011-02-21 14:26:21.000000000 +0800
@@ -2902,6 +2902,7 @@ static int relocate_file_extent_cluster(
 	struct file_ra_state *ra;
 	int nr = 0;
 	int ret = 0;
+	int dirtied;
 
 	if (!cluster->nr)
 		return 0;
@@ -2978,7 +2979,7 @@ static int relocate_file_extent_cluster(
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
 			      page_start, page_end, GFP_NOFS);
@@ -2986,7 +2987,8 @@ static int relocate_file_extent_cluster(
 		page_cache_release(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (dirtied)
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 06/27] btrfs: lower the dirty balance poll interval
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (4 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-04  6:22   ` Dave Chinner
  2011-03-03  6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: btrfs-limit-nr-dirtied.patch --]
[-- Type: text/plain, Size: 1283 bytes --]

Call balance_dirty_pages_ratelimit_nr() on every 32 pages dirtied.

Tests show that original larger intervals can easily make the bdi
dirty limit exceeded on 100 concurrent dd.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2011-03-02 20:15:19.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-03-02 20:35:07.000000000 +0800
@@ -949,9 +949,8 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
-	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
-		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
-		     (sizeof(struct page *)));
+	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
+		     min(32UL, PAGE_CACHE_SIZE / sizeof(struct page *)));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 	if (!pages) {
 		ret = -ENOMEM;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/27] btrfs: lower the dirty balance poll interval
  2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
@ 2011-03-04  6:22   ` Dave Chinner
  2011-03-04  7:57     ` Wu Fengguang
  0 siblings, 1 reply; 44+ messages in thread
From: Dave Chinner @ 2011-03-04  6:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Chris Mason, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

On Thu, Mar 03, 2011 at 02:45:11PM +0800, Wu Fengguang wrote:
> Call balance_dirty_pages_ratelimit_nr() on every 32 pages dirtied.
> 
> Tests show that original larger intervals can easily make the bdi
> dirty limit exceeded on 100 concurrent dd.
> 
> CC: Chris Mason <chris.mason@oracle.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/btrfs/file.c |    5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> --- linux-next.orig/fs/btrfs/file.c	2011-03-02 20:15:19.000000000 +0800
> +++ linux-next/fs/btrfs/file.c	2011-03-02 20:35:07.000000000 +0800
> @@ -949,9 +949,8 @@ static ssize_t btrfs_file_aio_write(stru
>  	}
>  
>  	iov_iter_init(&i, iov, nr_segs, count, num_written);
> -	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
> -		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
> -		     (sizeof(struct page *)));
> +	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
> +		     min(32UL, PAGE_CACHE_SIZE / sizeof(struct page *)));

You're basically hardcoding the maximum to 32 pages here, because
PAGE_CACHE_SIZE / sizeof(page *) is always going to be much larger
than 32.

This means that you are effectively neutering the large write
efficiencies of btrfs - you're reducing the delayed allocation sizes
from 512 * PAGE_CACHE_SIZE down to 32 * PAGE_CACHE_SIZE. This will
increase the overhead of the write process for btrfs for large IOs.

Also, I've got some multipage write modifications that allow 1024
pages at a time between mapping/allocation calls with XFS - once
again for improving the efficiencies of the extent
mapping/allocations in the write path. If the new writeback
throttling algorithms don't work with large numbers of pages being
copied in a single go, then that's a problem.

As it is, if 100 concurrent dd's can overrun the dirty limit w/ 512
pages at a time, then 1000 concurrent dd's w/ 32 pages at a time is
just as likely to overrun it, too. We support 4096 CPU systems, so a
few thousand concurrent writers is not out of the question. Hence I
don't think just reducing the number of pages between dirty balance
calls is a sufficient solution....

Cheers,

Dave..
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 06/27] btrfs: lower the dirty balance poll interval
  2011-03-04  6:22   ` Dave Chinner
@ 2011-03-04  7:57     ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-04  7:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andrew Morton, Jan Kara, Chris Mason, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Mar 04, 2011 at 02:22:17PM +0800, Dave Chinner wrote:
> On Thu, Mar 03, 2011 at 02:45:11PM +0800, Wu Fengguang wrote:
> > Call balance_dirty_pages_ratelimit_nr() on every 32 pages dirtied.
> > 
> > Tests show that original larger intervals can easily make the bdi
> > dirty limit exceeded on 100 concurrent dd.
> > 
> > CC: Chris Mason <chris.mason@oracle.com>
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/btrfs/file.c |    5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > --- linux-next.orig/fs/btrfs/file.c	2011-03-02 20:15:19.000000000 +0800
> > +++ linux-next/fs/btrfs/file.c	2011-03-02 20:35:07.000000000 +0800
> > @@ -949,9 +949,8 @@ static ssize_t btrfs_file_aio_write(stru
> >  	}
> >  
> >  	iov_iter_init(&i, iov, nr_segs, count, num_written);
> > -	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
> > -		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
> > -		     (sizeof(struct page *)));
> > +	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
> > +		     min(32UL, PAGE_CACHE_SIZE / sizeof(struct page *)));
> 
> You're basically hardcoding the maximum to 32 pages here, because
> PAGE_CACHE_SIZE / sizeof(page *) is always going to be much larger
> than 32.
> 
> This means that you are effectively neutering the large write
> efficiencies of btrfs - you're reducing the delayed allocation sizes
> from 512 * PAGE_CACHE_SIZE down to 32 * PAGE_CACHE_SIZE. This will
> increase the overhead of the write process for btrfs for large IOs.
> 
> Also, I've got some multipage write modifications that allow 1024
> pages at a time between mapping/allocation calls with XFS - once
> again for improving the efficiencies of the extent
> mapping/allocations in the write path. If the new writeback
> throttling algorithms don't work with large numbers of pages being
> copied in a single go, then that's a problem.
> 
> As it is, if 100 concurrent dd's can overrun the dirty limit w/ 512
> pages at a time, then 1000 concurrent dd's w/ 32 pages at a time is
> just as likely to overrun it, too. We support 4096 CPU systems, so a
> few thousand concurrent writers is not out of the question. Hence I
> don't think just reducing the number of pages between dirty balance
> calls is a sufficient solution....

Yes I probably have been too nervous about temporary dirty exceeding.

I do keep an improvement patch in house. However it adds btrfs
dependency on VFS, it could be submitted to btrfs after the VFS
changes have been merged. As the 32-page limit will hurt normal
workload, I'll drop it and merge it with the below one.

Thanks,
Fengguang
---

--- linux-next.orig/fs/btrfs/file.c	2011-03-02 20:35:54.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2011-03-02 20:34:07.000000000 +0800
@@ -950,7 +950,8 @@ static ssize_t btrfs_file_aio_write(stru
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
 	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
-		     min(32UL, PAGE_CACHE_SIZE / sizeof(struct page *)));
+		     min(PAGE_CACHE_SIZE / sizeof(struct page *),
+			 current->nr_dirtied_pause));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 	if (!pages) {
 		ret = -ENOMEM;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 07/27] btrfs: wait on too many nr_async_bios
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (5 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: btrfs-nr_async_bios-wait.patch --]
[-- Type: text/plain, Size: 1736 bytes --]

Tests show that btrfs is repeatedly moving _all_ PG_dirty pages into
PG_writeback state. It's desirable to have some limit on the number of
writeback pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |    7 +++++++
 1 file changed, 7 insertions(+)

before patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png

after patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png

--- linux-next.orig/fs/btrfs/disk-io.c	2011-03-03 14:03:39.000000000 +0800
+++ linux-next/fs/btrfs/disk-io.c	2011-03-03 14:03:40.000000000 +0800
@@ -616,6 +616,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 			extent_submit_bio_hook_t *submit_bio_done)
 {
 	struct async_submit_bio *async;
+	int limit;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
 	if (!async)
@@ -643,6 +644,12 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 
 	btrfs_queue_worker(&fs_info->workers, &async->work);
 
+	limit = btrfs_async_submit_limit(fs_info);
+
+	if (atomic_read(&fs_info->nr_async_bios) > limit)
+		wait_event(fs_info->async_submit_wait,
+			   (atomic_read(&fs_info->nr_async_bios) < limit));
+
 	while (atomic_read(&fs_info->async_submit_draining) &&
 	      atomic_read(&fs_info->nr_async_submits)) {
 		wait_event(fs_info->async_submit_wait,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (6 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: nfs-revert-livelock-72cb77f4a5ac.patch --]
[-- Type: text/plain, Size: 3060 bytes --]

This reverts commit 72cb77f4a5 ("NFS: Throttle page dirtying while we're
flushing to disk"). The two problems it tries to address

- sync live lock
- out of order writes

are now all addressed in the VFS

- PAGECACHE_TAG_TOWRITE prevents sync live lock
- IO-less balance_dirty_pages() avoids concurrent writes

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux-next.orig/fs/nfs/file.c	2011-02-21 14:21:13.000000000 +0800
+++ linux-next/fs/nfs/file.c	2011-02-21 14:28:57.000000000 +0800
@@ -392,15 +392,6 @@ static int nfs_write_begin(struct file *
 			   IOMODE_RW);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux-next.orig/fs/nfs/write.c	2011-02-21 14:24:57.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-02-21 14:28:57.000000000 +0800
@@ -337,26 +337,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux-next.orig/include/linux/nfs_fs.h	2011-02-21 14:24:59.000000000 +0800
+++ linux-next/include/linux/nfs_fs.h	2011-02-21 14:29:00.000000000 +0800
@@ -215,7 +215,6 @@ struct nfs_inode {
 #define NFS_INO_ADVISE_RDPLUS	(0)		/* advise readdirplus */
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (7 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03 16:07   ` Peter Zijlstra
  2011-03-03 16:08   ` Peter Zijlstra
  2011-03-03  6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
                   ` (18 subsequent siblings)
  27 siblings, 2 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Jens Axboe, Chris Mason, Peter Zijlstra,
	Trond Myklebust, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh,
	linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 6157 bytes --]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   93 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 85 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-03-03 14:03:40.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-03-03 14:03:40.000000000 +0800
@@ -185,11 +185,68 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2011-03-03 14:03:38.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2011-03-03 14:03:40.000000000 +0800
@@ -102,6 +102,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2011-03-03 14:03:38.000000000 +0800
+++ linux-next/fs/nfs/client.c	2011-03-03 14:03:40.000000000 +0800
@@ -1042,6 +1042,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->delegations);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
@ 2011-03-03 16:07   ` Peter Zijlstra
  2011-03-04  1:53     ` Wu Fengguang
  2011-03-03 16:08   ` Peter Zijlstra
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-03-03 16:07 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

On Thu, 2011-03-03 at 14:45 +0800, Wu Fengguang wrote:
> +static void nfs_wait_contested(int is_sync,
> +                              struct backing_dev_info *bdi,
> +                              wait_queue_head_t *wqh) 

s/contested/congested/ ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-03 16:07   ` Peter Zijlstra
@ 2011-03-04  1:53     ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-04  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Mar 04, 2011 at 12:07:00AM +0800, Peter Zijlstra wrote:
> On Thu, 2011-03-03 at 14:45 +0800, Wu Fengguang wrote:
> > +static void nfs_wait_contested(int is_sync,
> > +                              struct backing_dev_info *bdi,
> > +                              wait_queue_head_t *wqh) 
> 
> s/contested/congested/ ?

Good catch. Will update in another email.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
  2011-03-03 16:07   ` Peter Zijlstra
@ 2011-03-03 16:08   ` Peter Zijlstra
  2011-03-04  2:01     ` Wu Fengguang
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-03-03 16:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

On Thu, 2011-03-03 at 14:45 +0800, Wu Fengguang wrote:
> +static void nfs_wakeup_congested(long nr,
> +                                struct backing_dev_info *bdi,
> +                                wait_queue_head_t *wqh)
> +{
> +       long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
> +
> +       if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
> +               if (test_bit(BDI_sync_congested, &bdi->state)) {
> +                       clear_bdi_congested(bdi, BLK_RW_SYNC);
> +                       smp_mb__after_clear_bit();
> +               }
> +               if (waitqueue_active(&wqh[BLK_RW_SYNC]))
> +                       wake_up(&wqh[BLK_RW_SYNC]);
> +       }
> +       if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
> +               if (test_bit(BDI_async_congested, &bdi->state)) {
> +                       clear_bdi_congested(bdi, BLK_RW_ASYNC);
> +                       smp_mb__after_clear_bit();
> +               }
> +               if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
> +                       wake_up(&wqh[BLK_RW_ASYNC]);
> +       }
> +} 

memory barriers want a comment - always - explaining what they order and
against whoem.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-03 16:08   ` Peter Zijlstra
@ 2011-03-04  2:01     ` Wu Fengguang
  2011-03-04  9:10       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Wu Fengguang @ 2011-03-04  2:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Mar 04, 2011 at 12:08:01AM +0800, Peter Zijlstra wrote:
> On Thu, 2011-03-03 at 14:45 +0800, Wu Fengguang wrote:
> > +static void nfs_wakeup_congested(long nr,
> > +                                struct backing_dev_info *bdi,
> > +                                wait_queue_head_t *wqh)
> > +{
> > +       long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
> > +
> > +       if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
> > +               if (test_bit(BDI_sync_congested, &bdi->state)) {
> > +                       clear_bdi_congested(bdi, BLK_RW_SYNC);
> > +                       smp_mb__after_clear_bit();
> > +               }
> > +               if (waitqueue_active(&wqh[BLK_RW_SYNC]))
> > +                       wake_up(&wqh[BLK_RW_SYNC]);
> > +       }
> > +       if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
> > +               if (test_bit(BDI_async_congested, &bdi->state)) {
> > +                       clear_bdi_congested(bdi, BLK_RW_ASYNC);
> > +                       smp_mb__after_clear_bit();
> > +               }
> > +               if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
> > +                       wake_up(&wqh[BLK_RW_ASYNC]);
> > +       }
> > +} 
> 
> memory barriers want a comment - always - explaining what they order and
> against whoem.

OK. Added this comment:

                        clear_bdi_congested(bdi, BLK_RW_SYNC);
                        /*
                         * On the following wake_up(), nfs_wait_congested()
                         * will see the cleared bit and quit.
                         */
                        smp_mb__after_clear_bit();
                }
                if (waitqueue_active(&wqh[BLK_RW_SYNC]))
                        wake_up(&wqh[BLK_RW_SYNC]);

Thanks,
Fengguang
---
Subject: nfs: writeback pages wait queue
Date: Tue Aug 03 22:47:07 CST 2010

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   97 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 89 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-03-03 14:44:16.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-03-04 09:58:38.000000000 +0800
@@ -185,11 +185,72 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_congested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			/*
+			 * On the following wake_up(), nfs_wait_congested()
+			 * will see the cleared bit and quit.
+			 */
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +261,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +274,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +378,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2011-03-03 14:44:15.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2011-03-03 14:44:16.000000000 +0800
@@ -102,6 +102,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2011-03-03 14:44:15.000000000 +0800
+++ linux-next/fs/nfs/client.c	2011-03-03 14:44:16.000000000 +0800
@@ -1042,6 +1042,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->delegations);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-04  2:01     ` Wu Fengguang
@ 2011-03-04  9:10       ` Peter Zijlstra
  2011-03-04  9:26         ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-03-04  9:10 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, 2011-03-04 at 10:01 +0800, Wu Fengguang wrote:
>                         clear_bdi_congested(bdi, BLK_RW_SYNC);
>                         /*
>                          * On the following wake_up(), nfs_wait_congested()
>                          * will see the cleared bit and quit.
>                          */
>                         smp_mb__after_clear_bit();
>                 }
>                 if (waitqueue_active(&wqh[BLK_RW_SYNC]))
>                         wake_up(&wqh[BLK_RW_SYNC]); 

If I tell you that: try_to_wake_up() implies an smp_wmb(), do you then
still need this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-04  9:10       ` Peter Zijlstra
@ 2011-03-04  9:26         ` Peter Zijlstra
  2011-03-04 14:38           ` Wu Fengguang
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2011-03-04  9:26 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, 2011-03-04 at 10:10 +0100, Peter Zijlstra wrote:
> On Fri, 2011-03-04 at 10:01 +0800, Wu Fengguang wrote:
> >                         clear_bdi_congested(bdi, BLK_RW_SYNC);
> >                         /*
> >                          * On the following wake_up(), nfs_wait_congested()
> >                          * will see the cleared bit and quit.
> >                          */
> >                         smp_mb__after_clear_bit();
> >                 }
> >                 if (waitqueue_active(&wqh[BLK_RW_SYNC]))
> >                         wake_up(&wqh[BLK_RW_SYNC]); 
> 
> If I tell you that: try_to_wake_up() implies an smp_wmb(), do you then
> still need this?

Also, there is no matching rmb,mb in nfs_wait_congested().. barrier
always come in pairs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-04  9:26         ` Peter Zijlstra
@ 2011-03-04 14:38           ` Wu Fengguang
  2011-03-04 14:41             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Wu Fengguang @ 2011-03-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Mar 04, 2011 at 05:26:35PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-03-04 at 10:10 +0100, Peter Zijlstra wrote:
> > On Fri, 2011-03-04 at 10:01 +0800, Wu Fengguang wrote:
> > >                         clear_bdi_congested(bdi, BLK_RW_SYNC);
> > >                         /*
> > >                          * On the following wake_up(), nfs_wait_congested()
> > >                          * will see the cleared bit and quit.
> > >                          */
> > >                         smp_mb__after_clear_bit();
> > >                 }
> > >                 if (waitqueue_active(&wqh[BLK_RW_SYNC]))
> > >                         wake_up(&wqh[BLK_RW_SYNC]); 
> > 
> > If I tell you that: try_to_wake_up() implies an smp_wmb(), do you then
> > still need this?
> 
> Also, there is no matching rmb,mb in nfs_wait_congested().. barrier
> always come in pairs.

Sorry for being ignorance on the memory barriers.. Looking at the
document, I noticed that prepare_to_wait() by calling
set_current_state() inserts a general memory barrier after storing
current->state and as you said try_to_wake_up() inserts a write
barrier before storing current->state. So the single bit change before
changing current->state is guaranteed to be observed by the woken up
task with no need for extra memory barriers.

The below patch removes the unnecessary smp_mb__after_clear_bit().

Thanks,
Fengguang
---
Subject: nfs: writeback pages wait queue
Date: Tue Aug 03 22:47:07 CST 2010

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <axboe@kernel.dk>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   89 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 81 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2011-03-03 14:44:16.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-03-04 22:28:21.000000000 +0800
@@ -185,11 +185,64 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_congested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state))
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +253,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +266,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +370,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_congested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2011-03-03 14:44:15.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2011-03-03 14:44:16.000000000 +0800
@@ -102,6 +102,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2011-03-03 14:44:15.000000000 +0800
+++ linux-next/fs/nfs/client.c	2011-03-03 14:44:16.000000000 +0800
@@ -1042,6 +1042,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->delegations);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 09/27] nfs: writeback pages wait queue
  2011-03-04 14:38           ` Wu Fengguang
@ 2011-03-04 14:41             ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2011-03-04 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Jens Axboe, Chris Mason, Trond Myklebust,
	Christoph Hellwig, Dave Chinner, Theodore Ts'o, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, 2011-03-04 at 22:38 +0800, Wu Fengguang wrote:
> Sorry for being ignorance on the memory barriers..

n/p, everybody who encounters them seems to be a bit confused at first
and a lot more confused later :-)

> The below patch removes the unnecessary smp_mb__after_clear_bit(). 

OK, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 10/27] nfs: limit the commit size to reduce fluctuations
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (8 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: nfs-more-smooth-commit.patch --]
[-- Type: text/plain, Size: 1712 bytes --]

Limit the commit size to 1/8 dirty control scope, so that the arrival of
one commit will not knock the overall dirty pages off the scope.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

After patch, there are still drop offs from the control scope,

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pages.png

due to bursty arrival of commits:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/nfs-commit.png

--- linux-next.orig/fs/nfs/write.c	2011-03-02 20:39:01.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-03-02 20:39:01.000000000 +0800
@@ -1492,9 +1492,10 @@ static int nfs_commit_unstable_pages(str
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
 		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
+		 * are a lot of outstanding writes for this mapping, until
+		 * collected enough pages to commit.
 		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (nfsi->ncommit <= nfsi->npages / 32 /* DIRTY_MARGIN */)
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 11/27] nfs: limit the commit range
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (9 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: nfs-commit-range.patch --]
[-- Type: text/plain, Size: 2970 bytes --]

Hopefully this will help limit the number of unstable pages to be synced
at one time, more timely return of the commit request and reduce dirty
throttle fluctuations.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-25 10:13:34.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-25 10:13:35.000000000 +0800
@@ -1304,7 +1304,7 @@ static void nfs_commitdata_release(void 
  */
 static int nfs_commit_rpcsetup(struct list_head *head,
 		struct nfs_write_data *data,
-		int how)
+		int how, pgoff_t offset, pgoff_t count)
 {
 	struct nfs_page *first = nfs_list_entry(head->next);
 	struct inode *inode = first->wb_context->path.dentry->d_inode;
@@ -1336,8 +1336,8 @@ static int nfs_commit_rpcsetup(struct li
 
 	data->args.fh     = NFS_FH(data->inode);
 	/* Note: we always request a commit of the entire inode */
-	data->args.offset = 0;
-	data->args.count  = 0;
+	data->args.offset = offset;
+	data->args.count  = count;
 	data->args.context = get_nfs_open_context(first->wb_context);
 	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
@@ -1360,7 +1360,8 @@ static int nfs_commit_rpcsetup(struct li
  * Commit dirty pages
  */
 static int
-nfs_commit_list(struct inode *inode, struct list_head *head, int how)
+nfs_commit_list(struct inode *inode, struct list_head *head, int how,
+		pgoff_t offset, pgoff_t count)
 {
 	struct nfs_write_data	*data;
 	struct nfs_page         *req;
@@ -1371,7 +1372,7 @@ nfs_commit_list(struct inode *inode, str
 		goto out_bad;
 
 	/* Set up the argument struct */
-	return nfs_commit_rpcsetup(head, data, how);
+	return nfs_commit_rpcsetup(head, data, how, offset, count);
  out_bad:
 	while (!list_empty(head)) {
 		req = nfs_list_entry(head->next);
@@ -1453,6 +1454,8 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
+	pgoff_t first_index;
+	pgoff_t last_index;
 	int may_wait = how & FLUSH_SYNC;
 	int res = 0;
 
@@ -1460,9 +1463,14 @@ int nfs_commit_inode(struct inode *inode
 		goto out_mark_dirty;
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
+	if (res) {
+		first_index = nfs_list_entry(head.next)->wb_index;
+		last_index  = nfs_list_entry(head.prev)->wb_index;
+	}
 	spin_unlock(&inode->i_lock);
 	if (res) {
-		int error = nfs_commit_list(inode, &head, how);
+		int error = nfs_commit_list(inode, &head, how, first_index,
+					    last_index - first_index + 1);
 		if (error < 0)
 			return error;
 		if (may_wait)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (10 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: nfs-congestion-thresh.patch --]
[-- Type: text/plain, Size: 2288 bytes --]

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c      |   13 +++++++++++++
 mm/page-writeback.c |    1 +
 2 files changed, 14 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2011-03-03 14:04:01.000000000 +0800
+++ linux-next/fs/nfs/write.c	2011-03-03 14:04:01.000000000 +0800
@@ -1651,6 +1651,9 @@ out:
 
 int __init nfs_init_writepagecache(void)
 {
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
@@ -1688,6 +1691,16 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+
 	return 0;
 }
 
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:04:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:04:01.000000000 +0800
@@ -431,6 +431,7 @@ void global_dirty_limits(unsigned long *
 	*pbackground = background;
 	*pdirty = dirty;
 }
+EXPORT_SYMBOL_GPL(global_dirty_limits);
 
 /**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 13/27] writeback: account per-bdi accumulated written pages
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (11 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Michael Rubin, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh,
	linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2648 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Peter Zijlstra <a.p.zijlstra@chello.nl>:
Move BDI_WRITTEN accounting into __bdi_writeout_inc().
This will cover and fix fuse, which only calls bdi_writeout_inc().

CC: Michael Rubin <mrubin@google.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:03:37.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:04:06.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:03:37.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:04:06.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:04:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:04:06.000000000 +0800
@@ -219,6 +219,7 @@ int dirty_bytes_handler(struct ctl_table
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (12 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Michael Rubin, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-dirtied.patch --]
[-- Type: text/plain, Size: 2416 bytes --]

Introduce the BDI_DIRTIED counter. It will be used for estimating the
bdi's dirty bandwidth.

CC: Jan Kara <jack@suse.cz>
CC: Michael Rubin <mrubin@google.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    2 ++
 mm/page-writeback.c         |    1 +
 3 files changed, 4 insertions(+)

--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:16.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:18.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_DIRTIED,
 	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:18.000000000 +0800
@@ -1127,6 +1127,7 @@ void account_page_dirtied(struct page *p
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+		__inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
 		task_dirty_inc(current);
 		task_io_account_write(PAGE_CACHE_SIZE);
 	}
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:16.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:18.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiDirtied:       %8lu kB\n"
 		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
@@ -101,6 +102,7 @@ static int bdi_debug_stats_show(struct s
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 15/27] writeback: bdi write bandwidth estimation
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (13 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-write-bandwidth.patch --]
[-- Type: text/plain, Size: 7623 bytes --]

The estimation value will start from 50MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

The overheads won't be high because the bdi bandwidth update only occurs
in >100ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The original plan is to use per-cpu vars for bdi->write_bandwidth.
However Peter suggested that it opens the window that some CPU see
outdated values. So switch to use spinlock protected global vars.
A global spinlock is used with intention to update global states in
subsequent patches as well.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than 500ms will be skipped.

The estimation is not done purely in the flusher thread because slow
devices may take dozens of seconds to write the initial 64MB chunk
(write_bandwidth starts with 50MB/s, this translates to 64MB
nr_to_write). So it may take more than 1 minute to adapt to the smallish
bandwidth if the bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    3 +
 include/linux/backing-dev.h |    5 ++
 include/linux/writeback.h   |   11 ++++
 mm/backing-dev.c            |   12 +++++
 mm/page-writeback.c         |   79 ++++++++++++++++++++++++++++++++++
 5 files changed, 110 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-03-03 14:43:50.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-03-03 14:44:07.000000000 +0800
@@ -662,6 +662,7 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -697,6 +698,8 @@ static long wb_writeback(struct bdi_writ
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
 
+		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
+
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
 
--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:11.000000000 +0800
@@ -75,6 +75,11 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	unsigned long bw_time_stamp;
+	unsigned long written_stamp;
+	unsigned long write_bandwidth;
+	unsigned long avg_bandwidth;
+
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
--- linux-next.orig/include/linux/writeback.h	2011-03-03 14:43:50.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-03-03 14:44:10.000000000 +0800
@@ -128,6 +128,17 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long thresh,
+			  unsigned long dirty,
+			  unsigned long bdi_dirty,
+			  unsigned long start_time);
+static inline void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					      unsigned long start_time)
+{
+	bdi_update_bandwidth(bdi, 0, 0, 0, start_time);
+}
+
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied);
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:11.000000000 +0800
@@ -641,6 +641,11 @@ static void bdi_wb_init(struct bdi_write
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
 
+/*
+ * initial write bandwidth: 50 MB/s
+ */
+#define INIT_BW		(50 << (20 - PAGE_SHIFT))
+
 int bdi_init(struct backing_dev_info *bdi)
 {
 	int i, err;
@@ -663,6 +668,13 @@ int bdi_init(struct backing_dev_info *bd
 	}
 
 	bdi->dirty_exceeded = 0;
+
+	bdi->bw_time_stamp = jiffies;
+	bdi->written_stamp = 0;
+
+	bdi->write_bandwidth = INIT_BW;
+	bdi->avg_bandwidth = INIT_BW;
+
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:11.000000000 +0800
@@ -472,6 +472,79 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					 unsigned long elapsed,
+					 unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(3 * HZ);
+	unsigned long avg = bdi->avg_bandwidth;
+	unsigned long old = bdi->write_bandwidth;
+	unsigned long cur;
+	u64 bw;
+
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (unlikely(elapsed > period / 2)) {
+		do_div(bw, elapsed);
+		elapsed = period / 2;
+		bw *= elapsed;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	cur = bw >> ilog2(period);
+	bdi->write_bandwidth = cur;
+
+	/*
+	 * one more level of smoothing
+	 */
+	if (avg > old && old > cur)
+		avg -= (avg - old) >> 5;
+
+	if (avg < old && old < cur)
+		avg += (old - avg) >> 5;
+
+	bdi->avg_bandwidth = avg;
+}
+
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long thresh,
+			  unsigned long dirty,
+			  unsigned long bdi_dirty,
+			  unsigned long start_time)
+{
+	static DEFINE_SPINLOCK(dirty_lock);
+	unsigned long now = jiffies;
+	unsigned long elapsed;
+	unsigned long written;
+
+	if (!spin_trylock(&dirty_lock))
+		return;
+
+	elapsed = now - bdi->bw_time_stamp;
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/* skip quiet periods when disk bandwidth is under-utilized */
+	if (elapsed > HZ/2 &&
+	    elapsed > now - start_time)
+		goto snapshot;
+
+	/*
+	 * rate-limit, only update once every 100ms. Demand higher threshold
+	 * on the flusher so that the throttled task(s) can do most updates.
+	 */
+	if (!thresh && elapsed <= HZ/4)
+		goto unlock;
+	if (elapsed <= HZ/10)
+		goto unlock;
+
+	__bdi_update_write_bandwidth(bdi, elapsed, written);
+
+snapshot:
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = now;
+unlock:
+	spin_unlock(&dirty_lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -491,6 +564,7 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 1;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long start_time = jiffies;
 
 	if (!bdi_cap_account_dirty(bdi))
 		return;
@@ -539,6 +613,11 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		bdi_update_bandwidth(bdi, dirty_thresh,
+				     nr_reclaimable + nr_writeback,
+				     bdi_nr_reclaimable + bdi_nr_writeback,
+				     start_time);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 16/27] writeback: smoothed global/bdi dirty pages
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (14 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-smooth-dirty.patch --]
[-- Type: text/plain, Size: 3985 bytes --]

Maintain a smoothed version of dirty pages for use in the throttle
bandwidth calculations.

default_backing_dev_info.avg_dirty holds the smoothed global dirty
pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    2 +
 mm/backing-dev.c            |    3 +
 mm/page-writeback.c         |   62 ++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:10.000000000 +0800
@@ -472,6 +472,64 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+static void bdi_update_dirty_smooth(struct backing_dev_info *bdi,
+				    unsigned long dirty)
+{
+	unsigned long avg = bdi->avg_dirty;
+	unsigned long old = bdi->old_dirty;
+
+	if (unlikely(!avg)) {
+		avg = dirty;
+		goto update;
+	}
+
+	/*
+	 * dirty pages are departing upwards, follow up
+	 */
+	if (avg < old && old <= dirty) {
+		avg += (old - avg) >> 3;
+		goto update;
+	}
+
+	/*
+	 * dirty pages are departing downwards, follow down
+	 */
+	if (avg > old && old >= dirty) {
+		avg -= (avg - old) >> 3;
+		goto update;
+	}
+
+	/*
+	 * This can filter out one half unnecessary updates when bdi_dirty is
+	 * fluctuating around the balance point, and is most effective on XFS,
+	 * whose pattern is
+	 *                                                             .
+	 *	[.] dirty	[-] avg                       .       .
+	 *                                                   .       .
+	 *              .         .         .         .     .       .
+	 *      ---------------------------------------    .       .
+	 *            .         .         .         .     .       .
+	 *           .         .         .         .     .       .
+	 *          .         .         .         .     .       .
+	 *         .         .         .         .     .       .
+	 *        .         .         .         .
+	 *       .         .         .         .      (flucuated)
+	 *      .         .         .         .
+	 *     .         .         .         .
+	 *
+	 * @avg will remain flat at the cost of being biased towards high. In
+	 * practice the error tend to be much smaller: thanks to more coarse
+	 * grained fluctuations, @avg becomes the real average number for the
+	 * last two rising lines of @dirty.
+	 */
+	goto out;
+
+update:
+	bdi->avg_dirty = avg;
+out:
+	bdi->old_dirty = dirty;
+}
+
 static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 					 unsigned long elapsed,
 					 unsigned long written)
@@ -537,6 +595,10 @@ void bdi_update_bandwidth(struct backing
 		goto unlock;
 
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
+	if (thresh) {
+		bdi_update_dirty_smooth(bdi, bdi_dirty);
+		bdi_update_dirty_smooth(&default_backing_dev_info, dirty);
+	}
 
 snapshot:
 	bdi->written_stamp = written;
--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:10.000000000 +0800
@@ -79,6 +79,8 @@ struct backing_dev_info {
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
 	unsigned long avg_bandwidth;
+	unsigned long avg_dirty;
+	unsigned long old_dirty;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:10.000000000 +0800
@@ -675,6 +675,9 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_bandwidth = INIT_BW;
 
+	bdi->avg_dirty = 0;
+	bdi->old_dirty = 0;
+
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 17/27] writeback: smoothed dirty threshold and limit
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (15 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-dirty-thresh-limit.patch --]
[-- Type: text/plain, Size: 6395 bytes --]

Both the global/bdi dirty thresholds may fluctuate undesirably.

- the start of a heavy weight application (ie. KVM) may instantly knock
  down determine_dirtyable_memory() and hence the global/bdi dirty thresholds.

- in JBOD setup, the bdi dirty thresholds are observed to fluctuate more

So maintain a version of smoothed bdi dirty threshold in
bdi->dirty_threshold and introduce the global dirty limit in
default_backing_dev_info.dirty_threshold.

The global limit can effectively mask out the impact of sudden drop of
dirtyable memory.  Without it, the dirtier tasks may be blocked in the
block area for 10s after someone eats 500MB memory; with the limit, the
dirtier tasks will be throttled at eg. 1/8 => 1/4 => 1/2 => original
dirty bandwith by the main control line and bring down the dirty pages
at reasonable speeds.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    3 +
 include/linux/writeback.h   |   34 +++++++++++++++++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   66 ++++++++++++++++++++++++++++++++++
 4 files changed, 104 insertions(+)

--- linux-next.orig/include/linux/writeback.h	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-03-03 14:44:07.000000000 +0800
@@ -12,6 +12,40 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES	(4096UL >> (PAGE_CACHE_SHIFT - 10))
+
+/*
+ * The 1/4 region under the global dirty thresh is for smooth dirty throttling:
+ *
+ *		(thresh - 2*thresh/DIRTY_SCOPE, thresh)
+ *
+ * The 1/32 region under the global dirty limit will be more rigidly throttled:
+ *
+ *		(limit - limit/DIRTY_MARGIN, limit)
+ *
+ * The 1/32 region above the global dirty limit will be put to maximum pauses:
+ *
+ *		(limit, limit + limit/DIRTY_MARGIN)
+ *
+ * Further beyond, the dirtier task will enter a loop waiting (possibly long
+ * time) for the dirty pages to drop below (limit + limit/DIRTY_MARGIN).
+ *
+ * The last case may happen lightly when memory is very tight or at sudden
+ * workload rampup. Or under DoS situations such as a fork bomb where every new
+ * task dirties some more pages, or creating 10,000 tasks each writing to a USB
+ * key slowly in 4KB/s.
+ *
+ * The global dirty threshold is normally equal to global dirty limit, except
+ * when the system suddenly allocates a lot of anonymous memory and knocks down
+ * the global dirty threshold quickly, in which case the global dirty limit
+ * will follow down slowly to prevent livelocking all dirtier tasks.
+ */
+#define DIRTY_SCOPE		8
+#define DIRTY_MARGIN		(DIRTY_SCOPE * 4)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:07.000000000 +0800
@@ -472,6 +472,24 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * If we can dirty N more pages globally, honour N/8 to the bdi that runs low,
+ * so as to help it ramp up.
+ *
+ * It helps the chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
+ * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
+ * B always get heavily throttled and bdi B's dirty limit might never be able
+ * to grow up from 0.
+ */
+static unsigned long dirty_rampup_size(unsigned long dirty,
+				       unsigned long thresh)
+{
+	if (thresh > dirty + MIN_WRITEBACK_PAGES)
+		return min(MIN_WRITEBACK_PAGES * 2, (thresh - dirty) / 8);
+
+	return MIN_WRITEBACK_PAGES / 8;
+}
+
 static void bdi_update_dirty_smooth(struct backing_dev_info *bdi,
 				    unsigned long dirty)
 {
@@ -563,6 +581,50 @@ static void __bdi_update_write_bandwidth
 	bdi->avg_bandwidth = avg;
 }
 
+static void update_dirty_limit(unsigned long thresh,
+			       unsigned long dirty)
+{
+	unsigned long limit = default_backing_dev_info.dirty_threshold;
+	unsigned long min = dirty + limit / DIRTY_MARGIN;
+
+	if (limit < thresh) {
+		limit = thresh;
+		goto out;
+	}
+
+	/* take care not to follow into the brake area */
+	if (limit > thresh + thresh / (DIRTY_MARGIN * 8) &&
+	    limit > min) {
+		limit -= (limit - max(thresh, min)) >> 3;
+		goto out;
+	}
+
+	return;
+out:
+	default_backing_dev_info.dirty_threshold = limit;
+}
+
+static void bdi_update_dirty_threshold(struct backing_dev_info *bdi,
+				       unsigned long thresh,
+				       unsigned long dirty)
+{
+	unsigned long old = bdi->old_dirty_threshold;
+	unsigned long avg = bdi->dirty_threshold;
+	unsigned long min;
+
+	min = dirty_rampup_size(dirty, thresh);
+	thresh = bdi_dirty_limit(bdi, thresh);
+
+	if (avg > old && old >= thresh)
+		avg -= (avg - old) >> 4;
+
+	if (avg < old && old <= thresh)
+		avg += (old - avg) >> 4;
+
+	bdi->dirty_threshold = max(avg, min);
+	bdi->old_dirty_threshold = thresh;
+}
+
 void bdi_update_bandwidth(struct backing_dev_info *bdi,
 			  unsigned long thresh,
 			  unsigned long dirty,
@@ -594,6 +656,10 @@ void bdi_update_bandwidth(struct backing
 	if (elapsed <= HZ/10)
 		goto unlock;
 
+	if (thresh) {
+		update_dirty_limit(thresh, dirty);
+		bdi_update_dirty_threshold(bdi, thresh, dirty);
+	}
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
 	if (thresh) {
 		bdi_update_dirty_smooth(bdi, bdi_dirty);
--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:07.000000000 +0800
@@ -81,6 +81,9 @@ struct backing_dev_info {
 	unsigned long avg_bandwidth;
 	unsigned long avg_dirty;
 	unsigned long old_dirty;
+	unsigned long dirty_threshold;
+	unsigned long old_dirty_threshold;
+
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:07.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:07.000000000 +0800
@@ -677,6 +677,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	bdi->avg_dirty = 0;
 	bdi->old_dirty = 0;
+	bdi->dirty_threshold = MIN_WRITEBACK_PAGES;
 
 	err = prop_local_init_percpu(&bdi->completions);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (16 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-oversize-background-thresh.patch --]
[-- Type: text/plain, Size: 2006 bytes --]

The change is virtually a no-op for the majority users that use the
default 10/20 background/dirty ratios. For others don't know why they
are setting background ratio close enough to dirty ratio. Someone must
set background ratio equal to dirty ratio, but no one seems to notice or
complain that it's then silently halved under the hood..

The other solution is to return -EIO when setting a too large background
threshold or a too small dirty threshold. However that could possibly
break some disordered usage scenario, eg.

	echo 10 > /proc/sys/vm/dirty_ratio
	echo  5 > /proc/sys/vm/dirty_background_ratio

The first echo will fail because the background ratio is still 10.
Such order dependent behavior seems disgusting for end users.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-03-02 17:04:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-02 17:06:17.000000000 +0800
@@ -422,8 +422,14 @@ void global_dirty_limits(unsigned long *
 	else
 		background = (dirty_background_ratio * available_memory) / 100;
 
-	if (background >= dirty)
-		background = dirty / 2;
+	/*
+	 * Ensure at least 1/4 gap between background and dirty thresholds, so
+	 * that when dirty throttling starts at (background + dirty)/2, it's
+	 * below or at the entrance of the soft dirty throttle scope.
+	 */
+	if (background > dirty - dirty / (DIRTY_SCOPE / 2))
+		background = dirty - dirty / (DIRTY_SCOPE / 2);
+
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 19/27] writeback: dirty throttle bandwidth control
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (17 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-07 21:34   ` Wu Fengguang
  2011-03-29 21:08   ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
                   ` (8 subsequent siblings)
  27 siblings, 2 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 25801 bytes --]

balance_dirty_pages() has been using a very simple and robust threshold
based throttle scheme. It automatically limits the dirty rate down,
however in a very bumpy way that constantly block the dirtier tasks for
hundreds of milliseconds on a local ext4.

The new scheme is to expand the ON/OFF threshold to a larger scope in
which both the number of dirty pages and the dirty rate are explicitly
controlled. The basic ideas are

- position feedback control

  At the center of the control scope is the setpoint/goal. When the
  number of dirty pages go higher/lower than the goal, its dirty rate
  will be proportionally decreased/increased to prevent it from drifting
  away.

  When the dirty pages drops low to the bottom of the control scope, or
  rushes high to the upper limit, the dirty rate will quickly be scaled
  up/down, to the point of completely let go of or completely block the
  dirtier task.

- rate feedback control

  What's the balanced dirty rate if the dirty pages are exactly at the
  goal? If there are N tasks dirtying pages on 1 disk at rate task_bw MB/s,
  then task_bw should be balanced at write_bw/N where write_bw is the
  disk's write bandwidth. We call base_bw=write_bw/(N*sqrt(N)) the
  disk's base throttle bandwidth.  Each task will be allowed to dirty at
  rate task_bw=base_bw/sqrt(task_weight) where task_weight=1/N reflects
  how much dirty pages in the system are dirtied by the task. So the
  overall dirty rate dirty_bw=N*task_bw will match write_bw exactly.

  In practice we don't know base_bw beforehand. Because we don't know
  the exact number of N and cannot assume all tasks are equal weighted.
  So a reference bandwidth ref_bw is estimated as the target of base_bw.
  base_bw will be adjusted step by step towards ref_bw. In each step,
  ref_bw is calculated as (base_bw * pos_ratio * write_bw / dirty_bw):
  when the (unknown number of) tasks are rate limited based on previous
  (base_bw*pos_ratio*sqrt(task_weight)), if the overall dirty rate
  dirty_bw is M times write_bw, then the base_bw shall be scaled 1/M to
  match/balance dirty_bw <=> write_bw. Note that pos_ratio is the result
  of position control, it will be 1 if the dirty pages are exactly at
  the goal.

  The ref_bw estimation will be pretty accurate if without
  (1) noises
  (2) feedback delays between steps
  (3) the mismatch between the number of dirty and writeback events
      caused by user space truncate and file system redirty

  (1) can be smoothed out; (2) will decrease proportionally with the
  adjust size when base_bw gets close to ref_bw.

  (3) can be ultimitely fixed by accounting the truncate/redirty events.
  But for now we can rely on the robustness of base_bw update algorithms
  to deal with the mismatches: no obvious imbalance is observed in ext4
  workloads which have bursts of redirty and large dirtied:written=3:2
  ratio. In theory when the truncate/redirty makes (write_bw/dirty_bw <
  1), ref_bw and base_bw will go low, driving up the pos_ratio which
  then corrects (pos_ratio * write_bw / dirty_bw) back to 1, thus
  balance ref_bw at some point. What's more,
  bdi_update_throttle_bandwidth() dictates that base_bw will only be
  updated when ref_bw and pos_bw=base_bw*pos_ratio are both higher or
  lower than base_bw. So the higher pos_bw will effectively stop base_bw
  from approaching the lower ref_bw.

In general, it's pretty safe and robust.
- the upper/lower bounds in the position control provides ultimate
  safeguard: in case the algorithms fly away, the worst case would be
  the dirty pages continuously hitting the bounds with big fluctuates in
  dirty rate -- basically similiar to the current state.
- the base bandwidth update rules are accurate and robust enough for
  base_bw to quickly adapt to new workload and remain stable thereafter
  This is confirmed by a wide range of tests: base_bw only goes less
  stable when the control scope is smaller than the write bandwidth,
  in which case the pos_ratio is already fluctuating much more.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |   10 
 include/linux/writeback.h   |    7 
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |  478 ++++++++++++++++++++++++++++++++++
 4 files changed, 495 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:22.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:27.000000000 +0800
@@ -76,18 +76,26 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	unsigned long bw_time_stamp;
+	unsigned long dirtied_stamp;
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
 	unsigned long avg_bandwidth;
+	unsigned long long throttle_bandwidth;
+	unsigned long long reference_bandwidth;
+	unsigned long long old_ref_bandwidth;
 	unsigned long avg_dirty;
 	unsigned long old_dirty;
 	unsigned long dirty_threshold;
 	unsigned long old_dirty_threshold;
 
-
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
+	/* last time exceeded (limit - limit/DIRTY_MARGIN) */
+	unsigned long dirty_exceed_time;
+	/* last time dropped below (background_thresh + dirty_thresh) / 2 */
+	unsigned long dirty_free_run;
+
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
--- linux-next.orig/include/linux/writeback.h	2011-03-03 14:44:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-03-03 14:44:23.000000000 +0800
@@ -46,6 +46,13 @@ extern spinlock_t inode_lock;
 #define DIRTY_MARGIN		(DIRTY_SCOPE * 4)
 
 /*
+ * The base throttle bandwidth will be 1000 times smaller than write bandwidth
+ * when there are 100 concurrent heavy dirtiers. This shift can work with up to
+ * 40 bits dirty size and 2^16 concurrent dirtiers.
+ */
+#define BASE_BW_SHIFT		24
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:27.000000000 +0800
@@ -496,6 +496,255 @@ static unsigned long dirty_rampup_size(u
 	return MIN_WRITEBACK_PAGES / 8;
 }
 
+/*
+ * last time exceeded (limit - limit/DIRTY_MARGIN)
+ */
+static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
+				    unsigned long time_window)
+{
+	return jiffies - bdi->dirty_exceed_time <= time_window;
+}
+
+/*
+ * last time dropped below (thresh - 2*thresh/DIRTY_SCOPE + thresh/DIRTY_MARGIN)
+ */
+static bool dirty_free_run_recently(struct backing_dev_info *bdi,
+				    unsigned long time_window)
+{
+	return jiffies - bdi->dirty_free_run <= time_window;
+}
+
+/*
+ * Position based bandwidth control.
+ *
+ * (1) hard dirty limiting areas
+ *
+ * The block area is required to stop large number of slow dirtiers, because
+ * the max pause area is only able to throttle a task at 1page/200ms=20KB/s.
+ *
+ * The max pause area is sufficient for normal workloads, and has the virtue
+ * of bounded latency for light dirtiers.
+ *
+ * The brake area is typically enough to hold off the dirtiers as long as the
+ * dirtyable memory is not so tight.
+ *
+ * The block area and max pause area are enforced inside the loop of
+ * balance_dirty_pages(). Others can be found in dirty_throttle_bandwidth().
+ *
+ *         block area,  loop until drop below the area  -------------------|<===
+ *     max pause area,  sleep(max_pause) and return     -----------|<=====>|
+ *         brake area,  bw scaled from 1 down to 0      ---|<=====>|
+ * --------------------------------------------------------o-------o-------o----
+ *                                                         ^       ^       ^
+ *                          limit - limit/DIRTY_MARGIN  ---'       |       |
+ *                          limit                       -----------'       |
+ *                          limit + limit/DIRTY_MARGIN  -------------------'
+ *
+ * (2) global control areas
+ *
+ * The rampup area is for ramping up the base bandwidth whereas the above brake
+ * area is for scaling down the base bandwidth.
+ *
+ * The global thresh is typically equal to the above global limit. The
+ * difference is, @thresh is real-time computed from global_dirty_limits() and
+ * @limit is tracking @thresh at 100ms intervals in update_dirty_limit(). The
+ * point is to track @thresh slowly if it dropped below the number of dirty
+ * pages, so as to avoid unnecessarily entering the three areas in (1).
+ *
+ *rampup area                 setpoint/goal
+ *|<=======>|                      v
+ * |-------------------------------*-------------------------------|------------
+ * ^                               ^                               ^
+ * thresh - 2*thresh/DIRTY_SCOPE   thresh - thresh/DIRTY_SCOPE     thresh
+ *
+ * (3) bdi control areas
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up bw as dirty pages drop low  bdi_setpoint
+ * |<=============================================>|       v
+ * |-------------------------------------------------------*-------|----------
+ * 0                    bdi_thresh - bdi_thresh/DIRTY_SCOPE^       ^bdi_thresh
+ *
+ * (4) global/bdi control lines
+ *
+ * dirty_throttle_bandwidth() applies 2 main and 3 regional control lines for
+ * scaling up/down the base bandwidth based on the position of dirty pages.
+ *
+ * The two main control lines for the global/bdi control scopes do not end at
+ * thresh/bdi_thresh.  They are centered at setpoint/bdi_setpoint and cover the
+ * whole [0, limit].  If the control line drops below 0 before reaching @limit,
+ * an auxiliary line will be setup to connect them. The below figure illustrates
+ * the main bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling down bdi_dirty back to normal if it starts
+ * high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi_setpoint.
+ * - the global/bdi dirty thresh/goal may be knocked down suddenly either on
+ *   user request or on increased memory consumption.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------*-----------------------------.--------------------*]
+ *  0                 bdi_setpoint                  bdi_origin           limit
+ *
+ * The bdi control line: if (bdi_origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long dirty_throttle_bandwidth(struct backing_dev_info *bdi,
+					      unsigned long thresh,
+					      unsigned long dirty,
+					      unsigned long bdi_dirty,
+					      struct task_struct *tsk)
+{
+	unsigned long limit = default_backing_dev_info.dirty_threshold;
+	unsigned long bdi_thresh = bdi->dirty_threshold;
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long bw;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	origin = 2 * thresh;
+	goal = thresh - thresh / DIRTY_SCOPE;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;
+		goal = (goal + origin) / 2;
+		bw >>= 1;
+	}
+	bw = origin - dirty;
+	bw <<= BASE_BW_SHIFT;
+	do_div(bw, origin - goal + 1);
+
+	/*
+	 * brake area to prevent global dirty exceeding
+	 */
+	if (dirty > limit - limit / DIRTY_MARGIN) {
+		bw *= limit - dirty;
+		do_div(bw, limit / DIRTY_MARGIN + 1);
+	}
+
+	/*
+	 * rampup area, immediately above the unthrottled free-run region.
+	 * It's setup mainly to get an estimation of ref_bw for reliably
+	 * ramping up the base bandwidth.
+	 */
+	dirty = default_backing_dev_info.avg_dirty;
+	origin = thresh - thresh / (DIRTY_SCOPE/2) + thresh / DIRTY_MARGIN;
+	if (dirty < origin) {
+		span = (origin - dirty) * bw;
+		do_div(span, thresh / (8 * DIRTY_MARGIN) + 1);
+		bw += span;
+	}
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * In JBOD case, bdi_thresh could fluctuate proportional to its own
+	 * size. Otherwise the bdi write bandwidth is good for limiting the
+	 * floating area, to compensate for the global control line being too
+	 * flat in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(2 * bdi->avg_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	dirty = bdi->avg_dirty;
+	if (unlikely(dirty > goal + span)) {
+		if (dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;
+			goal += span;
+			bw >>= 1;
+		}
+	}
+	bw *= origin - dirty;
+	do_div(bw, origin - goal + 1);
+
+	/*
+	 * bdi reserve area, safeguard against bdi dirty underflow and disk idle
+	 */
+	origin = bdi_thresh - bdi_thresh / (DIRTY_SCOPE / 2);
+	if (bdi_dirty < origin)
+		bw = bw * origin / (bdi_dirty | 1);
+
+	/*
+	 * honour light dirtiers higher bandwidth:
+	 *
+	 *	bw *= sqrt(1 / task_dirty_weight);
+	 */
+	if (tsk) {
+		unsigned long numerator, denominator;
+		const unsigned long priority_base = 1024;
+		unsigned long priority = priority_base;
+
+		/*
+		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+		 * real-time tasks.
+		 */
+		if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+			priority *= 2;
+
+		task_dirties_fraction(tsk, &numerator, &denominator);
+
+		denominator <<= 10;
+		denominator = denominator * priority / priority_base;
+		bw *= int_sqrt(denominator / (numerator + 1)) *
+					    priority / priority_base;
+		bw >>= 5 + BASE_BW_SHIFT / 2;
+		bw = (unsigned long)bw * bdi->throttle_bandwidth;
+		bw >>= 2 * BASE_BW_SHIFT - BASE_BW_SHIFT / 2;
+
+		/*
+		 * The avg_bandwidth bound is necessary because
+		 * bdi_update_throttle_bandwidth() blindly sets base bandwidth
+		 * to avg_bandwidth for more stable estimation, when it
+		 * believes the current task is the only dirtier.
+		 */
+		if (priority > priority_base)
+			return min((unsigned long)bw, bdi->avg_bandwidth);
+	}
+
+	return bw;
+}
+
 static void bdi_update_dirty_smooth(struct backing_dev_info *bdi,
 				    unsigned long dirty)
 {
@@ -631,6 +880,230 @@ static void bdi_update_dirty_threshold(s
 	bdi->old_dirty_threshold = thresh;
 }
 
+/*
+ * ref_bw typically fluctuates within a small range, with large isolated points
+ * from time to time. The smoothed reference_bandwidth can effectively filter
+ * out 1 such standalone point. When there comes 2+ isolated points together --
+ * observed in ext4 on sudden redirty -- reference_bandwidth may surge high and
+ * take long time to return to normal, which can mostly be counteracted by
+ * xref_bw and other update restrictions in bdi_update_throttle_bandwidth().
+ */
+static void bdi_update_reference_bandwidth(struct backing_dev_info *bdi,
+					   unsigned long ref_bw)
+{
+	unsigned long old = bdi->old_ref_bandwidth;
+	unsigned long avg = bdi->reference_bandwidth;
+
+	if (avg > old && old >= ref_bw && avg - old >= old - ref_bw)
+		avg -= (avg - old) >> 3;
+
+	if (avg < old && old <= ref_bw && old - avg >= ref_bw - old)
+		avg += (old - avg) >> 3;
+
+	bdi->reference_bandwidth = avg;
+	bdi->old_ref_bandwidth = ref_bw;
+}
+
+/*
+ * Base throttle bandwidth.
+ */
+static void bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
+					  unsigned long thresh,
+					  unsigned long dirty,
+					  unsigned long bdi_dirty,
+					  unsigned long dirtied,
+					  unsigned long elapsed)
+{
+	unsigned long limit = default_backing_dev_info.dirty_threshold;
+	unsigned long margin = limit / DIRTY_MARGIN;
+	unsigned long goal = thresh - thresh / DIRTY_SCOPE;
+	unsigned long bdi_thresh = bdi->dirty_threshold;
+	unsigned long bdi_goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	unsigned long long bw = bdi->throttle_bandwidth;
+	unsigned long long dirty_bw;
+	unsigned long long pos_bw;
+	unsigned long long delta;
+	unsigned long long ref_bw = 0;
+	unsigned long long xref_bw;
+	unsigned long pos_ratio;
+	unsigned long spread;
+
+	if (dirty > limit - margin)
+		bdi->dirty_exceed_time = jiffies;
+
+	if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin)
+		bdi->dirty_free_run = jiffies;
+
+	/*
+	 * The dirty rate should match the writeback rate exactly, except when
+	 * dirty pages are truncated before IO submission. The mismatches are
+	 * hopefully small and hence ignored. So a continuous stream of dirty
+	 * page trucates will result in errors in ref_bw, fortunately pos_bw
+	 * can effectively stop the base bw from being driven away endlessly
+	 * by the errors.
+	 *
+	 * It'd be nicer for the filesystems to not redirty too much pages
+	 * either on IO or lock contention, or on sub-page writes.  ext4 is
+	 * known to redirty pages in big bursts, leading to
+	 *   - surges of dirty_bw, which can be mostly safeguarded by the
+	 *     min/max'ed xref_bw
+	 *   - the temporary drop of task weight and hence surge of task bw
+	 * It could possibly be fixed in the FS.
+	 */
+	dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+	pos_ratio = dirty_throttle_bandwidth(bdi, thresh, dirty,
+					     bdi_dirty, NULL);
+	/*
+	 * pos_bw = task_bw, assuming 100% task dirty weight
+	 *
+	 * (pos_bw > bw) means the position of the number of dirty pages is
+	 * lower than the global and/or bdi setpoints. It does not necessarily
+	 * mean the base throttle bandwidth is larger than its balanced value.
+	 * The latter is likely only when
+	 * - (position) the dirty pages are at some distance from the setpoint,
+	 * - (speed) and either stands still or is departing from the setpoint.
+	 */
+	pos_bw = (bw >> (BASE_BW_SHIFT/2)) * pos_ratio >>
+			(BASE_BW_SHIFT/2);
+
+	/*
+	 * A typical desktop has only 1 task writing to 1 disk, in which case
+	 * the dirtier task should be throttled at the disk's write bandwidth.
+	 * Note that we ignore minor dirty/writeback mismatches such as
+	 * redirties and truncated dirty pages.
+	 */
+	if (bdi_thresh > thresh - thresh / 16) {
+		unsigned long numerator, denominator;
+
+		task_dirties_fraction(current, &numerator, &denominator);
+		if (numerator > denominator - denominator / 16)
+			ref_bw = bdi->avg_bandwidth << BASE_BW_SHIFT;
+	}
+	/*
+	 * Otherwise there may be
+	 * 1) N dd tasks writing to the current disk, or
+	 * 2) X dd tasks and Y "rsync --bwlimit" tasks.
+	 * The below estimation is accurate enough for (1). For (2), where not
+	 * all task's dirty rate can be changed proportionally by adjusting the
+	 * base throttle bandwidth, it would require multiple adjust-reestimate
+	 * cycles to approach the rate matching point. Which is not a big
+	 * concern as we always do small steps to approach the target. The
+	 * un-controllable tasks may only slow down the progress.
+	 */
+	if (!ref_bw) {
+		ref_bw = pos_ratio * bdi->avg_bandwidth;
+		do_div(ref_bw, dirty_bw | 1);
+		ref_bw = (bw >> (BASE_BW_SHIFT/2)) * (unsigned long)ref_bw >>
+				(BASE_BW_SHIFT/2);
+	}
+
+	/*
+	 * The average dirty pages typically fluctuates within this scope.
+	 */
+	spread = min(bdi->write_bandwidth / 8, bdi_thresh / DIRTY_MARGIN);
+
+	/*
+	 * Update the base throttle bandwidth rigidly: eg. only try lowering it
+	 * when both the global/bdi dirty pages are away from their setpoints,
+	 * and are either standing still or continue departing away.
+	 *
+	 * The "+ avg_dirty / 256" tricks mainly help btrfs, which behaves
+	 * amazingly smoothly.  Its average dirty pages simply tracks more and
+	 * more close to the number of dirty pages without any overshooting,
+	 * thus its dirty pages may be ever moving towards the setpoint and
+	 * @avg_dirty ever approaching @dirty, slower and slower, but very hard
+	 * to cross it to trigger a base bandwidth update. What the trick does
+	 * is "when @avg_dirty is _close enough_ to @dirty, it indicates slowed
+	 * down @dirty change rate, hence the other inequalities are now a good
+	 * indication of something unbalanced in the current bdi".
+	 *
+	 * In the cases of hitting the upper/lower margins, it's obviously
+	 * necessary to adjust the (possibly very unbalanced) base bandwidth,
+	 * unless the opposite margin was also been hit recently, which
+	 * indicates that the dirty control scope may be smaller than the bdi
+	 * write bandwidth and hence the dirty pages are quickly fluctuating
+	 * between the upper/lower margins.
+	 */
+	if (bw < pos_bw) {
+		if (dirty < goal &&
+		    dirty <= default_backing_dev_info.avg_dirty +
+			     (default_backing_dev_info.avg_dirty >> 8) &&
+		    bdi->avg_dirty + spread < bdi_goal &&
+		    bdi_dirty <= bdi->avg_dirty + (bdi->avg_dirty >> 8) &&
+		    bdi_dirty <= bdi->old_dirty)
+			goto adjust;
+		if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin &&
+		    !dirty_exceeded_recently(bdi, HZ))
+			goto adjust;
+	}
+
+	if (bw > pos_bw) {
+		if (dirty > goal &&
+		    dirty >= default_backing_dev_info.avg_dirty -
+			     (default_backing_dev_info.avg_dirty >> 8) &&
+		    bdi->avg_dirty > bdi_goal + spread &&
+		    bdi_dirty >= bdi->avg_dirty - (bdi->avg_dirty >> 8) &&
+		    bdi_dirty >= bdi->old_dirty)
+			goto adjust;
+		if (dirty > limit - margin &&
+		    !dirty_free_run_recently(bdi, HZ))
+			goto adjust;
+	}
+
+	goto out;
+
+adjust:
+	/*
+	 * The min/max'ed xref_bw is an effective safeguard. The most dangerous
+	 * case that could unnecessarily disturb the base bandwith is: when the
+	 * control scope is roughly equal to the write bandwidth, the dirty
+	 * pages may rush into the upper/lower margins regularly. It typically
+	 * hits the upper margin in a blink, making a sudden drop of pos_bw and
+	 * ref_bw. Assume 5 points A, b, c, D, E, where b, c have the dropped
+	 * down number of pages, and A, D, E are at normal level.  At point b,
+	 * the xref_bw will be the good A; at c, the xref_bw will be the
+	 * dragged-down-by-b reference_bandwidth which is bad; at D and E, the
+	 * still-low reference_bandwidth will no longer bring the base
+	 * bandwidth down, as xref_bw will take the larger values from D and E.
+	 */
+	if (pos_bw > bw) {
+		xref_bw = min(ref_bw, bdi->old_ref_bandwidth);
+		xref_bw = min(xref_bw, bdi->reference_bandwidth);
+		if (xref_bw > bw)
+			delta = xref_bw - bw;
+		else
+			delta = 0;
+	} else {
+		xref_bw = max(ref_bw, bdi->reference_bandwidth);
+		xref_bw = max(xref_bw, bdi->reference_bandwidth);
+		if (xref_bw < bw)
+			delta = bw - xref_bw;
+		else
+			delta = 0;
+	}
+
+	/*
+	 * Don't pursue 100% rate matching. It's impossible since the balanced
+	 * rate itself is constantly fluctuating. So decrease the track speed
+	 * when it gets close to the target. Also limit the step size in
+	 * various ways to avoid overshooting.
+	 */
+	delta >>= bw / (2 * delta + 1);
+	delta = min(delta, (u64)abs64(pos_bw - bw));
+	delta >>= 1;
+	delta = min(delta, bw / 8);
+
+	if (pos_bw > bw)
+		bw += delta;
+	else
+		bw -= delta;
+
+	bdi->throttle_bandwidth = bw;
+out:
+	bdi_update_reference_bandwidth(bdi, ref_bw);
+}
+
 void bdi_update_bandwidth(struct backing_dev_info *bdi,
 			  unsigned long thresh,
 			  unsigned long dirty,
@@ -640,12 +1113,14 @@ void bdi_update_bandwidth(struct backing
 	static DEFINE_SPINLOCK(dirty_lock);
 	unsigned long now = jiffies;
 	unsigned long elapsed;
+	unsigned long dirtied;
 	unsigned long written;
 
 	if (!spin_trylock(&dirty_lock))
 		return;
 
 	elapsed = now - bdi->bw_time_stamp;
+	dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
 
 	/* skip quiet periods when disk bandwidth is under-utilized */
@@ -665,6 +1140,8 @@ void bdi_update_bandwidth(struct backing
 	if (thresh) {
 		update_dirty_limit(thresh, dirty);
 		bdi_update_dirty_threshold(bdi, thresh, dirty);
+		bdi_update_throttle_bandwidth(bdi, thresh, dirty,
+					      bdi_dirty, dirtied, elapsed);
 	}
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
 	if (thresh) {
@@ -673,6 +1150,7 @@ void bdi_update_bandwidth(struct backing
 	}
 
 snapshot:
+	bdi->dirtied_stamp = dirtied;
 	bdi->written_stamp = written;
 	bdi->bw_time_stamp = now;
 unlock:
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:22.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:27.000000000 +0800
@@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	bdi->write_bandwidth = INIT_BW;
 	bdi->avg_bandwidth = INIT_BW;
+	bdi->throttle_bandwidth = (u64)INIT_BW << BASE_BW_SHIFT;
 
 	bdi->avg_dirty = 0;
 	bdi->old_dirty = 0;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 19/27] writeback: dirty throttle bandwidth control
  2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
@ 2011-03-07 21:34   ` Wu Fengguang
  2011-03-29 21:08   ` Wu Fengguang
  1 sibling, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-07 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Thu, Mar 03, 2011 at 02:45:24PM +0800, Wu, Fengguang wrote:
> balance_dirty_pages() has been using a very simple and robust threshold
> based throttle scheme. It automatically limits the dirty rate down,
> however in a very bumpy way that constantly block the dirtier tasks for
> hundreds of milliseconds on a local ext4.

To get an idea of what exactly is going on in the current kernel, I
back ported the balance_dirty_pages and global_page_state trace events
to 2.6.38-rc7 and run the same test cases. The resulted graphs are
pretty striking.

In the worst NFS cases, the pause time frequently go up to 20-30 seconds,
and the dirty progress is rather bumpy.

1-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-14/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-14/global_dirtied_written.png

8-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/NFS/nfs-8dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-26/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/NFS/nfs-8dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-26/balance_dirty_pages-task-bw.png

The writes to USB key starts with a long 30 seconds pause, followed by
many ~2 seconds long pauses for ext4. XFS is better; btrfs performs the
best, however can still have 7s and 2s long delays.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/1UKEY+1HDD-3G/ext4-1dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-34/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/1UKEY+1HDD-3G/xfs-1dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-07-23-56/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/1UKEY+1HDD-3G/btrfs-1dd-1M-8p-2945M-20%25-2.6.38-rc7+-2011-03-08-00-14/balance_dirty_pages-pause.png

For the normal writes to HDD, ext4 has some >300ms pause times in 1-dd
case, >600ms for 2-dd case, and >2s for 8-dd case. The pause time
roughly deteriorates proportionally with the number of concurrent dd tasks.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/ext4-1dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-15/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/ext4-2dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-22/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/ext4-8dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-30/balance_dirty_pages-pause.png

XFS performs similarly
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/xfs-8dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-08/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/xfs-8dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-08/balance_dirty_pages-task-bw.png

btrfs is better, typically has 1-2s max pause time in 8-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/btrfs-8dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-21-48/balance_dirty_pages-pause.png

The long pause times will obviously ruin user experiences. It may also
hurt performance. For example, if the dirtier is a simple "cp" or
"scp", the long pause time will break the readahead pipeline or the
network pipeline, leading to moments of underutilized disk/network
bandwidth.

Comparing to the above graphs, this patchset is able to keep latency
under control (less than the configured 200ms max pause time) in all
known cases, whether it be 1-dd or 1000-dd, on local file systems,
over NFS or on USB key.

8-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/vanilla/4G/ext4-8dd-1M-8p-3911M-20%25-2.6.38-rc7+-2011-03-07-22-30/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-8dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-18/balance_dirty_pages-task-bw.png

128-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-128dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-25/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/4G/xfs-128dd-1M-8p-3927M-20%25-2.6.38-rc6-dt6+-2011-02-27-23-25/balance_dirty_pages-task-bw.png

1000-dd case
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/10SSD-RAID0-64G/xfs-1000dd-1M-64p-64288M-20%25-2.6.38-rc6-dt6+-2011-02-28-10-40/balance_dirty_pages-pause.png

UKEY
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-1dd-1M-8p-2975M-20%25-2.6.38-rc6-dt6+-2011-02-28-20-21/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/xfs-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-27/balance_dirty_pages-pause.png

NFS
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-1dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-09/balance_dirty_pages-pause.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/NFS/nfs-8dd-1M-8p-2945M-20%25-2.6.38-rc6-dt6+-2011-02-22-21-22/balance_dirty_pages-pause.png

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 19/27] writeback: dirty throttle bandwidth control
  2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
  2011-03-07 21:34   ` Wu Fengguang
@ 2011-03-29 21:08   ` Wu Fengguang
  1 sibling, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-29 21:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

Hi,

This is the hard core of the patchset. Sorry the original changelog is
way too detail oriented. I'll try to provide a more general overview
to help understand the main ideas.

There are two major code paths in this IO-less dirty throttling scheme.

(1) on write() syscall
    
balance_dirty_pages(pages_dirtied)
{
        task_bandwidth = bdi->base_bandwidth * pos_ratio /
                                                sqrt(task_dirty_weight);
        pause = pages_dirtied / task_bandwidth;
        sleep(pause);
}

where pos_ratio is calculated in

dirty_throttle_bandwidth()
{
        pos_ratio = 1.0;

        if (nr_dirty < goal)      scale up   pos_ratio
        if (nr_dirty > goal)      scale down pos_ratio
        if (bdi_dirty < bdi_goal) scale up   pos_ratio
        if (bdi_dirty > bdi_goal) scale down pos_ratio

        if (nr_dirty close to dirty limit) scale down pos_ratio
        if (bdi_dirty close to 0)          scale up   pos_ratio
}

(2) on every 100ms

bdi_update_bandwidth()
{
        update bdi->base_bandwidth
        update bdi->write_bandwidth
        update smoothed dirty pages
        update smoothed dirty threshold/limit
}

where bdi->base_bandwidth is updated in bdi_update_throttle_bandwidth()
to make sure that the bdi's
        - dirty bandwidth (the rate dirty pages are created)
        - write bandwidth (the rate dirty pages are cleaned)
will match if pos_ratio=1. The skeleton logic is:

bdi_update_throttle_bandwidth()
{
        if (common case: 1 task writing to 1 disk)
                ref_bw = bdi->write_bandwidth;
        else
                ref_bw = bdi->base_bandwidth * pos_ratio *
                                        (bdi->write_bandwidth / dirty_bw);

        if (dirty pages are departing from the dirty goals)
                step bdi->base_bandwidth closer to ref_bw;
}

Basically on the two core functions,

- dirty_throttle_bandwidth() is made of easy to understand policies,
  except that the lots of integer arithmetics are not so fun.

- bdi_update_throttle_bandwidth() is a mechanical estimation/tracking
  problem that is made tricky by lots of fluctuations. It does succeed
  in getting a very smooth/stable bdi->base_bandwidth on top of the much
  fluctuated pos_ratio, bdi->write_bandwidth and dirty_bw.

Thanks,
Fengguang

On Thu, Mar 03, 2011 at 02:45:24PM +0800, Wu, Fengguang wrote:
> balance_dirty_pages() has been using a very simple and robust threshold
> based throttle scheme. It automatically limits the dirty rate down,
> however in a very bumpy way that constantly block the dirtier tasks for
> hundreds of milliseconds on a local ext4.
> 
> The new scheme is to expand the ON/OFF threshold to a larger scope in
> which both the number of dirty pages and the dirty rate are explicitly
> controlled. The basic ideas are
> 
> - position feedback control
> 
>   At the center of the control scope is the setpoint/goal. When the
>   number of dirty pages go higher/lower than the goal, its dirty rate
>   will be proportionally decreased/increased to prevent it from drifting
>   away.
> 
>   When the dirty pages drops low to the bottom of the control scope, or
>   rushes high to the upper limit, the dirty rate will quickly be scaled
>   up/down, to the point of completely let go of or completely block the
>   dirtier task.
> 
> - rate feedback control
> 
>   What's the balanced dirty rate if the dirty pages are exactly at the
>   goal? If there are N tasks dirtying pages on 1 disk at rate task_bw MB/s,
>   then task_bw should be balanced at write_bw/N where write_bw is the
>   disk's write bandwidth. We call base_bw=write_bw/(N*sqrt(N)) the
>   disk's base throttle bandwidth.  Each task will be allowed to dirty at
>   rate task_bw=base_bw/sqrt(task_weight) where task_weight=1/N reflects
>   how much dirty pages in the system are dirtied by the task. So the
>   overall dirty rate dirty_bw=N*task_bw will match write_bw exactly.
> 
>   In practice we don't know base_bw beforehand. Because we don't know
>   the exact number of N and cannot assume all tasks are equal weighted.
>   So a reference bandwidth ref_bw is estimated as the target of base_bw.
>   base_bw will be adjusted step by step towards ref_bw. In each step,
>   ref_bw is calculated as (base_bw * pos_ratio * write_bw / dirty_bw):
>   when the (unknown number of) tasks are rate limited based on previous
>   (base_bw*pos_ratio*sqrt(task_weight)), if the overall dirty rate
>   dirty_bw is M times write_bw, then the base_bw shall be scaled 1/M to
>   match/balance dirty_bw <=> write_bw. Note that pos_ratio is the result
>   of position control, it will be 1 if the dirty pages are exactly at
>   the goal.
> 
>   The ref_bw estimation will be pretty accurate if without
>   (1) noises
>   (2) feedback delays between steps
>   (3) the mismatch between the number of dirty and writeback events
>       caused by user space truncate and file system redirty
> 
>   (1) can be smoothed out; (2) will decrease proportionally with the
>   adjust size when base_bw gets close to ref_bw.
> 
>   (3) can be ultimitely fixed by accounting the truncate/redirty events.
>   But for now we can rely on the robustness of base_bw update algorithms
>   to deal with the mismatches: no obvious imbalance is observed in ext4
>   workloads which have bursts of redirty and large dirtied:written=3:2
>   ratio. In theory when the truncate/redirty makes (write_bw/dirty_bw <
>   1), ref_bw and base_bw will go low, driving up the pos_ratio which
>   then corrects (pos_ratio * write_bw / dirty_bw) back to 1, thus
>   balance ref_bw at some point. What's more,
>   bdi_update_throttle_bandwidth() dictates that base_bw will only be
>   updated when ref_bw and pos_bw=base_bw*pos_ratio are both higher or
>   lower than base_bw. So the higher pos_bw will effectively stop base_bw
>   from approaching the lower ref_bw.
> 
> In general, it's pretty safe and robust.
> - the upper/lower bounds in the position control provides ultimate
>   safeguard: in case the algorithms fly away, the worst case would be
>   the dirty pages continuously hitting the bounds with big fluctuates in
>   dirty rate -- basically similiar to the current state.
> - the base bandwidth update rules are accurate and robust enough for
>   base_bw to quickly adapt to new workload and remain stable thereafter
>   This is confirmed by a wide range of tests: base_bw only goes less
>   stable when the control scope is smaller than the write bandwidth,
>   in which case the pos_ratio is already fluctuating much more.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  include/linux/backing-dev.h |   10
>  include/linux/writeback.h   |    7
>  mm/backing-dev.c            |    1
>  mm/page-writeback.c         |  478 ++++++++++++++++++++++++++++++++++
>  4 files changed, 495 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/include/linux/backing-dev.h 2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h      2011-03-03 14:44:27.000000000 +0800
> @@ -76,18 +76,26 @@ struct backing_dev_info {
>         struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
> 
>         unsigned long bw_time_stamp;
> +       unsigned long dirtied_stamp;
>         unsigned long written_stamp;
>         unsigned long write_bandwidth;
>         unsigned long avg_bandwidth;
> +       unsigned long long throttle_bandwidth;
> +       unsigned long long reference_bandwidth;
> +       unsigned long long old_ref_bandwidth;
>         unsigned long avg_dirty;
>         unsigned long old_dirty;
>         unsigned long dirty_threshold;
>         unsigned long old_dirty_threshold;
> 
> -
>         struct prop_local_percpu completions;
>         int dirty_exceeded;
> 
> +       /* last time exceeded (limit - limit/DIRTY_MARGIN) */
> +       unsigned long dirty_exceed_time;
> +       /* last time dropped below (background_thresh + dirty_thresh) / 2 */
> +       unsigned long dirty_free_run;
> +
>         unsigned int min_ratio;
>         unsigned int max_ratio, max_prop_frac;
> 
> --- linux-next.orig/include/linux/writeback.h   2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h        2011-03-03 14:44:23.000000000 +0800
> @@ -46,6 +46,13 @@ extern spinlock_t inode_lock;
>  #define DIRTY_MARGIN           (DIRTY_SCOPE * 4)
> 
>  /*
> + * The base throttle bandwidth will be 1000 times smaller than write bandwidth
> + * when there are 100 concurrent heavy dirtiers. This shift can work with up to
> + * 40 bits dirty size and 2^16 concurrent dirtiers.
> + */
> +#define BASE_BW_SHIFT          24
> +
> +/*
>   * fs/fs-writeback.c
>   */
>  enum writeback_sync_modes {
> --- linux-next.orig/mm/page-writeback.c 2011-03-03 14:44:23.000000000 +0800
> +++ linux-next/mm/page-writeback.c      2011-03-03 14:44:27.000000000 +0800
> @@ -496,6 +496,255 @@ static unsigned long dirty_rampup_size(u
>         return MIN_WRITEBACK_PAGES / 8;
>  }
> 
> +/*
> + * last time exceeded (limit - limit/DIRTY_MARGIN)
> + */
> +static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
> +                                   unsigned long time_window)
> +{
> +       return jiffies - bdi->dirty_exceed_time <= time_window;
> +}
> +
> +/*
> + * last time dropped below (thresh - 2*thresh/DIRTY_SCOPE + thresh/DIRTY_MARGIN)
> + */
> +static bool dirty_free_run_recently(struct backing_dev_info *bdi,
> +                                   unsigned long time_window)
> +{
> +       return jiffies - bdi->dirty_free_run <= time_window;
> +}
> +
> +/*
> + * Position based bandwidth control.
> + *
> + * (1) hard dirty limiting areas
> + *
> + * The block area is required to stop large number of slow dirtiers, because
> + * the max pause area is only able to throttle a task at 1page/200ms=20KB/s.
> + *
> + * The max pause area is sufficient for normal workloads, and has the virtue
> + * of bounded latency for light dirtiers.
> + *
> + * The brake area is typically enough to hold off the dirtiers as long as the
> + * dirtyable memory is not so tight.
> + *
> + * The block area and max pause area are enforced inside the loop of
> + * balance_dirty_pages(). Others can be found in dirty_throttle_bandwidth().
> + *
> + *         block area,  loop until drop below the area  -------------------|<===
> + *     max pause area,  sleep(max_pause) and return     -----------|<=====>|
> + *         brake area,  bw scaled from 1 down to 0      ---|<=====>|
> + * --------------------------------------------------------o-------o-------o----
> + *                                                         ^       ^       ^
> + *                          limit - limit/DIRTY_MARGIN  ---'       |       |
> + *                          limit                       -----------'       |
> + *                          limit + limit/DIRTY_MARGIN  -------------------'
> + *
> + * (2) global control areas
> + *
> + * The rampup area is for ramping up the base bandwidth whereas the above brake
> + * area is for scaling down the base bandwidth.
> + *
> + * The global thresh is typically equal to the above global limit. The
> + * difference is, @thresh is real-time computed from global_dirty_limits() and
> + * @limit is tracking @thresh at 100ms intervals in update_dirty_limit(). The
> + * point is to track @thresh slowly if it dropped below the number of dirty
> + * pages, so as to avoid unnecessarily entering the three areas in (1).
> + *
> + *rampup area                 setpoint/goal
> + *|<=======>|                      v
> + * |-------------------------------*-------------------------------|------------
> + * ^                               ^                               ^
> + * thresh - 2*thresh/DIRTY_SCOPE   thresh - thresh/DIRTY_SCOPE     thresh
> + *
> + * (3) bdi control areas
> + *
> + * The bdi reserve area tries to keep a reasonable number of dirty pages for
> + * preventing block queue underrun.
> + *
> + * reserve area, scale up bw as dirty pages drop low  bdi_setpoint
> + * |<=============================================>|       v
> + * |-------------------------------------------------------*-------|----------
> + * 0                    bdi_thresh - bdi_thresh/DIRTY_SCOPE^       ^bdi_thresh
> + *
> + * (4) global/bdi control lines
> + *
> + * dirty_throttle_bandwidth() applies 2 main and 3 regional control lines for
> + * scaling up/down the base bandwidth based on the position of dirty pages.
> + *
> + * The two main control lines for the global/bdi control scopes do not end at
> + * thresh/bdi_thresh.  They are centered at setpoint/bdi_setpoint and cover the
> + * whole [0, limit].  If the control line drops below 0 before reaching @limit,
> + * an auxiliary line will be setup to connect them. The below figure illustrates
> + * the main bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling down bdi_dirty back to normal if it starts
> + * high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi_setpoint.
> + * - the global/bdi dirty thresh/goal may be knocked down suddenly either on
> + *   user request or on increased memory consumption.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------*-----------------------------.--------------------*]
> + *  0                 bdi_setpoint                  bdi_origin           limit
> + *
> + * The bdi control line: if (bdi_origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long dirty_throttle_bandwidth(struct backing_dev_info *bdi,
> +                                             unsigned long thresh,
> +                                             unsigned long dirty,
> +                                             unsigned long bdi_dirty,
> +                                             struct task_struct *tsk)
> +{
> +       unsigned long limit = default_backing_dev_info.dirty_threshold;
> +       unsigned long bdi_thresh = bdi->dirty_threshold;
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long bw;
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       origin = 2 * thresh;
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;
> +               goal = (goal + origin) / 2;
> +               bw >>= 1;
> +       }
> +       bw = origin - dirty;
> +       bw <<= BASE_BW_SHIFT;
> +       do_div(bw, origin - goal + 1);
> +
> +       /*
> +        * brake area to prevent global dirty exceeding
> +        */
> +       if (dirty > limit - limit / DIRTY_MARGIN) {
> +               bw *= limit - dirty;
> +               do_div(bw, limit / DIRTY_MARGIN + 1);
> +       }
> +
> +       /*
> +        * rampup area, immediately above the unthrottled free-run region.
> +        * It's setup mainly to get an estimation of ref_bw for reliably
> +        * ramping up the base bandwidth.
> +        */
> +       dirty = default_backing_dev_info.avg_dirty;
> +       origin = thresh - thresh / (DIRTY_SCOPE/2) + thresh / DIRTY_MARGIN;
> +       if (dirty < origin) {
> +               span = (origin - dirty) * bw;
> +               do_div(span, thresh / (8 * DIRTY_MARGIN) + 1);
> +               bw += span;
> +       }
> +
> +       /*
> +        * bdi setpoint
> +        */
> +       if (unlikely(bdi_thresh > thresh))
> +               bdi_thresh = thresh;
> +       goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +       /*
> +        * In JBOD case, bdi_thresh could fluctuate proportional to its own
> +        * size. Otherwise the bdi write bandwidth is good for limiting the
> +        * floating area, to compensate for the global control line being too
> +        * flat in large memory systems.
> +        */
> +       span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +               (2 * bdi->avg_bandwidth) * bdi_thresh;
> +       do_div(span, thresh + 1);
> +       origin = goal + 2 * span;
> +
> +       dirty = bdi->avg_dirty;
> +       if (unlikely(dirty > goal + span)) {
> +               if (dirty > limit)
> +                       return 0;
> +               if (origin < limit) {
> +                       origin = limit;
> +                       goal += span;
> +                       bw >>= 1;
> +               }
> +       }
> +       bw *= origin - dirty;
> +       do_div(bw, origin - goal + 1);
> +
> +       /*
> +        * bdi reserve area, safeguard against bdi dirty underflow and disk idle
> +        */
> +       origin = bdi_thresh - bdi_thresh / (DIRTY_SCOPE / 2);
> +       if (bdi_dirty < origin)
> +               bw = bw * origin / (bdi_dirty | 1);
> +
> +       /*
> +        * honour light dirtiers higher bandwidth:
> +        *
> +        *      bw *= sqrt(1 / task_dirty_weight);
> +        */
> +       if (tsk) {
> +               unsigned long numerator, denominator;
> +               const unsigned long priority_base = 1024;
> +               unsigned long priority = priority_base;
> +
> +               /*
> +                * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
> +                * real-time tasks.
> +                */
> +               if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
> +                       priority *= 2;
> +
> +               task_dirties_fraction(tsk, &numerator, &denominator);
> +
> +               denominator <<= 10;
> +               denominator = denominator * priority / priority_base;
> +               bw *= int_sqrt(denominator / (numerator + 1)) *
> +                                           priority / priority_base;
> +               bw >>= 5 + BASE_BW_SHIFT / 2;
> +               bw = (unsigned long)bw * bdi->throttle_bandwidth;
> +               bw >>= 2 * BASE_BW_SHIFT - BASE_BW_SHIFT / 2;
> +
> +               /*
> +                * The avg_bandwidth bound is necessary because
> +                * bdi_update_throttle_bandwidth() blindly sets base bandwidth
> +                * to avg_bandwidth for more stable estimation, when it
> +                * believes the current task is the only dirtier.
> +                */
> +               if (priority > priority_base)
> +                       return min((unsigned long)bw, bdi->avg_bandwidth);
> +       }
> +
> +       return bw;
> +}
> +
>  static void bdi_update_dirty_smooth(struct backing_dev_info *bdi,
>                                     unsigned long dirty)
>  {
> @@ -631,6 +880,230 @@ static void bdi_update_dirty_threshold(s
>         bdi->old_dirty_threshold = thresh;
>  }
> 
> +/*
> + * ref_bw typically fluctuates within a small range, with large isolated points
> + * from time to time. The smoothed reference_bandwidth can effectively filter
> + * out 1 such standalone point. When there comes 2+ isolated points together --
> + * observed in ext4 on sudden redirty -- reference_bandwidth may surge high and
> + * take long time to return to normal, which can mostly be counteracted by
> + * xref_bw and other update restrictions in bdi_update_throttle_bandwidth().
> + */
> +static void bdi_update_reference_bandwidth(struct backing_dev_info *bdi,
> +                                          unsigned long ref_bw)
> +{
> +       unsigned long old = bdi->old_ref_bandwidth;
> +       unsigned long avg = bdi->reference_bandwidth;
> +
> +       if (avg > old && old >= ref_bw && avg - old >= old - ref_bw)
> +               avg -= (avg - old) >> 3;
> +
> +       if (avg < old && old <= ref_bw && old - avg >= ref_bw - old)
> +               avg += (old - avg) >> 3;
> +
> +       bdi->reference_bandwidth = avg;
> +       bdi->old_ref_bandwidth = ref_bw;
> +}
> +
> +/*
> + * Base throttle bandwidth.
> + */
> +static void bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
> +                                         unsigned long thresh,
> +                                         unsigned long dirty,
> +                                         unsigned long bdi_dirty,
> +                                         unsigned long dirtied,
> +                                         unsigned long elapsed)
> +{
> +       unsigned long limit = default_backing_dev_info.dirty_threshold;
> +       unsigned long margin = limit / DIRTY_MARGIN;
> +       unsigned long goal = thresh - thresh / DIRTY_SCOPE;
> +       unsigned long bdi_thresh = bdi->dirty_threshold;
> +       unsigned long bdi_goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +       unsigned long long bw = bdi->throttle_bandwidth;
> +       unsigned long long dirty_bw;
> +       unsigned long long pos_bw;
> +       unsigned long long delta;
> +       unsigned long long ref_bw = 0;
> +       unsigned long long xref_bw;
> +       unsigned long pos_ratio;
> +       unsigned long spread;
> +
> +       if (dirty > limit - margin)
> +               bdi->dirty_exceed_time = jiffies;
> +
> +       if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin)
> +               bdi->dirty_free_run = jiffies;
> +
> +       /*
> +        * The dirty rate should match the writeback rate exactly, except when
> +        * dirty pages are truncated before IO submission. The mismatches are
> +        * hopefully small and hence ignored. So a continuous stream of dirty
> +        * page trucates will result in errors in ref_bw, fortunately pos_bw
> +        * can effectively stop the base bw from being driven away endlessly
> +        * by the errors.
> +        *
> +        * It'd be nicer for the filesystems to not redirty too much pages
> +        * either on IO or lock contention, or on sub-page writes.  ext4 is
> +        * known to redirty pages in big bursts, leading to
> +        *   - surges of dirty_bw, which can be mostly safeguarded by the
> +        *     min/max'ed xref_bw
> +        *   - the temporary drop of task weight and hence surge of task bw
> +        * It could possibly be fixed in the FS.
> +        */
> +       dirty_bw = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> +
> +       pos_ratio = dirty_throttle_bandwidth(bdi, thresh, dirty,
> +                                            bdi_dirty, NULL);
> +       /*
> +        * pos_bw = task_bw, assuming 100% task dirty weight
> +        *
> +        * (pos_bw > bw) means the position of the number of dirty pages is
> +        * lower than the global and/or bdi setpoints. It does not necessarily
> +        * mean the base throttle bandwidth is larger than its balanced value.
> +        * The latter is likely only when
> +        * - (position) the dirty pages are at some distance from the setpoint,
> +        * - (speed) and either stands still or is departing from the setpoint.
> +        */
> +       pos_bw = (bw >> (BASE_BW_SHIFT/2)) * pos_ratio >>
> +                       (BASE_BW_SHIFT/2);
> +
> +       /*
> +        * A typical desktop has only 1 task writing to 1 disk, in which case
> +        * the dirtier task should be throttled at the disk's write bandwidth.
> +        * Note that we ignore minor dirty/writeback mismatches such as
> +        * redirties and truncated dirty pages.
> +        */
> +       if (bdi_thresh > thresh - thresh / 16) {
> +               unsigned long numerator, denominator;
> +
> +               task_dirties_fraction(current, &numerator, &denominator);
> +               if (numerator > denominator - denominator / 16)
> +                       ref_bw = bdi->avg_bandwidth << BASE_BW_SHIFT;
> +       }
> +       /*
> +        * Otherwise there may be
> +        * 1) N dd tasks writing to the current disk, or
> +        * 2) X dd tasks and Y "rsync --bwlimit" tasks.
> +        * The below estimation is accurate enough for (1). For (2), where not
> +        * all task's dirty rate can be changed proportionally by adjusting the
> +        * base throttle bandwidth, it would require multiple adjust-reestimate
> +        * cycles to approach the rate matching point. Which is not a big
> +        * concern as we always do small steps to approach the target. The
> +        * un-controllable tasks may only slow down the progress.
> +        */
> +       if (!ref_bw) {
> +               ref_bw = pos_ratio * bdi->avg_bandwidth;
> +               do_div(ref_bw, dirty_bw | 1);
> +               ref_bw = (bw >> (BASE_BW_SHIFT/2)) * (unsigned long)ref_bw >>
> +                               (BASE_BW_SHIFT/2);
> +       }
> +
> +       /*
> +        * The average dirty pages typically fluctuates within this scope.
> +        */
> +       spread = min(bdi->write_bandwidth / 8, bdi_thresh / DIRTY_MARGIN);
> +
> +       /*
> +        * Update the base throttle bandwidth rigidly: eg. only try lowering it
> +        * when both the global/bdi dirty pages are away from their setpoints,
> +        * and are either standing still or continue departing away.
> +        *
> +        * The "+ avg_dirty / 256" tricks mainly help btrfs, which behaves
> +        * amazingly smoothly.  Its average dirty pages simply tracks more and
> +        * more close to the number of dirty pages without any overshooting,
> +        * thus its dirty pages may be ever moving towards the setpoint and
> +        * @avg_dirty ever approaching @dirty, slower and slower, but very hard
> +        * to cross it to trigger a base bandwidth update. What the trick does
> +        * is "when @avg_dirty is _close enough_ to @dirty, it indicates slowed
> +        * down @dirty change rate, hence the other inequalities are now a good
> +        * indication of something unbalanced in the current bdi".
> +        *
> +        * In the cases of hitting the upper/lower margins, it's obviously
> +        * necessary to adjust the (possibly very unbalanced) base bandwidth,
> +        * unless the opposite margin was also been hit recently, which
> +        * indicates that the dirty control scope may be smaller than the bdi
> +        * write bandwidth and hence the dirty pages are quickly fluctuating
> +        * between the upper/lower margins.
> +        */
> +       if (bw < pos_bw) {
> +               if (dirty < goal &&
> +                   dirty <= default_backing_dev_info.avg_dirty +
> +                            (default_backing_dev_info.avg_dirty >> 8) &&
> +                   bdi->avg_dirty + spread < bdi_goal &&
> +                   bdi_dirty <= bdi->avg_dirty + (bdi->avg_dirty >> 8) &&
> +                   bdi_dirty <= bdi->old_dirty)
> +                       goto adjust;
> +               if (dirty < thresh - thresh / (DIRTY_SCOPE/2) + margin &&
> +                   !dirty_exceeded_recently(bdi, HZ))
> +                       goto adjust;
> +       }
> +
> +       if (bw > pos_bw) {
> +               if (dirty > goal &&
> +                   dirty >= default_backing_dev_info.avg_dirty -
> +                            (default_backing_dev_info.avg_dirty >> 8) &&
> +                   bdi->avg_dirty > bdi_goal + spread &&
> +                   bdi_dirty >= bdi->avg_dirty - (bdi->avg_dirty >> 8) &&
> +                   bdi_dirty >= bdi->old_dirty)
> +                       goto adjust;
> +               if (dirty > limit - margin &&
> +                   !dirty_free_run_recently(bdi, HZ))
> +                       goto adjust;
> +       }
> +
> +       goto out;
> +
> +adjust:
> +       /*
> +        * The min/max'ed xref_bw is an effective safeguard. The most dangerous
> +        * case that could unnecessarily disturb the base bandwith is: when the
> +        * control scope is roughly equal to the write bandwidth, the dirty
> +        * pages may rush into the upper/lower margins regularly. It typically
> +        * hits the upper margin in a blink, making a sudden drop of pos_bw and
> +        * ref_bw. Assume 5 points A, b, c, D, E, where b, c have the dropped
> +        * down number of pages, and A, D, E are at normal level.  At point b,
> +        * the xref_bw will be the good A; at c, the xref_bw will be the
> +        * dragged-down-by-b reference_bandwidth which is bad; at D and E, the
> +        * still-low reference_bandwidth will no longer bring the base
> +        * bandwidth down, as xref_bw will take the larger values from D and E.
> +        */
> +       if (pos_bw > bw) {
> +               xref_bw = min(ref_bw, bdi->old_ref_bandwidth);
> +               xref_bw = min(xref_bw, bdi->reference_bandwidth);
> +               if (xref_bw > bw)
> +                       delta = xref_bw - bw;
> +               else
> +                       delta = 0;
> +       } else {
> +               xref_bw = max(ref_bw, bdi->reference_bandwidth);
> +               xref_bw = max(xref_bw, bdi->reference_bandwidth);
> +               if (xref_bw < bw)
> +                       delta = bw - xref_bw;
> +               else
> +                       delta = 0;
> +       }
> +
> +       /*
> +        * Don't pursue 100% rate matching. It's impossible since the balanced
> +        * rate itself is constantly fluctuating. So decrease the track speed
> +        * when it gets close to the target. Also limit the step size in
> +        * various ways to avoid overshooting.
> +        */
> +       delta >>= bw / (2 * delta + 1);
> +       delta = min(delta, (u64)abs64(pos_bw - bw));
> +       delta >>= 1;
> +       delta = min(delta, bw / 8);
> +
> +       if (pos_bw > bw)
> +               bw += delta;
> +       else
> +               bw -= delta;
> +
> +       bdi->throttle_bandwidth = bw;
> +out:
> +       bdi_update_reference_bandwidth(bdi, ref_bw);
> +}
> +
>  void bdi_update_bandwidth(struct backing_dev_info *bdi,
>                           unsigned long thresh,
>                           unsigned long dirty,
> @@ -640,12 +1113,14 @@ void bdi_update_bandwidth(struct backing
>         static DEFINE_SPINLOCK(dirty_lock);
>         unsigned long now = jiffies;
>         unsigned long elapsed;
> +       unsigned long dirtied;
>         unsigned long written;
> 
>         if (!spin_trylock(&dirty_lock))
>                 return;
> 
>         elapsed = now - bdi->bw_time_stamp;
> +       dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>         written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> 
>         /* skip quiet periods when disk bandwidth is under-utilized */
> @@ -665,6 +1140,8 @@ void bdi_update_bandwidth(struct backing
>         if (thresh) {
>                 update_dirty_limit(thresh, dirty);
>                 bdi_update_dirty_threshold(bdi, thresh, dirty);
> +               bdi_update_throttle_bandwidth(bdi, thresh, dirty,
> +                                             bdi_dirty, dirtied, elapsed);
>         }
>         __bdi_update_write_bandwidth(bdi, elapsed, written);
>         if (thresh) {
> @@ -673,6 +1150,7 @@ void bdi_update_bandwidth(struct backing
>         }
> 
>  snapshot:
> +       bdi->dirtied_stamp = dirtied;
>         bdi->written_stamp = written;
>         bdi->bw_time_stamp = now;
>  unlock:
> --- linux-next.orig/mm/backing-dev.c    2011-03-03 14:44:22.000000000 +0800
> +++ linux-next/mm/backing-dev.c 2011-03-03 14:44:27.000000000 +0800
> @@ -674,6 +674,7 @@ int bdi_init(struct backing_dev_info *bd
> 
>         bdi->write_bandwidth = INIT_BW;
>         bdi->avg_bandwidth = INIT_BW;
> +       bdi->throttle_bandwidth = (u64)INIT_BW << BASE_BW_SHIFT;
> 
>         bdi->avg_dirty = 0;
>         bdi->old_dirty = 0;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 20/27] writeback: IO-less balance_dirty_pages()
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (18 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-ioless-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 27231 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 200ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8


PSEUDO THROTTLE CODE
====================

balance_dirty_pages():

	/* soft throttling */
	if (within dirty control scope)
		sleep (dirtied_pages / throttle_bandwidth)

	/* max throttling */
	if (dirty_limit exceeded)
		sleep 200ms

	/* block waiting */
	while (dirty_limit+dirty_limit/32 exceeded)
		sleep 200ms


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once crossing
the global (background + dirty)/2=15% threshold, and be balanced around
17.5%. Before patch, the behavior is to just throttle it at 20%
dirtyable memory.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than 15% dirtyable memory.

THINK TIME
==========

The task's think time is into account when computing the final pause time.
This will make accurate throttle bandwidth. In the rare case that the task
slept longer than the period time, the extra sleep time will also be
compensated in next period if it's not too big (<500ms).  Accumulated
errors are carefully avoided as long as the task don't sleep for too
long time.

case 1: period > think

                pause = period - think
                paused_when += pause

                             period time
              |======================================>|
                  think time
              |===============>|
        ------|----------------|----------------------|-----------
        paused_when         jiffies


case 2: period <= think

                don't pause and reduce future pause time by:
                paused_when += period

                       period time
              |=========================>|
                             think time
              |======================================>|
        ------|--------------------------+------------|-----------
        paused_when                                jiffies


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%,
  		     IO size and throughput increases by 10%

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 
 include/linux/sched.h       |    8 
 mm/backing-dev.c            |    2 
 mm/memory_hotplug.c         |    3 
 mm/page-writeback.c         |  354 +++++++++++++++-------------------
 5 files changed, 169 insertions(+), 199 deletions(-)

--- linux-next.orig/include/linux/sched.h	2011-03-03 14:43:49.000000000 +0800
+++ linux-next/include/linux/sched.h	2011-03-03 14:44:23.000000000 +0800
@@ -1487,6 +1487,14 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+	unsigned long paused_when;	/* start of a write-and-pause period */
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:23.000000000 +0800
@@ -37,24 +37,9 @@
 #include <trace/events/writeback.h>
 
 /*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
+ * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
-static long ratelimit_pages = 32;
-
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
+#define MAX_PAUSE	max(HZ/5, 1)
 
 /* The following parameters are exported via /proc/sys/vm */
 
@@ -257,36 +242,6 @@ static inline void task_dirties_fraction
 }
 
 /*
- * task_dirty_limit - scale down dirty throttling threshold for one task
- *
- * task specific dirty limit:
- *
- *   dirty -= (dirty/8) * p_{t}
- *
- * To protect light/slow dirtying tasks from heavier/fast ones, we start
- * throttling individual tasks before reaching the bdi dirty limit.
- * Relatively low thresholds will be allocated to heavy dirtiers. So when
- * dirty pages grow large, heavy dirtiers will be throttled first, which will
- * effectively curb the growth of dirty pages. Light dirtiers with high enough
- * dirty threshold may never get throttled.
- */
-static unsigned long task_dirty_limit(struct task_struct *tsk,
-				       unsigned long bdi_dirty)
-{
-	long numerator, denominator;
-	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
-
-	task_dirties_fraction(tsk, &numerator, &denominator);
-	inv *= numerator;
-	do_div(inv, denominator);
-
-	dirty -= inv;
-
-	return max(dirty, bdi_dirty/2);
-}
-
-/*
  *
  */
 static unsigned int bdi_min_ratio;
@@ -399,8 +354,6 @@ unsigned long determine_dirtyable_memory
  * Calculate the dirty thresholds based on sysctl parameters
  * - vm.dirty_background_ratio  or  vm.dirty_background_bytes
  * - vm.dirty_ratio             or  vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
  */
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
 {
@@ -431,10 +384,6 @@ void global_dirty_limits(unsigned long *
 		background = dirty - dirty / (DIRTY_SCOPE / 2);
 
 	tsk = current;
-	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
-		background += background / 4;
-		dirty += dirty / 4;
-	}
 	*pbackground = background;
 	*pdirty = dirty;
 }
@@ -497,6 +446,23 @@ static unsigned long dirty_rampup_size(u
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it near-sqrt to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(unsigned long dirty,
+				     unsigned long thresh)
+{
+	if (thresh > dirty)
+		return 1UL << (ilog2(thresh - dirty) >> 1);
+
+	return 1;
+}
+
+/*
  * last time exceeded (limit - limit/DIRTY_MARGIN)
  */
 static bool dirty_exceeded_recently(struct backing_dev_info *bdi,
@@ -1158,6 +1124,43 @@ unlock:
 }
 
 /*
+ * Limit pause time for small memory systems. If sleeping for too long time,
+ * the small pool of dirty/writeback pages may go empty and disk go idle.
+ */
+static unsigned long max_pause(struct backing_dev_info *bdi,
+			       unsigned long bdi_dirty)
+{
+	unsigned long t;  /* jiffies */
+
+	/* 1ms for every 1MB; may further consider bdi bandwidth */
+	t = bdi_dirty >> (30 - PAGE_CACHE_SHIFT - ilog2(HZ));
+	t += 2;
+
+	return min_t(unsigned long, t, MAX_PAUSE);
+}
+
+/*
+ * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
+ * But ensure reasonably large [min_pause, max_pause] range size, so that
+ * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
+ */
+static unsigned long min_pause(struct backing_dev_info *bdi,
+			       unsigned long max)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->throttle_bandwidth) - BASE_BW_SHIFT;
+	unsigned long t = 1 + max / 8;  /* jiffies */
+
+	if (lo >= hi)
+		return t;
+
+	/* (N * 10ms) on 2^N concurrent tasks */
+	t += (hi - lo) * (10 * HZ) / 1024;
+
+	return min(t, max / 2);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -1165,49 +1168,34 @@ unlock:
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	unsigned long nr_reclaimable;
+	unsigned long nr_dirty;
+	unsigned long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
-	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
-	bool dirty_exceeded = false;
+	unsigned long bw;
+	unsigned long period;
+	unsigned long pause = 0;
+	unsigned long pause_max;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
 
-	if (!bdi_cap_account_dirty(bdi))
-		return;
-
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
 		/*
-		 * Throttle it only when the background writeback cannot
-		 * catch-up. This avoids (excessively) small writeouts
-		 * when the bdi limits are ramping up.
-		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
-			break;
-
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
-
-		/*
 		 * In order to avoid the stacked BDI deadlock we need
 		 * to ensure we accurately count the 'dirty' pages when
 		 * the threshold is low.
@@ -1217,67 +1205,89 @@ static void balance_dirty_pages(struct a
 		 * actually dirty; with m+n sitting in the percpu
 		 * deltas.
 		 */
-		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+		if (bdi->dirty_threshold < 2*bdi_stat_error(bdi)) {
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		bdi_update_bandwidth(bdi, dirty_thresh,
-				     nr_reclaimable + nr_writeback,
-				     bdi_nr_reclaimable + bdi_nr_writeback,
-				     start_time);
-
 		/*
-		 * The bdi thresh is somehow "soft" limit derived from the
-		 * global "hard" limit. The former helps to prevent heavy IO
-		 * bdi or process from holding back light ones; The latter is
-		 * the last resort safeguard.
+		 * Throttle it only when the background writeback cannot
+		 * catch-up. This avoids (excessively) small writeouts
+		 * when the bdi limits are ramping up.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
-
-		if (!dirty_exceeded)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			current->paused_when = jiffies;
+			current->nr_dirtied = 0;
 			break;
+		}
 
-		if (!bdi->dirty_exceeded)
-			bdi->dirty_exceeded = 1;
+		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
+				     bdi_dirty, start_time);
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
+		pause_max = max_pause(bdi, bdi_dirty);
+
+		bw = dirty_throttle_bandwidth(bdi, dirty_thresh, nr_dirty,
+					      bdi_dirty, current);
+		if (unlikely(bw == 0)) {
+			period = pause_max;
+			pause = pause_max;
+			goto pause;
+		}
+		period = (HZ * pages_dirtied + bw / 2) / bw;
+		pause = current->paused_when + period - jiffies;
+		/*
+		 * Take it as long think time if pause falls into (-10s, 0).
+		 * If it's less than 500ms (ext2 blocks the dirtier task for
+		 * up to 400ms from time to time on 1-HDD; so does xfs, however
+		 * at much less frequency), try to compensate it in future by
+		 * updating the virtual time; otherwise just reset the time, as
+		 * it may be a light dirtier.
 		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
+		if (unlikely(-pause < HZ*10)) {
+			if (-pause > HZ/2) {
+				current->paused_when = jiffies;
+				current->nr_dirtied = 0;
+				pause = 0;
+			} else if (period) {
+				current->paused_when += period;
+				current->nr_dirtied = 0;
+				pause = 1;
+			} else
+				current->nr_dirtied_pause <<= 1;
+			break;
 		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
+		if (pause > pause_max)
+			pause = pause_max;
+
+pause:
+		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		current->paused_when += pause;
+		current->nr_dirtied = 0;
 
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		if (nr_dirty < default_backing_dev_info.dirty_threshold +
+		    default_backing_dev_info.dirty_threshold / DIRTY_MARGIN)
+			break;
 	}
 
-	if (!dirty_exceeded && bdi->dirty_exceeded)
-		bdi->dirty_exceeded = 0;
+	if (pause == 0)
+		current->nr_dirtied_pause =
+				ratelimit_pages(nr_dirty, dirty_thresh);
+	else if (pause <= min_pause(bdi, pause_max))
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
+	else if (pause >= pause_max)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
 
 	if (writeback_in_progress(bdi))
 		return;
@@ -1290,8 +1300,10 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (laptop_mode)
+		return;
+
+	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
 }
 
@@ -1305,8 +1317,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -1316,36 +1326,35 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (!bdi_cap_account_dirty(bdi))
+		return;
+
+	current->nr_dirtied += nr_pages_dirtied;
+
+	if (dirty_exceeded_recently(bdi, MAX_PAUSE)) {
+		unsigned long max = current->nr_dirtied +
+						(128 >> (PAGE_SHIFT - 10));
+
+		if (current->nr_dirtied_pause > max)
+			current->nr_dirtied_pause = max;
+	}
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
-	}
-	preempt_enable();
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause))
+		balance_dirty_pages(mapping, current->nr_dirtied);
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -1433,44 +1442,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -1492,9 +1463,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/include/linux/backing-dev.h	2011-03-03 14:44:23.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-03-03 14:44:23.000000000 +0800
@@ -89,7 +89,6 @@ struct backing_dev_info {
 	unsigned long old_dirty_threshold;
 
 	struct prop_local_percpu completions;
-	int dirty_exceeded;
 
 	/* last time exceeded (limit - limit/DIRTY_MARGIN) */
 	unsigned long dirty_exceed_time;
--- linux-next.orig/mm/memory_hotplug.c	2011-03-03 14:43:49.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2011-03-03 14:44:23.000000000 +0800
@@ -468,8 +468,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 	unlock_memory_hotplug();
@@ -901,7 +899,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_memory_hotplug();
--- linux-next.orig/mm/backing-dev.c	2011-03-03 14:44:23.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-03 14:44:23.000000000 +0800
@@ -667,8 +667,6 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
-	bdi->dirty_exceeded = 0;
-
 	bdi->bw_time_stamp = jiffies;
 	bdi->written_stamp = 0;
 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 21/27] writeback: show bdi write bandwidth in debugfs
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (19 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2593 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2011-03-02 17:40:05.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-03-02 17:40:41.000000000 +0800
@@ -87,24 +87,30 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiDirtied:       %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiDirtied:         %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   K(bdi_thresh),
+		   K(dirty_thresh),
+		   K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
-		   nr_dirty, nr_io, nr_more_io,
+		   (unsigned long) K(bdi->write_bandwidth),
+		   nr_dirty,
+		   nr_io,
+		   nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 22/27] writeback: trace dirty_throttle_bandwidth
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (20 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 3004 bytes --]

It provides critical information to understand how various throttle
bandwidths are updated.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   49 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    1 
 2 files changed, 50 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:31.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:38.000000000 +0800
@@ -1068,6 +1068,7 @@ adjust:
 	bdi->throttle_bandwidth = bw;
 out:
 	bdi_update_reference_bandwidth(bdi, ref_bw);
+	trace_dirty_throttle_bandwidth(bdi, dirty_bw, pos_bw, ref_bw);
 }
 
 void bdi_update_bandwidth(struct backing_dev_info *bdi,
--- linux-next.orig/include/trace/events/writeback.h	2011-03-03 14:43:49.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-03-03 14:44:38.000000000 +0800
@@ -152,6 +152,55 @@ DEFINE_WBC_EVENT(wbc_balance_dirty_writt
 DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
+#define Bps(x)			((x) >> (BASE_BW_SHIFT - PAGE_SHIFT))
+
+TRACE_EVENT(dirty_throttle_bandwidth,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long dirty_bw,
+		 unsigned long long pos_bw,
+		 unsigned long long ref_bw),
+
+	TP_ARGS(bdi, dirty_bw, pos_bw, ref_bw),
+
+	TP_STRUCT__entry(
+		__array(char,			bdi, 32)
+		__field(unsigned long,		write_bw)
+		__field(unsigned long,		avg_bw)
+		__field(unsigned long,		dirty_bw)
+		__field(unsigned long long,	base_bw)
+		__field(unsigned long long,	pos_bw)
+		__field(unsigned long long,	ref_bw)
+		__field(unsigned long long,	avg_ref_bw)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->write_bw	= KBps(bdi->write_bandwidth);
+		__entry->avg_bw		= KBps(bdi->avg_bandwidth);
+		__entry->dirty_bw	= KBps(dirty_bw);
+		__entry->base_bw	= Bps(bdi->throttle_bandwidth);
+		__entry->pos_bw		= Bps(pos_bw);
+		__entry->ref_bw		= Bps(ref_bw);
+		__entry->avg_ref_bw	= Bps(bdi->reference_bandwidth);
+	),
+
+
+	TP_printk("bdi %s: "
+		  "write_bw=%lu avg_bw=%lu dirty_bw=%lu "
+		  "base_bw=%llu pos_bw=%llu ref_bw=%llu aref_bw=%llu",
+		  __entry->bdi,
+		  __entry->write_bw,	/* bdi write bandwidth */
+		  __entry->avg_bw,	/* bdi avg write bandwidth */
+		  __entry->dirty_bw,	/* bdi dirty bandwidth */
+		  __entry->base_bw,	/* base throttle bandwidth */
+		  __entry->pos_bw,	/* position control bandwidth */
+		  __entry->ref_bw,	/* reference throttle bandwidth */
+		  __entry->avg_ref_bw	/* smoothed reference bandwidth */
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 23/27] writeback: trace balance_dirty_pages
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (21 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 8969 bytes --]

It would be useful for analyzing the dynamics of the throttling
algorithms, and helpful for debugging user reported problems.

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h        |    5 +
 include/trace/events/writeback.h |   92 ++++++++++++++++++++++++++++-
 mm/page-writeback.c              |   22 ++++++
 3 files changed, 114 insertions(+), 5 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2011-03-03 14:44:38.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-03-03 14:44:39.000000000 +0800
@@ -147,9 +147,6 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
@@ -201,6 +198,95 @@ TRACE_EVENT(dirty_throttle_bandwidth,
 	)
 );
 
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 unsigned long thresh,
+		 unsigned long dirty,
+		 unsigned long bdi_dirty,
+		 unsigned long task_bw,
+		 unsigned long dirtied,
+		 unsigned long period,
+		 long pause,
+		 unsigned long start_time),
+
+	TP_ARGS(bdi, thresh, dirty, bdi_dirty,
+		task_bw, dirtied, period, pause, start_time),
+
+	TP_STRUCT__entry(
+		__array(	 char,	bdi, 32)
+		__field(unsigned long,	bdi_weight)
+		__field(unsigned long,	task_weight)
+		__field(unsigned long,	limit)
+		__field(unsigned long,	goal)
+		__field(unsigned long,	dirty)
+		__field(unsigned long,	bdi_goal)
+		__field(unsigned long,	bdi_dirty)
+		__field(unsigned long,	avg_dirty)
+		__field(unsigned long,	base_bw)
+		__field(unsigned long,	task_bw)
+		__field(unsigned long,	dirtied)
+		__field(unsigned long,	period)
+		__field(	 long,	think)
+		__field(	 long,	pause)
+		__field(unsigned long,	paused)
+	),
+
+	TP_fast_assign(
+		long numerator;
+		long denominator;
+
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+
+		bdi_writeout_fraction(bdi, &numerator, &denominator);
+		__entry->bdi_weight	= 1000 * numerator / denominator;
+		task_dirties_fraction(current, &numerator, &denominator);
+		__entry->task_weight	= 1000 * numerator / denominator;
+
+		__entry->limit = default_backing_dev_info.dirty_threshold;
+		__entry->goal		= thresh - thresh / DIRTY_SCOPE;
+		__entry->dirty		= dirty;
+		__entry->bdi_goal	= bdi->dirty_threshold -
+					  bdi->dirty_threshold / DIRTY_SCOPE;
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->avg_dirty	= bdi->avg_dirty;
+		__entry->base_bw	= KBps(bdi->throttle_bandwidth) >>
+								BASE_BW_SHIFT;
+		__entry->task_bw	= KBps(task_bw);
+		__entry->dirtied	= dirtied;
+		__entry->think		= current->paused_when == 0 ? 0 :
+			 (long)(jiffies - current->paused_when) * 1000 / HZ;
+		__entry->period		= period * 1000 / HZ;
+		__entry->pause		= pause * 1000 / HZ;
+		__entry->paused		= (jiffies - start_time) * 1000 / HZ;
+	),
+
+
+	TP_printk("bdi %s: bdi_weight=%lu task_weight=%lu "
+		  "limit=%lu goal=%lu dirty=%lu "
+		  "bdi_goal=%lu bdi_dirty=%lu avg_dirty=%lu "
+		  "base_bw=%lu task_bw=%lu "
+		  "dirtied=%lu "
+		  "period=%lu think=%ld pause=%ld paused=%lu",
+		  __entry->bdi,
+		  __entry->bdi_weight,
+		  __entry->task_weight,
+		  __entry->limit,
+		  __entry->goal,
+		  __entry->dirty,
+		  __entry->bdi_goal,
+		  __entry->bdi_dirty,
+		  __entry->avg_dirty,
+		  __entry->base_bw,	/* base throttle bandwidth */
+		  __entry->task_bw,	/* task throttle bandwidth */
+		  __entry->dirtied,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause,	/* ms */
+		  __entry->paused	/* ms */
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:44:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:44:39.000000000 +0800
@@ -227,14 +227,14 @@ void task_dirty_inc(struct task_struct *
 /*
  * Obtain an accurate fraction of the BDI's portion.
  */
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
 		long *numerator, long *denominator)
 {
 	prop_fraction_percpu(&vm_completions, &bdi->completions,
 				numerator, denominator);
 }
 
-static inline void task_dirties_fraction(struct task_struct *tsk,
+void task_dirties_fraction(struct task_struct *tsk,
 		long *numerator, long *denominator)
 {
 	prop_fraction_single(&vm_dirties, &tsk->dirties,
@@ -1251,6 +1251,15 @@ static void balance_dirty_pages(struct a
 		 * it may be a light dirtier.
 		 */
 		if (unlikely(-pause < HZ*10)) {
+			trace_balance_dirty_pages(bdi,
+						  dirty_thresh,
+						  nr_dirty,
+						  bdi_dirty,
+						  bw,
+						  pages_dirtied,
+						  period,
+						  pause,
+						  start_time);
 			if (-pause > HZ/2) {
 				current->paused_when = jiffies;
 				current->nr_dirtied = 0;
@@ -1267,6 +1276,15 @@ static void balance_dirty_pages(struct a
 			pause = pause_max;
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  dirty_thresh,
+					  nr_dirty,
+					  bdi_dirty,
+					  bw,
+					  pages_dirtied,
+					  period,
+					  pause,
+					  start_time);
 		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
--- linux-next.orig/include/linux/writeback.h	2011-03-03 14:44:38.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-03-03 14:44:39.000000000 +0800
@@ -169,6 +169,11 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty);
 
+void bdi_writeout_fraction(struct backing_dev_info *bdi,
+			   long *numerator, long *denominator);
+void task_dirties_fraction(struct task_struct *tsk,
+			   long *numerator, long *denominator);
+
 void bdi_update_bandwidth(struct backing_dev_info *bdi,
 			  unsigned long thresh,
 			  unsigned long dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 24/27] writeback: trace global_dirty_state
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (22 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 2800 bytes --]

Add trace balance_dirty_state for showing the global dirty page counts
and thresholds at each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   48 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |    1 
 2 files changed, 49 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-03-03 14:02:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-03 14:03:24.000000000 +0800
@@ -386,6 +386,7 @@ void global_dirty_limits(unsigned long *
 	tsk = current;
 	*pbackground = background;
 	*pdirty = dirty;
+	trace_global_dirty_state(background, dirty);
 }
 EXPORT_SYMBOL_GPL(global_dirty_limits);
 
--- linux-next.orig/include/trace/events/writeback.h	2011-03-03 14:02:58.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-03-03 14:03:24.000000000 +0800
@@ -287,6 +287,54 @@ TRACE_EVENT(balance_dirty_pages,
 		  )
 );
 
+TRACE_EVENT(global_dirty_state,
+
+	TP_PROTO(unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	poll_thresh)
+		__field(unsigned long,	nr_dirtied)
+		__field(unsigned long,	nr_written)
+	),
+
+	TP_fast_assign(
+		__entry->nr_dirty	= global_page_state(NR_FILE_DIRTY);
+		__entry->nr_writeback	= global_page_state(NR_WRITEBACK);
+		__entry->nr_unstable	= global_page_state(NR_UNSTABLE_NFS);
+		__entry->nr_dirtied	= global_page_state(NR_DIRTIED);
+		__entry->nr_written	= global_page_state(NR_WRITTEN);
+		__entry->background_thresh	= background_thresh;
+		__entry->dirty_thresh		= dirty_thresh;
+		__entry->poll_thresh		= current->nr_dirtied_pause;
+	),
+
+	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld poll=%ld "
+		  "dirtied=%lu written=%lu",
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_thresh - __entry->nr_dirty -
+		  __entry->nr_writeback - __entry->nr_unstable,
+		  __entry->poll_thresh,
+		  __entry->nr_dirtied,
+		  __entry->nr_written
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 25/27] writeback: make nr_to_write a per-file limit
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (23 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 2184 bytes --]

This ensures full 4MB or larger writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2011-03-03 14:02:53.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-03-03 14:03:32.000000000 +0800
@@ -330,6 +330,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long uninitialized_var(nr_to_write);
 	unsigned dirty;
 	int ret;
 
@@ -365,8 +367,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -689,6 +699,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
+		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
--- linux-next.orig/include/linux/writeback.h	2011-03-03 14:02:53.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-03-03 14:03:32.000000000 +0800
@@ -74,6 +74,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (24 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03  6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
  2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason, Peter Zijlstra,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 4820 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

XFS is observed to do IO completions in a batch, and the batch size is
equal to the write chunk size. To avoid dirty pages to suddenly drop
out of balance_dirty_pages()'s dirty control scope and create large
fluctuations, the chunk size is also limited to half the control scope.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++--------------------
 1 file changed, 34 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-03-02 17:24:06.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-03-02 17:24:10.000000000 +0800
@@ -594,15 +594,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -614,6 +605,39 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long writeback_chunk_size(struct backing_dev_info *bdi,
+					  int sync_mode)
+{
+	unsigned long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          __writeback_inodes_sb()     <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                  (quickly) tag currently dirty pages
+	 *                  (maybe slowly) sync all tagged pages
+	 */
+	if (sync_mode == WB_SYNC_ALL)
+		return LONG_MAX;
+
+	pages = min(bdi->avg_bandwidth,
+		    bdi->dirty_threshold / DIRTY_SCOPE);
+
+	if (pages <= MIN_WRITEBACK_PAGES)
+		return MIN_WRITEBACK_PAGES;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -653,24 +677,6 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
-		write_chunk = LONG_MAX;
-
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 	for (;;) {
@@ -698,6 +704,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		write_chunk = writeback_chunk_size(wb->bdi, wbc.sync_mode);
 		wbc.nr_to_write = write_chunk;
 		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 27/27] writeback: trace writeback_single_inode
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (25 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
@ 2011-03-03  6:45 ` Wu Fengguang
  2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
  27 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-03-03  6:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Vivek Goyal, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3795 bytes --]

It is valuable to know how the dirty inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   56 +++++++++++++++++++++++++++++
 2 files changed, 63 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-03-02 17:24:06.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-03-02 17:24:06.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -351,7 +351,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -367,10 +368,8 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -446,6 +445,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2011-03-02 17:24:06.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2011-03-02 17:24:06.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,49 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+		__field(unsigned long, writeback_index)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+		__entry->writeback_index = inode->i_mapping->writeback_index;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu "
+		  "wrote=%lu to_write=%ld index=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write,
+		  __entry->writeback_index
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define Bps(x)			((x) >> (BASE_BW_SHIFT - PAGE_SHIFT))
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/27] IO-less dirty throttling v6
  2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
                   ` (26 preceding siblings ...)
  2011-03-03  6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
@ 2011-03-03 20:12 ` Vivek Goyal
  2011-03-03 20:48   ` Vivek Goyal
  27 siblings, 1 reply; 44+ messages in thread
From: Vivek Goyal @ 2011-03-03 20:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

On Thu, Mar 03, 2011 at 02:45:05PM +0800, Wu Fengguang wrote:

[..]
> - serve as simple IO controllers: if provide an interface for the user
>   to set task_bw directly (by returning the user specified value
>   directly at the beginning of dirty_throttle_bandwidth(), plus always
>   throttle such tasks even under the background dirty threshold), we get
>   a bandwidth based per-task async write IO controller; let the user
>   scale up/down the @priority parameter in dirty_throttle_bandwidth(),
>   we get a priority based IO controller. It's possible to extend the
>   capabilities to the scope of cgroup, too.
> 

Hi Fengguang,

Above simple IO controller capabilities sound interesting and I was
looking at the patch to figure out the details. 

You seem to be mentioning that user can explicitly set the upper rate
limit per task for async IO. Can't really figure that out where is the
interface for setting such upper limits. Can you please point me to that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/27] IO-less dirty throttling v6
  2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
@ 2011-03-03 20:48   ` Vivek Goyal
  2011-03-04  9:06     ` Wu Fengguang
  0 siblings, 1 reply; 44+ messages in thread
From: Vivek Goyal @ 2011-03-03 20:48 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Andrea Righi, Balbir Singh, linux-mm, linux-fsdevel,
	LKML

On Thu, Mar 03, 2011 at 03:12:26PM -0500, Vivek Goyal wrote:
> On Thu, Mar 03, 2011 at 02:45:05PM +0800, Wu Fengguang wrote:
> 
> [..]
> > - serve as simple IO controllers: if provide an interface for the user
> >   to set task_bw directly (by returning the user specified value
> >   directly at the beginning of dirty_throttle_bandwidth(), plus always
> >   throttle such tasks even under the background dirty threshold), we get
> >   a bandwidth based per-task async write IO controller; let the user
> >   scale up/down the @priority parameter in dirty_throttle_bandwidth(),
> >   we get a priority based IO controller. It's possible to extend the
> >   capabilities to the scope of cgroup, too.
> > 
> 
> Hi Fengguang,
> 
> Above simple IO controller capabilities sound interesting and I was
> looking at the patch to figure out the details. 
> 
> You seem to be mentioning that user can explicitly set the upper rate
> limit per task for async IO. Can't really figure that out where is the
> interface for setting such upper limits. Can you please point me to that.

Never mind. Jeff moyer pointed out that you mentioned above as possible
future enhancements on top of this patchset.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 00/27] IO-less dirty throttling v6
  2011-03-03 20:48   ` Vivek Goyal
@ 2011-03-04  9:06     ` Wu Fengguang
  2011-04-04 18:12       ` async write IO controllers Wu Fengguang
  0 siblings, 1 reply; 44+ messages in thread
From: Wu Fengguang @ 2011-03-04  9:06 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Fri, Mar 04, 2011 at 04:48:27AM +0800, Vivek Goyal wrote:
> On Thu, Mar 03, 2011 at 03:12:26PM -0500, Vivek Goyal wrote:
> > On Thu, Mar 03, 2011 at 02:45:05PM +0800, Wu Fengguang wrote:
> > 
> > [..]
> > > - serve as simple IO controllers: if provide an interface for the user
> > >   to set task_bw directly (by returning the user specified value
> > >   directly at the beginning of dirty_throttle_bandwidth(), plus always
> > >   throttle such tasks even under the background dirty threshold), we get
> > >   a bandwidth based per-task async write IO controller; let the user
> > >   scale up/down the @priority parameter in dirty_throttle_bandwidth(),
> > >   we get a priority based IO controller. It's possible to extend the
> > >   capabilities to the scope of cgroup, too.
> > > 
> > 
> > Hi Fengguang,
> > 
> > Above simple IO controller capabilities sound interesting and I was
> > looking at the patch to figure out the details. 
> > 
> > You seem to be mentioning that user can explicitly set the upper rate
> > limit per task for async IO. Can't really figure that out where is the
> > interface for setting such upper limits. Can you please point me to that.
> 
> Never mind. Jeff moyer pointed out that you mentioned above as possible
> future enhancements on top of this patchset.

Hi Vivek,

Here is an update show the bandwidth limit possibility. I tested it by
starting 8 or 10 concurrent dd's, doing "ulimit -m $((i<<10))" before
starting the i'th dd. The first 3 dd's progress are shown in the
following graphs.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/BW-LIMIT/xfs-10dd-1M-8p-2975M-20%25-2.6.38-rc7-dt6+-2011-03-04-16-22/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/BW-LIMIT/xfs-8dd-1M-8p-2975M-20%25-2.6.38-rc7-dt6+-2011-03-04-16-15/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/BW-LIMIT/ext4-10dd-1M-8p-2975M-20%25-2.6.38-rc7-dt6+-2011-03-04-16-29/balance_dirty_pages-task-bw.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/BW-LIMIT/btrfs-10dd-1M-8p-2975M-20%25-2.6.38-rc7-dt6+-2011-03-04-16-35/balance_dirty_pages-task-bw.png

The bandwidth limit is not perfect in two of the above cases:
- the xfs 10dd case: tasks could be hard throttled on dirty exceeding
- the ext4 10dd case: filesystem makes >500ms latencies (smaller ones will be compensated)

Thanks,
Fengguang
---

Subject: writeback: per-task async write bandwidth limit
Date: Fri Mar 04 10:38:04 CST 2011

XXX: the user interface is reusing RLIMIT_RSS for now.

CC: Vivek Goyal <vgoyal@redhat.com>
CC: Andrea Righi <arighi@develer.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-03-04 10:33:06.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-03-04 16:03:52.000000000 +0800
@@ -428,6 +428,11 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+static unsigned long hard_dirty_limit(unsigned long thresh)
+{
+	return max(thresh, default_backing_dev_info.dirty_threshold);
+}
+
 /*
  * If we can dirty N more pages globally, honour N/8 to the bdi that runs low,
  * so as to help it ramp up.
@@ -589,7 +594,7 @@ static unsigned long dirty_throttle_band
 					      unsigned long bdi_dirty,
 					      struct task_struct *tsk)
 {
-	unsigned long limit = default_backing_dev_info.dirty_threshold;
+	unsigned long limit = hard_dirty_limit(thresh);
 	unsigned long bdi_thresh = bdi->dirty_threshold;
 	unsigned long origin;
 	unsigned long goal;
@@ -1221,6 +1226,11 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			if (current->signal->rlim[RLIMIT_RSS].rlim_cur !=
+			    RLIM_INFINITY) {
+				pause_max = MAX_PAUSE;
+				goto calc_bw;
+			}
 			current->paused_when = jiffies;
 			current->nr_dirtied = 0;
 			break;
@@ -1233,7 +1243,7 @@ static void balance_dirty_pages(struct a
 			bdi_start_background_writeback(bdi);
 
 		pause_max = max_pause(bdi, bdi_dirty);
-
+calc_bw:
 		bw = dirty_throttle_bandwidth(bdi, dirty_thresh, nr_dirty,
 					      bdi_dirty, current);
 		if (unlikely(bw == 0)) {
@@ -1241,6 +1251,8 @@ static void balance_dirty_pages(struct a
 			pause = pause_max;
 			goto pause;
 		}
+		bw = min(bw, current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT);
 		period = (HZ * pages_dirtied + bw / 2) / bw;
 		pause = current->paused_when + period - jiffies;
 		/*
@@ -1292,8 +1304,8 @@ pause:
 		current->paused_when += pause;
 		current->nr_dirtied = 0;
 
-		if (nr_dirty < default_backing_dev_info.dirty_threshold +
-		    default_backing_dev_info.dirty_threshold / DIRTY_MARGIN)
+		dirty_thresh = hard_dirty_limit(dirty_thresh);
+		if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MARGIN)
 			break;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* async write IO controllers
  2011-03-04  9:06     ` Wu Fengguang
@ 2011-04-04 18:12       ` Wu Fengguang
  0 siblings, 0 replies; 44+ messages in thread
From: Wu Fengguang @ 2011-04-04 18:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, Andrea Righi, Balbir Singh, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Hi Vivek,

To explore the possibility of an integrated async write cgroup IO
controller in balance_dirty_pages(), I did the attached patches.
They should serve it well to illustrate the basic ideas.

It's based on Andrea's two supporting patches and a slightly
simplified and improved version of this v6 patchset.

        root@fat ~# cat test-blkio-cgroup.sh
        #!/bin/sh

        mount /dev/sda7 /fs  

        rmdir /cgroup/async_write
        mkdir /cgroup/async_write
        echo $$ > /cgroup/async_write/tasks
        # echo "8:16  1048576" > /cgroup/async_write/blkio.throttle.read_bps_device

        dd if=/dev/zero of=/fs/zero1 bs=1M count=100 &
        dd if=/dev/zero of=/fs/zero2 bs=1M count=100 &

2-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied100+0 records in
        100+0 records out
        , 11.9477 s, 8.8 MB/s
        104857600 bytes (105 MB) copied, 11.9496 s, 8.8 MB/s

1-dd case:

        root@fat ~# 100+0 records in
        100+0 records out
        104857600 bytes (105 MB) copied, 6.21919 s, 16.9 MB/s

The patch hard codes a limit of 16MiB/s or 16.8MB/s.  So the 1-dd case
is pretty accurate, and the 2-dd case is a bit leaked due to the time
to take the throttle bandwidth from its initial value 16MiB/s to
8MiB/s. This could be compensated by some position control in future,
so that it won't leak in normal cases.

The main bits, blkcg_update_throttle_bandwidth() is in fact a minimal
version of bdi_update_throttle_bandwidth(); blkcg_update_bandwidth()
is also a cut-down version of bdi_update_bandwidth().

Thanks,
Fengguang

[-- Attachment #2: blk-cgroup-nr-dirtied.patch --]
[-- Type: text/x-diff, Size: 1920 bytes --]

Subject: blkcg: dirty rate accounting
Date: Sat Apr 02 20:15:28 CST 2011

To be used by the balance_dirty_pages() async write IO controller.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    4 ++++
 include/linux/blk-cgroup.h |    1 +
 mm/page-writeback.c        |    4 ++++
 3 files changed, 9 insertions(+)

--- linux-next.orig/block/blk-cgroup.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-02 21:59:24.000000000 +0800
@@ -1458,6 +1458,7 @@ static void blkiocg_destroy(struct cgrou
 
 	free_css_id(&blkio_subsys, &blkcg->css);
 	rcu_read_unlock();
+	percpu_counter_destroy(&blkcg->nr_dirtied);
 	if (blkcg != &blkio_root_cgroup)
 		kfree(blkcg);
 }
@@ -1483,6 +1484,9 @@ done:
 	INIT_HLIST_HEAD(&blkcg->blkg_list);
 
 	INIT_LIST_HEAD(&blkcg->policy_list);
+
+	percpu_counter_init(&blkcg->nr_dirtied, 0);
+
 	return &blkcg->css;
 }
 
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-02 21:59:02.000000000 +0800
@@ -111,6 +111,7 @@ struct blkio_cgroup {
 	spinlock_t lock;
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
+	struct percpu_counter nr_dirtied;
 };
 
 struct blkio_group_stats {
--- linux-next.orig/mm/page-writeback.c	2011-04-02 20:17:08.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-02 21:59:02.000000000 +0800
@@ -34,6 +34,7 @@
 #include <linux/syscalls.h>
 #include <linux/buffer_head.h>
 #include <linux/pagevec.h>
+#include <linux/blk-cgroup.h>
 #include <trace/events/writeback.h>
 
 /*
@@ -221,6 +222,9 @@ EXPORT_SYMBOL_GPL(bdi_writeout_inc);
 
 void task_dirty_inc(struct task_struct *tsk)
 {
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(tsk);
+	if (blkcg)
+		__percpu_counter_add(&blkcg->nr_dirtied, 1, BDI_STAT_BATCH);
 	prop_inc_single(&vm_dirties, &tsk->dirties);
 }
 

[-- Attachment #3: writeback-io-controller.patch --]
[-- Type: text/x-diff, Size: 5134 bytes --]

Subject: writeback: async write IO controllers
Date: Fri Mar 04 10:38:04 CST 2011

- a bare per-task async write IO controller
- a bare per-cgroup async write IO controller

XXX: the per-task user interface is reusing RLIMIT_RSS for now.
XXX: the per-cgroup user interface is missing

CC: Vivek Goyal <vgoyal@redhat.com>
CC: Andrea Righi <arighi@develer.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 block/blk-cgroup.c         |    2 
 include/linux/blk-cgroup.h |    4 +
 mm/page-writeback.c        |   86 +++++++++++++++++++++++++++++++----
 3 files changed, 84 insertions(+), 8 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-04-05 01:26:53.000000000 +0800
@@ -1117,6 +1117,49 @@ static unsigned long max_pause(struct ba
 	return clamp_val(t, MIN_PAUSE, MAX_PAUSE);
 }
 
+static void blkcg_update_throttle_bandwidth(struct blkio_cgroup *blkcg,
+					    unsigned long dirtied,
+					    unsigned long elapsed)
+{
+	unsigned long bw = blkcg->throttle_bandwidth;
+	unsigned long long ref_bw;
+	unsigned long dirty_bw;
+
+	ref_bw = blkcg->async_write_bps >> (3 + PAGE_SHIFT - RATIO_SHIFT);
+	dirty_bw = ((dirtied - blkcg->dirtied_stamp)*HZ + elapsed/2) / elapsed;
+	do_div(ref_bw, dirty_bw | 1);
+	ref_bw = bw * ref_bw >> RATIO_SHIFT;
+
+	blkcg->throttle_bandwidth = (bw + ref_bw) / 2;
+}
+
+void blkcg_update_bandwidth(struct blkio_cgroup *blkcg)
+{
+	unsigned long now = jiffies;
+	unsigned long dirtied;
+	unsigned long elapsed;
+
+	if (!blkcg)
+		return;
+	if (!spin_trylock(&blkcg->lock))
+		return;
+
+	elapsed = now - blkcg->bw_time_stamp;
+	dirtied = percpu_counter_read(&blkcg->nr_dirtied);
+
+	if (elapsed > MAX_PAUSE * 2)
+		goto snapshot;
+	if (elapsed <= MAX_PAUSE)
+		goto unlock;
+
+	blkcg_update_throttle_bandwidth(blkcg, dirtied, elapsed);
+snapshot:
+	blkcg->dirtied_stamp = dirtied;
+	blkcg->bw_time_stamp = now;
+unlock:
+	spin_unlock(&blkcg->lock);
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -1139,6 +1182,10 @@ static void balance_dirty_pages(struct a
 	unsigned long pause_max;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
+	struct blkio_cgroup *blkcg = task_to_blkio_cgroup(current);
+
+	if (blkcg == &blkio_root_cgroup)
+		blkcg = NULL;
 
 	for (;;) {
 		unsigned long now = jiffies;
@@ -1178,6 +1225,15 @@ static void balance_dirty_pages(struct a
 		 * when the bdi limits are ramping up.
 		 */
 		if (nr_dirty <= (background_thresh + dirty_thresh) / 2) {
+			if (blkcg) {
+				pause_max = max_pause(bdi, 0);
+				goto cgroup_ioc;
+			}
+			if (current->signal->rlim[RLIMIT_RSS].rlim_cur !=
+			    RLIM_INFINITY) {
+				pause_max = max_pause(bdi, 0);
+				goto task_ioc;
+			}
 			current->paused_when = now;
 			current->nr_dirtied = 0;
 			break;
@@ -1190,21 +1246,35 @@ static void balance_dirty_pages(struct a
 			bdi_start_background_writeback(bdi);
 
 		pause_max = max_pause(bdi, bdi_dirty);
-
 		base_bw = bdi->throttle_bandwidth;
-		/*
-		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
-		 * real-time tasks.
-		 */
-		if (current->flags & PF_LESS_THROTTLE || rt_task(current))
-			base_bw *= 2;
 		bw = position_ratio(bdi, dirty_thresh, nr_dirty, bdi_dirty);
 		if (unlikely(bw == 0)) {
 			period = pause_max;
 			pause = pause_max;
 			goto pause;
 		}
-		bw = base_bw * (u64)bw >> RATIO_SHIFT;
+		bw = (u64)base_bw * bw >> RATIO_SHIFT;
+		if (blkcg && bw > blkcg->throttle_bandwidth) {
+cgroup_ioc:
+			blkcg_update_bandwidth(blkcg);
+			bw = blkcg->throttle_bandwidth;
+			base_bw = bw;
+		}
+		if (bw > current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT) {
+task_ioc:
+			bw = current->signal->rlim[RLIMIT_RSS].rlim_cur >>
+								PAGE_SHIFT;
+			base_bw = bw;
+		}
+		/*
+		 * Double the bandwidth for PF_LESS_THROTTLE (ie. nfsd) and
+		 * real-time tasks.
+		 */
+		if (current->flags & PF_LESS_THROTTLE || rt_task(current)) {
+			bw *= 2;
+			base_bw = bw;
+		}
 		period = (HZ * pages_dirtied + bw / 2) / (bw | 1);
 		pause = current->paused_when + period - now;
 		/*
--- linux-next.orig/block/blk-cgroup.c	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/block/blk-cgroup.c	2011-04-05 01:26:39.000000000 +0800
@@ -1486,6 +1486,8 @@ done:
 	INIT_LIST_HEAD(&blkcg->policy_list);
 
 	percpu_counter_init(&blkcg->nr_dirtied, 0);
+	blkcg->async_write_bps = 16 << 23; /* XXX: tunable interface */
+	blkcg->throttle_bandwidth = 16 << (20 - PAGE_SHIFT);
 
 	return &blkcg->css;
 }
--- linux-next.orig/include/linux/blk-cgroup.h	2011-04-05 01:26:38.000000000 +0800
+++ linux-next/include/linux/blk-cgroup.h	2011-04-05 01:26:39.000000000 +0800
@@ -112,6 +112,10 @@ struct blkio_cgroup {
 	struct hlist_head blkg_list;
 	struct list_head policy_list; /* list of blkio_policy_node */
 	struct percpu_counter nr_dirtied;
+	unsigned long bw_time_stamp;
+	unsigned long dirtied_stamp;
+	unsigned long throttle_bandwidth;
+	unsigned long async_write_bps;
 };
 
 struct blkio_group_stats {

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2011-04-04 18:12 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-03  6:45 [PATCH 00/27] IO-less dirty throttling v6 Wu Fengguang
2011-03-03  6:45 ` [PATCH 01/27] writeback: add bdi_dirty_limit() kernel-doc Wu Fengguang
2011-03-03  6:45 ` [PATCH 02/27] writeback: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45 ` [PATCH 03/27] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2011-03-03  6:45 ` [PATCH 04/27] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2011-03-03  6:45 ` [PATCH 05/27] btrfs: avoid duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2011-03-03  6:45 ` [PATCH 06/27] btrfs: lower the dirty balance poll interval Wu Fengguang
2011-03-04  6:22   ` Dave Chinner
2011-03-04  7:57     ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 07/27] btrfs: wait on too many nr_async_bios Wu Fengguang
2011-03-03  6:45 ` [PATCH 08/27] nfs: dirty livelock prevention is now done in VFS Wu Fengguang
2011-03-03  6:45 ` [PATCH 09/27] nfs: writeback pages wait queue Wu Fengguang
2011-03-03 16:07   ` Peter Zijlstra
2011-03-04  1:53     ` Wu Fengguang
2011-03-03 16:08   ` Peter Zijlstra
2011-03-04  2:01     ` Wu Fengguang
2011-03-04  9:10       ` Peter Zijlstra
2011-03-04  9:26         ` Peter Zijlstra
2011-03-04 14:38           ` Wu Fengguang
2011-03-04 14:41             ` Peter Zijlstra
2011-03-03  6:45 ` [PATCH 10/27] nfs: limit the commit size to reduce fluctuations Wu Fengguang
2011-03-03  6:45 ` [PATCH 11/27] nfs: limit the commit range Wu Fengguang
2011-03-03  6:45 ` [PATCH 12/27] nfs: lower writeback threshold proportionally to dirty threshold Wu Fengguang
2011-03-03  6:45 ` [PATCH 13/27] writeback: account per-bdi accumulated written pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 14/27] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 15/27] writeback: bdi write bandwidth estimation Wu Fengguang
2011-03-03  6:45 ` [PATCH 16/27] writeback: smoothed global/bdi dirty pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 17/27] writeback: smoothed dirty threshold and limit Wu Fengguang
2011-03-03  6:45 ` [PATCH 18/27] writeback: enforce 1/4 gap between the dirty/background thresholds Wu Fengguang
2011-03-03  6:45 ` [PATCH 19/27] writeback: dirty throttle bandwidth control Wu Fengguang
2011-03-07 21:34   ` Wu Fengguang
2011-03-29 21:08   ` Wu Fengguang
2011-03-03  6:45 ` [PATCH 20/27] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-03-03  6:45 ` [PATCH 21/27] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2011-03-03  6:45 ` [PATCH 22/27] writeback: trace dirty_throttle_bandwidth Wu Fengguang
2011-03-03  6:45 ` [PATCH 23/27] writeback: trace balance_dirty_pages Wu Fengguang
2011-03-03  6:45 ` [PATCH 24/27] writeback: trace global_dirty_state Wu Fengguang
2011-03-03  6:45 ` [PATCH 25/27] writeback: make nr_to_write a per-file limit Wu Fengguang
2011-03-03  6:45 ` [PATCH 26/27] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2011-03-03  6:45 ` [PATCH 27/27] writeback: trace writeback_single_inode Wu Fengguang
2011-03-03 20:12 ` [PATCH 00/27] IO-less dirty throttling v6 Vivek Goyal
2011-03-03 20:48   ` Vivek Goyal
2011-03-04  9:06     ` Wu Fengguang
2011-04-04 18:12       ` async write IO controllers Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).