[PATCH 00/47] IO-less dirty throttling v3

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/47] IO-less dirty throttling v3
@ 2010-12-13  6:42 Wu Fengguang
  2010-12-13  6:42 ` [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
                   ` (47 more replies)
  0 siblings, 48 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Peter Zijlstra, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim,
	Wu Fengguang, linux-mm, linux-fsdevel, LKML

Andrew,

I'm glad to release this extensively tested v3 IO-less dirty throttling
patchset. It's based on 2.6.37-rc5 and Jan's sync livelock patches.

Given its trickiness and possibility of side effects, independent tests
are highly welcome. Here is the git tree for easy access

git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback.git dirty-throttling-v3

Andrew, I followed your suggestion to add some trace points, and goes further
to write scripts to do automated tests and to visualize the collected trace,
iostat and vmstat data. The help is tremendous. The tests and data analyzes
pave way to many fixes and algorithm improvements.

It still took long time. The most challenging tasks are the fluctuations on
100+ dd and on NFS, and various imperfections in the control system and in
many filesystems. I'd say I won't be able to go this far without the help of
the pretty graphs. And I believe they'll continue to make future maintenance
easy. To identify problems reported by the end users, just ask for the traces,
I'll then turn them into graphs and quickly get an overview of the problem.

The most up-to-date graphs and the corresponding scripts are uploaded to

	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests

Here you may find and compare test results for this patchset (2.6.37-rc5+) and
for vanilla kernel (2.6.37-rc5). Filesystem developers may be interested
to take a look at the dynamics.

The control algorithms are generally doing good in the recent graphs.
There are regular fluctuations of the dirty pages number, however they
are mostly originated from underneath: the low level is reporting IO
completion in units of 1MB, 32MB or even more, leading to sudden drops
of the dirty pages.

The tests cover the common scenarios

- ext2, ext3, ext4, xfs, btrfs, nfs
- 256M, 512M, 3G, 16G memory sizes
- single disk and 12-disk array
- 1, 2, 10, 100, 1000 concurrent dd's

They disclose lots of imperfections and bugs of
1) this patchset
2) file system not working well with the new paradigm 
3) file system problems also exist in vanilla kernel

I managed to fix case (1) and most of (2) and report (3).
Below are some interesting graphs illustrating the problems.

BTRFS

case (3) problem, nr_dirty going all the way down to 0, fixed by
[PATCH 38/47] btrfs: wait on too many nr_async_bios
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-17/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-21-23/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png                                                                                                                      
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png                                                                                                                      

case (3) problem, not good looking but otherwise harmless, not fixed yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1K-8p-2953M-2.6.37-rc3+-2010-11-30-14/vmstat-written.png
root cause is btrfs always clear page dirty in the end of prepare_pages() and
then to set it dirty again in dirty_and_release_pages(). This leads to
duplicate dirty accounting on 1KB-size writes.

case (3) problem, bdi limit exceeded on 10+ concurrent dd's, fixed by
[PATCH 37/47] btrfs: lower the dirty balacing rate limit
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-02-20/dirty-pages.png

case (2) problem, not root caused yet

in vanilla kernel, the dirty/writeback pages are interesting
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/vmstat-dirty.png

but performance is still excellent
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-14-37/iostat-bw.png

with IO-less balance_dirty_pages(), it's much more slow
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-bw.png

dirty pages go very low
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/vmstat-dirty.png

with only 20% disk util
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/btrfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-03-54/iostat-util.png

EXT4

case (3) problem, maybe memory leak, not root caused yet
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/ext4-100dd-1M-24p-15976M-2.6.37-rc5+-2010-12-09-23-40/dirty-pages.png

case (3) problem, burst-of-redirty, known issue with data=ordered, would be non-trivial to fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-00-37/dirty-pages-3000.png
the workaround now is to mount with data=writeback
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/ext4_wb-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-12-13-40/dirty-pages.png

NFS

There are some hard problems
- large fluctuations of everything
- writeback/unstable pages squeezing dirty pages
- sometimes it may stall the dirtiers for 1-2 seconds because no COMMITs return
  during the time, hard to fix in the client side

before the patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5-2010-12-11-10-31/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5-2010-12-10-12-40/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-4K-8p-2953M-2.6.37-rc3+-2010-11-29-10/dirty-bandwidth.png

after patches
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/vmstat-dirty.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/dirty-bandwidth-3000.png
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/vmstat-dirty.png

burst of commit submits/returns
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2953M-2.6.37-rc3+-2010-12-03-01/nfs-commit-1000.png
after fix
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png

The 1-second stall happens at around 317s and 321s. Fortunately it only
happens for 10+ concurrent dd's, which is not typical NFS client workloads.
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png


XFS

performs mostly ideal, except for some trivial imperfections: somewhere
the lines are not straight.

dirty/writeback pages
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5-2010-12-10-18-18/vmstat-dirty.png

avg queue size and wait time
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/iostat-misc.png

bandwidth
http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/16G-12HDD-RAID0/xfs-1000dd-1M-24p-15976M-2.6.37-rc5+-2010-12-10-02-53/dirty-bandwidth.png


Changes from v2 <http://lkml.org/lkml/2010/11/16/728>

- lock protected bdi bandwidth estimation
- user space think time compensation
- raise max pause time to 200ms for lower CPU overheads on concurrent dirtiers
- control system enhancements to handle large pause time and huge number of tasks
- concurrent dd test suite and a lot of tests
- adaptive scale up writeback chunk size
- make it right for small memory systems
- various bug fixes
- new trace points

Changes from initial RFC <http://thread.gmane.org/gmane.linux.kernel.mm/52966>

- adaptive rate limiting, to reduce overheads when under throttle threshold
- prevent overrunning dirty limit on lots of concurrent dirtiers
- add Documentation/filesystems/writeback-throttling-design.txt
- lower max pause time from 200ms to 100ms; min pause time from 10ms to 1jiffy
- don't drop the laptop mode code
- update and comment the trace event
- benchmarks on concurrent dd and fs_mark covering both large and tiny files
- bdi->write_bandwidth updates should be rate limited on concurrent dirtiers,
  otherwise it will drift fast and fluctuate
- don't call balance_dirty_pages_ratelimit() when writing to already dirtied
  pages, otherwise the task will be throttled too much

bdi dirty limit fixes
	[PATCH 01/47] writeback: enabling gate limit for light dirtied bdi
	[PATCH 02/47] writeback: safety margin for bdi stat error

v2 patches rebased onto the above two fixes
	[PATCH 03/47] writeback: IO-less balance_dirty_pages()
	[PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages()
	[PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages()
	[PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
	[PATCH 07/47] writeback: account per-bdi accumulated written pages
	[PATCH 08/47] writeback: bdi write bandwidth estimation
	[PATCH 09/47] writeback: show bdi write bandwidth in debugfs
	[PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low
	[PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time
	[PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds
	[PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers
	[PATCH 14/47] writeback: add trace event for balance_dirty_pages()
	[PATCH 15/47] writeback: make nr_to_write a per-file limit

trivial fixes for v2
	[PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix
	[PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages()
	[PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc()
	[PATCH 19/47] writeback: fix increasement of nr_dirtied_pause
	[PATCH 20/47] writeback: use do_div in bw calculation
	[PATCH 21/47] writeback: prevent divide error on tiny HZ
	[PATCH 22/47] writeback: prevent bandwidth calculation overflow

spinlock protected bandwidth estimation, as suggested by Peter
	[PATCH 23/47] writeback: spinlock protected bdi bandwidth update

algorithm updates
	[PATCH 24/47] writeback: increase pause time on concurrent dirtiers
	[PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi
	[PATCH 26/47] writeback: start background writeback earlier
	[PATCH 27/47] writeback: user space think time compensation
	[PATCH 28/47] writeback: bdi base throttle bandwidth
	[PATCH 29/47] writeback: smoothed bdi dirty pages
	[PATCH 30/47] writeback: adapt max balance pause time to memory size
	[PATCH 31/47] writeback: increase min pause time on concurrent dirtiers

trace points
	[PATCH 32/47] writeback: extend balance_dirty_pages() trace event
	[PATCH 33/47] writeback: trace global dirty page states
	[PATCH 34/47] writeback: trace writeback_single_inode()

larger writeback chunk size
	[PATCH 35/47] writeback: scale IO chunk size up to device bandwidth

btrfs fixes
	[PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
	[PATCH 37/47] btrfs: lower the dirty balacing rate limit
	[PATCH 38/47] btrfs: wait on too many nr_async_bios

nfs fixes
	[PATCH 39/47] nfs: livelock prevention is now done in VFS
	[PATCH 40/47] NFS: writeback pages wait queue
	[PATCH 41/47] nfs: in-commit pages accounting and wait queue
	[PATCH 42/47] nfs: heuristics to avoid commit
	[PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode()
	[PATCH 44/47] nfs: limit the range of commits
	[PATCH 45/47] nfs: adapt congestion threshold to dirty threshold
	[PATCH 46/47] nfs: trace nfs_commit_unstable_pages()
	[PATCH 47/47] nfs: trace nfs_commit_release()

 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++
 fs/btrfs/disk-io.c                                        |    7 
 fs/btrfs/file.c                                           |   16 
 fs/btrfs/ioctl.c                                          |    6 
 fs/btrfs/relocation.c                                     |    6 
 fs/fs-writeback.c                                         |   85 +
 fs/nfs/client.c                                           |    3 
 fs/nfs/file.c                                             |    9 
 fs/nfs/write.c                                            |  241 +++-
 include/linux/backing-dev.h                               |    9 
 include/linux/nfs_fs.h                                    |    1 
 include/linux/nfs_fs_sb.h                                 |    3 
 include/linux/sched.h                                     |    8 
 include/linux/writeback.h                                 |   26 
 include/trace/events/nfs.h                                |   89 +
 include/trace/events/writeback.h                          |  195 +++
 mm/backing-dev.c                                          |   32 
 mm/filemap.c                                              |    5 
 mm/memory_hotplug.c                                       |    3 
 mm/page-writeback.c                                       |  518 +++++++---
 20 files changed, 1212 insertions(+), 260 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 02/47] writeback: safety margin for bdi stat error Wu Fengguang
                   ` (46 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Rik van Riel, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-bdi-dirty-limit.patch --]
[-- Type: text/plain, Size: 4587 bytes --]

I noticed that my NFSROOT test system goes slow responding when there
is heavy dd to a local disk. Traces show that the NFSROOT's bdi limit
is near 0 and many tasks in the system are repeatedly stuck in
balance_dirty_pages().

There are two generic problems:

- light dirtiers at one device (more often than not the rootfs) get
  heavily impacted by heavy dirtiers on another independent device

- the light dirtied device does heavy throttling because bdi limit=0,
  and the heavy throttling may in turn withhold its bdi limit in 0 as
  it cannot dirty fast enough to grow up the bdi's proportional weight.

Fix it by introducing some "low pass" gate, which is a small (<=32MB)
value reserved by others and can be safely "stole" from the current
global dirty margin.  It does not need to be big to help the bdi gain
its initial weight.

Acked-by: Rik van Riel <riel@redhat.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/writeback.h |    3 ++-
 mm/backing-dev.c          |    2 +-
 mm/page-writeback.c       |   29 ++++++++++++++++++++++++++---
 3 files changed, 29 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 23:28:19.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 23:30:45.000000000 +0800
@@ -443,13 +443,26 @@ void global_dirty_limits(unsigned long *
  *
  * The bdi's share of dirty limit will be adapting to its throughput and
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
- */
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+ *
+ * There is a chicken and egg problem: when bdi A (eg. /pub) is heavy dirtied
+ * and bdi B (eg. /) is light dirtied hence has 0 dirty limit, tasks writing to
+ * B always get heavily throttled and bdi B's dirty limit might never be able
+ * to grow up from 0. So we do tricks to reserve some global margin and honour
+ * it to the bdi's that run low.
+ */
+unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
+			      unsigned long dirty,
+			      unsigned long dirty_pages)
 {
 	u64 bdi_dirty;
 	long numerator, denominator;
 
 	/*
+	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
+	 */
+	dirty -= min(dirty / 128, 32768UL >> (PAGE_SHIFT-10));
+
+	/*
 	 * Calculate this BDI's share of the dirty ratio.
 	 */
 	bdi_writeout_fraction(bdi, &numerator, &denominator);
@@ -459,6 +472,15 @@ unsigned long bdi_dirty_limit(struct bac
 	do_div(bdi_dirty, denominator);
 
 	bdi_dirty += (dirty * bdi->min_ratio) / 100;
+
+	/*
+	 * If we can dirty N more pages globally, honour N/2 to the bdi that
+	 * runs low, so as to help it ramp up.
+	 */
+	if (unlikely(bdi_dirty < (dirty - dirty_pages) / 2 &&
+		     dirty > dirty_pages))
+		bdi_dirty = (dirty - dirty_pages) / 2;
+
 	if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
 		bdi_dirty = dirty * bdi->max_ratio / 100;
 
@@ -508,7 +530,8 @@ static void balance_dirty_pages(struct a
 				(background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
+					     nr_reclaimable + nr_writeback);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
--- linux-next.orig/mm/backing-dev.c	2010-12-08 23:28:19.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 23:28:43.000000000 +0800
@@ -83,7 +83,7 @@ static int bdi_debug_stats_show(struct s
 	spin_unlock(&inode_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
-	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+	bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_thresh);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
--- linux-next.orig/include/linux/writeback.h	2010-12-08 23:28:19.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-08 23:28:43.000000000 +0800
@@ -126,7 +126,8 @@ int dirty_writeback_centisecs_handler(st
 
 void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
-			       unsigned long dirty);
+			       unsigned long dirty,
+			       unsigned long dirty_pages);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 02/47] writeback: safety margin for bdi stat error
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
  2010-12-13  6:42 ` [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 03/47] writeback: IO-less balance_dirty_pages() Wu Fengguang
                   ` (45 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-error.patch --]
[-- Type: text/plain, Size: 2738 bytes --]

In a simple dd test on a 8p system with "mem=256M", I find all light
dirtier tasks on the root fs are get heavily throttled. That happens
because the global limit is exceeded. It's unbelievable at first sight,
because the test fs doing the heavy dd is under its bdi limit.  After
doing some tracing, it's discovered that

        bdi_dirty < bdi_dirty_limit() < global_dirty_limit() < nr_dirty

So the root cause is, the bdi_dirty is well under the global nr_dirty
due to accounting errors. This can be fixed by using bdi_stat_sum(),
however that's costly on large NUMA machines. So do a less costly fix
of lowering the bdi limit, so that the accounting errors won't lead to
the absurd situation "global limit exceeded but bdi limit not exceeded".

This provides guarantee when there is only 1 heavily dirtied bdi, and
works by opportunity for 2+ heavy dirtied bdi's (hopefully they won't
reach big error _and_ exceed their bdi limit at the same time).

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:21.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:21.000000000 +0800
@@ -434,10 +434,16 @@ void global_dirty_limits(unsigned long *
 	*pdirty = dirty;
 }
 
-/*
+/**
  * bdi_dirty_limit - @bdi's share of dirty throttling threshold
+ * @bdi: the backing_dev_info to query
+ * @dirty: global dirty limit in pages
+ * @dirty_pages: current number of dirty pages
  *
- * Allocate high/low dirty limits to fast/slow devices, in order to prevent
+ * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ *
+ * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
  * - piling up dirty pages (that will take long time to sync) on slow devices
  *
@@ -458,6 +464,14 @@ unsigned long bdi_dirty_limit(struct bac
 	long numerator, denominator;
 
 	/*
+	 * try to prevent "global limit exceeded but bdi limit not exceeded"
+	 */
+	if (likely(dirty > bdi_stat_error(bdi)))
+		dirty -= bdi_stat_error(bdi);
+	else
+		return 0;
+
+	/*
 	 * Provide a global safety margin of ~1%, or up to 32MB for a 20GB box.
 	 */
 	dirty -= min(dirty / 128, 32768ULL >> (PAGE_SHIFT-10));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 03/47] writeback: IO-less balance_dirty_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
  2010-12-13  6:42 ` [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
  2010-12-13  6:42 ` [PATCH 02/47] writeback: safety margin for bdi stat error Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
                   ` (44 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Dave Chinner, Peter Zijlstra, Jens Axboe,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-throttle.patch --]
[-- Type: text/plain, Size: 28983 bytes --]

As proposed by Chris, Dave and Jan, don't start foreground writeback IO
inside balance_dirty_pages(). Instead, simply let it idle sleep for some
time to throttle the dirtying task. In the mean while, kick off the
per-bdi flusher thread to do background writeback IO.

This patch introduces the basic framework, which will be further
consolidated by the next patches.

RATIONALS
=========

The current balance_dirty_pages() is rather IO inefficient.

- concurrent writeback of multiple inodes (Dave Chinner)

  If every thread doing writes and being throttled start foreground
  writeback, it leads to N IO submitters from at least N different
  inodes at the same time, end up with N different sets of IO being
  issued with potentially zero locality to each other, resulting in
  much lower elevator sort/merge efficiency and hence we seek the disk
  all over the place to service the different sets of IO.
  OTOH, if there is only one submission thread, it doesn't jump between
  inodes in the same way when congestion clears - it keeps writing to
  the same inode, resulting in large related chunks of sequential IOs
  being issued to the disk. This is more efficient than the above
  foreground writeback because the elevator works better and the disk
  seeks less.

- IO size too small for fast arrays and too large for slow USB sticks

  The write_chunk used by current balance_dirty_pages() cannot be
  directly set to some large value (eg. 128MB) for better IO efficiency.
  Because it could lead to more than 1 second user perceivable stalls.
  Even the current 4MB write size may be too large for slow USB sticks.
  The fact that balance_dirty_pages() starts IO on itself couples the
  IO size to wait time, which makes it hard to do suitable IO size while
  keeping the wait time under control.

For the above two reasons, it's much better to shift IO to the flusher
threads and let balance_dirty_pages() just wait for enough time or progress.

Jan Kara, Dave Chinner and me explored the scheme to let
balance_dirty_pages() wait for enough writeback IO completions to
safeguard the dirty limit. However it's found to have two problems:

- in large NUMA systems, the per-cpu counters may have big accounting
  errors, leading to big throttle wait time and jitters.

- NFS may kill large amount of unstable pages with one single COMMIT.
  Because NFS server serves COMMIT with expensive fsync() IOs, it is
  desirable to delay and reduce the number of COMMITs. So it's not
  likely to optimize away such kind of bursty IO completions, and the
  resulted large (and tiny) stall times in IO completion based throttling.

So here is a pause time oriented approach, which tries to control the
pause time in each balance_dirty_pages() invocations, by controlling
the number of pages dirtied before calling balance_dirty_pages(), for
smooth and efficient dirty throttling:

- avoid useless (eg. zero pause time) balance_dirty_pages() calls
- avoid too small pause time (less than  10ms, which burns CPU power)
- avoid too large pause time (more than 100ms, which hurts responsiveness)
- avoid big fluctuations of pause times

For example, when doing a simple cp on ext4 with mem=4G HZ=250.

before patch, the pause time fluctuates from 0 to 324ms
(and the stall time may grow very large for slow devices)

[ 1237.139962] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1237.207489] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.225190] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.234488] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.244692] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0
[ 1237.375231] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.443035] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.574630] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1237.642394] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1237.666320] balance_dirty_pages: write_chunk=1536 pages_written=57 pause=5
[ 1237.973365] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=81
[ 1238.212626] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=56
[ 1238.280431] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=15
[ 1238.412029] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=31
[ 1238.412791] balance_dirty_pages: write_chunk=1536 pages_written=0 pause=0

after patch, the pause time remains stable around 32ms

cp-2687  [002]  1452.237012: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [002]  1452.246157: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.253043: balance_dirty_pages: weight=56% dirtied=128 pause=8
cp-2687  [006]  1452.261899: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [006]  1452.268939: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.276932: balance_dirty_pages: weight=57% dirtied=128 pause=8
cp-2687  [002]  1452.285889: balance_dirty_pages: weight=57% dirtied=128 pause=8

CONTROL SYSTEM
==============

The current task_dirty_limit() adjusts bdi_dirty_limit to get
task_dirty_limit according to the dirty "weight" of the current task,
which is the percent of pages recently dirtied by the task. If 100%
pages are recently dirtied by the task, it will lower bdi_dirty_limit by
1/8. If only 1% pages are dirtied by the task, it will return almost
unmodified bdi_dirty_limit. In this way, a heavy dirtier will get
blocked at task_dirty_limit=(bdi_dirty_limit-bdi_dirty_limit/8) while
allowing a light dirtier to progress (the latter won't be blocked
because R << B in fig.1).

Fig.1 before patch, a heavy dirtier and a light dirtier
                                                R
----------------------------------------------+-o---------------------------*--|
                                              L A                           B  T
  T: bdi_dirty_limit, as returned by bdi_dirty_limit()
  L: T - T/8

  R: bdi_reclaimable + bdi_writeback

  A: task_dirty_limit for a heavy dirtier ~= R ~= L
  B: task_dirty_limit for a light dirtier ~= T

Since each process has its own dirty limit, we reuse A/B for the tasks as
well as their dirty limits.

If B is a newly started heavy dirtier, then it will slowly gain weight
and A will lose weight.  The task_dirty_limit for A and B will be
approaching the center of region (L, T) and eventually stabilize there.

Fig.2 before patch, two heavy dirtiers converging to the same threshold
                                                             R
----------------------------------------------+--------------o-*---------------|
                                              L              A B               T

Fig.3 after patch, one heavy dirtier
                                                |
    throttle_bandwidth ~= bdi_bandwidth  =>     o
                                                | o
                                                |   o
                                                |     o
                                                |       o
                                                |         o
                                              La|           o
----------------------------------------------+-+-------------o----------------|
                                                R             A                T
  T: bdi_dirty_limit
  A: task_dirty_limit      = T - Wa * T/16
  La: task_throttle_thresh = A - A/16

  R: bdi_dirty_pages = bdi_reclaimable + bdi_writeback ~= La

Now for IO-less balance_dirty_pages(), let's do it in a "bandwidth control"
way. In fig.3, a soft dirty limit region (La, A) is introduced. When R enters
this region, the task may be throttled for J jiffies on every N pages it dirtied.
Let's call (N/J) the "throttle bandwidth". It is computed by the following formula:

        throttle_bandwidth = bdi_bandwidth * (A - R) / (A - La)
where
	A = T - Wa * T/16
        La = A - A/16
where Wa is task weight for A. It's 0 for very light dirtier and 1 for
the one heavy dirtier (that consumes 100% bdi write bandwidth).  The
task weight will be updated independently by task_dirty_inc() at
set_page_dirty() time.

When R < La, we don't throttle it at all.
When R > A, the code will detect the negativeness and choose to pause
100ms (the upper pause boundary), then loop over again.


PSEUDO CODE
===========

balance_dirty_pages():

	/* soft throttling */
	if (task_throttle_thresh exceeded)
		sleep (task_dirtied_pages / throttle_bandwidth)

	/* hard throttling */
	while (task_dirty_limit exceeded) {
		sleep 100ms
		if (bdi_dirty_pages dropped more than task_dirtied_pages)
			break
	}

	/* global hard limit */
	while (dirty_limit exceeded)
		sleep 100ms

Basically there are three level of throttling now.

- normally the dirtier will be adaptively throttled with good timing

- when task_dirty_limit is exceeded, the task will be throttled until
  bdi dirty/writeback pages go down reasonably large

- when dirty_thresh is exceeded, the task can be throttled for arbitrary
  long time


BEHAVIOR CHANGE
===============

Users will notice that the applications will get throttled once the
crossing the global (background + dirty)/2=15% threshold. For a single
"cp", it could be soft throttled at 8*bdi->write_bandwidth around 15%
dirty pages, and be balanced at speed bdi->write_bandwidth around 17.5%
dirty pages. Before patch, the behavior is to just throttle it at 17.5%
dirty pages.

Since the task will be soft throttled earlier than before, it may be
perceived by end users as performance "slow down" if his application
happens to dirty more than ~15% memory.


BENCHMARKS
==========

The test box has a 4-core 3.2GHz CPU, 4GB mem and a SATA disk.

For each filesystem, the following command is run 3 times.

time (dd if=/dev/zero of=/tmp/10G bs=1M count=10240; sync); rm /tmp/10G

	    2.6.36-rc2-mm1	2.6.36-rc2-mm1+balance_dirty_pages
average real time
ext2        236.377s            232.144s              -1.8%
ext3        226.245s            225.751s              -0.2%
ext4        178.742s            179.343s              +0.3%
xfs         183.562s            179.808s              -2.0%
btrfs       179.044s            179.461s              +0.2%
NFS         645.627s            628.937s              -2.6%

average system time
ext2         22.142s             19.656s             -11.2%
ext3         34.175s             32.462s              -5.0%
ext4         23.440s             21.162s              -9.7%
xfs          19.089s             16.069s             -15.8%
btrfs        12.212s             11.670s              -4.4%
NFS          16.807s             17.410s              +3.6%

total user time
sum           0.136s              0.084s             -38.2%

In a more recent run of the tests, it's in fact slightly slower.

ext2         49.500 MB/s         49.200 MB/s          -0.6%
ext3         50.133 MB/s         50.000 MB/s          -0.3%
ext4         64.000 MB/s         63.200 MB/s          -1.2%
xfs          63.500 MB/s         63.167 MB/s          -0.5%
btrfs        63.133 MB/s         63.033 MB/s          -0.2%
NFS          16.833 MB/s         16.867 MB/s          +0.2%

In general there are no big IO performance changes for desktop users,
except for some noticeable reduction of CPU overheads. It mainly
benefits file servers with heavy concurrent writers on fast storage
arrays. As can be demonstrated by 10/100 concurrent dd's on xfs:

- 1 dirtier case:    the same
- 10 dirtiers case:  CPU system time is reduced to 50%
- 100 dirtiers case: CPU system time is reduced to 10%, IO size and throughput increases by 10%

			2.6.37-rc2				2.6.37-rc1-next-20101115+
        ----------------------------------------        ----------------------------------------
	%system		wkB/s		avgrq-sz	%system		wkB/s		avgrq-sz
100dd	30.916		37843.000	748.670		3.079		41654.853	822.322
100dd	30.501		37227.521	735.754		3.744		41531.725	820.360

10dd	39.442		47745.021	900.935		20.756		47951.702	901.006
10dd	39.204		47484.616	899.330		20.550		47970.093	900.247

1dd	13.046		57357.468	910.659		13.060		57632.715	909.212
1dd	12.896		56433.152	909.861		12.467		56294.440	909.644

The CPU overheads in 2.6.37-rc1-next-20101115+ is higher than
2.6.36-rc2-mm1+balance_dirty_pages, this may be due to the pause time
stablizing at lower values due to some algorithm adjustments (eg.
reduce the minimal pause time from 10ms to 1jiffy in new version)
leading to much more balance_dirty_pages() calls. The different pause
time also explains the different system time for 1/10/100dd cases on
the same 2.6.37-rc1-next-20101115+.

CC: Chris Mason <chris.mason@oracle.com>
CC: Dave Chinner <david@fromorbit.com>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 Documentation/filesystems/writeback-throttling-design.txt |  210 ++++++++++
 include/linux/writeback.h                                 |   10 
 mm/page-writeback.c                                       |   85 +---
 3 files changed, 249 insertions(+), 56 deletions(-)

--- linux-next.orig/include/linux/writeback.h	2010-12-08 22:44:21.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-08 22:44:22.000000000 +0800
@@ -12,6 +12,16 @@ struct backing_dev_info;
 extern spinlock_t inode_lock;
 
 /*
+ * The 1/8 region under the bdi dirty threshold is set aside for elastic
+ * throttling. In rare cases when the threshold is exceeded, more rigid
+ * throttling will be imposed, which will inevitably stall the dirtier task
+ * for seconds (or more) at _one_ time. The rare case could be a fork bomb
+ * where every new task dirties some more pages.
+ */
+#define BDI_SOFT_DIRTY_LIMIT	8
+#define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:21.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:22.000000000 +0800
@@ -42,20 +42,6 @@
  */
 static long ratelimit_pages = 32;
 
-/*
- * When balance_dirty_pages decides that the caller needs to perform some
- * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than dirtied pages to ensure that reasonably
- * large amounts of I/O are submitted.
- */
-static inline long sync_writeback_pages(unsigned long dirtied)
-{
-	if (dirtied < ratelimit_pages)
-		dirtied = ratelimit_pages;
-
-	return dirtied + dirtied / 2;
-}
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -279,7 +265,7 @@ static unsigned long task_dirty_limit(st
 {
 	long numerator, denominator;
 	unsigned long dirty = bdi_dirty;
-	u64 inv = dirty >> 3;
+	u64 inv = dirty / TASK_SOFT_DIRTY_LIMIT;
 
 	task_dirties_fraction(tsk, &numerator, &denominator);
 	inv *= numerator;
@@ -509,26 +495,25 @@ unsigned long bdi_dirty_limit(struct bac
  * perform some writeout.
  */
 static void balance_dirty_pages(struct address_space *mapping,
-				unsigned long write_chunk)
+				unsigned long pages_dirtied)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
+	unsigned long bw;
+	unsigned long pause;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
+		/*
+		 * Unstable writes are a feature of certain networked
+		 * filesystems (i.e. NFS) in which data may have been
+		 * written to the server's write cache, but has not yet
+		 * been flushed to permanent storage.
+		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK);
@@ -566,6 +551,23 @@ static void balance_dirty_pages(struct a
 			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+			pause = HZ/10;
+			goto pause;
+		}
+
+		bw = 100 << 20; /* use static 100MB/s for the moment */
+
+		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+
+		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = clamp_val(pause, 1, HZ/10);
+
+pause:
+		__set_current_state(TASK_INTERRUPTIBLE);
+		io_schedule_timeout(pause);
+
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
 		 * global "hard" limit. The former helps to prevent heavy IO
@@ -581,35 +583,6 @@ static void balance_dirty_pages(struct a
 
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
-
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		trace_wbc_balance_dirty_start(&wbc, bdi);
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wb(&bdi->wb, &wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			trace_wbc_balance_dirty_written(&wbc, bdi);
-			if (pages_written >= write_chunk)
-				break;		/* We've done our duty */
-		}
-		trace_wbc_balance_dirty_wait(&wbc, bdi);
-		__set_current_state(TASK_INTERRUPTIBLE);
-		io_schedule_timeout(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -626,7 +599,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
+	if ((laptop_mode && dirty_exceeded) ||
 	    (!laptop_mode && (nr_reclaimable > background_thresh)))
 		bdi_start_background_writeback(bdi);
 }
@@ -675,7 +648,7 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(bdp_ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
-		ratelimit = sync_writeback_pages(*p);
+		ratelimit = *p;
 		*p = 0;
 		preempt_enable();
 		balance_dirty_pages(mapping, ratelimit);
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/Documentation/filesystems/writeback-throttling-design.txt	2010-12-08 22:44:22.000000000 +0800
@@ -0,0 +1,210 @@
+writeback throttling design
+---------------------------
+
+introduction to dirty throttling
+--------------------------------
+
+The write(2) is normally buffered write that creates dirty page cache pages
+for holding the data and return immediately. The dirty pages will eventually
+be written to disk, or be dropped by unlink()/truncate().
+
+The delayed writeback of dirty pages enables the kernel to optimize the IO:
+
+- turn IO into async ones, which avoids blocking the tasks
+- submit IO as a batch for better throughput
+- avoid IO at all for temp files
+
+However, there have to be some limits on the number of allowable dirty pages.
+Typically applications are able to dirty pages more quickly than storage
+devices can write them. When approaching the dirty limits, the dirtier tasks
+will be throttled (put to brief sleeps from time to time) by
+balance_dirty_pages() in order to balance the dirty speed and writeback speed.
+
+dirty limits
+------------
+
+The dirty limit defaults to 20% reclaimable memory, and can be tuned via one of
+the following sysctl interfaces:
+
+	/proc/sys/vm/dirty_ratio
+	/proc/sys/vm/dirty_bytes
+
+The ultimate goal of balance_dirty_pages() is to keep the global dirty pages
+under control.
+
+	dirty_limit = dirty_ratio * free_reclaimable_pages
+
+However a global threshold may create deadlock for stacked BDIs (loop, FUSE and
+local NFS mounts). When A writes to B, and A generates enough dirty pages to
+get throttled, B will never start writeback until the dirty pages go away.
+
+Another problem is inter device starvation. When there are concurrent writes to
+a slow device and a fast one, the latter may well be starved due to unnecessary
+throttling on its dirtier tasks, leading to big IO performance drop.
+
+The solution is to split the global dirty limit into per-bdi limits among all
+the backing devices and scale writeback cache per backing device, proportional
+to its writeout speed.
+
+	bdi_dirty_limit = bdi_weight * dirty_limit
+
+where bdi_weight (ranging from 0 to 1) reflects the recent writeout speed of
+the BDI.
+
+We further scale the bdi dirty limit inversly with the task's dirty rate.
+This makes heavy writers have a lower dirty limit than the occasional writer,
+to prevent a heavy dd from slowing down all other light writers in the system.
+
+	task_dirty_limit = bdi_dirty_limit - task_weight * bdi_dirty_limit/16
+
+pause time
+----------
+
+The main task of dirty throttling is to determine when and how long to pause
+the current dirtier task.  Basically we want to
+
+- avoid too small pause time (less than 1 jiffy, which burns CPU power)
+- avoid too large pause time (more than 100ms, which hurts responsiveness)
+- avoid big fluctuations of pause times
+
+To smoothly control the pause time, we do soft throttling in a small region
+under task_dirty_limit, starting from
+
+	task_throttle_thresh = task_dirty_limit - task_dirty_limit/16
+
+In fig.1, when bdi_dirty_pages falls into
+
+    [0, La]:    do nothing
+    [La, A]:    do soft throttling
+    [A, inf]:   do hard throttling
+
+Where hard throttling is to wait until bdi_dirty_pages falls more than
+task_dirtied_pages (the pages dirtied by the task since its last throttle
+time). It's "hard" because it may end up waiting for long time.
+
+Fig.1 dirty throttling regions
+                                              o
+                                                o
+                                                  o
+                                                    o
+                                                      o
+                                                        o
+                                                          o
+                                                            o
+----------------------------------------------+---------------o----------------|
+                                              La              A                T
+                no throttle                     soft throttle   hard throttle
+  T: bdi_dirty_limit
+  A: task_dirty_limit      = T - task_weight * T/16
+  La: task_throttle_thresh = A - A/16
+
+Soft dirty throttling is to pause the dirtier task for J:pause_time jiffies on
+every N:task_dirtied_pages pages it dirtied.  Let's call (N/J) the "throttle
+bandwidth". It is computed by the following formula:
+
+                                     task_dirty_limit - bdi_dirty_pages
+throttle_bandwidth = bdi_bandwidth * ----------------------------------
+                                           task_dirty_limit/16
+
+where bdi_bandwidth is the BDI's estimated write speed.
+
+Given the throttle_bandwidth for a task, we select a suitable N, so that when
+the task dirties so much pages, it enters balance_dirty_pages() to sleep for
+roughly J jiffies. N is adaptive to storage and task write speeds, so that the
+task always get suitable (not too long or small) pause time.
+
+dynamics
+--------
+
+When there is one heavy dirtier, bdi_dirty_pages will keep growing until
+exceeding the low threshold of the task's soft throttling region [La, A].
+At which point (La) the task will be controlled under speed
+throttle_bandwidth=bdi_bandwidth (fig.2) and remain stable there.
+
+Fig.2 one heavy dirtier
+
+    throttle_bandwidth ~= bdi_bandwidth  =>   o
+                                              | o
+                                              |   o
+                                              |     o
+                                              |       o
+                                              |         o
+                                              |           o
+                                            La|             o
+----------------------------------------------+---------------o----------------|
+                                              R               A                T
+  R: bdi_dirty_pages ~= La
+
+When there comes a new dd task B, task_weight_B will gradually grow from 0 to
+50% while task_weight_A will decrease from 100% to 50%.  When task_weight_B is
+still small, B is considered a light dirtier and is allowed to dirty pages much
+faster than the bdi write bandwidth. In fact initially it won't be throttled at
+all when R < Lb where Lb = B - B/16 and B ~= T.
+
+Fig.3 an old dd (A) + a newly started dd (B)
+
+                      throttle bandwidth  =>    *
+                                                | *
+                                                |   *
+                                                |     *
+                                                |       *
+                                                |         *
+                                                |           *
+                                                |             *
+                      throttle bandwidth  =>    o               *
+                                                | o               *
+                                                |   o               *
+                                                |     o               *
+                                                |       o               *
+                                                |         o               *
+                                                |           o               *
+------------------------------------------------+-------------o---------------*|
+                                                R             A               BT
+
+So R:bdi_dirty_pages will grow large. As task_weight_A and task_weight_B
+converge to 50%, the points A, B will go towards each other (fig.4) and
+eventually coincide with each other. R will stabilize around A-A/32 where
+A=B=T-0.5*T/16.  throttle_bandwidth will stabilize around bdi_bandwidth/2.
+
+Note that the application "think+dirty time" is ignored for simplicity in the
+above discussions. With non-zero user space think time, the balance point will
+slightly drift and not a big deal otherwise.
+
+Fig.4 the two dd's converging to the same bandwidth
+
+                                                         |
+                                 throttle bandwidth  =>  *
+                                                         | *
+                                 throttle bandwidth  =>  o   *
+                                                         | o   *
+                                                         |   o   *
+                                                         |     o   *
+                                                         |       o   *
+                                                         |         o   *
+---------------------------------------------------------+-----------o---*-----|
+                                                         R           A   B     T
+
+There won't be big oscillations between A and B, because as soon as A coincides
+with B, their throttle_bandwidth and hence dirty speed will be equal, A's
+weight will stop decreasing and B's weight will stop growing, so the two points
+won't keep moving and cross each other.
+
+Sure there are always oscillations of bdi_dirty_pages as long as the dirtier
+task alternatively do dirty and pause. But it will be bounded. When there is 1
+heavy dirtier, the error bound will be (pause_time * bdi_bandwidth). When there
+are 2 heavy dirtiers, the max error is 2 * (pause_time * bdi_bandwidth/2),
+which remains the same as 1 dirtier case (given the same pause time). In fact
+the more dirtier tasks, the less errors will be, since the dirtier tasks are
+not likely going to sleep at the same time.
+
+References
+----------
+
+Smarter write throttling
+http://lwn.net/Articles/245600/
+
+Flushing out pdflush
+http://lwn.net/Articles/326552/
+
+Dirty throttling slides
+http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling.pdf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (2 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 03/47] writeback: IO-less balance_dirty_pages() Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
                   ` (43 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-cleanup-name-merge.patch --]
[-- Type: text/plain, Size: 3541 bytes --]

Lots of lengthy tests.. Let's compact the names

	nr_dirty = NR_FILE_DIRTY + NR_WRITEBACK + NR_UNSTABLE_NFS

balance_dirty_pages() only cares about the above dirty sum except
in one place -- on starting background writeback.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   30 ++++++++++++++----------------
 1 file changed, 14 insertions(+), 16 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:22.000000000 +0800
@@ -497,8 +497,9 @@ unsigned long bdi_dirty_limit(struct bac
 static void balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
-	long nr_reclaimable, bdi_nr_reclaimable;
-	long nr_writeback, bdi_nr_writeback;
+	long nr_reclaimable;
+	long nr_dirty;
+	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -516,7 +517,7 @@ static void balance_dirty_pages(struct a
 		 */
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 					global_page_state(NR_UNSTABLE_NFS);
-		nr_writeback = global_page_state(NR_WRITEBACK);
+		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
@@ -525,12 +526,10 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_reclaimable + nr_writeback <=
-				(background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
 			break;
 
-		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh,
-					     nr_reclaimable + nr_writeback);
+		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
 		bdi_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
@@ -544,21 +543,21 @@ static void balance_dirty_pages(struct a
 		 * deltas.
 		 */
 		if (bdi_thresh < 2*bdi_stat_error(bdi)) {
-			bdi_nr_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat_sum(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat_sum(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat_sum(bdi, BDI_WRITEBACK);
 		} else {
-			bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
-			bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+			bdi_dirty = bdi_stat(bdi, BDI_RECLAIMABLE) +
+				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
-		if (bdi_nr_reclaimable + bdi_nr_writeback >= bdi_thresh) {
+		if (bdi_dirty >= bdi_thresh) {
 			pause = HZ/10;
 			goto pause;
 		}
 
 		bw = 100 << 20; /* use static 100MB/s for the moment */
 
-		bw = bw * (bdi_thresh - (bdi_nr_reclaimable + bdi_nr_writeback));
+		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
@@ -574,9 +573,8 @@ pause:
 		 * bdi or process from holding back light ones; The latter is
 		 * the last resort safeguard.
 		 */
-		dirty_exceeded =
-			(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
-			|| (nr_reclaimable + nr_writeback > dirty_thresh);
+		dirty_exceeded = (bdi_dirty > bdi_thresh) ||
+				  (nr_dirty > dirty_thresh);
 
 		if (!dirty_exceeded)
 			break;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (3 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
                   ` (42 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 9629 bytes --]

Try to limit the dirty throttle pause time in range [1 jiffy, 100 ms],
by controlling how many pages can be dirtied before inserting a pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards less than 50ms. Another intention is for fine timing control of
slow devices, which may need to do full 100ms pauses for every 1 page.

The switch from per-cpu to per-task rate limit makes it easier to exceed
the global dirty limit with a fork bomb, where each new task dirties 1 page,
sleep 10m and continue to dirty 1000 more pages. The caveat is, when it
dirties the first page, it may be honoured a high nr_dirtied_pause
because nr_dirty is still low at that time. In this way lots of tasks
get the free tickets to dirty more pages than allowed. The solution is
to disable rate limiting (ie. to ignore nr_dirtied_pause) totally once
the bdi becomes dirty exceeded.

Note that some filesystems will dirty a batch of pages before calling
balance_dirty_pages_ratelimited_nr(). They saves a little CPU overheads
at the cost of possibly overrunning the dirty limits a bit and/or in the
case of very slow devices, pause the application for much more than
100ms at a time. This is a trade-off, and seems reasonable optimization
as long as the batch size is controlled within a dozen pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 
 mm/page-writeback.c   |  126 ++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 71 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-08 22:43:58.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-08 22:44:23.000000000 +0800
@@ -1471,6 +1471,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:23.000000000 +0800
@@ -36,12 +36,6 @@
 #include <linux/pagevec.h>
 #include <trace/events/writeback.h>
 
-/*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -488,6 +482,40 @@ unsigned long bdi_dirty_limit(struct bac
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it adaptively to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	unsigned long dirty_pages;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_pages = global_page_state(NR_FILE_DIRTY) +
+		      global_page_state(NR_WRITEBACK) +
+		      global_page_state(NR_UNSTABLE_NFS);
+
+	if (dirty_pages <= (dirty_thresh + background_thresh) / 2)
+		goto out;
+
+	dirty_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_pages);
+	dirty_pages  = bdi_stat(bdi, BDI_RECLAIMABLE) +
+		       bdi_stat(bdi, BDI_WRITEBACK);
+
+	if (dirty_pages < dirty_thresh)
+		goto out;
+
+	return 1;
+out:
+	return 1 + int_sqrt(dirty_thresh - dirty_pages);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -504,7 +532,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
-	unsigned long pause;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -586,6 +614,17 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	if (pause == 0 && nr_dirty < background_thresh)
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
+	else if (pause == 1)
+		current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;
+	else if (pause >= HZ/10)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -612,8 +651,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -623,36 +660,30 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	current->nr_dirtied += nr_pages_dirtied;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (unlikely(!current->nr_dirtied_pause))
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
+		     bdi->dirty_exceeded)) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -740,44 +771,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -799,9 +792,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2010-12-08 22:43:58.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2010-12-08 22:44:23.000000000 +0800
@@ -446,8 +446,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 
@@ -877,7 +875,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_system_sleep();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (4 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 07/47] writeback: account per-bdi accumulated written pages Wu Fengguang
                   ` (41 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-duplicate-bdp-calls.patch --]
[-- Type: text/plain, Size: 1421 bytes --]

When dd in 512bytes, balance_dirty_pages_ratelimited() used to be called
8 times for the same page, even if the page is only dirtied once. Fix it
with a (slightly racy) PageDirty() test.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/filemap.c	2010-12-08 22:43:58.000000000 +0800
+++ linux-next/mm/filemap.c	2010-12-08 22:44:23.000000000 +0800
@@ -2244,6 +2244,7 @@ static ssize_t generic_perform_write(str
 	long status = 0;
 	ssize_t written = 0;
 	unsigned int flags = 0;
+	unsigned int dirty;
 
 	/*
 	 * Copies from kernel address space cannot fail (NFSD is a big user).
@@ -2292,6 +2293,7 @@ again:
 		pagefault_enable();
 		flush_dcache_page(page);
 
+		dirty = PageDirty(page);
 		mark_page_accessed(page);
 		status = a_ops->write_end(file, mapping, pos, bytes, copied,
 						page, fsdata);
@@ -2318,7 +2320,8 @@ again:
 		pos += copied;
 		written += copied;
 
-		balance_dirty_pages_ratelimited(mapping);
+		if (!dirty)
+			balance_dirty_pages_ratelimited(mapping);
 
 	} while (iov_iter_count(i));
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 07/47] writeback: account per-bdi accumulated written pages
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (5 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 08/47] writeback: bdi write bandwidth estimation Wu Fengguang
                   ` (40 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written.patch --]
[-- Type: text/plain, Size: 2430 bytes --]

From: Jan Kara <jack@suse.cz>

Introduce the BDI_WRITTEN counter. It will be used for estimating the
bdi's write bandwidth.

Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++--
 mm/page-writeback.c         |    1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-08 22:43:58.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-08 22:44:24.000000000 +0800
@@ -40,6 +40,7 @@ typedef int (congested_fn)(void *, int);
 enum bdi_stat_item {
 	BDI_RECLAIMABLE,
 	BDI_WRITEBACK,
+	BDI_WRITTEN,
 	NR_BDI_STAT_ITEMS
 };
 
--- linux-next.orig/mm/backing-dev.c	2010-12-08 22:44:21.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
@@ -92,6 +92,7 @@ static int bdi_debug_stats_show(struct s
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
 		   "BackgroundThresh: %8lu kB\n"
+		   "BdiWritten:       %8lu kB\n"
 		   "b_dirty:          %8lu\n"
 		   "b_io:             %8lu\n"
 		   "b_more_io:        %8lu\n"
@@ -99,8 +100,9 @@ static int bdi_debug_stats_show(struct s
 		   "state:            %8lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh), K(dirty_thresh),
-		   K(background_thresh), nr_dirty, nr_io, nr_more_io,
+		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
+		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:23.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:24.000000000 +0800
@@ -1329,6 +1329,7 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
+				__inc_bdi_stat(bdi, BDI_WRITTEN);
 				__bdi_writeout_inc(bdi);
 			}
 		}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 08/47] writeback: bdi write bandwidth estimation
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (6 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 07/47] writeback: account per-bdi accumulated written pages Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 09/47] writeback: show bdi write bandwidth in debugfs Wu Fengguang
                   ` (39 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Li Shaohua, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Peter Zijlstra, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-estimation-in-flusher.patch --]
[-- Type: text/plain, Size: 6519 bytes --]

The estimation value will start from 100MB/s and adapt to the real
bandwidth in seconds.  It's pretty accurate for common filesystems.

As the first use case, it replaces the fixed 100MB/s value used for
throttle bandwidth calculation in balance_dirty_pages().

The overheads won't be high because the bdi bandwidth update only occurs
in >10ms intervals.

Initially it's only estimated in balance_dirty_pages() because this is
the most reliable place to get reasonable large bandwidth -- the bdi is
normally fully utilized when bdi_thresh is reached.

Then Shaohua recommends to also do it in the flusher thread, to keep the
value updated when there are only periodic/background writeback and no
tasks throttled.

The estimation cannot be done purely in the flusher thread because it's
not sufficient for NFS. NFS writeback won't block at get_request_wait(),
so tend to complete quickly. Another problem is, slow devices may take
dozens of seconds to write the initial 64MB chunk (write_bandwidth
starts with 100MB/s, this translates to 64MB nr_to_write). So it may
take more than 1 minute to adapt to the smallish bandwidth if the
bandwidth is only updated in the flusher thread.

CC: Li Shaohua <shaohua.li@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    5 ++++
 include/linux/backing-dev.h |    2 +
 include/linux/writeback.h   |    3 ++
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   41 +++++++++++++++++++++++++++++++++-
 5 files changed, 51 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-08 22:44:24.000000000 +0800
@@ -75,6 +75,8 @@ struct backing_dev_info {
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
+	unsigned long write_bandwidth_update_time;
+	int write_bandwidth;
 	int dirty_exceeded;
 
 	unsigned int min_ratio;
--- linux-next.orig/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
@@ -660,6 +660,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
+	bdi->write_bandwidth = 100 << 20;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-12-08 22:44:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-08 22:44:24.000000000 +0800
@@ -635,6 +635,8 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
@@ -668,6 +670,8 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
+	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -702,6 +706,7 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
+		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:24.000000000 +0800
@@ -515,6 +515,41 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written)
+{
+	unsigned long written;
+	unsigned long elapsed;
+	unsigned long bw;
+	unsigned long w;
+
+	if (*bw_written == 0)
+		goto snapshot;
+
+	elapsed = jiffies - *bw_time;
+	if (elapsed < HZ/100)
+		return;
+
+	/*
+	 * When there lots of tasks throttled in balance_dirty_pages(), they
+	 * will each try to update the bandwidth for the same period, making
+	 * the bandwidth drift much faster than the desired rate (as in the
+	 * single dirtier case). So do some rate limiting.
+	 */
+	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+		goto snapshot;
+
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
+	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
+	w = min(elapsed / (HZ/100), 128UL);
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+	bdi->write_bandwidth_update_time = jiffies;
+snapshot:
+	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+	*bw_time = jiffies;
+}
+
 /*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
@@ -535,6 +570,8 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	unsigned long bw_time;
+	s64 bw_written = 0;
 
 	for (;;) {
 		/*
@@ -583,7 +620,7 @@ static void balance_dirty_pages(struct a
 			goto pause;
 		}
 
-		bw = 100 << 20; /* use static 100MB/s for the moment */
+		bw = bdi->write_bandwidth;
 
 		bw = bw * (bdi_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
@@ -592,8 +629,10 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/writeback.h	2010-12-08 22:44:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-08 22:44:24.000000000 +0800
@@ -138,6 +138,9 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty,
 			       unsigned long dirty_pages);
+void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+				unsigned long *bw_time,
+				s64 *bw_written);
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 09/47] writeback: show bdi write bandwidth in debugfs
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (7 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 08/47] writeback: bdi write bandwidth estimation Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:42 ` [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
                   ` (38 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Peter Zijlstra, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-show.patch --]
[-- Type: text/plain, Size: 2330 bytes --]

Add a "BdiWriteBandwidth" entry (and indent others) in /debug/bdi/*/stats.

btw increase digital field width to 10, for keeping the possibly
huge BdiWritten number aligned at least for desktop systems.

This will break user space tools if they are dumb enough to depend on
the number of white spaces.

CC: Theodore Ts'o <tytso@mit.edu>
CC: Jan Kara <jack@suse.cz>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
@@ -87,21 +87,23 @@ static int bdi_debug_stats_show(struct s
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 	seq_printf(m,
-		   "BdiWriteback:     %8lu kB\n"
-		   "BdiReclaimable:   %8lu kB\n"
-		   "BdiDirtyThresh:   %8lu kB\n"
-		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n"
-		   "BdiWritten:       %8lu kB\n"
-		   "b_dirty:          %8lu\n"
-		   "b_io:             %8lu\n"
-		   "b_more_io:        %8lu\n"
-		   "bdi_list:         %8u\n"
-		   "state:            %8lx\n",
+		   "BdiWriteback:       %10lu kB\n"
+		   "BdiReclaimable:     %10lu kB\n"
+		   "BdiDirtyThresh:     %10lu kB\n"
+		   "DirtyThresh:        %10lu kB\n"
+		   "BackgroundThresh:   %10lu kB\n"
+		   "BdiWritten:         %10lu kB\n"
+		   "BdiWriteBandwidth:  %10lu kBps\n"
+		   "b_dirty:            %10lu\n"
+		   "b_io:               %10lu\n"
+		   "b_more_io:          %10lu\n"
+		   "bdi_list:           %10u\n"
+		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+		   (unsigned long) bdi->write_bandwidth >> 10,
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (8 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 09/47] writeback: show bdi write bandwidth in debugfs Wu Fengguang
@ 2010-12-13  6:42 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
                   ` (37 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break.patch --]
[-- Type: text/plain, Size: 2879 bytes --]

Tests show that bdi_thresh may take minutes to ramp up on a typical
desktop. The time should be improvable but cannot be eliminated totally.
So when (background_thresh + dirty_thresh)/2 is reached and
balance_dirty_pages() starts to throttle the task, it will suddenly find
the (still low and ramping up) bdi_thresh is exceeded _excessively_. Here
we definitely don't want to stall the task for one minute (when it's
writing to USB stick). So introduce an alternative way to break out of
the loop when the bdi dirty/write pages has dropped by a reasonable
amount.

When dirty_background_ratio is set close to dirty_ratio, bdi_thresh may
also be constantly exceeded due to the task_dirty_limit() gap. This is
addressed by another patch to lower the background threshold when
necessary.

It will take at least 100ms before trying to break out.

Note that this opens the chance that during normal operation, a huge
number of slow dirtiers writing to a really slow device might manage to
outrun bdi_thresh. But the risk is pretty low. It takes at least one
100ms sleep loop to break out, and the global limit is still enforced.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
@@ -563,6 +563,7 @@ static void balance_dirty_pages(struct a
 	long nr_reclaimable;
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
@@ -615,6 +616,25 @@ static void balance_dirty_pages(struct a
 				    bdi_stat(bdi, BDI_WRITEBACK);
 		}
 
+		/*
+		 * bdi_thresh takes time to ramp up from the initial 0,
+		 * especially for slow devices.
+		 *
+		 * It's possible that at the moment dirty throttling starts,
+		 *	bdi_dirty = nr_dirty
+		 *		  = (background_thresh + dirty_thresh) / 2
+		 *		  >> bdi_thresh
+		 * Then the task could be blocked for a dozen second to flush
+		 * all the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 250ms worth
+		 * of dirty pages have been cleaned during our pause time.
+		 */
+		if (nr_dirty < dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty >
+		    bdi->write_bandwidth >> (PAGE_CACHE_SHIFT + 2))
+			break;
+		bdi_prev_dirty = bdi_dirty;
+
 		if (bdi_dirty >= bdi_thresh) {
 			pause = HZ/10;
 			goto pause;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (9 preceding siblings ...)
  2010-12-13  6:42 ` [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
                   ` (36 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Richard Kennedy, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, Rik van Riel,
	KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-speedup-per-bdi-threshold-ramp-up.patch --]
[-- Type: text/plain, Size: 1817 bytes --]

Reduce the dampening for the control system, yielding faster
convergence.

Currently it converges at a snail's pace for slow devices (in order of
minutes).  For really fast storage, the convergence speed should be fine.

It makes sense to make it reasonably fast for typical desktops.

After patch, it converges in ~10 seconds for 60MB/s writes and 4GB mem.
So expect ~1s for a fast 600MB/s storage under 4GB mem, or ~4s under
16GB mem, which seems reasonable.

$ while true; do grep BdiDirtyThresh /debug/bdi/8:0/stats; sleep 1; done
BdiDirtyThresh:            0 kB
BdiDirtyThresh:       118748 kB
BdiDirtyThresh:       214280 kB
BdiDirtyThresh:       303868 kB
BdiDirtyThresh:       376528 kB
BdiDirtyThresh:       411180 kB
BdiDirtyThresh:       448636 kB
BdiDirtyThresh:       472260 kB
BdiDirtyThresh:       490924 kB
BdiDirtyThresh:       499596 kB
BdiDirtyThresh:       507068 kB
...
DirtyThresh:          530392 kB

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
@@ -125,7 +125,7 @@ static int calc_period_shift(void)
 	else
 		dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) /
 				100;
-	return 2 + ilog2(dirty_total - 1);
+	return ilog2(dirty_total - 1) - 1;
 }
 
 /*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (10 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
                   ` (35 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-oversize-background-thresh.patch --]
[-- Type: text/plain, Size: 1600 bytes --]

The change is virtually a no-op for the majority users that use the
default 10/20 background/dirty ratios. For others don't know why they
are setting background ratio close enough to dirty ratio. Someone must
set background ratio equal to dirty ratio, but no one seems to notice or
complain that it's then silently halved under the hood..

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
@@ -403,8 +403,15 @@ void global_dirty_limits(unsigned long *
 	else
 		background = (dirty_background_ratio * available_memory) / 100;
 
-	if (background >= dirty)
-		background = dirty / 2;
+	/*
+	 * Ensure at least 1/4 gap between background and dirty thresholds, so
+	 * that when dirty throttling starts at (background + dirty)/2, it's at
+	 * the entrance of bdi soft throttle threshold, so as to avoid being
+	 * hard throttled.
+	 */
+	if (background > dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT)
+		background = dirty - dirty * 2 / BDI_SOFT_DIRTY_LIMIT;
+
 	tsk = current;
 	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
 		background += background / 4;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (11 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 14/47] writeback: add trace event for balance_dirty_pages() Wu Fengguang
                   ` (34 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-adaptive-throttle-bandwidth.patch --]
[-- Type: text/plain, Size: 3754 bytes --]

This will noticeably reduce the fluctuaions of pause time when there are
100+ concurrent dirtiers.

The more parallel dirtiers (1 dirtier => 4 dirtiers), the smaller
bandwidth each dirtier will share (bdi_bandwidth => bdi_bandwidth/4),
the less gap to the dirty limit ((C-A) => (C-B)), the less stable the
pause time will be (given the same fluctuation of bdi_dirty).

For example, if A drifts to A', its pause time may drift from 5ms to
6ms, while B to B' may drift from 50ms to 90ms.  It's much larger
fluctuations in relative ratio as well as absolute time.

Fig.1 before patch, gap (C-B) is too low to get smooth pause time

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
                                              |       o
                                              |         o
throttle_bandwidth_B = bdi_bandwidth / 4 .....|...........o
                                              |           | o <= B'
----------------------------------------------+-----------+---o
                                              A           B   C

The solution is to lower the slope of the throttle line accordingly,
which makes B stabilize at some point more far away from C.

Fig.2 after patch

throttle_bandwidth_A = bdi_bandwidth .........o
                                              | o <= A'
                                              |   o
                                              |     o
    lowered max throttle bandwidth for B ===> *       o
                                              |   *     o
throttle_bandwidth_B = bdi_bandwidth / 4 .............*   o
                                              |       |   * o
----------------------------------------------+-------+-------o
                                              A       B       C

Note that C is actually different points for 1-dirty and 4-dirtiers
cases, but for easy graphing, we move them together.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:25.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:26.000000000 +0800
@@ -574,6 +574,7 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long task_thresh;
 	unsigned long bw;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
@@ -603,7 +604,7 @@ static void balance_dirty_pages(struct a
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh, nr_dirty);
-		bdi_thresh = task_dirty_limit(current, bdi_thresh);
+		task_thresh = task_dirty_limit(current, bdi_thresh);
 
 		/*
 		 * In order to avoid the stacked BDI deadlock we need
@@ -642,14 +643,23 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_dirty = bdi_dirty;
 
-		if (bdi_dirty >= bdi_thresh) {
+		if (bdi_dirty >= task_thresh) {
 			pause = HZ/10;
 			goto pause;
 		}
 
+		/*
+		 * When bdi_dirty grows closer to bdi_thresh, it indicates more
+		 * concurrent dirtiers. Proportionally lower the max throttle
+		 * bandwidth. This will resist bdi_dirty from approaching to
+		 * close to task_thresh, and help reduce fluctuations of pause
+		 * time when there are lots of dirtiers.
+		 */
 		bw = bdi->write_bandwidth;
-
 		bw = bw * (bdi_thresh - bdi_dirty);
+		bw = bw / (bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+
+		bw = bw * (task_thresh - bdi_dirty);
 		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 14/47] writeback: add trace event for balance_dirty_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (12 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 15/47] writeback: make nr_to_write a per-file limit Wu Fengguang
                   ` (33 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-balance_dirty_pages.patch --]
[-- Type: text/plain, Size: 5875 bytes --]

Here is an interesting test to verify the theory with balance_dirty_pages()
tracing. On a partition that can do ~60MB/s, a sparse file is created and
4 rsync tasks with different write bandwidth started:

	dd if=/dev/zero of=/mnt/1T bs=1M count=1 seek=1024000
	echo 1 > /debug/tracing/events/writeback/balance_dirty_pages/enable

	rsync localhost:/mnt/1T /mnt/a --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/A --bwlimit 10000&
	rsync localhost:/mnt/1T /mnt/b --bwlimit 20000&
	rsync localhost:/mnt/1T /mnt/c --bwlimit 30000&

Trace outputs within 0.1 second, grouped by tasks:

rsync-3824  [004] 15002.076447: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130876 gap=5340 dirtied=192 pause=20

rsync-3822  [003] 15002.091701: balance_dirty_pages: bdi=btrfs-2 weight=15% limit=130777 gap=5113 dirtied=192 pause=20

rsync-3821  [006] 15002.004667: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129570 gap=3714 dirtied=64 pause=8
rsync-3821  [006] 15002.012654: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129589 gap=3733 dirtied=64 pause=8
rsync-3821  [006] 15002.021838: balance_dirty_pages: bdi=btrfs-2 weight=30% limit=129604 gap=3748 dirtied=64 pause=8
rsync-3821  [004] 15002.091193: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129583 gap=3983 dirtied=64 pause=8
rsync-3821  [004] 15002.102729: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129594 gap=3802 dirtied=64 pause=8
rsync-3821  [000] 15002.109252: balance_dirty_pages: bdi=btrfs-2 weight=29% limit=129619 gap=3827 dirtied=64 pause=8

rsync-3823  [002] 15002.009029: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128762 gap=2842 dirtied=64 pause=12
rsync-3823  [002] 15002.021598: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128813 gap=3021 dirtied=64 pause=12
rsync-3823  [003] 15002.032973: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128805 gap=2885 dirtied=64 pause=12
rsync-3823  [003] 15002.048800: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128823 gap=2967 dirtied=64 pause=12
rsync-3823  [003] 15002.060728: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128821 gap=3221 dirtied=64 pause=12
rsync-3823  [000] 15002.073152: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128825 gap=3225 dirtied=64 pause=12
rsync-3823  [005] 15002.090111: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128782 gap=3214 dirtied=64 pause=12
rsync-3823  [004] 15002.102520: balance_dirty_pages: bdi=btrfs-2 weight=39% limit=128764 gap=3036 dirtied=64 pause=12

The data vividly show that

- the heaviest writer is throttled a bit (weight=39%)

- the lighter writers run at full speed (weight=15%,15%,30%)
  rsync is smart enough to compensate for the in-kernel pause time

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   61 +++++++++++++++++++++++++++--
 mm/page-writeback.c              |    6 ++
 2 files changed, 64 insertions(+), 3 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-08 22:44:20.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-08 22:44:26.000000000 +0800
@@ -147,11 +147,66 @@ DEFINE_EVENT(wbc_class, name, \
 DEFINE_WBC_EVENT(wbc_writeback_start);
 DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
-DEFINE_WBC_EVENT(wbc_balance_dirty_start);
-DEFINE_WBC_EVENT(wbc_balance_dirty_written);
-DEFINE_WBC_EVENT(wbc_balance_dirty_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define BDP_PERCENT(a, b, c)	((__entry->a - __entry->b) * 100 * c + \
+				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+TRACE_EVENT(balance_dirty_pages,
+
+	TP_PROTO(struct backing_dev_info *bdi,
+		 long bdi_dirty,
+		 long bdi_limit,
+		 long task_limit,
+		 long pages_dirtied,
+		 long pause),
+
+	TP_ARGS(bdi, bdi_dirty, bdi_limit, task_limit,
+		pages_dirtied, pause),
+
+	TP_STRUCT__entry(
+		__array(char,	bdi, 32)
+		__field(long,	bdi_dirty)
+		__field(long,	bdi_limit)
+		__field(long,	task_limit)
+		__field(long,	pages_dirtied)
+		__field(long,	pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+		__entry->bdi_dirty	= bdi_dirty;
+		__entry->bdi_limit	= bdi_limit;
+		__entry->task_limit	= task_limit;
+		__entry->pages_dirtied	= pages_dirtied;
+		__entry->pause		= pause * 1000 / HZ;
+	),
+
+
+	/*
+	 *            [..............soft throttling range............]
+	 *            ^               |<=========== bdi_gap =========>|
+	 * (background+dirty)/2       |<== task_gap ==>|
+	 * -------------------|-------+----------------|--------------|
+	 *   (bdi_limit * 7/8)^       ^bdi_dirty       ^task_limit    ^bdi_limit
+	 *
+	 * Reasonable large gaps help produce smooth pause times.
+	 */
+	TP_printk("bdi=%s bdi_dirty=%lu bdi_limit=%lu task_limit=%lu "
+		  "task_weight=%ld%% task_gap=%ld%% bdi_gap=%ld%% "
+		  "pages_dirtied=%lu pause=%lu",
+		  __entry->bdi,
+		  __entry->bdi_dirty,
+		  __entry->bdi_limit,
+		  __entry->task_limit,
+		  /* task weight: proportion of recent dirtied pages */
+		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, bdi_dirty, TASK_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  __entry->pages_dirtied,
+		  __entry->pause
+		  )
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:26.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:26.000000000 +0800
@@ -666,6 +666,12 @@ static void balance_dirty_pages(struct a
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
+		trace_balance_dirty_pages(bdi,
+					  bdi_dirty,
+					  bdi_thresh,
+					  task_thresh,
+					  pages_dirtied,
+					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_INTERRUPTIBLE);
 		io_schedule_timeout(pause);

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 15/47] writeback: make nr_to_write a per-file limit
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (13 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 14/47] writeback: add trace event for balance_dirty_pages() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix Wu Fengguang
                   ` (32 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-single-file-limit.patch --]
[-- Type: text/plain, Size: 2160 bytes --]

This ensures full 4MB (or larger) writeback size for large dirty files.

CC: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   11 +++++++++++
 include/linux/writeback.h |    1 +
 2 files changed, 12 insertions(+)

--- linux-next.orig/fs/fs-writeback.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-08 22:44:26.000000000 +0800
@@ -330,6 +330,8 @@ static int
 writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
+	long per_file_limit = wbc->per_file_limit;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -365,8 +367,16 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
+	if (per_file_limit) {
+		nr_to_write = wbc->nr_to_write;
+		wbc->nr_to_write = per_file_limit;
+	}
+
 	ret = do_writepages(mapping, wbc);
 
+	if (per_file_limit)
+		wbc->nr_to_write += nr_to_write - per_file_limit;
+
 	/*
 	 * Make sure to wait on the data before writing out the metadata.
 	 * This is important for filesystems that modify metadata on data
@@ -698,6 +708,7 @@ static long wb_writeback(struct bdi_writ
 
 		wbc.more_io = 0;
 		wbc.nr_to_write = write_chunk;
+		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
 
 		trace_wbc_writeback_start(&wbc, wb->bdi);
--- linux-next.orig/include/linux/writeback.h	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-08 22:44:26.000000000 +0800
@@ -43,6 +43,7 @@ struct writeback_control {
 					   extra jobs and livelock */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	long per_file_limit;		/* Write this many pages for one file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (14 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 15/47] writeback: make nr_to_write a per-file limit Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages() Wu Fengguang
                   ` (31 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Christoph Hellwig, Dave Chinner,
	Jens Axboe, KOSAKI Motohiro, Li Shaohua, Mel Gorman,
	Michael Rubin, Peter Zijlstra, Richard Kennedy, Rik van Riel,
	Theodore Tso, Wu Fengguang, Trond Myklebust, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-make-nr_to_write-a-per-file-limit-fix.patch --]
[-- Type: text/plain, Size: 1670 bytes --]

From: Andrew Morton <akpm@linux-foundation.org>

older gcc's are dumb:

fs/fs-writeback.c: In function 'writeback_single_inode':
fs/fs-writeback.c:334: warning: 'nr_to_write' may be used uninitialized in this function

Cc: Chris Mason <chris.mason@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Li Shaohua <shaohua.li@intel.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Michael Rubin <mrubin@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Kennedy <richard@rsk.demon.co.uk>
Cc: Rik van Riel <riel@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
LKML-Reference: <201011180023.oAI0NXFl014362@imap1.linux-foundation.org>
---
 fs/fs-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-08 22:44:26.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-08 22:44:27.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long nr_to_write;
+	long uninitialized_var(nr_to_write);
 	unsigned dirty;
 	int ret;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (15 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc() Wu Fengguang
                   ` (30 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-pause-TASK_UNINTERRUPTIBLE.patch --]
[-- Type: text/plain, Size: 1256 bytes --]

Comments from Andrew Morton:

Using TASK_INTERRUPTIBLE in balance_dirty_pages() seems wrong.  If it's
going to do that then it must break out if signal_pending(), otherwise
it's pretty much guaranteed to degenerate into a busywait loop.  Plus
we *do* want these processes to appear in D state and to contribute to
load average.

So it should be TASK_UNINTERRUPTIBLE.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:26.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:27.000000000 +0800
@@ -673,7 +673,7 @@ pause:
 					  pages_dirtied,
 					  pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
-		__set_current_state(TASK_INTERRUPTIBLE);
+		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
 		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (16 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause Wu Fengguang
                   ` (29 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-written-fix.patch --]
[-- Type: text/plain, Size: 1227 bytes --]

This will cover and fix fuse, which only calls bdi_writeout_inc(). -Peter

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:27.000000000 +0800
@@ -199,6 +199,7 @@ int dirty_bytes_handler(struct ctl_table
  */
 static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
 {
+	__inc_bdi_stat(bdi, BDI_WRITTEN);
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
 }
@@ -1411,7 +1412,6 @@ int test_clear_page_writeback(struct pag
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi)) {
 				__dec_bdi_stat(bdi, BDI_WRITEBACK);
-				__inc_bdi_stat(bdi, BDI_WRITTEN);
 				__bdi_writeout_inc(bdi);
 			}
 		}


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (17 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 20/47] writeback: use do_div in bw calculation Wu Fengguang
                   ` (28 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-fix-increase-nr_dirtied_pause.patch --]
[-- Type: text/plain, Size: 1154 bytes --]

Fix a bug that

	current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;

does not effectively increase nr_dirtied_pause when it's less than 32.
Thus nr_dirtied_pause may never grow up.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:27.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
@@ -700,7 +700,7 @@ pause:
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
-		current->nr_dirtied_pause += current->nr_dirtied_pause >> 5;
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= HZ/10)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 20/47] writeback: use do_div in bw calculation
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (18 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 21/47] writeback: prevent divide error on tiny HZ Wu Fengguang
                   ` (27 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-use-do_div.patch --]
[-- Type: text/plain, Size: 837 bytes --]

cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
@@ -658,10 +658,10 @@ static void balance_dirty_pages(struct a
 		 */
 		bw = bdi->write_bandwidth;
 		bw = bw * (bdi_thresh - bdi_dirty);
-		bw = bw / (bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
+		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
 		bw = bw * (task_thresh - bdi_dirty);
-		bw = bw / (bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
+		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
 		pause = clamp_val(pause, 1, HZ/10);



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 21/47] writeback: prevent divide error on tiny HZ
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (19 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 20/47] writeback: use do_div in bw calculation Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 22/47] writeback: prevent bandwidth calculation overflow Wu Fengguang
                   ` (26 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bandwidth-HZ-fix.patch --]
[-- Type: text/plain, Size: 1914 bytes --]

As suggested by Andrew and Peter:

I do recall hearing of people who set HZ very low, perhaps because their
huge machines were seeing performance problems when the timer tick went
off.  Probably there's no need to do that any more.

But still, we shouldn't hard-wire the (HZ >= 100) assumption if we don't
absolutely need to, and I don't think it is absolutely needed here.

People who do cpu bring-up on very slow FPGAs also lower HZ as far as
possible.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
@@ -527,6 +527,7 @@ void bdi_update_write_bandwidth(struct b
 				unsigned long *bw_time,
 				s64 *bw_written)
 {
+	const unsigned long unit_time = max(HZ/100, 1);
 	unsigned long written;
 	unsigned long elapsed;
 	unsigned long bw;
@@ -536,7 +537,7 @@ void bdi_update_write_bandwidth(struct b
 		goto snapshot;
 
 	elapsed = jiffies - *bw_time;
-	if (elapsed < HZ/100)
+	if (elapsed < unit_time)
 		return;
 
 	/*
@@ -550,7 +551,7 @@ void bdi_update_write_bandwidth(struct b
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
 	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
-	w = min(elapsed / (HZ/100), 128UL);
+	w = min(elapsed / unit_time, 128UL);
 	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
 	bdi->write_bandwidth_update_time = jiffies;
 snapshot:


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 22/47] writeback: prevent bandwidth calculation overflow
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (20 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 21/47] writeback: prevent divide error on tiny HZ Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 23/47] writeback: spinlock protected bdi bandwidth update Wu Fengguang
                   ` (25 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Rik van Riel, Wu Fengguang,
	Christoph Hellwig, Trond Myklebust, Dave Chinner,
	Theodore Ts'o, Chris Mason, Mel Gorman, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-prevent-bw-overflow.patch --]
[-- Type: text/plain, Size: 4239 bytes --]

On 32bit kernel, bdi->write_bandwidth can express at most 4GB/s.

However the current calculation code can overflow when disk bandwidth
reaches 800MB/s.  Fix it by using "long long" and div64_u64() in the
calculations.

And further change its unit from bytes/second to pages/second.
That allows up to 16TB/s bandwidth in 32bit kernel.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    5 +++--
 mm/backing-dev.c            |    4 ++--
 mm/page-writeback.c         |   14 +++++++-------
 3 files changed, 12 insertions(+), 11 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:28.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:29.000000000 +0800
@@ -531,7 +531,7 @@ void bdi_update_write_bandwidth(struct b
 	unsigned long written;
 	unsigned long elapsed;
 	unsigned long bw;
-	unsigned long w;
+	unsigned long long w;
 
 	if (*bw_written == 0)
 		goto snapshot;
@@ -550,9 +550,10 @@ void bdi_update_write_bandwidth(struct b
 		goto snapshot;
 
 	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
-	bw = (HZ * PAGE_CACHE_SIZE * written + elapsed/2) / elapsed;
+	bw = (HZ * written + elapsed / 2) / elapsed;
 	w = min(elapsed / unit_time, 128UL);
-	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) + bw * w) >> 10;
+	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) +
+				bw * w + 1023) >> 10;
 	bdi->write_bandwidth_update_time = jiffies;
 snapshot:
 	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
@@ -577,7 +578,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long task_thresh;
-	unsigned long bw;
+	unsigned long long bw;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -640,8 +641,7 @@ static void balance_dirty_pages(struct a
 		 * of dirty pages have been cleaned during our pause time.
 		 */
 		if (nr_dirty < dirty_thresh &&
-		    bdi_prev_dirty - bdi_dirty >
-		    bdi->write_bandwidth >> (PAGE_CACHE_SHIFT + 2))
+		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 4)
 			break;
 		bdi_prev_dirty = bdi_dirty;
 
@@ -664,7 +664,7 @@ static void balance_dirty_pages(struct a
 		bw = bw * (task_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		pause = HZ * (pages_dirtied << PAGE_CACHE_SHIFT) / (bw + 1);
+		pause = HZ * pages_dirtied / ((unsigned long)bw + 1);
 		pause = clamp_val(pause, 1, HZ/10);
 
 pause:
--- linux-next.orig/mm/backing-dev.c	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 22:44:29.000000000 +0800
@@ -103,7 +103,7 @@ static int bdi_debug_stats_show(struct s
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
 		   K(bdi_thresh), K(dirty_thresh), K(background_thresh),
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
-		   (unsigned long) bdi->write_bandwidth >> 10,
+		   (unsigned long) K(bdi->write_bandwidth),
 		   nr_dirty, nr_io, nr_more_io,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
@@ -662,7 +662,7 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
-	bdi->write_bandwidth = 100 << 20;
+	bdi->write_bandwidth = (100 << 20) / PAGE_CACHE_SIZE;
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/include/linux/backing-dev.h	2010-12-08 22:44:24.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-08 22:44:29.000000000 +0800
@@ -74,9 +74,10 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
-	struct prop_local_percpu completions;
+	unsigned long write_bandwidth;
 	unsigned long write_bandwidth_update_time;
-	int write_bandwidth;
+
+	struct prop_local_percpu completions;
 	int dirty_exceeded;
 
 	unsigned int min_ratio;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 23/47] writeback: spinlock protected bdi bandwidth update
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (21 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 22/47] writeback: prevent bandwidth calculation overflow Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 24/47] writeback: increase pause time on concurrent dirtiers Wu Fengguang
                   ` (24 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Peter Zijlstra, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Chris Mason,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trylock.patch --]
[-- Type: text/plain, Size: 6992 bytes --]

The original plan is to use per-cpu vars for bdi->write_bandwidth.
However Peter suggested that it opens the window that some CPU see
outdated values. So switch to use spinlock protected global vars.

It tries to update the bandwidth only when disk is fully utilized.
Any inactive period of more than 500ms will be skipped.

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |    7 +--
 include/linux/backing-dev.h |    4 +
 include/linux/writeback.h   |   13 ++++-
 mm/backing-dev.c            |    4 +
 mm/page-writeback.c         |   74 +++++++++++++++++++---------------
 5 files changed, 62 insertions(+), 40 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 22:44:29.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 22:44:29.000000000 +0800
@@ -523,41 +523,54 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
-void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
-				unsigned long *bw_time,
-				s64 *bw_written)
+static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					 unsigned long elapsed,
+					 unsigned long written)
+{
+	const unsigned long period = roundup_pow_of_two(HZ);
+	u64 bw;
+
+	bw = written - bdi->written_stamp;
+	bw *= HZ;
+	if (elapsed > period / 2) {
+		do_div(bw, elapsed);
+		elapsed = period / 2;
+		bw *= elapsed;
+	}
+	bw += (u64)bdi->write_bandwidth * (period - elapsed);
+	bdi->write_bandwidth = bw >> ilog2(period);
+}
+
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh)
 {
-	const unsigned long unit_time = max(HZ/100, 1);
-	unsigned long written;
 	unsigned long elapsed;
-	unsigned long bw;
-	unsigned long long w;
-
-	if (*bw_written == 0)
-		goto snapshot;
+	unsigned long written;
 
-	elapsed = jiffies - *bw_time;
-	if (elapsed < unit_time)
+	if (!spin_trylock(&bdi->bw_lock))
 		return;
 
-	/*
-	 * When there lots of tasks throttled in balance_dirty_pages(), they
-	 * will each try to update the bandwidth for the same period, making
-	 * the bandwidth drift much faster than the desired rate (as in the
-	 * single dirtier case). So do some rate limiting.
-	 */
-	if (jiffies - bdi->write_bandwidth_update_time < elapsed)
+	elapsed = jiffies - bdi->bw_time_stamp;
+	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+
+	/* skip quiet periods when disk bandwidth is under-utilized */
+	if (elapsed > HZ/2 &&
+	    elapsed > jiffies - start_time)
 		goto snapshot;
 
-	written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]) - *bw_written;
-	bw = (HZ * written + elapsed / 2) / elapsed;
-	w = min(elapsed / unit_time, 128UL);
-	bdi->write_bandwidth = (bdi->write_bandwidth * (1024-w) +
-				bw * w + 1023) >> 10;
-	bdi->write_bandwidth_update_time = jiffies;
+	/* rate-limit, only update once every 100ms */
+	if (elapsed <= HZ/10)
+		goto unlock;
+
+	__bdi_update_write_bandwidth(bdi, elapsed, written);
+
 snapshot:
-	*bw_written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
-	*bw_time = jiffies;
+	bdi->written_stamp = written;
+	bdi->bw_time_stamp = jiffies;
+unlock:
+	spin_unlock(&bdi->bw_lock);
 }
 
 /*
@@ -582,8 +595,7 @@ static void balance_dirty_pages(struct a
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
-	unsigned long bw_time;
-	s64 bw_written = 0;
+	unsigned long start_time = jiffies;
 
 	for (;;) {
 		/*
@@ -645,6 +657,8 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_dirty = bdi_dirty;
 
+		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
+
 		if (bdi_dirty >= task_thresh) {
 			pause = HZ/10;
 			goto pause;
@@ -674,10 +688,8 @@ pause:
 					  task_thresh,
 					  pages_dirtied,
 					  pause);
-		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
-		bdi_update_write_bandwidth(bdi, &bw_time, &bw_written);
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the
--- linux-next.orig/include/linux/backing-dev.h	2010-12-08 22:44:29.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-08 22:44:29.000000000 +0800
@@ -74,8 +74,10 @@ struct backing_dev_info {
 
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
+	spinlock_t bw_lock;
+	unsigned long bw_time_stamp;
+	unsigned long written_stamp;
 	unsigned long write_bandwidth;
-	unsigned long write_bandwidth_update_time;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/backing-dev.c	2010-12-08 22:44:29.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-08 22:44:29.000000000 +0800
@@ -662,7 +662,9 @@ int bdi_init(struct backing_dev_info *bd
 			goto err;
 	}
 
-	bdi->write_bandwidth = (100 << 20) / PAGE_CACHE_SIZE;
+	spin_lock_init(&bdi->bw_lock);
+	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
--- linux-next.orig/fs/fs-writeback.c	2010-12-08 22:44:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-08 22:44:29.000000000 +0800
@@ -645,8 +645,6 @@ static long wb_writeback(struct bdi_writ
 		.range_cyclic		= work->range_cyclic,
 	};
 	unsigned long oldest_jif;
-	unsigned long bw_time;
-	s64 bw_written = 0;
 	long wrote = 0;
 	long write_chunk;
 	struct inode *inode;
@@ -680,7 +678,7 @@ static long wb_writeback(struct bdi_writ
 		write_chunk = LONG_MAX;
 
 	wbc.wb_start = jiffies; /* livelock avoidance */
-	bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
 	for (;;) {
 		/*
@@ -717,7 +715,8 @@ static long wb_writeback(struct bdi_writ
 		else
 			writeback_inodes_wb(wb, &wbc);
 		trace_wbc_writeback_written(&wbc, wb->bdi);
-		bdi_update_write_bandwidth(wb->bdi, &bw_time, &bw_written);
+
+		bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
--- linux-next.orig/include/linux/writeback.h	2010-12-08 22:44:26.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-08 22:44:29.000000000 +0800
@@ -139,9 +139,16 @@ void global_dirty_limits(unsigned long *
 unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
 			       unsigned long dirty,
 			       unsigned long dirty_pages);
-void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
-				unsigned long *bw_time,
-				s64 *bw_written);
+
+void bdi_update_bandwidth(struct backing_dev_info *bdi,
+			  unsigned long start_time,
+			  unsigned long bdi_dirty,
+			  unsigned long bdi_thresh);
+static inline void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
+					      unsigned long start_time)
+{
+	bdi_update_bandwidth(bdi, start_time, 0, 0);
+}
 
 void page_writeback_init(void);
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 24/47] writeback: increase pause time on concurrent dirtiers
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (22 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 23/47] writeback: spinlock protected bdi bandwidth update Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi Wu Fengguang
                   ` (23 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-200ms-pause-time.patch --]
[-- Type: text/plain, Size: 1956 bytes --]

Increase max pause time to 200ms, and make it work for (HZ < 5).

The larger 200ms will help reduce overheads in server workloads with
lots of concurrent dirtier tasks.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-09 11:52:05.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 11:54:05.000000000 +0800
@@ -36,6 +36,11 @@
 #include <linux/pagevec.h>
 #include <trace/events/writeback.h>
 
+/*
+ * Don't sleep more than 200ms at a time in balance_dirty_pages().
+ */
+#define MAX_PAUSE	max(HZ/5, 1)
+
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
@@ -660,7 +665,7 @@ static void balance_dirty_pages(struct a
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
 		if (bdi_dirty >= task_thresh) {
-			pause = HZ/10;
+			pause = MAX_PAUSE;
 			goto pause;
 		}
 
@@ -679,7 +684,7 @@ static void balance_dirty_pages(struct a
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		pause = HZ * pages_dirtied / ((unsigned long)bw + 1);
-		pause = clamp_val(pause, 1, HZ/10);
+		pause = clamp_val(pause, 1, MAX_PAUSE);
 
 pause:
 		trace_balance_dirty_pages(bdi,
@@ -714,7 +719,7 @@ pause:
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
-	else if (pause >= HZ/10)
+	else if (pause >= MAX_PAUSE)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,
 		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (23 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 24/47] writeback: increase pause time on concurrent dirtiers Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 26/47] writeback: start background writeback earlier Wu Fengguang
                   ` (22 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bdi-throttle-break-fix.patch --]
[-- Type: text/plain, Size: 1817 bytes --]

The break is designed mainly to help the single task case.

For the 1-dd case, it looks better to lower the break threshold to
125ms data. After all, it's not easy for the dirty pages to drop by
250ms worth of data when you only slept 200ms (note: the max pause time
has been doubled for reducing overheads when there are lots of
concurrent dirtiers).

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:19:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:26:56.000000000 +0800
@@ -652,13 +652,13 @@ static void balance_dirty_pages(struct a
 		 *	bdi_dirty = nr_dirty
 		 *		  = (background_thresh + dirty_thresh) / 2
 		 *		  >> bdi_thresh
-		 * Then the task could be blocked for a dozen second to flush
-		 * all the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
-		 * complementary way to break out of the loop when 250ms worth
+		 * Then the task could be blocked for many seconds to flush all
+		 * the exceeded (bdi_dirty - bdi_thresh) pages. So offer a
+		 * complementary way to break out of the loop when 125ms worth
 		 * of dirty pages have been cleaned during our pause time.
 		 */
-		if (nr_dirty < dirty_thresh &&
-		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 4)
+		if (nr_dirty <= dirty_thresh &&
+		    bdi_prev_dirty - bdi_dirty > (long)bdi->write_bandwidth / 8)
 			break;
 		bdi_prev_dirty = bdi_dirty;
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 26/47] writeback: start background writeback earlier
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (24 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 27/47] writeback: user space think time compensation Wu Fengguang
                   ` (21 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-kick-background-early.patch --]
[-- Type: text/plain, Size: 1587 bytes --]

It's possible for some one to suddenly eat lots of memory,
leading to sudden drop of global dirty limit. So a dirtier
task may get hard throttled immediately without some previous
balance_dirty_pages() call to invoke background writeback.

In this case we need to check for background writeback earlier in the
loop to avoid stucking the application for very long time. This was not
a problem before the IO-less balance_dirty_pages() because it will try
to write something and then break out of the loop regardless of the
global limit.

Another scheme this check will help is, the dirty limit is too close to
the background threshold, so that someone manages to jump directly into
the pause threshold (background+dirty)/2.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +++
 1 file changed, 3 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2010-12-08 23:54:36.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-08 23:56:55.000000000 +0800
@@ -662,6 +662,9 @@ static void balance_dirty_pages(struct a
 			break;
 		bdi_prev_dirty = bdi_dirty;
 
+		if (unlikely(!writeback_in_progress(bdi)))
+			bdi_start_background_writeback(bdi);
+
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
 		if (bdi_dirty >= task_thresh) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 27/47] writeback: user space think time compensation
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (25 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 26/47] writeback: start background writeback earlier Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 28/47] writeback: bdi base throttle bandwidth Wu Fengguang
                   ` (20 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-task-last-dirty-time.patch --]
[-- Type: text/plain, Size: 3647 bytes --]

Take the task's think time into account when computing the final pause time.
This will make accurate throttle bandwidth. In the rare case that the task
slept longer than the period time, the extra sleep time will also be
compensated in next period if it's not too big (<100ms).  Accumulated
errors are carefully avoided as long as the task don't sleep for too
long time.

case 1: period > think

		pause = period - think
		paused_when += pause

			     period time
	      |======================================>|
		  think time
	      |===============>|
	------|----------------|----------------------|-----------
	paused_when         jiffies


case 2: period <= think

		don't pause and reduce future pause time by:
		paused_when += period

		       period time
	      |=========================>|
			     think time
	      |======================================>|
	------|--------------------------+------------|-----------
	paused_when                                jiffies


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    1 +
 mm/page-writeback.c   |   22 ++++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-09 11:50:59.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-09 11:54:28.000000000 +0800
@@ -1477,6 +1477,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	unsigned long paused_when;	/* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
--- linux-next.orig/mm/page-writeback.c	2010-12-09 11:54:10.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:00:53.000000000 +0800
@@ -597,6 +597,7 @@ static void balance_dirty_pages(struct a
 	unsigned long bdi_thresh;
 	unsigned long task_thresh;
 	unsigned long long bw;
+	unsigned long period;
 	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
@@ -667,7 +668,7 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= task_thresh) {
+		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -686,7 +687,22 @@ static void balance_dirty_pages(struct a
 		bw = bw * (task_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
-		pause = HZ * pages_dirtied / ((unsigned long)bw + 1);
+		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;
+		pause = current->paused_when + period - jiffies;
+		/*
+		 * Take it as long think time if pause falls into (-10s, 0).
+		 * If it's less than 100ms, try to compensate it in future by
+		 * updating the virtual time; otherwise just reset the time, as
+		 * it may be a light dirtier.
+		 */
+		if (unlikely(-pause < HZ*10)) {
+			if (-pause <= HZ/10)
+				current->paused_when += period;
+			else
+				current->paused_when = jiffies;
+			pause = 1;
+			break;
+		}
 		pause = clamp_val(pause, 1, MAX_PAUSE);
 
 pause:
@@ -696,8 +712,10 @@ pause:
 					  task_thresh,
 					  pages_dirtied,
 					  pause);
+		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		io_schedule_timeout(pause);
+		current->paused_when += pause;
 
 		/*
 		 * The bdi thresh is somehow "soft" limit derived from the


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 28/47] writeback: bdi base throttle bandwidth
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (26 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 27/47] writeback: user space think time compensation Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 29/47] writeback: smoothed bdi dirty pages Wu Fengguang
                   ` (19 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-bw-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 6713 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->write_bandwidth)
+	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)

where
                                    adapt to
	bdi->throttle_bandwidth ================> bdi->write_bandwidth / N
	                        stabilize around

	N = number of concurrent heavy dirtier tasks
	    (light dirtiers will have little effect)

It offers two great benefits:

1) in many configurations (eg. NFS), bdi->write_bandwidth fluctuates a lot
   (more than 100%) by nature. bdi->throttle_bandwidth will be much more
   stable.  It will normally be a flat line in the time-bw graph.

2) bdi->throttle_bandwidth will be close to the final task_bw in stable state.
   In contrast, bdi->write_bandwidth is N times larger than task_bw.
   Given N=4, bdi_dirty will float around A before patch, and we want it
   stabilize around B by lowering the slope of the control line, so that
   when bdi_dirty fluctuates for the same delta (to points A'/B'), the
   corresponding fluctuation of task_bw is reduced to 1/4. The benefit
   is obvious: when there are 1000 concurrent dirtiers, the fluctuations
   quickly go out of control; with this patch, the max fluctuations
   virtually are the same as the single dirtier case. In this way, the
   control system can scale to whatever huge number of dirtiers.

fig.1 before patch

               bdi->write_bandwidth   ........o
                                               o
                                                o
                                                 o
                                                  o
                                                   o
                                                    o
                                                     o
                                                      o
                                                       o
                                                        o
                                                         o
   task_bw = bdi->write_bandwidth / 4 ....................o
                                                          |o
                                                          | o
                                                          |  o <= A'
----------------------------------------------------------+---o
                                                          A   C

fig.2 after patch

task_bw = bdi->throttle_bandwidth     ........o
        = bdi->write_bandwidth / 4            |   o <= B'
                                              |       o
                                              |           o
----------------------------------------------+---------------o
                                              B               C

The added complexity is, it will take some time for
bdi->throttle_bandwidth to adapt to the workload:

- 2 seconds to scale to 10 times more dirtier tasks
- 10 seconds to 10 times less dirtier tasks

The slower adapt time to reduced tasks is not a big problem. Because
the control line is not linear. At worst, bdi_dirty will drop below the
15% throttle threshold where the tasks won't be throttled at all.

When the system has dirtiers of different speed, bdi->throttle_bandwidth
will adapt to around the most fast speed.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    1 
 mm/backing-dev.c            |    1 
 mm/page-writeback.c         |   42 +++++++++++++++++++++++++++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-09 11:50:58.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-09 12:01:39.000000000 +0800
@@ -78,6 +78,7 @@ struct backing_dev_info {
 	unsigned long bw_time_stamp;
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
+	unsigned long throttle_bandwidth;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:00:53.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:01:39.000000000 +0800
@@ -528,6 +528,45 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+/*
+ * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
+ * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
+ * dirtier tasks while keeping the fluctuation level flat.
+ */
+static void __bdi_update_throttle_bandwidth(struct backing_dev_info *bdi,
+					    unsigned long dirty,
+					    unsigned long thresh)
+{
+	unsigned long gap = thresh / TASK_SOFT_DIRTY_LIMIT + 1;
+	unsigned long bw = bdi->throttle_bandwidth;
+
+	if (dirty > thresh)
+		return;
+
+	/* adapt to concurrent dirtiers */
+	if (dirty > thresh - gap) {
+		bw -= bw >> (3 + 4 * (thresh - dirty) / gap);
+		goto out;
+	}
+
+	/* adapt to one single dirtier */
+	if (dirty > thresh - gap * 2 + gap / 4 &&
+	    bw > bdi->write_bandwidth + bdi->write_bandwidth / 2) {
+		bw -= bw >> (3 + 4 * (thresh - dirty - gap) / gap);
+		goto out;
+	}
+
+	if (dirty <= thresh - gap * 2 - gap / 2 &&
+	    bw < bdi->write_bandwidth - bdi->write_bandwidth / 2) {
+		bw += (bw >> 4) + 1;
+		goto out;
+	}
+
+	return;
+out:
+	bdi->throttle_bandwidth = bw;
+}
+
 static void __bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 					 unsigned long elapsed,
 					 unsigned long written)
@@ -570,6 +609,7 @@ void bdi_update_bandwidth(struct backing
 		goto unlock;
 
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
+	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -680,7 +720,7 @@ static void balance_dirty_pages(struct a
 		 * close to task_thresh, and help reduce fluctuations of pause
 		 * time when there are lots of dirtiers.
 		 */
-		bw = bdi->write_bandwidth;
+		bw = bdi->throttle_bandwidth;
 		bw = bw * (bdi_thresh - bdi_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
--- linux-next.orig/mm/backing-dev.c	2010-12-09 11:50:58.000000000 +0800
+++ linux-next/mm/backing-dev.c	2010-12-09 12:01:39.000000000 +0800
@@ -664,6 +664,7 @@ int bdi_init(struct backing_dev_info *bd
 
 	spin_lock_init(&bdi->bw_lock);
 	bdi->write_bandwidth = 100 << (20 - PAGE_SHIFT);  /* 100 MB/s */
+	bdi->throttle_bandwidth = 100 << (20 - PAGE_SHIFT);
 
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 29/47] writeback: smoothed bdi dirty pages
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (27 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 28/47] writeback: bdi base throttle bandwidth Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 30/47] writeback: adapt max balance pause time to memory size Wu Fengguang
                   ` (18 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-smoothed-bdi_dirty.patch --]
[-- Type: text/plain, Size: 4303 bytes --]

This basically does

-	task_bw = linear_function(task_weight, bdi_dirty, bdi->throttle_bandwidth)
+	task_bw = linear_function(task_weight, avg_dirty, bdi->throttle_bandwidth)

So that the fluctuations of bdi_dirty can be filtered by half.

The main problem is, bdi_dirty regularly drops low suddenly for dozens
of megabytes in NFS on the completion of COMMIT requests.  The same
problem, though less severe, exists for btrfs, xfs and maybe some types
of storages. avg_dirty can help filter out such downwards spikes.

Upwards spikes are also possible, and if does happen, should better be
fixed in the FS code.  To avoid exceeding the dirty limits, once
bdi_dirty exceeds avg_dirty, the higher value will instantly be used as
the feedback to the control system. So the control system cannot filter
out upwards spikes for the sake of safety.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/backing-dev.h |    2 +
 mm/page-writeback.c         |   44 ++++++++++++++++++++++++++++++----
 2 files changed, 42 insertions(+), 4 deletions(-)

--- linux-next.orig/include/linux/backing-dev.h	2010-12-09 12:08:16.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2010-12-09 12:08:18.000000000 +0800
@@ -79,6 +79,8 @@ struct backing_dev_info {
 	unsigned long written_stamp;
 	unsigned long write_bandwidth;
 	unsigned long throttle_bandwidth;
+	unsigned long avg_dirty;
+	unsigned long old_dirty;
 
 	struct prop_local_percpu completions;
 	int dirty_exceeded;
--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:08:16.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:08:18.000000000 +0800
@@ -528,6 +528,36 @@ out:
 	return 1 + int_sqrt(dirty_thresh - dirty_pages);
 }
 
+static void __bdi_update_dirty_smooth(struct backing_dev_info *bdi,
+				      unsigned long dirty,
+				      unsigned long thresh)
+{
+	unsigned long avg = bdi->avg_dirty;
+	unsigned long old = bdi->old_dirty;
+
+	/* skip call from the flusher */
+	if (!thresh)
+		return;
+
+	if (avg > thresh) {
+		avg = dirty;
+		goto update;
+	}
+
+	if (dirty <= avg && dirty >= old)
+		goto out;
+
+	if (dirty >= avg && dirty <= old)
+		goto out;
+
+	avg = (avg * 15 + dirty) / 16;
+
+update:
+	bdi->avg_dirty = avg;
+out:
+	bdi->old_dirty = dirty;
+}
+
 /*
  * The bdi throttle bandwidth is introduced for resisting bdi_dirty from
  * getting too close to task_thresh. It allows scaling up to 1000+ concurrent
@@ -608,8 +638,9 @@ void bdi_update_bandwidth(struct backing
 	if (elapsed <= HZ/10)
 		goto unlock;
 
+	__bdi_update_dirty_smooth(bdi, bdi_dirty, bdi_thresh);
 	__bdi_update_write_bandwidth(bdi, elapsed, written);
-	__bdi_update_throttle_bandwidth(bdi, bdi_dirty, bdi_thresh);
+	__bdi_update_throttle_bandwidth(bdi, bdi->avg_dirty, bdi_thresh);
 
 snapshot:
 	bdi->written_stamp = written;
@@ -631,6 +662,7 @@ static void balance_dirty_pages(struct a
 	long nr_reclaimable;
 	long nr_dirty;
 	long bdi_dirty;  /* = file_dirty + writeback + unstable_nfs */
+	long avg_dirty;  /* smoothed bdi_dirty */
 	long bdi_prev_dirty = 0;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
@@ -708,7 +740,11 @@ static void balance_dirty_pages(struct a
 
 		bdi_update_bandwidth(bdi, start_time, bdi_dirty, bdi_thresh);
 
-		if (bdi_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+		avg_dirty = bdi->avg_dirty;
+		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
+			avg_dirty = bdi_dirty;
+
+		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
 			pause = MAX_PAUSE;
 			goto pause;
 		}
@@ -721,10 +757,10 @@ static void balance_dirty_pages(struct a
 		 * time when there are lots of dirtiers.
 		 */
 		bw = bdi->throttle_bandwidth;
-		bw = bw * (bdi_thresh - bdi_dirty);
+		bw = bw * (bdi_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / BDI_SOFT_DIRTY_LIMIT + 1);
 
-		bw = bw * (task_thresh - bdi_dirty);
+		bw = bw * (task_thresh - avg_dirty);
 		do_div(bw, bdi_thresh / TASK_SOFT_DIRTY_LIMIT + 1);
 
 		period = HZ * pages_dirtied / ((unsigned long)bw + 1) + 1;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 30/47] writeback: adapt max balance pause time to memory size
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (28 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 29/47] writeback: smoothed bdi dirty pages Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
                   ` (17 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-max-pause-time-for-small-memory-system.patch --]
[-- Type: text/plain, Size: 2914 bytes --]

For small memory systems, sleeping for 200ms at a time is an overkill.
Given 4MB dirty limit, all the dirty/writeback pages will be written to
a 80MB/s disk within 50ms. If the task goes sleep for 200ms after it
dirtied 4MB, the disk will go idle for 150ms without any new data feed.

So allow up to N milliseconds pause time for (4*N) MB bdi dirty limit.
On a typical 4GB desktop, the max pause time will be ~150ms.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:19:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:23:26.000000000 +0800
@@ -650,6 +650,22 @@ unlock:
 }
 
 /*
+ * Limit pause time for small memory systems. If sleeping for too long time,
+ * the small pool of dirty/writeback pages may go empty and disk go idle.
+ */
+static unsigned long max_pause(unsigned long bdi_thresh)
+{
+	unsigned long t;
+
+	/* 1ms for every 4MB */
+	t = bdi_thresh >> (32 - PAGE_CACHE_SHIFT -
+			   ilog2(roundup_pow_of_two(HZ)));
+	t += 2;
+
+	return min_t(unsigned long, t, MAX_PAUSE);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -671,6 +687,7 @@ static void balance_dirty_pages(struct a
 	unsigned long long bw;
 	unsigned long period;
 	unsigned long pause = 0;
+	unsigned long pause_max;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
@@ -744,8 +761,10 @@ static void balance_dirty_pages(struct a
 		if (avg_dirty < bdi_dirty || avg_dirty > task_thresh)
 			avg_dirty = bdi_dirty;
 
+		pause_max = max_pause(bdi_thresh);
+
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
-			pause = MAX_PAUSE;
+			pause = pause_max;
 			goto pause;
 		}
 
@@ -779,7 +798,7 @@ static void balance_dirty_pages(struct a
 			pause = 1;
 			break;
 		}
-		pause = clamp_val(pause, 1, MAX_PAUSE);
+		pause = clamp_val(pause, 1, pause_max);
 
 pause:
 		trace_balance_dirty_pages(bdi,
@@ -816,7 +835,7 @@ pause:
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
 	else if (pause == 1)
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
-	else if (pause >= MAX_PAUSE)
+	else if (pause >= pause_max)
 		/*
 		 * when repeated, writing 1 page per 100ms on slow devices,
 		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (29 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 30/47] writeback: adapt max balance pause time to memory size Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 32/47] writeback: extend balance_dirty_pages() trace event Wu Fengguang
                   ` (16 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Dave Chinner, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-min-pause-time-for-concurrent-dirtiers.patch --]
[-- Type: text/plain, Size: 2104 bytes --]

Target for >60ms pause time when there are 100+ heavy dirtiers per bdi.
(will average around 100ms given 200ms max pause time)

It's OK for 1 dd task doing 100MB/s to be throttle paused 100 times per
second.  However when there are 100 tasks writing to the same disk,
That sums up to 100*100 balance_dirty_pages() calls per second and may
lead to massive cacheline bouncing on accessing the global page states
in NUMA machines.  Even in single socket boxes, we easily see >10% CPU
time reduction by increasing the pause time.

CC: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:24:45.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:24:47.000000000 +0800
@@ -666,6 +666,27 @@ static unsigned long max_pause(unsigned 
 }
 
 /*
+ * Scale up pause time for concurrent dirtiers in order to reduce CPU overheads.
+ * But ensure reasonably large [min_pause, max_pause] range size, so that
+ * nr_dirtied_pause (and hence future pause time) can stay reasonably stable.
+ */
+static unsigned long min_pause(struct backing_dev_info *bdi,
+			       unsigned long max)
+{
+	unsigned long hi = ilog2(bdi->write_bandwidth);
+	unsigned long lo = ilog2(bdi->throttle_bandwidth);
+	unsigned long t;
+
+	if (lo >= hi)
+		return 1;
+
+	/* (N * 10ms) on 2^N concurrent tasks */
+	t = (hi - lo) * (10 * HZ) / 1024;
+
+	return clamp_val(t, 1, max / 2);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -833,7 +854,7 @@ pause:
 
 	if (pause == 0 && nr_dirty < background_thresh)
 		current->nr_dirtied_pause = ratelimit_pages(bdi);
-	else if (pause == 1)
+	else if (pause <= min_pause(bdi, pause_max))
 		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
 	else if (pause >= pause_max)
 		/*



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 32/47] writeback: extend balance_dirty_pages() trace event
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (30 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 33/47] writeback: trace global dirty page states Wu Fengguang
                   ` (15 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-add-fields.patch --]
[-- Type: text/plain, Size: 5047 bytes --]

Make it more useful for analyzing the dynamics of the throttling
algorithms, and helpful for debugging user reported problems.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   52 +++++++++++++++++++++--------
 mm/page-writeback.c              |   14 +++++++
 2 files changed, 53 insertions(+), 13 deletions(-)

--- linux-next.orig/include/trace/events/writeback.h	2010-12-09 12:21:04.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-09 12:24:49.000000000 +0800
@@ -149,35 +149,53 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+#define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define BDP_PERCENT(a, b, c)	((__entry->a - __entry->b) * 100 * c + \
 				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)
+
 TRACE_EVENT(balance_dirty_pages,
 
 	TP_PROTO(struct backing_dev_info *bdi,
 		 long bdi_dirty,
+		 long avg_dirty,
 		 long bdi_limit,
 		 long task_limit,
-		 long pages_dirtied,
+		 long dirtied,
+		 long task_bw,
+		 long period,
 		 long pause),
 
-	TP_ARGS(bdi, bdi_dirty, bdi_limit, task_limit,
-		pages_dirtied, pause),
+	TP_ARGS(bdi, bdi_dirty, avg_dirty, bdi_limit, task_limit,
+		dirtied, task_bw, period, pause),
 
 	TP_STRUCT__entry(
 		__array(char,	bdi, 32)
 		__field(long,	bdi_dirty)
+		__field(long,	avg_dirty)
 		__field(long,	bdi_limit)
 		__field(long,	task_limit)
-		__field(long,	pages_dirtied)
+		__field(long,	dirtied)
+		__field(long,	bdi_bw)
+		__field(long,	base_bw)
+		__field(long,	task_bw)
+		__field(long,	period)
+		__field(long,	think)
 		__field(long,	pause)
 	),
 
 	TP_fast_assign(
 		strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
 		__entry->bdi_dirty	= bdi_dirty;
+		__entry->avg_dirty	= avg_dirty;
 		__entry->bdi_limit	= bdi_limit;
 		__entry->task_limit	= task_limit;
-		__entry->pages_dirtied	= pages_dirtied;
+		__entry->dirtied	= dirtied;
+		__entry->bdi_bw		= KBps(bdi->write_bandwidth);
+		__entry->base_bw	= KBps(bdi->throttle_bandwidth);
+		__entry->task_bw	= KBps(task_bw);
+		__entry->think		= current->paused_when == 0 ? 0 :
+			 (long)(jiffies - current->paused_when) * 1000 / HZ;
+		__entry->period		= period * 1000 / HZ;
 		__entry->pause		= pause * 1000 / HZ;
 	),
 
@@ -191,19 +209,27 @@ TRACE_EVENT(balance_dirty_pages,
 	 *
 	 * Reasonable large gaps help produce smooth pause times.
 	 */
-	TP_printk("bdi=%s bdi_dirty=%lu bdi_limit=%lu task_limit=%lu "
-		  "task_weight=%ld%% task_gap=%ld%% bdi_gap=%ld%% "
-		  "pages_dirtied=%lu pause=%lu",
+	TP_printk("bdi %s: "
+		  "bdi_limit=%lu task_limit=%lu bdi_dirty=%lu avg_dirty=%lu "
+		  "bdi_gap=%ld%% task_gap=%ld%% task_weight=%ld%% "
+		  "bdi_bw=%lu base_bw=%lu task_bw=%lu "
+		  "dirtied=%lu period=%lu think=%ld pause=%ld",
 		  __entry->bdi,
-		  __entry->bdi_dirty,
 		  __entry->bdi_limit,
 		  __entry->task_limit,
+		  __entry->bdi_dirty,
+		  __entry->avg_dirty,
+		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
+		  BDP_PERCENT(task_limit, avg_dirty, TASK_SOFT_DIRTY_LIMIT),
 		  /* task weight: proportion of recent dirtied pages */
 		  BDP_PERCENT(bdi_limit, task_limit, TASK_SOFT_DIRTY_LIMIT),
-		  BDP_PERCENT(task_limit, bdi_dirty, TASK_SOFT_DIRTY_LIMIT),
-		  BDP_PERCENT(bdi_limit, bdi_dirty, BDI_SOFT_DIRTY_LIMIT),
-		  __entry->pages_dirtied,
-		  __entry->pause
+		  __entry->bdi_bw,	/* bdi write bandwidth */
+		  __entry->base_bw,	/* bdi base throttle bandwidth */
+		  __entry->task_bw,	/* task throttle bandwidth */
+		  __entry->dirtied,
+		  __entry->period,	/* ms */
+		  __entry->think,	/* ms */
+		  __entry->pause	/* ms */
 		  )
 );
 
--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:24:47.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:24:49.000000000 +0800
@@ -785,6 +785,8 @@ static void balance_dirty_pages(struct a
 		pause_max = max_pause(bdi_thresh);
 
 		if (avg_dirty >= task_thresh || nr_dirty > dirty_thresh) {
+			bw = 0;
+			period = 0;
 			pause = pause_max;
 			goto pause;
 		}
@@ -812,6 +814,15 @@ static void balance_dirty_pages(struct a
 		 * it may be a light dirtier.
 		 */
 		if (unlikely(-pause < HZ*10)) {
+			trace_balance_dirty_pages(bdi,
+						  bdi_dirty,
+						  avg_dirty,
+						  bdi_thresh,
+						  task_thresh,
+						  pages_dirtied,
+						  bw,
+						  period,
+						  pause);
 			if (-pause <= HZ/10)
 				current->paused_when += period;
 			else
@@ -824,9 +835,12 @@ static void balance_dirty_pages(struct a
 pause:
 		trace_balance_dirty_pages(bdi,
 					  bdi_dirty,
+					  avg_dirty,
 					  bdi_thresh,
 					  task_thresh,
 					  pages_dirtied,
+					  bw,
+					  period,
 					  pause);
 		current->paused_when = jiffies;
 		__set_current_state(TASK_UNINTERRUPTIBLE);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 33/47] writeback: trace global dirty page states
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (31 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 32/47] writeback: extend balance_dirty_pages() trace event Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 34/47] writeback: trace writeback_single_inode() Wu Fengguang
                   ` (14 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-global-dirty-states.patch --]
[-- Type: text/plain, Size: 3706 bytes --]

Add trace balance_dirty_state for showing the global dirty page counts
and thresholds at each balance_dirty_pages() loop.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/trace/events/writeback.h |   57 +++++++++++++++++++++++++++++
 mm/page-writeback.c              |   15 ++++++-
 2 files changed, 69 insertions(+), 3 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2010-12-09 12:24:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-09 12:24:52.000000000 +0800
@@ -720,12 +720,21 @@ static void balance_dirty_pages(struct a
 		 * written to the server's write cache, but has not yet
 		 * been flushed to permanent storage.
 		 */
-		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
-					global_page_state(NR_UNSTABLE_NFS);
-		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY);
+		bdi_dirty = global_page_state(NR_UNSTABLE_NFS);
+		nr_dirty = global_page_state(NR_WRITEBACK);
 
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 
+		trace_balance_dirty_state(mapping,
+					  nr_reclaimable,
+					  nr_dirty,
+					  bdi_dirty,
+					  background_thresh,
+					  dirty_thresh);
+		nr_reclaimable += bdi_dirty;
+		nr_dirty += nr_reclaimable;
+
 		/*
 		 * Throttle it only when the background writeback cannot
 		 * catch-up. This avoids (excessively) small writeouts
--- linux-next.orig/include/trace/events/writeback.h	2010-12-09 12:24:49.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-09 12:24:52.000000000 +0800
@@ -149,6 +149,63 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(balance_dirty_state,
+
+	TP_PROTO(struct address_space *mapping,
+		 unsigned long nr_dirty,
+		 unsigned long nr_writeback,
+		 unsigned long nr_unstable,
+		 unsigned long background_thresh,
+		 unsigned long dirty_thresh
+	),
+
+	TP_ARGS(mapping,
+		nr_dirty,
+		nr_writeback,
+		nr_unstable,
+		background_thresh,
+		dirty_thresh
+	),
+
+	TP_STRUCT__entry(
+		__array(char,		bdi, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	nr_dirty)
+		__field(unsigned long,	nr_writeback)
+		__field(unsigned long,	nr_unstable)
+		__field(unsigned long,	background_thresh)
+		__field(unsigned long,	dirty_thresh)
+		__field(unsigned long,	task_dirtied_pause)
+	),
+
+	TP_fast_assign(
+		strlcpy(__entry->bdi,
+			dev_name(mapping->backing_dev_info->dev), 32);
+		__entry->ino			= mapping->host->i_ino;
+		__entry->nr_dirty		= nr_dirty;
+		__entry->nr_writeback		= nr_writeback;
+		__entry->nr_unstable		= nr_unstable;
+		__entry->background_thresh	= background_thresh;
+		__entry->dirty_thresh		= dirty_thresh;
+		__entry->task_dirtied_pause	= current->nr_dirtied_pause;
+	),
+
+	TP_printk("bdi %s: dirty=%lu wb=%lu unstable=%lu "
+		  "bg_thresh=%lu thresh=%lu gap=%ld "
+		  "poll_thresh=%lu ino=%lu",
+		  __entry->bdi,
+		  __entry->nr_dirty,
+		  __entry->nr_writeback,
+		  __entry->nr_unstable,
+		  __entry->background_thresh,
+		  __entry->dirty_thresh,
+		  __entry->dirty_thresh - __entry->nr_dirty -
+		  __entry->nr_writeback - __entry->nr_unstable,
+		  __entry->task_dirtied_pause,
+		  __entry->ino
+	)
+);
+
 #define KBps(x)			((x) << (PAGE_SHIFT - 10))
 #define BDP_PERCENT(a, b, c)	((__entry->a - __entry->b) * 100 * c + \
 				  __entry->bdi_limit/2) / (__entry->bdi_limit|1)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 34/47] writeback: trace writeback_single_inode()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (32 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 33/47] writeback: trace global dirty page states Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 35/47] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
                   ` (13 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-trace-writeback_single_inode.patch --]
[-- Type: text/plain, Size: 3599 bytes --]

It is valuable to know how the inodes are iterated and their IO size.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |   12 +++---
 include/trace/events/writeback.h |   52 +++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-09 22:15:49.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-09 22:15:50.000000000 +0800
@@ -331,7 +331,7 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	long per_file_limit = wbc->per_file_limit;
-	long uninitialized_var(nr_to_write);
+	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -351,7 +351,8 @@ writeback_single_inode(struct inode *ino
 		 */
 		if (wbc->sync_mode != WB_SYNC_ALL) {
 			requeue_io(inode);
-			return 0;
+			ret = 0;
+			goto out;
 		}
 
 		/*
@@ -367,10 +368,8 @@ writeback_single_inode(struct inode *ino
 	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode_lock);
 
-	if (per_file_limit) {
-		nr_to_write = wbc->nr_to_write;
+	if (per_file_limit)
 		wbc->nr_to_write = per_file_limit;
-	}
 
 	ret = do_writepages(mapping, wbc);
 
@@ -446,6 +445,9 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+out:
+	trace_writeback_single_inode(inode, wbc,
+				     nr_to_write - wbc->nr_to_write);
 	return ret;
 }
 
--- linux-next.orig/include/trace/events/writeback.h	2010-12-09 22:15:49.000000000 +0800
+++ linux-next/include/trace/events/writeback.h	2010-12-09 22:21:41.000000000 +0800
@@ -10,6 +10,19 @@
 
 struct wb_writeback_work;
 
+#define show_inode_state(state)					\
+	__print_flags(state, "|",				\
+		{I_DIRTY_SYNC,		"I_DIRTY_SYNC"},	\
+		{I_DIRTY_DATASYNC,	"I_DIRTY_DATASYNC"},	\
+		{I_DIRTY_PAGES,		"I_DIRTY_PAGES"},	\
+		{I_NEW,			"I_NEW"},		\
+		{I_WILL_FREE,		"I_WILL_FREE"},		\
+		{I_FREEING,		"I_FREEING"},		\
+		{I_CLEAR,		"I_CLEAR"},		\
+		{I_SYNC,		"I_SYNC"},		\
+		{I_REFERENCED,		"I_REFERENCED"}		\
+		)
+
 DECLARE_EVENT_CLASS(writeback_work_class,
 	TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
 	TP_ARGS(bdi, work),
@@ -149,6 +162,45 @@ DEFINE_WBC_EVENT(wbc_writeback_written);
 DEFINE_WBC_EVENT(wbc_writeback_wait);
 DEFINE_WBC_EVENT(wbc_writepage);
 
+TRACE_EVENT(writeback_single_inode,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 unsigned long wrote
+	),
+
+	TP_ARGS(inode, wbc, wrote),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, age)
+		__field(unsigned long, wrote)
+		__field(long, nr_to_write)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->age		= (jiffies - inode->dirtied_when) *
+								1000 / HZ;
+		__entry->wrote		= wrote;
+		__entry->nr_to_write	= wbc->nr_to_write;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s age=%lu wrote=%lu to_write=%ld",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->age,
+		  __entry->wrote,
+		  __entry->nr_to_write
+	)
+);
+
 TRACE_EVENT(balance_dirty_state,
 
 	TP_PROTO(struct address_space *mapping,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 35/47] writeback: scale IO chunk size up to device bandwidth
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (33 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 34/47] writeback: trace writeback_single_inode() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
                   ` (12 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Theodore Tso, Dave Chinner, Chris Mason, Peter Zijlstra,
	Wu Fengguang, Christoph Hellwig, Trond Myklebust, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

[-- Attachment #1: writeback-128M-MAX_WRITEBACK_PAGES.patch --]
[-- Type: text/plain, Size: 5257 bytes --]

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.					    -- Theodore Ts'o

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So remove the MAX_WRITEBACK_PAGES constraint totally. The writeback pages
will adapt to as large as the storage device can write within 1 second.

For a typical hard disk, the resulted chunk size will be 32MB or 64MB.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

CC: Theodore Ts'o <tytso@mit.edu>
CC: Dave Chinner <david@fromorbit.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   60 +++++++++++++++++++-----------------
 include/linux/writeback.h |    5 +++
 2 files changed, 38 insertions(+), 27 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2010-12-09 12:24:57.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2010-12-09 12:24:58.000000000 +0800
@@ -602,15 +602,6 @@ static void __writeback_inodes_sb(struct
 	spin_unlock(&inode_lock);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
-
 static inline bool over_bground_thresh(void)
 {
 	unsigned long background_thresh, dirty_thresh;
@@ -622,6 +613,38 @@ static inline bool over_bground_thresh(v
 }
 
 /*
+ * Give each inode a nr_to_write that can complete within 1 second.
+ */
+static unsigned long writeback_chunk_size(struct backing_dev_info *bdi,
+					  int sync_mode)
+{
+	unsigned long pages;
+
+	/*
+	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
+	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
+	 * here avoids calling into writeback_inodes_wb() more than once.
+	 *
+	 * The intended call sequence for WB_SYNC_ALL writeback is:
+	 *
+	 *      wb_writeback()
+	 *          __writeback_inodes_sb()     <== called only once
+	 *              write_cache_pages()     <== called once for each inode
+	 *                  (quickly) tag currently dirty pages
+	 *                  (maybe slowly) sync all tagged pages
+	 */
+	if (sync_mode == WB_SYNC_ALL)
+		return LONG_MAX;
+
+	pages = bdi->write_bandwidth;
+
+	if (pages < MIN_WRITEBACK_PAGES)
+		return MIN_WRITEBACK_PAGES;
+
+	return rounddown_pow_of_two(pages);
+}
+
+/*
  * Explicit flushing or periodic writeback of "old" data.
  *
  * Define "old": the first time one of an inode's pages is dirtied, we mark the
@@ -661,24 +684,6 @@ static long wb_writeback(struct bdi_writ
 		wbc.range_end = LLONG_MAX;
 	}
 
-	/*
-	 * WB_SYNC_ALL mode does livelock avoidance by syncing dirty
-	 * inodes/pages in one big loop. Setting wbc.nr_to_write=LONG_MAX
-	 * here avoids calling into writeback_inodes_wb() more than once.
-	 *
-	 * The intended call sequence for WB_SYNC_ALL writeback is:
-	 *
-	 *      wb_writeback()
-	 *          __writeback_inodes_sb()     <== called only once
-	 *              write_cache_pages()     <== called once for each inode
-	 *                   (quickly) tag currently dirty pages
-	 *                   (maybe slowly) sync all tagged pages
-	 */
-	if (wbc.sync_mode == WB_SYNC_NONE)
-		write_chunk = MAX_WRITEBACK_PAGES;
-	else
-		write_chunk = LONG_MAX;
-
 	wbc.wb_start = jiffies; /* livelock avoidance */
 	bdi_update_write_bandwidth(wb->bdi, wbc.wb_start);
 
@@ -707,6 +712,7 @@ static long wb_writeback(struct bdi_writ
 			break;
 
 		wbc.more_io = 0;
+		write_chunk = writeback_chunk_size(wb->bdi, wbc.sync_mode);
 		wbc.nr_to_write = write_chunk;
 		wbc.per_file_limit = write_chunk;
 		wbc.pages_skipped = 0;
--- linux-next.orig/include/linux/writeback.h	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/include/linux/writeback.h	2010-12-09 12:24:58.000000000 +0800
@@ -22,6 +22,11 @@ extern spinlock_t inode_lock;
 #define TASK_SOFT_DIRTY_LIMIT	(BDI_SOFT_DIRTY_LIMIT * 2)
 
 /*
+ * 4MB minimal write chunk size
+ */
+#define MIN_WRITEBACK_PAGES     (4096 >> (PAGE_CACHE_SHIFT - 10))
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (34 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 35/47] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 37/47] btrfs: lower the dirty balacing rate limit Wu Fengguang
                   ` (11 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-fix-balance-size.patch --]
[-- Type: text/plain, Size: 4150 bytes --]

When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited() should be called once instead of 4
times. Failing to do so will make all tasks throttled much too heavy.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c       |   11 +++++++----
 fs/btrfs/ioctl.c      |    6 ++++--
 fs/btrfs/relocation.c |    6 ++++--
 3 files changed, 15 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-09 12:24:59.000000000 +0800
@@ -762,7 +762,8 @@ out:
 static noinline int prepare_pages(struct btrfs_root *root, struct file *file,
 			 struct page **pages, size_t num_pages,
 			 loff_t pos, unsigned long first_index,
-			 unsigned long last_index, size_t write_bytes)
+			 unsigned long last_index, size_t write_bytes,
+			 int *nr_dirtied)
 {
 	struct extent_state *cached_state = NULL;
 	int i;
@@ -825,7 +826,8 @@ again:
 				     GFP_NOFS);
 	}
 	for (i = 0; i < num_pages; i++) {
-		clear_page_dirty_for_io(pages[i]);
+		if (!clear_page_dirty_for_io(pages[i]))
+			(*nr_dirtied)++;
 		set_page_extent_mapped(pages[i]);
 		WARN_ON(!PageLocked(pages[i]));
 	}
@@ -966,6 +968,7 @@ static ssize_t btrfs_file_aio_write(stru
 					 offset);
 		size_t num_pages = (write_bytes + PAGE_CACHE_SIZE - 1) >>
 					PAGE_CACHE_SHIFT;
+		int nr_dirtied = 0;
 
 		WARN_ON(num_pages > nrptrs);
 		memset(pages, 0, sizeof(struct page *) * nrptrs);
@@ -976,7 +979,7 @@ static ssize_t btrfs_file_aio_write(stru
 
 		ret = prepare_pages(root, file, pages, num_pages,
 				    pos, first_index, last_index,
-				    write_bytes);
+				    write_bytes, &nr_dirtied);
 		if (ret) {
 			btrfs_delalloc_release_space(inode, write_bytes);
 			goto out;
@@ -1000,7 +1003,7 @@ static ssize_t btrfs_file_aio_write(stru
 						 pos + write_bytes - 1);
 		} else {
 			balance_dirty_pages_ratelimited_nr(inode->i_mapping,
-							   num_pages);
+							   nr_dirtied);
 			if (num_pages <
 			    (root->leafsize >> PAGE_CACHE_SHIFT) + 1)
 				btrfs_btree_balance_dirty(root, 1);
--- linux-next.orig/fs/btrfs/ioctl.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/btrfs/ioctl.c	2010-12-09 12:24:59.000000000 +0800
@@ -647,6 +647,7 @@ static int btrfs_defrag_file(struct file
 	u64 skip = 0;
 	u64 defrag_end = 0;
 	unsigned long i;
+	int dirtied;
 	int ret;
 
 	if (inode->i_size == 0)
@@ -751,7 +752,7 @@ again:
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
 		ClearPageChecked(page);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 		unlock_extent(io_tree, page_start, page_end, GFP_NOFS);
 
 loop_unlock:
@@ -759,7 +760,8 @@ loop_unlock:
 		page_cache_release(page);
 		mutex_unlock(&inode->i_mutex);
 
-		balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
+		if (dirtied)
+			balance_dirty_pages_ratelimited_nr(inode->i_mapping, 1);
 		i++;
 	}
 
--- linux-next.orig/fs/btrfs/relocation.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/btrfs/relocation.c	2010-12-09 12:24:59.000000000 +0800
@@ -2894,6 +2894,7 @@ static int relocate_file_extent_cluster(
 	struct file_ra_state *ra;
 	int nr = 0;
 	int ret = 0;
+	int dirtied;
 
 	if (!cluster->nr)
 		return 0;
@@ -2970,7 +2971,7 @@ static int relocate_file_extent_cluster(
 		}
 
 		btrfs_set_extent_delalloc(inode, page_start, page_end, NULL);
-		set_page_dirty(page);
+		dirtied = set_page_dirty(page);
 
 		unlock_extent(&BTRFS_I(inode)->io_tree,
 			      page_start, page_end, GFP_NOFS);
@@ -2978,7 +2979,8 @@ static int relocate_file_extent_cluster(
 		page_cache_release(page);
 
 		index++;
-		balance_dirty_pages_ratelimited(inode->i_mapping);
+		if (dirtied)
+			balance_dirty_pages_ratelimited(inode->i_mapping);
 		btrfs_throttle(BTRFS_I(inode)->root);
 	}
 	WARN_ON(nr != cluster->nr);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 37/47] btrfs: lower the dirty balacing rate limit
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (35 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 38/47] btrfs: wait on too many nr_async_bios Wu Fengguang
                   ` (10 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Chris Mason, Wu Fengguang, Christoph Hellwig,
	Trond Myklebust, Dave Chinner, Theodore Ts'o, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-limit-nr-dirtied.patch --]
[-- Type: text/plain, Size: 1314 bytes --]

Call balance_dirty_pages_ratelimit_nr() on every 16 pages dirtied.

Experiments show that larger intervals (in the original code) can
easily make the bdi dirty limit exceeded on 100 concurrent dd.

CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/file.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- linux-next.orig/fs/btrfs/file.c	2010-12-09 12:24:59.000000000 +0800
+++ linux-next/fs/btrfs/file.c	2010-12-09 12:25:00.000000000 +0800
@@ -924,9 +924,8 @@ static ssize_t btrfs_file_aio_write(stru
 	}
 
 	iov_iter_init(&i, iov, nr_segs, count, num_written);
-	nrptrs = min((iov_iter_count(&i) + PAGE_CACHE_SIZE - 1) /
-		     PAGE_CACHE_SIZE, PAGE_CACHE_SIZE /
-		     (sizeof(struct page *)));
+	nrptrs = min(DIV_ROUND_UP(iov_iter_count(&i), PAGE_CACHE_SIZE),
+		     min(16UL, PAGE_CACHE_SIZE / (sizeof(struct page *))));
 	pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL);
 
 	/* generic_write_checks can change our pos */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 38/47] btrfs: wait on too many nr_async_bios
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (36 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 37/47] btrfs: lower the dirty balacing rate limit Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 39/47] nfs: livelock prevention is now done in VFS Wu Fengguang
                   ` (9 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: btrfs-nr_async_bios-wait.patch --]
[-- Type: text/plain, Size: 1729 bytes --]

Tests show that btrfs is repeatedly moving _all_ PG_dirty pages into
PG_writeback state. It's desirable to have some limit on the number of
writeback pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/btrfs/disk-io.c |    7 +++++++
 1 file changed, 7 insertions(+)

before patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-30/vmstat-dirty-300.png

after patch:
	http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/btrfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-08-21-14/vmstat-dirty-300.png

--- linux-next.orig/fs/btrfs/disk-io.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/btrfs/disk-io.c	2010-12-09 12:25:00.000000000 +0800
@@ -590,6 +590,7 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 			extent_submit_bio_hook_t *submit_bio_done)
 {
 	struct async_submit_bio *async;
+	int limit;
 
 	async = kmalloc(sizeof(*async), GFP_NOFS);
 	if (!async)
@@ -617,6 +618,12 @@ int btrfs_wq_submit_bio(struct btrfs_fs_
 
 	btrfs_queue_worker(&fs_info->workers, &async->work);
 
+	limit = btrfs_async_submit_limit(fs_info);
+
+	if (atomic_read(&fs_info->nr_async_bios) > limit)
+		wait_event(fs_info->async_submit_wait,
+			   (atomic_read(&fs_info->nr_async_bios) < limit));
+
 	while (atomic_read(&fs_info->async_submit_draining) &&
 	      atomic_read(&fs_info->nr_async_submits)) {
 		wait_event(fs_info->async_submit_wait,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 39/47] nfs: livelock prevention is now done in VFS
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (37 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 38/47] btrfs: wait on too many nr_async_bios Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 40/47] nfs: writeback pages wait queue Wu Fengguang
                   ` (8 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-revert-livelock-72cb77f4a5ac.patch --]
[-- Type: text/plain, Size: 2760 bytes --]

This reverts commit 72cb77f4a5 ("NFS: Throttle page dirtying while we're
flushing to disk"). The two problems it tries to address

- sync live lock
- out of order writes

are now all addressed in the VFS

- PAGECACHE_TAG_TOWRITE prevents sync live lock
- IO-less balance_dirty_pages() avoids concurrent writes

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/file.c          |    9 ---------
 fs/nfs/write.c         |   11 -----------
 include/linux/nfs_fs.h |    1 -
 3 files changed, 21 deletions(-)

--- linux-next.orig/fs/nfs/file.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/nfs/file.c	2010-12-09 12:25:01.000000000 +0800
@@ -392,15 +392,6 @@ static int nfs_write_begin(struct file *
 			   IOMODE_RW);
 
 start:
-	/*
-	 * Prevent starvation issues if someone is doing a consistency
-	 * sync-to-disk
-	 */
-	ret = wait_on_bit(&NFS_I(mapping->host)->flags, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (ret)
-		return ret;
-
 	page = grab_cache_page_write_begin(mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
--- linux-next.orig/fs/nfs/write.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-09 12:25:01.000000000 +0800
@@ -337,26 +337,15 @@ static int nfs_writepages_callback(struc
 int nfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
 	struct inode *inode = mapping->host;
-	unsigned long *bitlock = &NFS_I(inode)->flags;
 	struct nfs_pageio_descriptor pgio;
 	int err;
 
-	/* Stop dirtying of new pages while we sync */
-	err = wait_on_bit_lock(bitlock, NFS_INO_FLUSHING,
-			nfs_wait_bit_killable, TASK_KILLABLE);
-	if (err)
-		goto out_err;
-
 	nfs_inc_stats(inode, NFSIOS_VFSWRITEPAGES);
 
 	nfs_pageio_init_write(&pgio, inode, wb_priority(wbc));
 	err = write_cache_pages(mapping, wbc, nfs_writepages_callback, &pgio);
 	nfs_pageio_complete(&pgio);
 
-	clear_bit_unlock(NFS_INO_FLUSHING, bitlock);
-	smp_mb__after_clear_bit();
-	wake_up_bit(bitlock, NFS_INO_FLUSHING);
-
 	if (err < 0)
 		goto out_err;
 	err = pgio.pg_error;
--- linux-next.orig/include/linux/nfs_fs.h	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/include/linux/nfs_fs.h	2010-12-09 12:25:01.000000000 +0800
@@ -216,7 +216,6 @@ struct nfs_inode {
 #define NFS_INO_STALE		(1)		/* possible stale inode */
 #define NFS_INO_ACL_LRU_SET	(2)		/* Inode is on the LRU list */
 #define NFS_INO_MOUNTPOINT	(3)		/* inode is remote mountpoint */
-#define NFS_INO_FLUSHING	(4)		/* inode is flushing out data */
 #define NFS_INO_FSCACHE		(5)		/* inode can be cached by FS-Cache */
 #define NFS_INO_FSCACHE_LOCK	(6)		/* FS-Cache cookie management lock */
 #define NFS_INO_COMMIT		(7)		/* inode is committing unstable writes */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 40/47] nfs: writeback pages wait queue
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (38 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 39/47] nfs: livelock prevention is now done in VFS Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 41/47] nfs: in-commit pages accounting and " Wu Fengguang
                   ` (7 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Jens Axboe, Chris Mason, Peter Zijlstra,
	Trond Myklebust, Wu Fengguang, Christoph Hellwig, Dave Chinner,
	Theodore Ts'o, Mel Gorman, Rik van Riel, KOSAKI Motohiro,
	Greg Thelen, Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-request-queue.patch --]
[-- Type: text/plain, Size: 6156 bytes --]

The generic writeback routines are departing from congestion_wait()
in preference of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages will grow out of control, exhausting all PG_dirty pages.

CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   93 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 85 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-09 12:25:01.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-09 12:25:01.000000000 +0800
@@ -185,11 +185,68 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, struct backing_dev_info *bdi)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (nr < 2 * limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit / 8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -200,11 +257,8 @@ static int nfs_set_page_writeback(struct
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
 		page_cache_get(page);
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +270,10 @@ static void nfs_end_page_writeback(struc
 
 	end_page_writeback(page);
 	page_cache_release(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page, bool nonblock)
@@ -318,19 +374,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-09 12:25:01.000000000 +0800
@@ -106,6 +106,7 @@ struct nfs_server {
 	struct nfs_iostats __percpu *io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-09 12:21:03.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-09 12:25:01.000000000 +0800
@@ -1006,6 +1006,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 41/47] nfs: in-commit pages accounting and wait queue
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (39 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 40/47] nfs: writeback pages wait queue Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 42/47] nfs: heuristics to avoid commit Wu Fengguang
                   ` (6 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-in-commit.patch --]
[-- Type: text/plain, Size: 6146 bytes --]

When doing 10+ concurrent dd's, I observed very bumpy commits submission
(partly because the dd's are started at the same time, and hence reached
4MB to-commit pages at the same time). Basically we rely on the server
to complete and return write/commit requests, and want both to progress
smoothly and not consume too many pages. The write request wait queue is
not enough as it's mainly network bounded. So add another commit request
wait queue. Only async writes need to sleep on this queue.

cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/client.c           |    1 
 fs/nfs/write.c            |   51 ++++++++++++++++++++++++++++++------
 include/linux/nfs_fs_sb.h |    2 +
 3 files changed, 46 insertions(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-09 12:25:01.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-09 12:25:02.000000000 +0800
@@ -516,7 +516,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	struct page *page = req->wb_page;
 
@@ -554,7 +554,7 @@ nfs_mark_request_commit(struct nfs_page 
 }
 
 static inline int
-nfs_clear_request_commit(struct nfs_page *req)
+nfs_clear_request_commit(struct inode *inode, struct nfs_page *req)
 {
 	return 0;
 }
@@ -599,8 +599,10 @@ nfs_scan_commit(struct inode *inode, str
 		return 0;
 
 	ret = nfs_scan_list(nfsi, dst, idx_start, npages, NFS_PAGE_TAG_COMMIT);
-	if (ret > 0)
+	if (ret > 0) {
 		nfsi->ncommit -= ret;
+		atomic_long_add(ret, &NFS_SERVER(inode)->in_commit);
+	}
 	if (nfs_need_commit(NFS_I(inode)))
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;
@@ -668,7 +670,7 @@ static struct nfs_page *nfs_try_to_updat
 		spin_lock(&inode->i_lock);
 	}
 
-	if (nfs_clear_request_commit(req) &&
+	if (nfs_clear_request_commit(inode, req) &&
 			radix_tree_tag_clear(&NFS_I(inode)->nfs_page_tree,
 				req->wb_index, NFS_PAGE_TAG_COMMIT) != NULL)
 		NFS_I(inode)->ncommit--;
@@ -1271,6 +1273,34 @@ int nfs_writeback_done(struct rpc_task *
 
 
 #if defined(CONFIG_NFS_V3) || defined(CONFIG_NFS_V4)
+static void nfs_commit_wait(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+	DEFINE_WAIT(wait);
+
+	if (atomic_long_read(&nfss->in_commit) < limit)
+		return;
+
+	for (;;) {
+		prepare_to_wait(&nfss->in_commit_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		if (atomic_long_read(&nfss->in_commit) < limit)
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&nfss->in_commit_wait, &wait);
+}
+
+static void nfs_commit_wakeup(struct nfs_server *nfss)
+{
+	long limit = nfs_congestion_kb >> (PAGE_SHIFT - 10);
+
+	if (atomic_long_read(&nfss->in_commit) < limit - limit / 8 &&
+	    waitqueue_active(&nfss->in_commit_wait))
+		wake_up(&nfss->in_commit_wait);
+}
+
 static int nfs_commit_set_lock(struct nfs_inode *nfsi, int may_wait)
 {
 	if (!test_and_set_bit(NFS_INO_COMMIT, &nfsi->flags))
@@ -1376,6 +1406,7 @@ nfs_commit_list(struct inode *inode, str
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		atomic_long_dec(&NFS_SERVER(inode)->in_commit);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
@@ -1409,7 +1440,8 @@ static void nfs_commit_release(void *cal
 	while (!list_empty(&data->pages)) {
 		req = nfs_list_entry(data->pages.next);
 		nfs_list_remove_request(req);
-		nfs_clear_request_commit(req);
+		nfs_clear_request_commit(data->inode, req);
+		atomic_long_dec(&NFS_SERVER(data->inode)->in_commit);
 
 		dprintk("NFS:       commit (%s/%lld %d@%lld)",
 			req->wb_context->path.dentry->d_inode->i_sb->s_id,
@@ -1438,6 +1470,7 @@ static void nfs_commit_release(void *cal
 		nfs_clear_page_tag_locked(req);
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
+	nfs_commit_wakeup(NFS_SERVER(data->inode));
 	nfs_commitdata_release(calldata);
 }
 
@@ -1452,11 +1485,13 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
-	int may_wait = how & FLUSH_SYNC;
+	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
-	if (!nfs_commit_set_lock(NFS_I(inode), may_wait))
+	if (!nfs_commit_set_lock(NFS_I(inode), sync))
 		goto out_mark_dirty;
+	if (!sync)
+		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
 	spin_unlock(&inode->i_lock);
@@ -1464,7 +1499,7 @@ int nfs_commit_inode(struct inode *inode
 		int error = nfs_commit_list(inode, &head, how);
 		if (error < 0)
 			return error;
-		if (may_wait)
+		if (sync)
 			wait_on_bit(&NFS_I(inode)->flags, NFS_INO_COMMIT,
 					nfs_wait_bit_killable,
 					TASK_KILLABLE);
--- linux-next.orig/include/linux/nfs_fs_sb.h	2010-12-09 12:25:01.000000000 +0800
+++ linux-next/include/linux/nfs_fs_sb.h	2010-12-09 12:25:02.000000000 +0800
@@ -107,6 +107,8 @@ struct nfs_server {
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
 	wait_queue_head_t	writeback_wait[2];
+	atomic_long_t		in_commit;	/* number of in-commit pages */
+	wait_queue_head_t	in_commit_wait;
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux-next.orig/fs/nfs/client.c	2010-12-09 12:25:01.000000000 +0800
+++ linux-next/fs/nfs/client.c	2010-12-09 12:25:02.000000000 +0800
@@ -1008,6 +1008,7 @@ static struct nfs_server *nfs_alloc_serv
 	atomic_set(&server->active, 0);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
 	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
+	init_waitqueue_head(&server->in_commit_wait);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 42/47] nfs: heuristics to avoid commit
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (40 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 41/47] nfs: in-commit pages accounting and " Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
                   ` (5 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-should-commit.patch --]
[-- Type: text/plain, Size: 2199 bytes --]

The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") do not work well for large inodes being
actively written to.

Refine the criterion to
- it has gone quiet (all data transfered to server)
- has accumulated >= 4MB data to commit (so it will be large IO)
- too few active commits (hence active IO) in the server

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   31 ++++++++++++++++++++++++++-----
 1 file changed, 26 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-09 12:19:24.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-09 12:20:20.000000000 +0800
@@ -1518,17 +1518,38 @@ out_mark_dirty:
 	return res;
 }
 
-static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
+static bool nfs_should_commit(struct inode *inode,
+			      struct writeback_control *wbc)
 {
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	struct nfs_inode *nfsi = NFS_I(inode);
+	unsigned long npages = nfsi->npages;
+	unsigned long to_commit = nfsi->ncommit;
+	unsigned long in_commit = atomic_long_read(&nfss->in_commit);
+
+	/* no more active writes */
+	if (to_commit == npages)
+		return true;
+
+	/* big enough */
+	if (to_commit >= MIN_WRITEBACK_PAGES)
+		return true;
+
+	/* active commits drop low: kick more IO for the server disk */
+	if (to_commit > in_commit / 2)
+		return true;
+
+	return false;
+}
+
+static int nfs_commit_unstable_pages(struct inode *inode,
+				     struct writeback_control *wbc)
+{
 	int flags = FLUSH_SYNC;
 	int ret = 0;
 
 	if (wbc->sync_mode == WB_SYNC_NONE) {
-		/* Don't commit yet if this is a non-blocking flush and there
-		 * are a lot of outstanding writes for this mapping.
-		 */
-		if (nfsi->ncommit <= (nfsi->npages >> 1))
+		if (!nfs_should_commit(inode, wbc))
 			goto out_mark_dirty;
 
 		/* don't wait for the COMMIT response */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (41 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 42/47] nfs: heuristics to avoid commit Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 44/47] nfs: limit the range of commits Wu Fengguang
                   ` (4 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: writeback-nfs-commit-remove-nr_to_write.patch --]
[-- Type: text/plain, Size: 1122 bytes --]

It's introduced in commit 420e3646 ("NFS: Reduce the number of
unnecessary COMMIT calls") and seems not necessary.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |    9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-09 12:20:20.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-09 12:20:30.000000000 +0800
@@ -1557,15 +1557,8 @@ static int nfs_commit_unstable_pages(str
 	}
 
 	ret = nfs_commit_inode(inode, flags);
-	if (ret >= 0) {
-		if (wbc->sync_mode == WB_SYNC_NONE) {
-			if (ret < wbc->nr_to_write)
-				wbc->nr_to_write -= ret;
-			else
-				wbc->nr_to_write = 0;
-		}
+	if (ret >= 0)
 		return 0;
-	}
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	return ret;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 44/47] nfs: limit the range of commits
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (42 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 45/47] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
                   ` (3 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-commit-range.patch --]
[-- Type: text/plain, Size: 2970 bytes --]

Hopefully this will help limit the number of unstable pages to be synced
at one time, more timely return of the commit request and reduce dirty
throttle fluctuations.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-08 22:54:08.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-08 22:54:13.000000000 +0800
@@ -1333,7 +1333,7 @@ static void nfs_commitdata_release(void 
  */
 static int nfs_commit_rpcsetup(struct list_head *head,
 		struct nfs_write_data *data,
-		int how)
+		int how, pgoff_t offset, pgoff_t count)
 {
 	struct nfs_page *first = nfs_list_entry(head->next);
 	struct inode *inode = first->wb_context->path.dentry->d_inode;
@@ -1365,8 +1365,8 @@ static int nfs_commit_rpcsetup(struct li
 
 	data->args.fh     = NFS_FH(data->inode);
 	/* Note: we always request a commit of the entire inode */
-	data->args.offset = 0;
-	data->args.count  = 0;
+	data->args.offset = offset;
+	data->args.count  = count;
 	data->args.context = get_nfs_open_context(first->wb_context);
 	data->res.count   = 0;
 	data->res.fattr   = &data->fattr;
@@ -1389,7 +1389,8 @@ static int nfs_commit_rpcsetup(struct li
  * Commit dirty pages
  */
 static int
-nfs_commit_list(struct inode *inode, struct list_head *head, int how)
+nfs_commit_list(struct inode *inode, struct list_head *head, int how,
+		pgoff_t offset, pgoff_t count)
 {
 	struct nfs_write_data	*data;
 	struct nfs_page         *req;
@@ -1400,7 +1401,7 @@ nfs_commit_list(struct inode *inode, str
 		goto out_bad;
 
 	/* Set up the argument struct */
-	return nfs_commit_rpcsetup(head, data, how);
+	return nfs_commit_rpcsetup(head, data, how, offset, count);
  out_bad:
 	while (!list_empty(head)) {
 		req = nfs_list_entry(head->next);
@@ -1485,6 +1486,8 @@ static const struct rpc_call_ops nfs_com
 int nfs_commit_inode(struct inode *inode, int how)
 {
 	LIST_HEAD(head);
+	pgoff_t first_index;
+	pgoff_t last_index;
 	int sync = how & FLUSH_SYNC;
 	int res = 0;
 
@@ -1494,9 +1497,14 @@ int nfs_commit_inode(struct inode *inode
 		nfs_commit_wait(NFS_SERVER(inode));
 	spin_lock(&inode->i_lock);
 	res = nfs_scan_commit(inode, &head, 0, 0);
+	if (res) {
+		first_index = nfs_list_entry(head.next)->wb_index;
+		last_index  = nfs_list_entry(head.prev)->wb_index;
+	}
 	spin_unlock(&inode->i_lock);
 	if (res) {
-		int error = nfs_commit_list(inode, &head, how);
+		int error = nfs_commit_list(inode, &head, how, first_index,
+					    last_index - first_index + 1);
 		if (error < 0)
 			return error;
 		if (sync)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 45/47] nfs: adapt congestion threshold to dirty threshold
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (43 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 44/47] nfs: limit the range of commits Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 46/47] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
                   ` (2 subsequent siblings)
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-congestion-thresh.patch --]
[-- Type: text/plain, Size: 1873 bytes --]

nfs_congestion_kb is to control the max allowed writeback and in-commit
pages. It's not reasonable for them to outnumber dirty and to-commit
pages. So each of them should not take more than 1/4 dirty threshold.

Considering that nfs_init_writepagecache() is called on fresh boot,
at the time dirty_thresh is much higher than the real dirty limit after
lots of user space memory consumptions, use 1/8 instead.

We might update nfs_congestion_kb when global dirty limit is changed
at runtime, but whatever, do it simple first.

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-08 22:44:37.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-08 22:44:37.000000000 +0800
@@ -1698,6 +1698,9 @@ out:
 
 int __init nfs_init_writepagecache(void)
 {
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+
 	nfs_wdata_cachep = kmem_cache_create("nfs_write_data",
 					     sizeof(struct nfs_write_data),
 					     0, SLAB_HWCACHE_ALIGN,
@@ -1735,6 +1738,16 @@ int __init nfs_init_writepagecache(void)
 	if (nfs_congestion_kb > 256*1024)
 		nfs_congestion_kb = 256*1024;
 
+	/*
+	 * Limit to 1/8 dirty threshold, so that writeback+in_commit pages
+	 * won't overnumber dirty+to_commit pages.
+	 */
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_thresh <<= PAGE_SHIFT - 10;
+
+	if (nfs_congestion_kb > dirty_thresh / 8)
+		nfs_congestion_kb = dirty_thresh / 8;
+
 	return 0;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 46/47] nfs: trace nfs_commit_unstable_pages()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (44 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 45/47] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13  6:43 ` [PATCH 47/47] nfs: trace nfs_commit_release() Wu Fengguang
  2010-12-13 11:27 ` [PATCH 00/47] IO-less dirty throttling v3 Peter Zijlstra
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Trond Myklebust, Wu Fengguang, Christoph Hellwig,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: nfs-trace-write_inode.patch --]
[-- Type: text/plain, Size: 2488 bytes --]

CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |   10 ++++--
 include/trace/events/nfs.h |   58 +++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 2 deletions(-)

--- linux-next.orig/fs/nfs/write.c	2010-12-08 22:44:37.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-08 22:44:38.000000000 +0800
@@ -29,6 +29,9 @@
 #include "nfs4_fs.h"
 #include "fscache.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/nfs.h>
+
 #define NFSDBG_FACILITY		NFSDBG_PAGECACHE
 
 #define MIN_POOL_WRITE		(32)
@@ -1566,10 +1569,13 @@ static int nfs_commit_unstable_pages(str
 
 	ret = nfs_commit_inode(inode, flags);
 	if (ret >= 0)
-		return 0;
+		goto out;
+
 out_mark_dirty:
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
-	return ret;
+out:
+	trace_nfs_commit_unstable_pages(inode, wbc, flags, ret);
+	return ret >= 0 ? 0 : ret;
 }
 #else
 static int nfs_commit_unstable_pages(struct inode *inode, struct writeback_control *wbc)
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-next/include/trace/events/nfs.h	2010-12-08 22:44:38.000000000 +0800
@@ -0,0 +1,58 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM nfs
+
+#if !defined(_TRACE_NFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NFS_H
+
+#include <linux/nfs_fs.h>
+
+
+TRACE_EVENT(nfs_commit_unstable_pages,
+
+	TP_PROTO(struct inode *inode,
+		 struct writeback_control *wbc,
+		 int sync,
+		 int ret
+	),
+
+	TP_ARGS(inode, wbc, sync, ret),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	npages)
+		__field(unsigned long,	in_commit)
+		__field(unsigned long,	write_chunk)
+		__field(int,		sync)
+		__field(int,		ret)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->npages		= NFS_I(inode)->npages;
+		__entry->in_commit	=
+			atomic_long_read(&NFS_SERVER(inode)->in_commit);
+		__entry->write_chunk	= wbc->per_file_limit;
+		__entry->sync		= sync;
+		__entry->ret		= ret;
+	),
+
+	TP_printk("bdi %s: ino=%lu npages=%ld "
+		  "incommit=%lu write_chunk=%lu sync=%d ret=%d",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->npages,
+		  __entry->in_commit,
+		  __entry->write_chunk,
+		  __entry->sync,
+		  __entry->ret
+	)
+);
+
+
+#endif /* _TRACE_NFS_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 47/47] nfs: trace nfs_commit_release()
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (45 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 46/47] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
@ 2010-12-13  6:43 ` Wu Fengguang
  2010-12-13 11:27 ` [PATCH 00/47] IO-less dirty throttling v3 Peter Zijlstra
  47 siblings, 0 replies; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13  6:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Peter Zijlstra,
	Mel Gorman, Rik van Riel, KOSAKI Motohiro, Greg Thelen,
	Minchan Kim, linux-mm, linux-fsdevel, LKML

[-- Attachment #1: trace-nfs-commit-release.patch --]
[-- Type: text/plain, Size: 1823 bytes --]


Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/nfs/write.c             |    3 +++
 include/trace/events/nfs.h |   31 +++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+)

--- linux-next.orig/fs/nfs/write.c	2010-12-08 22:44:38.000000000 +0800
+++ linux-next/fs/nfs/write.c	2010-12-08 22:44:38.000000000 +0800
@@ -1475,6 +1475,9 @@ static void nfs_commit_release(void *cal
 	}
 	nfs_commit_clear_lock(NFS_I(data->inode));
 	nfs_commit_wakeup(NFS_SERVER(data->inode));
+	trace_nfs_commit_release(data->inode,
+				 data->args.offset,
+				 data->args.count);
 	nfs_commitdata_release(calldata);
 }
 
--- linux-next.orig/include/trace/events/nfs.h	2010-12-08 22:44:38.000000000 +0800
+++ linux-next/include/trace/events/nfs.h	2010-12-08 22:44:38.000000000 +0800
@@ -51,6 +51,37 @@ TRACE_EVENT(nfs_commit_unstable_pages,
 	)
 );
 
+TRACE_EVENT(nfs_commit_release,
+
+	TP_PROTO(struct inode *inode,
+		 unsigned long offset,
+		 unsigned long len),
+
+	TP_ARGS(inode, offset, len),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long,	ino)
+		__field(unsigned long,	offset)
+		__field(unsigned long,	len)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+			dev_name(inode->i_mapping->backing_dev_info->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->offset		= offset;
+		__entry->len		= len;
+	),
+
+	TP_printk("bdi %s: ino=%lu offset=%lu len=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  __entry->offset,
+		  __entry->len
+	)
+);
+
 
 #endif /* _TRACE_NFS_H */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/47] IO-less dirty throttling v3
  2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
                   ` (46 preceding siblings ...)
  2010-12-13  6:43 ` [PATCH 47/47] nfs: trace nfs_commit_release() Wu Fengguang
@ 2010-12-13 11:27 ` Peter Zijlstra
  2010-12-13 11:49   ` Wu Fengguang
  47 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2010-12-13 11:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel, LKML

On Mon, 2010-12-13 at 14:42 +0800, Wu Fengguang wrote:
> bdi dirty limit fixes
>         [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi
>         [PATCH 02/47] writeback: safety margin for bdi stat error
> 
> v2 patches rebased onto the above two fixes
>         [PATCH 03/47] writeback: IO-less balance_dirty_pages()
>         [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages()
>         [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages()
>         [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
>         [PATCH 07/47] writeback: account per-bdi accumulated written pages
>         [PATCH 08/47] writeback: bdi write bandwidth estimation
>         [PATCH 09/47] writeback: show bdi write bandwidth in debugfs
>         [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low
>         [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time
>         [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds
>         [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers
>         [PATCH 14/47] writeback: add trace event for balance_dirty_pages()
>         [PATCH 15/47] writeback: make nr_to_write a per-file limit
> 
> trivial fixes for v2
>         [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix
>         [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages()
>         [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc()
>         [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause
>         [PATCH 20/47] writeback: use do_div in bw calculation
>         [PATCH 21/47] writeback: prevent divide error on tiny HZ
>         [PATCH 22/47] writeback: prevent bandwidth calculation overflow
> 
> spinlock protected bandwidth estimation, as suggested by Peter
>         [PATCH 23/47] writeback: spinlock protected bdi bandwidth update
> 
> algorithm updates
>         [PATCH 24/47] writeback: increase pause time on concurrent dirtiers
>         [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi
>         [PATCH 26/47] writeback: start background writeback earlier
>         [PATCH 27/47] writeback: user space think time compensation
>         [PATCH 28/47] writeback: bdi base throttle bandwidth
>         [PATCH 29/47] writeback: smoothed bdi dirty pages
>         [PATCH 30/47] writeback: adapt max balance pause time to memory size
>         [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers 

I would think it would be easier for review to fold all this back into
sensible patches.

Reviewing is lots easier if the patches present logical steps. The
presented series will have us looking back and forth, review patch, find
bugs, then scan fwd to see if the bug has been solved, etc..


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/47] IO-less dirty throttling v3
  2010-12-13 11:27 ` [PATCH 00/47] IO-less dirty throttling v3 Peter Zijlstra
@ 2010-12-13 11:49   ` Wu Fengguang
  2010-12-13 12:38     ` Peter Zijlstra
  0 siblings, 1 reply; 51+ messages in thread
From: Wu Fengguang @ 2010-12-13 11:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Mon, Dec 13, 2010 at 07:27:11PM +0800, Peter Zijlstra wrote:
> On Mon, 2010-12-13 at 14:42 +0800, Wu Fengguang wrote:
> > bdi dirty limit fixes
> >         [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi
> >         [PATCH 02/47] writeback: safety margin for bdi stat error
> > 
> > v2 patches rebased onto the above two fixes
> >         [PATCH 03/47] writeback: IO-less balance_dirty_pages()
> >         [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages()
> >         [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages()
> >         [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls
> >         [PATCH 07/47] writeback: account per-bdi accumulated written pages
> >         [PATCH 08/47] writeback: bdi write bandwidth estimation
> >         [PATCH 09/47] writeback: show bdi write bandwidth in debugfs
> >         [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low
> >         [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time
> >         [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds
> >         [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers
> >         [PATCH 14/47] writeback: add trace event for balance_dirty_pages()
> >         [PATCH 15/47] writeback: make nr_to_write a per-file limit
> > 
> > trivial fixes for v2
> >         [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix
> >         [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages()
> >         [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc()
> >         [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause
> >         [PATCH 20/47] writeback: use do_div in bw calculation
> >         [PATCH 21/47] writeback: prevent divide error on tiny HZ
> >         [PATCH 22/47] writeback: prevent bandwidth calculation overflow
> > 
> > spinlock protected bandwidth estimation, as suggested by Peter
> >         [PATCH 23/47] writeback: spinlock protected bdi bandwidth update
> > 
> > algorithm updates
> >         [PATCH 24/47] writeback: increase pause time on concurrent dirtiers
> >         [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi
> >         [PATCH 26/47] writeback: start background writeback earlier
> >         [PATCH 27/47] writeback: user space think time compensation
> >         [PATCH 28/47] writeback: bdi base throttle bandwidth
> >         [PATCH 29/47] writeback: smoothed bdi dirty pages
> >         [PATCH 30/47] writeback: adapt max balance pause time to memory size
> >         [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers 
> 
> I would think it would be easier for review to fold all this back into
> sensible patches.
> 
> Reviewing is lots easier if the patches present logical steps. The
> presented series will have us looking back and forth, review patch, find
> bugs, then scan fwd to see if the bug has been solved, etc..

Good suggestion. Sorry I did have the plan to fold them at some later
time.  I'll do a new version to fold the patches 16-25.  26-31 will be
retained since they are logical enhancements that do not involve
back-and-forth changes. 12 will be removed as it seems not absolutely
necessary -- let the users do whatever they feel OK, even if it means
make the throttling algorithms work in some suboptimal condition.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 00/47] IO-less dirty throttling v3
  2010-12-13 11:49   ` Wu Fengguang
@ 2010-12-13 12:38     ` Peter Zijlstra
  0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2010-12-13 12:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Trond Myklebust,
	Dave Chinner, Theodore Ts'o, Chris Mason, Mel Gorman,
	Rik van Riel, KOSAKI Motohiro, Greg Thelen, Minchan Kim, linux-mm,
	linux-fsdevel@vger.kernel.org, LKML

On Mon, 2010-12-13 at 19:49 +0800, Wu Fengguang wrote:
> > Reviewing is lots easier if the patches present logical steps. The
> > presented series will have us looking back and forth, review patch, find
> > bugs, then scan fwd to see if the bug has been solved, etc..
> 
> Good suggestion. Sorry I did have the plan to fold them at some later
> time.  I'll do a new version to fold the patches 16-25.  26-31 will be
> retained since they are logical enhancements that do not involve
> back-and-forth changes. 12 will be removed as it seems not absolutely
> necessary -- let the users do whatever they feel OK, even if it means
> make the throttling algorithms work in some suboptimal condition.
> 

Thanks, much appreciated! I'll await this new series.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-12-13 12:38 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-12-13  6:42 [PATCH 00/47] IO-less dirty throttling v3 Wu Fengguang
2010-12-13  6:42 ` [PATCH 01/47] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
2010-12-13  6:42 ` [PATCH 02/47] writeback: safety margin for bdi stat error Wu Fengguang
2010-12-13  6:42 ` [PATCH 03/47] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-12-13  6:42 ` [PATCH 04/47] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-12-13  6:42 ` [PATCH 05/47] writeback: per-task rate limit on balance_dirty_pages() Wu Fengguang
2010-12-13  6:42 ` [PATCH 06/47] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-12-13  6:42 ` [PATCH 07/47] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-12-13  6:42 ` [PATCH 08/47] writeback: bdi write bandwidth estimation Wu Fengguang
2010-12-13  6:42 ` [PATCH 09/47] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-12-13  6:42 ` [PATCH 10/47] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-12-13  6:43 ` [PATCH 11/47] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-12-13  6:43 ` [PATCH 12/47] writeback: make reasonable gap between the dirty/background thresholds Wu Fengguang
2010-12-13  6:43 ` [PATCH 13/47] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-12-13  6:43 ` [PATCH 14/47] writeback: add trace event for balance_dirty_pages() Wu Fengguang
2010-12-13  6:43 ` [PATCH 15/47] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-12-13  6:43 ` [PATCH 16/47] writeback: make-nr_to_write-a-per-file-limit fix Wu Fengguang
2010-12-13  6:43 ` [PATCH 17/47] writeback: do uninterruptible sleep in balance_dirty_pages() Wu Fengguang
2010-12-13  6:43 ` [PATCH 18/47] writeback: move BDI_WRITTEN accounting into __bdi_writeout_inc() Wu Fengguang
2010-12-13  6:43 ` [PATCH 19/47] writeback: fix increasement of nr_dirtied_pause Wu Fengguang
2010-12-13  6:43 ` [PATCH 20/47] writeback: use do_div in bw calculation Wu Fengguang
2010-12-13  6:43 ` [PATCH 21/47] writeback: prevent divide error on tiny HZ Wu Fengguang
2010-12-13  6:43 ` [PATCH 22/47] writeback: prevent bandwidth calculation overflow Wu Fengguang
2010-12-13  6:43 ` [PATCH 23/47] writeback: spinlock protected bdi bandwidth update Wu Fengguang
2010-12-13  6:43 ` [PATCH 24/47] writeback: increase pause time on concurrent dirtiers Wu Fengguang
2010-12-13  6:43 ` [PATCH 25/47] writeback: make it easier to break from a dirty exceeded bdi Wu Fengguang
2010-12-13  6:43 ` [PATCH 26/47] writeback: start background writeback earlier Wu Fengguang
2010-12-13  6:43 ` [PATCH 27/47] writeback: user space think time compensation Wu Fengguang
2010-12-13  6:43 ` [PATCH 28/47] writeback: bdi base throttle bandwidth Wu Fengguang
2010-12-13  6:43 ` [PATCH 29/47] writeback: smoothed bdi dirty pages Wu Fengguang
2010-12-13  6:43 ` [PATCH 30/47] writeback: adapt max balance pause time to memory size Wu Fengguang
2010-12-13  6:43 ` [PATCH 31/47] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
2010-12-13  6:43 ` [PATCH 32/47] writeback: extend balance_dirty_pages() trace event Wu Fengguang
2010-12-13  6:43 ` [PATCH 33/47] writeback: trace global dirty page states Wu Fengguang
2010-12-13  6:43 ` [PATCH 34/47] writeback: trace writeback_single_inode() Wu Fengguang
2010-12-13  6:43 ` [PATCH 35/47] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-12-13  6:43 ` [PATCH 36/47] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
2010-12-13  6:43 ` [PATCH 37/47] btrfs: lower the dirty balacing rate limit Wu Fengguang
2010-12-13  6:43 ` [PATCH 38/47] btrfs: wait on too many nr_async_bios Wu Fengguang
2010-12-13  6:43 ` [PATCH 39/47] nfs: livelock prevention is now done in VFS Wu Fengguang
2010-12-13  6:43 ` [PATCH 40/47] nfs: writeback pages wait queue Wu Fengguang
2010-12-13  6:43 ` [PATCH 41/47] nfs: in-commit pages accounting and " Wu Fengguang
2010-12-13  6:43 ` [PATCH 42/47] nfs: heuristics to avoid commit Wu Fengguang
2010-12-13  6:43 ` [PATCH 43/47] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
2010-12-13  6:43 ` [PATCH 44/47] nfs: limit the range of commits Wu Fengguang
2010-12-13  6:43 ` [PATCH 45/47] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
2010-12-13  6:43 ` [PATCH 46/47] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
2010-12-13  6:43 ` [PATCH 47/47] nfs: trace nfs_commit_release() Wu Fengguang
2010-12-13 11:27 ` [PATCH 00/47] IO-less dirty throttling v3 Peter Zijlstra
2010-12-13 11:49   ` Wu Fengguang
2010-12-13 12:38     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).