[PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>, Wu Fengguang <fengguang.wu@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: <linux-fsdevel@vger.kernel.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH 07/35] writeback: per-task rate limit on balance_dirty_pages()
Date: Mon, 13 Dec 2010 22:46:53 +0800	[thread overview]
Message-ID: <20101213150327.215545584@intel.com> (raw)
In-Reply-To: 20101213144646.341970461@intel.com

[-- Attachment #1: writeback-per-task-dirty-count.patch --]
[-- Type: text/plain, Size: 9352 bytes --]

Try to limit the dirty throttle pause time in range [1 jiffy, 100 ms],
by controlling how many pages can be dirtied before inserting a pause.

The dirty count will be directly billed to the task struct. Slow start
and quick back off is employed, so that the stable range will be biased
towards less than 50ms. Another intention is for fine timing control of
slow devices, which may need to do full 100ms pauses for every 1 page.

The switch from per-cpu to per-task rate limit makes it easier to exceed
the global dirty limit with a fork bomb, where each new task dirties 1 page,
sleep 10m and continue to dirty 1000 more pages. The caveat is, when it
dirties the first page, it may be honoured a high nr_dirtied_pause
because nr_dirty is still low at that time. In this way lots of tasks
get the free tickets to dirty more pages than allowed. The solution is
to disable rate limiting (ie. to ignore nr_dirtied_pause) totally once
the bdi becomes dirty exceeded.

Note that some filesystems will dirty a batch of pages before calling
balance_dirty_pages_ratelimited_nr(). They saves a little CPU overheads
at the cost of possibly overrunning the dirty limits a bit and/or in the
case of very slow devices, pause the application for much more than
100ms at a time. This is a trade-off, and seems reasonable optimization
as long as the batch size is controlled within a dozen pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 include/linux/sched.h |    7 ++
 mm/memory_hotplug.c   |    3 
 mm/page-writeback.c   |  126 ++++++++++++++++++----------------------
 3 files changed, 65 insertions(+), 71 deletions(-)

--- linux-next.orig/include/linux/sched.h	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/include/linux/sched.h	2010-12-13 21:46:13.000000000 +0800
@@ -1471,6 +1471,13 @@ struct task_struct {
 	int make_it_fail;
 #endif
 	struct prop_local_single dirties;
+	/*
+	 * when (nr_dirtied >= nr_dirtied_pause), it's time to call
+	 * balance_dirty_pages() for some dirty throttling pause
+	 */
+	int nr_dirtied;
+	int nr_dirtied_pause;
+
 #ifdef CONFIG_LATENCYTOP
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
--- linux-next.orig/mm/page-writeback.c	2010-12-13 21:46:12.000000000 +0800
+++ linux-next/mm/page-writeback.c	2010-12-13 21:46:13.000000000 +0800
@@ -37,12 +37,6 @@
 #include <trace/events/writeback.h>
 
 /*
- * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
- * will look to see if it needs to force writeback or throttling.
- */
-static long ratelimit_pages = 32;
-
-/*
  * Don't sleep more than 200ms at a time in balance_dirty_pages().
  */
 #define MAX_PAUSE	max(HZ/5, 1)
@@ -493,6 +487,40 @@ unsigned long bdi_dirty_limit(struct bac
 }
 
 /*
+ * After a task dirtied this many pages, balance_dirty_pages_ratelimited_nr()
+ * will look to see if it needs to start dirty throttling.
+ *
+ * If ratelimit_pages is too low then big NUMA machines will call the expensive
+ * global_page_state() too often. So scale it adaptively to the safety margin
+ * (the number of pages we may dirty without exceeding the dirty limits).
+ */
+static unsigned long ratelimit_pages(struct backing_dev_info *bdi)
+{
+	unsigned long background_thresh;
+	unsigned long dirty_thresh;
+	unsigned long dirty_pages;
+
+	global_dirty_limits(&background_thresh, &dirty_thresh);
+	dirty_pages = global_page_state(NR_FILE_DIRTY) +
+		      global_page_state(NR_WRITEBACK) +
+		      global_page_state(NR_UNSTABLE_NFS);
+
+	if (dirty_pages <= (dirty_thresh + background_thresh) / 2)
+		goto out;
+
+	dirty_thresh = bdi_dirty_limit(bdi, dirty_thresh, dirty_pages);
+	dirty_pages  = bdi_stat(bdi, BDI_RECLAIMABLE) +
+		       bdi_stat(bdi, BDI_WRITEBACK);
+
+	if (dirty_pages < dirty_thresh)
+		goto out;
+
+	return 1;
+out:
+	return 1 + int_sqrt(dirty_thresh - dirty_pages);
+}
+
+/*
  * balance_dirty_pages() must be called by processes which are generating dirty
  * data.  It looks at the number of dirty pages in the machine and will force
  * the caller to perform writeback if the system is over `vm_dirty_ratio'.
@@ -509,7 +537,7 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long bw;
-	unsigned long pause;
+	unsigned long pause = 0;
 	bool dirty_exceeded = false;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -591,6 +619,17 @@ pause:
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
+	if (pause == 0 && nr_dirty < background_thresh)
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
+	else if (pause == 1)
+		current->nr_dirtied_pause += current->nr_dirtied_pause / 32 + 1;
+	else if (pause >= MAX_PAUSE)
+		/*
+		 * when repeated, writing 1 page per 100ms on slow devices,
+		 * i-(i+2)/4 will be able to reach 1 but never reduce to 0.
+		 */
+		current->nr_dirtied_pause -= (current->nr_dirtied_pause+2) >> 2;
+
 	if (writeback_in_progress(bdi))
 		return;
 
@@ -617,8 +656,6 @@ void set_page_dirty_balance(struct page 
 	}
 }
 
-static DEFINE_PER_CPU(unsigned long, bdp_ratelimits) = 0;
-
 /**
  * balance_dirty_pages_ratelimited_nr - balance dirty memory state
  * @mapping: address_space which was dirtied
@@ -628,36 +665,30 @@ static DEFINE_PER_CPU(unsigned long, bdp
  * which was newly dirtied.  The function will periodically check the system's
  * dirty state and will initiate writeback if needed.
  *
- * On really big machines, get_writeback_state is expensive, so try to avoid
+ * On really big machines, global_page_state() is expensive, so try to avoid
  * calling it too often (ratelimiting).  But once we're over the dirty memory
- * limit we decrease the ratelimiting by a lot, to prevent individual processes
- * from overshooting the limit by (ratelimit_pages) each.
+ * limit we disable the ratelimiting, to prevent individual processes from
+ * overshooting the limit by (ratelimit_pages) each.
  */
 void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 					unsigned long nr_pages_dirtied)
 {
-	unsigned long ratelimit;
-	unsigned long *p;
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+
+	current->nr_dirtied += nr_pages_dirtied;
 
-	ratelimit = ratelimit_pages;
-	if (mapping->backing_dev_info->dirty_exceeded)
-		ratelimit = 8;
+	if (unlikely(!current->nr_dirtied_pause))
+		current->nr_dirtied_pause = ratelimit_pages(bdi);
 
 	/*
 	 * Check the rate limiting. Also, we do not want to throttle real-time
 	 * tasks in balance_dirty_pages(). Period.
 	 */
-	preempt_disable();
-	p =  &__get_cpu_var(bdp_ratelimits);
-	*p += nr_pages_dirtied;
-	if (unlikely(*p >= ratelimit)) {
-		ratelimit = *p;
-		*p = 0;
-		preempt_enable();
-		balance_dirty_pages(mapping, ratelimit);
-		return;
+	if (unlikely(current->nr_dirtied >= current->nr_dirtied_pause ||
+		     bdi->dirty_exceeded)) {
+		balance_dirty_pages(mapping, current->nr_dirtied);
+		current->nr_dirtied = 0;
 	}
-	preempt_enable();
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
@@ -745,44 +776,6 @@ void laptop_sync_completion(void)
 #endif
 
 /*
- * If ratelimit_pages is too high then we can get into dirty-data overload
- * if a large number of processes all perform writes at the same time.
- * If it is too low then SMP machines will call the (expensive)
- * get_writeback_state too often.
- *
- * Here we set ratelimit_pages to a level which ensures that when all CPUs are
- * dirtying in parallel, we cannot go more than 3% (1/32) over the dirty memory
- * thresholds before writeback cuts in.
- *
- * But the limit should not be set too high.  Because it also controls the
- * amount of memory which the balance_dirty_pages() caller has to write back.
- * If this is too large then the caller will block on the IO queue all the
- * time.  So limit it to four megabytes - the balance_dirty_pages() caller
- * will write six megabyte chunks, max.
- */
-
-void writeback_set_ratelimit(void)
-{
-	ratelimit_pages = vm_total_pages / (num_online_cpus() * 32);
-	if (ratelimit_pages < 16)
-		ratelimit_pages = 16;
-	if (ratelimit_pages * PAGE_CACHE_SIZE > 4096 * 1024)
-		ratelimit_pages = (4096 * 1024) / PAGE_CACHE_SIZE;
-}
-
-static int __cpuinit
-ratelimit_handler(struct notifier_block *self, unsigned long u, void *v)
-{
-	writeback_set_ratelimit();
-	return NOTIFY_DONE;
-}
-
-static struct notifier_block __cpuinitdata ratelimit_nb = {
-	.notifier_call	= ratelimit_handler,
-	.next		= NULL,
-};
-
-/*
  * Called early on to tune the page writeback dirty limits.
  *
  * We used to scale dirty pages according to how total memory
@@ -804,9 +797,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	writeback_set_ratelimit();
-	register_cpu_notifier(&ratelimit_nb);
-
 	shift = calc_period_shift();
 	prop_descriptor_init(&vm_completions, shift);
 	prop_descriptor_init(&vm_dirties, shift);
--- linux-next.orig/mm/memory_hotplug.c	2010-12-13 21:45:57.000000000 +0800
+++ linux-next/mm/memory_hotplug.c	2010-12-13 21:46:13.000000000 +0800
@@ -446,8 +446,6 @@ int online_pages(unsigned long pfn, unsi
 
 	vm_total_pages = nr_free_pagecache_pages();
 
-	writeback_set_ratelimit();
-
 	if (onlined_pages)
 		memory_notify(MEM_ONLINE, &arg);
 
@@ -877,7 +875,6 @@ repeat:
 	}
 
 	vm_total_pages = nr_free_pagecache_pages();
-	writeback_set_ratelimit();
 
 	memory_notify(MEM_OFFLINE, &arg);
 	unlock_system_sleep();

next prev parent reply	other threads:[~2010-12-13 14:46 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-13 14:46 [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang
2010-12-13 14:46 ` [PATCH 01/35] writeback: enabling gate limit for light dirtied bdi Wu Fengguang
2011-01-12 21:43   ` Jan Kara
2011-01-13  3:44     ` Wu Fengguang
2011-01-13  3:58       ` Wu Fengguang
2011-01-13 19:26       ` Peter Zijlstra
2011-01-14  3:21         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 02/35] writeback: safety margin for bdi stat error Wu Fengguang
2011-01-12 21:59   ` Jan Kara
2011-01-13  4:14     ` Wu Fengguang
2011-01-13 10:38       ` Jan Kara
2011-01-13 10:41         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 03/35] writeback: prevent duplicate balance_dirty_pages_ratelimited() calls Wu Fengguang
2010-12-13 14:46 ` [PATCH 04/35] writeback: reduce per-bdi dirty threshold ramp up time Wu Fengguang
2010-12-14 13:37   ` Richard Kennedy
2010-12-14 13:59     ` Wu Fengguang
2010-12-14 14:33       ` Wu Fengguang
2010-12-14 14:39         ` Wu Fengguang
2010-12-14 14:50           ` Peter Zijlstra
2010-12-14 15:15             ` Wu Fengguang
2010-12-14 15:26               ` Wu Fengguang
2010-12-14 14:56           ` Wu Fengguang
2010-12-15 18:48       ` Richard Kennedy
2010-12-17 13:07         ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 05/35] writeback: IO-less balance_dirty_pages() Wu Fengguang
2010-12-13 14:46 ` [PATCH 06/35] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2010-12-13 14:46 ` Wu Fengguang [this message]
2010-12-13 14:46 ` [PATCH 08/35] writeback: user space think time compensation Wu Fengguang
2010-12-13 14:46 ` [PATCH 09/35] writeback: account per-bdi accumulated written pages Wu Fengguang
2010-12-13 14:46 ` [PATCH 10/35] writeback: bdi write bandwidth estimation Wu Fengguang
2010-12-13 14:46 ` [PATCH 11/35] writeback: show bdi write bandwidth in debugfs Wu Fengguang
2010-12-13 14:46 ` [PATCH 12/35] writeback: scale down max throttle bandwidth on concurrent dirtiers Wu Fengguang
2010-12-14  1:21   ` Yan, Zheng
2010-12-14  7:00     ` Wu Fengguang
2010-12-14 13:00       ` Wu Fengguang
2010-12-13 14:46 ` [PATCH 13/35] writeback: bdi base throttle bandwidth Wu Fengguang
2010-12-13 14:47 ` [PATCH 14/35] writeback: smoothed bdi dirty pages Wu Fengguang
2010-12-13 14:47 ` [PATCH 15/35] writeback: adapt max balance pause time to memory size Wu Fengguang
2010-12-13 14:47 ` [PATCH 16/35] writeback: increase min pause time on concurrent dirtiers Wu Fengguang
2010-12-13 18:23   ` Valdis.Kletnieks
2010-12-14  6:51     ` Wu Fengguang
2010-12-14 18:42       ` Valdis.Kletnieks
2010-12-14 18:55         ` Peter Zijlstra
2010-12-14 20:13           ` Valdis.Kletnieks
2010-12-14 20:24             ` Peter Zijlstra
2010-12-14 20:37               ` Valdis.Kletnieks
2010-12-13 14:47 ` [PATCH 17/35] writeback: quit throttling when bdi dirty pages dropped low Wu Fengguang
2010-12-16  5:17   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 18/35] writeback: start background writeback earlier Wu Fengguang
2010-12-16  5:37   ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 19/35] writeback: make nr_to_write a per-file limit Wu Fengguang
2010-12-13 14:47 ` [PATCH 20/35] writeback: scale IO chunk size up to device bandwidth Wu Fengguang
2010-12-13 14:47 ` [PATCH 21/35] writeback: trace balance_dirty_pages() Wu Fengguang
2010-12-13 14:47 ` [PATCH 22/35] writeback: trace global dirty page states Wu Fengguang
2010-12-17  2:19   ` Wu Fengguang
2010-12-17  3:11     ` Wu Fengguang
2010-12-17  6:52     ` Hugh Dickins
2010-12-17  9:31       ` Wu Fengguang
2010-12-17 11:21       ` [PATCH] writeback: skip balance_dirty_pages() for in-memory fs Wu Fengguang
2010-12-17 14:21         ` Rik van Riel
2010-12-17 15:34         ` Minchan Kim
2010-12-17 15:42           ` Minchan Kim
2010-12-21  5:59         ` Hugh Dickins
2010-12-21  9:39           ` Wu Fengguang
2010-12-30  3:15             ` Hugh Dickins
2010-12-13 14:47 ` [PATCH 23/35] writeback: trace writeback_single_inode() Wu Fengguang
2010-12-13 14:47 ` [PATCH 24/35] btrfs: dont call balance_dirty_pages_ratelimited() on already dirty pages Wu Fengguang
2010-12-13 14:47 ` [PATCH 25/35] btrfs: lower the dirty balacing rate limit Wu Fengguang
2010-12-13 14:47 ` [PATCH 26/35] btrfs: wait on too many nr_async_bios Wu Fengguang
2010-12-13 14:47 ` [PATCH 27/35] nfs: livelock prevention is now done in VFS Wu Fengguang
2010-12-13 14:47 ` [PATCH 28/35] nfs: writeback pages wait queue Wu Fengguang
2010-12-13 14:47 ` [PATCH 29/35] nfs: in-commit pages accounting and " Wu Fengguang
2010-12-13 21:15   ` Trond Myklebust
2010-12-14 15:40     ` Wu Fengguang
2010-12-14 15:57       ` Trond Myklebust
2010-12-15 15:07         ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 30/35] nfs: heuristics to avoid commit Wu Fengguang
2010-12-13 20:53   ` Trond Myklebust
2010-12-14  8:20     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 31/35] nfs: dont change wbc->nr_to_write in write_inode() Wu Fengguang
2010-12-13 21:01   ` Trond Myklebust
2010-12-14 15:53     ` Wu Fengguang
2010-12-13 14:47 ` [PATCH 32/35] nfs: limit the range of commits Wu Fengguang
2010-12-13 14:47 ` [PATCH 33/35] nfs: adapt congestion threshold to dirty threshold Wu Fengguang
2010-12-13 14:47 ` [PATCH 34/35] nfs: trace nfs_commit_unstable_pages() Wu Fengguang
2010-12-13 14:47 ` [PATCH 35/35] nfs: trace nfs_commit_release() Wu Fengguang
     [not found] ` <AANLkTinFeu7LMaDFgUcP3r2oqVHE5bei3T5JTPGBNvS9@mail.gmail.com>
2010-12-14  4:59   ` [PATCH 00/35] IO-less dirty throttling v4 Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101213150327.215545584@intel.com \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=jack@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).