From: Wu Fengguang <fengguang.wu@intel.com>
To: <linux-fsdevel@vger.kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrea Righi <arighi@develer.com>
Cc: linux-mm <linux-mm@kvack.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Subject: [PATCH 03/18] writeback: dirty rate control
Date: Sun, 04 Sep 2011 09:53:08 +0800 [thread overview]
Message-ID: <20110904020914.980576896@intel.com> (raw)
In-Reply-To: 20110904015305.367445271@intel.com
[-- Attachment #1: dirty-ratelimit --]
[-- Type: text/plain, Size: 10228 bytes --]
It's all about bdi->dirty_ratelimit, which aims to be (write_bw / N)
when there are N dd tasks.
On write() syscall, use bdi->dirty_ratelimit
============================================
balance_dirty_pages(pages_dirtied)
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
pause = pages_dirtied / task_ratelimit;
sleep(pause);
}
On every 200ms, update bdi->dirty_ratelimit
===========================================
bdi_update_dirty_ratelimit()
{
task_ratelimit = bdi->dirty_ratelimit * bdi_position_ratio();
balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate;
bdi->dirty_ratelimit = balanced_dirty_ratelimit
}
Estimation of balanced bdi->dirty_ratelimit
===========================================
balanced task_ratelimit
-----------------------
balance_dirty_pages() needs to throttle tasks dirtying pages such that
the total amount of dirty pages stays below the specified dirty limit in
order to avoid memory deadlocks. Furthermore we desire fairness in that
tasks get throttled proportionally to the amount of pages they dirty.
IOW we want to throttle tasks such that we match the dirty rate to the
writeout bandwidth, this yields a stable amount of dirty pages:
dirty_rate == write_bw (1)
The fairness requirement gives us:
task_ratelimit = balanced_dirty_ratelimit
== write_bw / N (2)
where N is the number of dd tasks. We don't know N beforehand, but
still can estimate balanced_dirty_ratelimit within 200ms.
Start by throttling each dd task at rate
task_ratelimit = task_ratelimit_0 (3)
(any non-zero initial value is OK)
After 200ms, we measured
dirty_rate = # of pages dirtied by all dd's / 200ms
write_bw = # of pages written to the disk / 200ms
For the aggressive dd dirtiers, the equality holds
dirty_rate == N * task_rate
== N * task_ratelimit_0 (4)
Or
task_ratelimit_0 == dirty_rate / N (5)
Now we conclude that the balanced task ratelimit can be estimated by
write_bw
balanced_dirty_ratelimit = task_ratelimit_0 * ---------- (6)
dirty_rate
Because with (4) and (5) we can get the desired equality (1):
write_bw
balanced_dirty_ratelimit == (dirty_rate / N) * ----------
dirty_rate
== write_bw / N
Then using the balanced task ratelimit we can compute task pause times like:
task_pause = task->nr_dirtied / task_ratelimit
task_ratelimit with position control
------------------------------------
However, while the above gives us means of matching the dirty rate to
the writeout bandwidth, it at best provides us with a stable dirty page
count (assuming a static system). In order to control the dirty page
count such that it is high enough to provide performance, but does not
exceed the specified limit we need another control.
The dirty position control works by extending (2) to
task_ratelimit = balanced_dirty_ratelimit * pos_ratio (7)
where pos_ratio is a negative feedback function that subjects to
1) f(setpoint) = 1.0
2) df/dx < 0
That is, if the dirty pages are ABOVE the setpoint, we throttle each
task a bit more HEAVY than balanced_dirty_ratelimit, so that the dirty
pages are created less fast than they are cleaned, thus DROP to the
setpoints (and the reverse).
Based on (7) and the assumption that both dirty_ratelimit and pos_ratio
remains CONSTANT for the past 200ms, we get
task_ratelimit_0 = balanced_dirty_ratelimit * pos_ratio (8)
Putting (8) into (6), we get the formula used in
bdi_update_dirty_ratelimit():
write_bw
balanced_dirty_ratelimit *= pos_ratio * ---------- (9)
dirty_rate
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/backing-dev.h | 7 ++
mm/backing-dev.c | 1
mm/page-writeback.c | 82 +++++++++++++++++++++++++++++++++-
3 files changed, 88 insertions(+), 2 deletions(-)
--- linux-next.orig/include/linux/backing-dev.h 2011-08-26 13:53:40.000000000 +0800
+++ linux-next/include/linux/backing-dev.h 2011-08-26 13:54:13.000000000 +0800
@@ -75,10 +75,17 @@ struct backing_dev_info {
struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
unsigned long bw_time_stamp; /* last time write bw is updated */
+ unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */
+ /*
+ * The base dirty throttle rate, re-calculated on every 200ms.
+ * All the bdi tasks' dirty rate will be curbed under it.
+ */
+ unsigned long dirty_ratelimit;
+
struct prop_local_percpu completions;
int dirty_exceeded;
--- linux-next.orig/mm/backing-dev.c 2011-08-26 13:53:40.000000000 +0800
+++ linux-next/mm/backing-dev.c 2011-08-26 13:54:13.000000000 +0800
@@ -670,6 +670,7 @@ int bdi_init(struct backing_dev_info *bd
bdi->bw_time_stamp = jiffies;
bdi->written_stamp = 0;
+ bdi->dirty_ratelimit = INIT_BW;
bdi->write_bandwidth = INIT_BW;
bdi->avg_write_bandwidth = INIT_BW;
--- linux-next.orig/mm/page-writeback.c 2011-08-26 13:52:42.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-08-26 15:52:42.000000000 +0800
@@ -787,6 +787,78 @@ static void global_update_bandwidth(unsi
spin_unlock(&dirty_lock);
}
+/*
+ * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ *
+ * Normal bdi tasks will be curbed at or below it in long term.
+ * Obviously it should be around (write_bw / N) when there are N dd tasks.
+ */
+static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long dirtied,
+ unsigned long elapsed)
+{
+ unsigned long write_bw = bdi->avg_write_bandwidth;
+ unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+ unsigned long dirty_rate;
+ unsigned long task_ratelimit;
+ unsigned long balanced_dirty_ratelimit;
+ unsigned long pos_ratio;
+
+ /*
+ * The dirty rate will match the writeout rate in long term, except
+ * when dirty pages are truncated by userspace or re-dirtied by FS.
+ */
+ dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+
+ pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty);
+ /*
+ * task_ratelimit reflects each dd's dirty rate for the past 200ms.
+ */
+ task_ratelimit = (u64)dirty_ratelimit *
+ pos_ratio >> RATELIMIT_CALC_SHIFT;
+
+ /*
+ * A linear estimation of the "balanced" throttle rate. The theory is,
+ * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+ * dirty_rate will be measured to be (N * task_ratelimit). So the below
+ * formula will yield the balanced rate limit (write_bw / N).
+ *
+ * Note that the expanded form is not a pure rate feedback:
+ * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) (1)
+ * but also takes pos_ratio into account:
+ * rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio (2)
+ *
+ * (1) is not realistic because pos_ratio also takes part in balancing
+ * the dirty rate. Consider the state
+ * pos_ratio = 0.5 (3)
+ * rate = 2 * (write_bw / N) (4)
+ * If (1) is used, it will stuck in that state! Because each dd will
+ * be throttled at
+ * task_ratelimit = pos_ratio * rate = (write_bw / N) (5)
+ * yielding
+ * dirty_rate = N * task_ratelimit = write_bw (6)
+ * put (6) into (1) we get
+ * rate_(i+1) = rate_(i) (7)
+ *
+ * So we end up using (2) to always keep
+ * rate_(i+1) ~= (write_bw / N) (8)
+ * regardless of the value of pos_ratio. As long as (8) is satisfied,
+ * pos_ratio is able to drive itself to 1.0, which is not only where
+ * the dirty count meet the setpoint, but also where the slope of
+ * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
+ */
+ balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+ dirty_rate | 1);
+
+ bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);
+}
+
void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long thresh,
unsigned long bg_thresh,
@@ -797,6 +869,7 @@ void __bdi_update_bandwidth(struct backi
{
unsigned long now = jiffies;
unsigned long elapsed = now - bdi->bw_time_stamp;
+ unsigned long dirtied;
unsigned long written;
/*
@@ -805,6 +878,7 @@ void __bdi_update_bandwidth(struct backi
if (elapsed < BANDWIDTH_INTERVAL)
return;
+ dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
/*
@@ -814,12 +888,16 @@ void __bdi_update_bandwidth(struct backi
if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
goto snapshot;
- if (thresh)
+ if (thresh) {
global_update_bandwidth(thresh, dirty, now);
-
+ bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty,
+ dirtied, elapsed);
+ }
bdi_update_write_bandwidth(bdi, elapsed, written);
snapshot:
+ bdi->dirtied_stamp = dirtied;
bdi->written_stamp = written;
bdi->bw_time_stamp = now;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2011-09-04 1:53 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-04 1:53 [PATCH 00/18] IO-less dirty throttling v11 Wu Fengguang
2011-09-04 1:53 ` [PATCH 01/18] writeback: account per-bdi accumulated dirtied pages Wu Fengguang
2011-09-04 1:53 ` [PATCH 02/18] writeback: dirty position control Wu Fengguang
2011-09-05 15:02 ` Peter Zijlstra
2011-09-06 2:10 ` Wu Fengguang
2011-09-05 15:05 ` Peter Zijlstra
2011-09-06 2:43 ` Wu Fengguang
2011-09-06 18:20 ` Vivek Goyal
2011-09-08 2:53 ` Wu Fengguang
2011-11-12 5:44 ` Nai Xia
2011-09-04 1:53 ` Wu Fengguang [this message]
2011-09-29 11:57 ` [PATCH 03/18] writeback: dirty rate control Wu Fengguang
2011-09-04 1:53 ` [PATCH 04/18] writeback: stabilize bdi->dirty_ratelimit Wu Fengguang
2011-09-04 1:53 ` [PATCH 05/18] writeback: per task dirty rate limit Wu Fengguang
2011-09-06 15:47 ` Peter Zijlstra
2011-09-06 23:27 ` Jan Kara
2011-09-06 23:34 ` Jan Kara
2011-09-07 7:27 ` Peter Zijlstra
2011-09-07 1:04 ` Wu Fengguang
2011-09-07 7:31 ` Peter Zijlstra
2011-09-07 11:00 ` Wu Fengguang
2011-09-04 1:53 ` [PATCH 06/18] writeback: IO-less balance_dirty_pages() Wu Fengguang
2011-09-06 12:13 ` Peter Zijlstra
2011-09-07 2:46 ` Wu Fengguang
2011-09-04 1:53 ` [PATCH 07/18] writeback: dirty ratelimit - think time compensation Wu Fengguang
2011-09-04 1:53 ` [PATCH 08/18] writeback: trace dirty_ratelimit Wu Fengguang
2011-09-04 1:53 ` [PATCH 09/18] writeback: trace balance_dirty_pages Wu Fengguang
2011-09-04 1:53 ` [PATCH 10/18] writeback: dirty position control - bdi reserve area Wu Fengguang
2011-09-06 14:09 ` Peter Zijlstra
2011-09-07 12:31 ` Wu Fengguang
2011-09-12 10:19 ` Peter Zijlstra
2011-09-18 14:17 ` Wu Fengguang
2011-09-18 14:37 ` Wu Fengguang
2011-09-18 14:47 ` Wu Fengguang
2011-09-28 14:02 ` Wu Fengguang
2011-09-28 14:50 ` Peter Zijlstra
2011-09-29 3:32 ` Wu Fengguang
2011-09-29 8:49 ` Peter Zijlstra
2011-09-29 11:05 ` Wu Fengguang
2011-09-29 12:15 ` Wu Fengguang
2011-09-04 1:53 ` [PATCH 11/18] block: add bdi flag to indicate risk of io queue underrun Wu Fengguang
2011-09-06 14:22 ` Peter Zijlstra
2011-09-07 2:37 ` Wu Fengguang
2011-09-07 7:31 ` Peter Zijlstra
2011-09-04 1:53 ` [PATCH 12/18] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
2011-09-04 1:53 ` [PATCH 13/18] writeback: limit max dirty pause time Wu Fengguang
2011-09-06 14:52 ` Peter Zijlstra
2011-09-07 2:35 ` Wu Fengguang
2011-09-12 10:22 ` Peter Zijlstra
2011-09-18 14:23 ` Wu Fengguang
2011-09-04 1:53 ` [PATCH 14/18] writeback: control " Wu Fengguang
2011-09-06 15:51 ` Peter Zijlstra
2011-09-07 2:02 ` Wu Fengguang
2011-09-12 10:28 ` Peter Zijlstra
2011-09-04 1:53 ` [PATCH 15/18] writeback: charge leaked page dirties to active tasks Wu Fengguang
2011-09-06 16:16 ` Peter Zijlstra
2011-09-07 9:06 ` Wu Fengguang
2011-09-07 0:17 ` Jan Kara
2011-09-07 9:37 ` Wu Fengguang
2011-09-04 1:53 ` [PATCH 16/18] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-09-04 1:53 ` [PATCH 17/18] writeback: fix dirtied pages accounting on redirty Wu Fengguang
2011-09-06 16:18 ` Peter Zijlstra
2011-09-07 0:22 ` Jan Kara
2011-09-07 1:18 ` Wu Fengguang
2011-09-07 6:56 ` Christoph Hellwig
2011-09-07 8:19 ` Peter Zijlstra
2011-09-07 16:42 ` Jan Kara
2011-09-07 16:46 ` Christoph Hellwig
2011-09-08 8:51 ` Steven Whitehouse
2011-09-04 1:53 ` [PATCH 18/18] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-09-07 13:32 ` [PATCH 00/18] IO-less dirty throttling v11 Wu Fengguang
2011-09-07 19:14 ` Trond Myklebust
2011-09-28 14:58 ` Christoph Hellwig
2011-09-29 4:11 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110904020914.980576896@intel.com \
--to=fengguang.wu@intel.com \
--cc=a.p.zijlstra@chello.nl \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).