From: Wu Fengguang <fengguang.wu@intel.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
Jan Kara <jack@suse.cz>, Dave Chinner <david@fromorbit.com>,
Christoph Hellwig <hch@infradead.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits
Date: Thu, 23 Jun 2011 21:18:01 +0800 [thread overview]
Message-ID: <20110623131801.GA11665@localhost> (raw)
In-Reply-To: <20110621172043.51621aa9.akpm@linux-foundation.org>
Andrew,
On Wed, Jun 22, 2011 at 08:20:43AM +0800, Andrew Morton wrote:
> On Sun, 19 Jun 2011 23:01:12 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > The max-pause limit helps to keep the sleep time inside
> > balance_dirty_pages() within 200ms. The 200ms max sleep means per task
> > rate limit of 8pages/200ms=160KB/s, which normally is enough to stop
> > dirtiers from continue pushing the dirty pages high, unless there are
> > a sufficient large number of slow dirtiers (ie. 500 tasks doing 160KB/s
> > will still sum up to 80MB/s, reaching the write bandwidth of a slow disk).
> >
> > The pass-good limit helps to let go of the good bdi's in the presence of
> > a blocked bdi (ie. NFS server not responding) or slow USB disk which for
> > some reason build up a large number of initial dirty pages that refuse
> > to go away anytime soon.
>
> The hard-wired numbers and hard-wired assumptions about device speeds
> shouldn't be here at all. They will be sub-optimal (and sometimes
> extremely so) for all cases. They will become wronger over time. Or
> less wrong, depending upon which way they were originally wrong.
Yeah the changelog contains many specific numbers as examples..
The 200ms is mainly based on the maximum acceptable human perceptible
delays. It may be too large in some cases, so the value will be
decreased adaptively in a patch to be submitted later.
The DIRTY_MAXPAUSE and DIRTY_PASSGOOD are to get proportional values
to dirty_thresh, so is reasonably taken as adaptive.
> > + dirty_thresh = hard_dirty_limit(dirty_thresh);
> > + if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_MAXPAUSE &&
> > + jiffies - start_time > MAX_PAUSE)
> > + break;
> > + if (nr_dirty < dirty_thresh + dirty_thresh / DIRTY_PASSGOOD &&
> > + bdi_dirty < bdi_thresh)
> > + break;
>
> It appears that despite their similarity, DIRTY_MAXPAUSE is a
> dimensionless value whereas the units of MAX_PAUSE is jiffies. Perhaps
> more care in naming choices would clarify things like this.
OK. I'll add _AREA suffixes to the DIRTY_MAXPAUSE macros to tell the
difference.
> The first comparison might be clearer if it used time_after().
Yeah, changed to time_after().
> Both statements need comments explaining what they do and *why they
Here is the patch with more comments :)
Thanks,
Fengguang
---
Subject: writeback: introduce max-pause and pass-good dirty limits
Date: Sun Jun 19 22:18:42 CST 2011
The max-pause limit helps to keep the sleep time inside
balance_dirty_pages() within MAX_PAUSE=200ms. The 200ms max sleep means
per task rate limit of 8pages/200ms=160KB/s when dirty exceeded, which
normally is enough to stop dirtiers from continue pushing the dirty
pages high, unless there are a sufficient large number of slow dirtiers
(eg. 500 tasks doing 160KB/s will still sum up to 80MB/s, exceeding the
write bandwidth of a slow disk and hence accumulating more and more dirty
pages).
The pass-good limit helps to let go of the good bdi's in the presence of
a blocked bdi (ie. NFS server not responding) or slow USB disk which for
some reason build up a large number of initial dirty pages that refuse
to go away anytime soon.
For example, given two bdi's A and B and the initial state
bdi_thresh_A = dirty_thresh / 2
bdi_thresh_B = dirty_thresh / 2
bdi_dirty_A = dirty_thresh / 2
bdi_dirty_B = dirty_thresh / 2
Then A get blocked, after a dozen seconds
bdi_thresh_A = 0
bdi_thresh_B = dirty_thresh
bdi_dirty_A = dirty_thresh / 2
bdi_dirty_B = dirty_thresh / 2
The (bdi_dirty_B < bdi_thresh_B) test is now useless and the dirty pages
will be effectively throttled by condition (nr_dirty < dirty_thresh).
This has two problems:
(1) we lose the protections for light dirtiers
(2) balance_dirty_pages() effectively becomes IO-less because the
(bdi_nr_reclaimable > bdi_thresh) test won't be true. This is good
for IO, but balance_dirty_pages() loses an important way to break
out of the loop which leads to more spread out throttle delays.
DIRTY_PASSGOOD_AREA can eliminate the above issues. The only problem is,
DIRTY_PASSGOOD_AREA needs to be defined as 2 to fully cover the above
example while this patch uses the more conservative value 8 so as not to
surprise people with too many dirty pages than expected.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
include/linux/writeback.h | 21 +++++++++++++++++++++
mm/page-writeback.c | 28 ++++++++++++++++++++++++++++
2 files changed, 49 insertions(+)
--- linux-next.orig/include/linux/writeback.h 2011-06-23 10:31:40.000000000 +0800
+++ linux-next/include/linux/writeback.h 2011-06-23 10:31:40.000000000 +0800
@@ -7,6 +7,27 @@
#include <linux/sched.h>
#include <linux/fs.h>
+/*
+ * The 1/16 region above the global dirty limit will be put to maximum pauses:
+ *
+ * (limit, limit + limit/DIRTY_MAXPAUSE_AREA)
+ *
+ * The 1/16 region above the max-pause region, dirty exceeded bdi's will be put
+ * to loops:
+ *
+ * (limit + limit/DIRTY_MAXPAUSE_AREA, limit + limit/DIRTY_PASSGOOD_AREA)
+ *
+ * Further beyond, all dirtier tasks will enter a loop waiting (possibly long
+ * time) for the dirty pages to drop, unless written enough pages.
+ *
+ * The global dirty threshold is normally equal to the global dirty limit,
+ * except when the system suddenly allocates a lot of anonymous memory and
+ * knocks down the global dirty threshold quickly, in which case the global
+ * dirty limit will follow down slowly to prevent livelocking all dirtier tasks.
+ */
+#define DIRTY_MAXPAUSE_AREA 16
+#define DIRTY_PASSGOOD_AREA 8
+
struct backing_dev_info;
/*
--- linux-next.orig/mm/page-writeback.c 2011-06-23 10:31:40.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-06-23 10:59:47.000000000 +0800
@@ -399,6 +399,11 @@ unsigned long determine_dirtyable_memory
return x + 1; /* Ensure that we never return 0 */
}
+static unsigned long hard_dirty_limit(unsigned long thresh)
+{
+ return max(thresh, global_dirty_limit);
+}
+
/*
* global_dirty_limits - background-writeback and dirty-throttling thresholds
*
@@ -716,6 +721,29 @@ static void balance_dirty_pages(struct a
io_schedule_timeout(pause);
trace_balance_dirty_wait(bdi);
+ dirty_thresh = hard_dirty_limit(dirty_thresh);
+ /*
+ * max-pause area. If dirty exceeded but still within this
+ * area, no need to sleep for more than 200ms: (a) 8 pages per
+ * 200ms is typically more than enough to curb heavy dirtiers;
+ * (b) the pause time limit makes the dirtiers more responsive.
+ */
+ if (nr_dirty < dirty_thresh +
+ dirty_thresh / DIRTY_MAXPAUSE_AREA &&
+ time_after(jiffies, start_time + MAX_PAUSE))
+ break;
+ /*
+ * pass-good area. When some bdi gets blocked (eg. NFS server
+ * not responding), or write bandwidth dropped dramatically due
+ * to concurrent reads, or dirty threshold suddenly dropped and
+ * the dirty pages cannot be brought down anytime soon (eg. on
+ * slow USB stick), at least let go of the good bdi's.
+ */
+ if (nr_dirty < dirty_thresh +
+ dirty_thresh / DIRTY_PASSGOOD_AREA &&
+ bdi_dirty < bdi_thresh)
+ break;
+
/*
* Increase the delay for each loop, up to our previous
* default of taking a 100ms nap.
next prev parent reply other threads:[~2011-06-23 13:18 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-19 15:01 [PATCH 0/7] more writeback patches for 3.1 Wu Fengguang
2011-06-19 15:01 ` [PATCH 1/7] writeback: consolidate variable names in balance_dirty_pages() Wu Fengguang
2011-06-20 7:45 ` Christoph Hellwig
2011-06-19 15:01 ` [PATCH 2/7] writeback: add parameters to __bdi_update_bandwidth() Wu Fengguang
2011-06-19 15:31 ` Christoph Hellwig
2011-06-19 15:35 ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 3/7] writeback: introduce smoothed global dirty limit Wu Fengguang
2011-06-19 15:36 ` Christoph Hellwig
2011-06-19 15:55 ` Wu Fengguang
2011-06-21 23:59 ` Andrew Morton
2011-06-22 14:11 ` Wu Fengguang
2011-06-20 21:18 ` Jan Kara
2011-06-21 14:24 ` Wu Fengguang
2011-06-22 0:04 ` Andrew Morton
2011-06-22 14:24 ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 4/7] writeback: introduce max-pause and pass-good dirty limits Wu Fengguang
2011-06-22 0:20 ` Andrew Morton
2011-06-23 13:18 ` Wu Fengguang [this message]
2011-06-19 15:01 ` [PATCH 5/7] writeback: make writeback_control.nr_to_write straight Wu Fengguang
2011-06-19 15:35 ` Christoph Hellwig
2011-06-19 16:14 ` Wu Fengguang
2011-06-19 15:01 ` [PATCH 6/7] writeback: scale IO chunk size up to half device bandwidth Wu Fengguang
2011-06-19 15:01 ` [PATCH 7/7] writeback: timestamp based bdi dirty_exceeded state Wu Fengguang
2011-06-20 20:09 ` Christoph Hellwig
2011-06-21 10:00 ` Steven Whitehouse
2011-06-20 21:38 ` Jan Kara
2011-06-21 15:07 ` Wu Fengguang
2011-06-21 21:14 ` Jan Kara
2011-06-22 14:37 ` Wu Fengguang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110623131801.GA11665@localhost \
--to=fengguang.wu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).