From: Jan Kara <jack@suse.cz>
To: Wu Fengguang <fengguang.wu@intel.com>
Cc: linux-fsdevel@vger.kernel.org, Jan Kara <jack@suse.cz>,
Peter Zijlstra <a.p.zijlstra@chello.nl>,
Christoph Hellwig <hch@lst.de>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 6/7] writeback: dirty ratelimit - think time compensation
Date: Wed, 7 Dec 2011 17:14:40 +0100 [thread overview]
Message-ID: <20111207161440.GL4622@quack.suse.cz> (raw)
In-Reply-To: <20111128140513.653000555@intel.com>
On Mon 28-11-11 21:53:44, Wu Fengguang wrote:
> Compensate the task's think time when computing the final pause time,
> so that ->dirty_ratelimit can be executed accurately.
>
> think time := time spend outside of balance_dirty_pages()
>
> In the rare case that the task slept longer than the 200ms period time
> (result in negative pause time), the sleep time will be compensated in
> the following periods, too, if it's less than 1 second.
>
> Accumulated errors are carefully avoided as long as the max pause area
> is not hitted.
>
> Pseudo code:
>
> period = pages_dirtied / task_ratelimit;
> think = jiffies - dirty_paused_when;
> pause = period - think;
>
> 1) normal case: period > think
>
> pause = period - think
> dirty_paused_when = jiffies + pause
> nr_dirtied = 0
>
> period time
> |===============================>|
> think time pause time
> |===============>|==============>|
> ------|----------------|---------------|------------------------
> dirty_paused_when jiffies
>
>
> 2) no pause case: period <= think
>
> don't pause; reduce future pause time by:
> dirty_paused_when += period
> nr_dirtied = 0
>
> period time
> |===============================>|
> think time
> |===================================================>|
> ------|--------------------------------+-------------------|----
> dirty_paused_when jiffies
>
> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Looks good. You can add:
Acked-by: Jan Kara <jack@suse.cz>
Honza
> ---
> include/linux/sched.h | 1
> include/trace/events/writeback.h | 14 ++++++++---
> kernel/fork.c | 1
> mm/page-writeback.c | 36 +++++++++++++++++++++++++----
> 4 files changed, 45 insertions(+), 7 deletions(-)
>
> --- linux-next.orig/include/linux/sched.h 2011-11-17 20:12:22.000000000 +0800
> +++ linux-next/include/linux/sched.h 2011-11-17 20:12:35.000000000 +0800
> @@ -1527,6 +1527,7 @@ struct task_struct {
> */
> int nr_dirtied;
> int nr_dirtied_pause;
> + unsigned long dirty_paused_when; /* start of a write-and-pause period */
>
> #ifdef CONFIG_LATENCYTOP
> int latency_record_count;
> --- linux-next.orig/mm/page-writeback.c 2011-11-17 20:12:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-11-17 20:12:38.000000000 +0800
> @@ -1010,6 +1010,7 @@ static void balance_dirty_pages(struct a
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> unsigned long bdi_thresh;
> + long period;
> long pause = 0;
> long uninitialized_var(max_pause);
> bool dirty_exceeded = false;
> @@ -1020,6 +1021,8 @@ static void balance_dirty_pages(struct a
> unsigned long start_time = jiffies;
>
> for (;;) {
> + unsigned long now = jiffies;
> +
> /*
> * Unstable writes are a feature of certain networked
> * filesystems (i.e. NFS) in which data may have been
> @@ -1039,8 +1042,11 @@ static void balance_dirty_pages(struct a
> */
> freerun = dirty_freerun_ceiling(dirty_thresh,
> background_thresh);
> - if (nr_dirty <= freerun)
> + if (nr_dirty <= freerun) {
> + current->dirty_paused_when = now;
> + current->nr_dirtied = 0;
> break;
> + }
>
> if (unlikely(!writeback_in_progress(bdi)))
> bdi_start_background_writeback(bdi);
> @@ -1098,10 +1104,21 @@ static void balance_dirty_pages(struct a
> task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
> RATELIMIT_CALC_SHIFT;
> if (unlikely(task_ratelimit == 0)) {
> + period = max_pause;
> pause = max_pause;
> goto pause;
> }
> - pause = HZ * pages_dirtied / task_ratelimit;
> + period = HZ * pages_dirtied / task_ratelimit;
> + pause = period;
> + if (current->dirty_paused_when)
> + pause -= now - current->dirty_paused_when;
> + /*
> + * For less than 1s think time (ext3/4 may block the dirtier
> + * for up to 800ms from time to time on 1-HDD; so does xfs,
> + * however at much less frequency), try to compensate it in
> + * future periods by updating the virtual time; otherwise just
> + * do a reset, as it may be a light dirtier.
> + */
> if (unlikely(pause <= 0)) {
> trace_balance_dirty_pages(bdi,
> dirty_thresh,
> @@ -1112,8 +1129,16 @@ static void balance_dirty_pages(struct a
> dirty_ratelimit,
> task_ratelimit,
> pages_dirtied,
> + period,
> pause,
> start_time);
> + if (pause < -HZ) {
> + current->dirty_paused_when = now;
> + current->nr_dirtied = 0;
> + } else if (period) {
> + current->dirty_paused_when += period;
> + current->nr_dirtied = 0;
> + }
> pause = 1; /* avoid resetting nr_dirtied_pause below */
> break;
> }
> @@ -1129,11 +1154,15 @@ pause:
> dirty_ratelimit,
> task_ratelimit,
> pages_dirtied,
> + period,
> pause,
> start_time);
> __set_current_state(TASK_KILLABLE);
> io_schedule_timeout(pause);
>
> + current->dirty_paused_when = now + pause;
> + current->nr_dirtied = 0;
> +
> /*
> * This is typically equal to (nr_dirty < dirty_thresh) and can
> * also keep "1000+ dd on a slow USB stick" under control.
> @@ -1148,11 +1177,10 @@ pause:
> if (!dirty_exceeded && bdi->dirty_exceeded)
> bdi->dirty_exceeded = 0;
>
> - current->nr_dirtied = 0;
> if (pause == 0) { /* in freerun area */
> current->nr_dirtied_pause =
> dirty_poll_interval(nr_dirty, dirty_thresh);
> - } else if (pause <= max_pause / 4 &&
> + } else if (period <= max_pause / 4 &&
> pages_dirtied >= current->nr_dirtied_pause) {
> current->nr_dirtied_pause = clamp_val(
> dirty_ratelimit * (max_pause / 2) / HZ,
> --- linux-next.orig/kernel/fork.c 2011-11-17 20:12:23.000000000 +0800
> +++ linux-next/kernel/fork.c 2011-11-17 20:12:35.000000000 +0800
> @@ -1296,6 +1296,7 @@ static struct task_struct *copy_process(
>
> p->nr_dirtied = 0;
> p->nr_dirtied_pause = 128 >> (PAGE_SHIFT - 10);
> + p->dirty_paused_when = 0;
>
> /*
> * Ok, make it visible to the rest of the system.
> --- linux-next.orig/include/trace/events/writeback.h 2011-11-17 19:13:41.000000000 +0800
> +++ linux-next/include/trace/events/writeback.h 2011-11-17 20:12:35.000000000 +0800
> @@ -289,12 +289,13 @@ TRACE_EVENT(balance_dirty_pages,
> unsigned long dirty_ratelimit,
> unsigned long task_ratelimit,
> unsigned long dirtied,
> + unsigned long period,
> long pause,
> unsigned long start_time),
>
> TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
> dirty_ratelimit, task_ratelimit,
> - dirtied, pause, start_time),
> + dirtied, period, pause, start_time),
>
> TP_STRUCT__entry(
> __array( char, bdi, 32)
> @@ -309,6 +310,8 @@ TRACE_EVENT(balance_dirty_pages,
> __field(unsigned int, dirtied_pause)
> __field(unsigned long, paused)
> __field( long, pause)
> + __field(unsigned long, period)
> + __field( long, think)
> ),
>
> TP_fast_assign(
> @@ -325,6 +328,9 @@ TRACE_EVENT(balance_dirty_pages,
> __entry->task_ratelimit = KBps(task_ratelimit);
> __entry->dirtied = dirtied;
> __entry->dirtied_pause = current->nr_dirtied_pause;
> + __entry->think = current->dirty_paused_when == 0 ? 0 :
> + (long)(jiffies - current->dirty_paused_when) * 1000/HZ;
> + __entry->period = period * 1000 / HZ;
> __entry->pause = pause * 1000 / HZ;
> __entry->paused = (jiffies - start_time) * 1000 / HZ;
> ),
> @@ -335,7 +341,7 @@ TRACE_EVENT(balance_dirty_pages,
> "bdi_setpoint=%lu bdi_dirty=%lu "
> "dirty_ratelimit=%lu task_ratelimit=%lu "
> "dirtied=%u dirtied_pause=%u "
> - "paused=%lu pause=%ld",
> + "paused=%lu pause=%ld period=%lu think=%ld",
> __entry->bdi,
> __entry->limit,
> __entry->setpoint,
> @@ -347,7 +353,9 @@ TRACE_EVENT(balance_dirty_pages,
> __entry->dirtied,
> __entry->dirtied_pause,
> __entry->paused, /* ms */
> - __entry->pause /* ms */
> + __entry->pause, /* ms */
> + __entry->period, /* ms */
> + __entry->think /* ms */
> )
> );
>
>
>
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2011-12-07 16:14 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-11-28 13:53 [PATCH 0/7] dirty throttling bits for 3.3 (v2) Wu Fengguang
2011-11-28 13:53 ` [PATCH 1/7] writeback: balanced_rate cannot exceed write bandwidth Wu Fengguang
2011-12-07 10:21 ` Jan Kara
2011-11-28 13:53 ` [PATCH 2/7] writeback: charge leaked page dirties to active tasks Wu Fengguang
2011-12-07 10:23 ` Jan Kara
2011-11-28 13:53 ` [PATCH 3/7] writeback: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-12-07 10:53 ` Jan Kara
2011-12-07 12:08 ` Wu Fengguang
2011-12-07 16:07 ` Jan Kara
2011-12-08 2:44 ` Wu Fengguang
2011-11-28 13:53 ` [PATCH 4/7] writeback: fix dirtied pages accounting on redirty Wu Fengguang
2011-12-07 16:09 ` Jan Kara
2011-11-28 13:53 ` [PATCH 5/7] btrfs: fix dirtied pages accounting on sub-page writes Wu Fengguang
2011-11-28 14:16 ` Wu Fengguang
2011-11-28 13:53 ` [PATCH 6/7] writeback: dirty ratelimit - think time compensation Wu Fengguang
2011-12-07 16:14 ` Jan Kara [this message]
2011-11-28 13:53 ` [PATCH 7/7] writeback: comment on the bdi dirty threshold Wu Fengguang
2011-12-07 10:57 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20111207161440.GL4622@quack.suse.cz \
--to=jack@suse.cz \
--cc=a.p.zijlstra@chello.nl \
--cc=akpm@linux-foundation.org \
--cc=fengguang.wu@intel.com \
--cc=hch@lst.de \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).