All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	hch@infradead.org, akpm@linux-foundation.org,
	peterz@infradead.org, wfg@mail.ustc.edu.cn
Subject: Re: [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread
Date: Fri, 18 Jun 2010 16:09:01 +1000	[thread overview]
Message-ID: <20100618060901.GA6590@dastard> (raw)
In-Reply-To: <1276797878-28893-1-git-send-email-jack@suse.cz>

On Thu, Jun 17, 2010 at 08:04:38PM +0200, Jan Kara wrote:
> This patch changes balance_dirty_pages() throttling so that the function does
> not submit writes on its own but rather waits for flusher thread to do enough
> writes. This has an advantage that we have a single source of IO allowing for
> better writeback locality. Also we do not have to reenter filesystems from a
> non-trivial context.
> 
> The waiting is implemented as follows: Each BDI has a FIFO of wait requests.
> When balance_dirty_pages() finds it needs to throttle the writer, it adds a
> request for writing write_chunk of pages to the FIFO and waits until the
> request is fulfilled or we drop below dirty limits. A flusher thread tracks
> amount of pages it has written and when enough pages are written since the
> first request in the FIFO became first, it wakes up the waiter and removes
> request from the FIFO (thus another request becomes first and flusher thread
> starts writing out on behalf of this request).
> 
> CC: hch@infradead.org
> CC: akpm@linux-foundation.org
> CC: peterz@infradead.org
> CC: wfg@mail.ustc.edu.cn
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  include/linux/backing-dev.h |   18 +++++
>  include/linux/writeback.h   |    1 +
>  mm/backing-dev.c            |  119 +++++++++++++++++++++++++++++++
>  mm/page-writeback.c         |  164 +++++++++++++++----------------------------
>  4 files changed, 195 insertions(+), 107 deletions(-)
> 
>  This is mostly a proof-of-concept patch. It's mostly untested and probably
> doesn't completely work. I was just thinking about this for some time already
> and would like to get some feedback on the approach...

It matches the way I'd like to see writeback throttling work.
I think this approach is likely to have less jitter and latency
because throttling is not tied to a specific set of pages or IOs to
complete.

> 
> +++ b/mm/backing-dev.c
> @@ -390,6 +390,122 @@ static void sync_supers_timer_fn(unsigned long unused)
>  	bdi_arm_supers_timer();
>  }
>  
> +/*
> + * Remove writer from the list, update wb_written_head as needed
> + *
> + * Needs wb_written_wait.lock held
> + */
> +static void bdi_remove_writer(struct backing_dev_info *bdi,
> +			      struct bdi_written_count *wc)
> +{
> +	int first = 0;
> +
> +	if (bdi->wb_written_list.next == &wc->list)
> +		first = 1;
> +	/* Initialize list so that entry owner knows it's removed */
> +	list_del_init(&wc->list);
> +	if (first) {
> +		if (list_empty(&bdi->wb_written_list)) {
> +			bdi->wb_written_head = ~(u64)0;
> +			return;
> +		}
> +		wc = list_entry(bdi->wb_written_list.next,
> +				struct bdi_written_count,
> +				list);

list_first_entry()?

> +		bdi->wb_written_head = bdi_stat(bdi, BDI_WRITTEN) + wc->written;

The resolution of the percpu counters is an issue here, I think.
percpu counters update in batches of 32 counts per CPU. wc->written
is going to have a value of roughly 8 or 32 depending on whether
bdi->dirty_exceeded is set or not. I note that you take this into
account when checking dirty threshold limits, but it doesn't appear
to be taken in to here.

That means for ratelimits of 8 pages on writes and only one CPU doing
IO completion wakeups are only going to occur in completion batches
of 32 pages. we'll end up with something like:

waiter, written		count at head	count when woken
 A, 8			n			n + 32
 B, 8			n + 32			n + 64
 C, 8			n + 64			n + 96
.....

And so on. This isn't necessarily bad - we'll throttle for longer
than we strictly need to - but the cumulative counter resolution
error gets worse as the number of CPUs doing IO completion grows.
Worst case ends up at for (num cpus * 31) + 1 pages of writeback for
just the first waiter. For an arbitrary FIFO queue of depth d, the
worst case is more like d * (num cpus * 31 + 1).

Hence I think that instead of calculating the next wakeup threshold
when the head changes the wakeup threshold needs to be a function of
the FIFO depth. That is, it is calculated at queueing time from the
current tail of the queue.

e.g. something like this when queuing:

	if (list_empty(&bdi->wb_written_list))
		wc->wakeup_at = bdi_stat(bdi, BDI_WRITTEN) + written;
		bdi->wb_written_head = wc->wakeup_at;
	else {
		tail = list_last_entry(&bdi->wb_written_list);
		wc->wakeup_at = tail->wakeup_at + written;
	}
	list_add_tail(&wc->list, &bdi->wb_written_list);

And this when the wakeup threshold is tripped:

	written = bdi_stat(bdi, BDI_WRITTEN);
	while (!list_empty(&bdi->wb_written_list)) {
		wc = list_first_entry();

		if (wc->wakeup_at > written)
			break;

		list_del_init(wc)
		wakeup(wc)
	}

	if (!list_empty(&bdi->wb_written_list)) {
		wc = list_first_entry();
		bdi->wb_written_head = wc->wakeup_at;
	} else
		bdi->wb_written_head = ~0ULL;

This takes the counter resolution completely out of the picture - if
the counter resolution is 32, and there are 4 waiters on the fifo
each waiting for 8 pages, then a single tick of the counter will
wake them all up.

Other than that, it seems like a good approach to me...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2010-06-18  6:09 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-06-17 18:04 [PATCH RFC] mm: Implement balance_dirty_pages() through waiting for flusher thread Jan Kara
2010-06-17 18:04 ` Jan Kara
2010-06-18  6:09 ` Dave Chinner [this message]
2010-06-18  9:11   ` Peter Zijlstra
2010-06-18 23:29     ` Dave Chinner
2010-06-21 23:36   ` Jan Kara
2010-06-22  5:44     ` Dave Chinner
2010-06-22  6:14       ` Andrew Morton
2010-06-22  7:45         ` Peter Zijlstra
2010-06-22  8:24           ` Andrew Morton
2010-06-22  8:52             ` Peter Zijlstra
2010-06-22 10:09         ` Dave Chinner
2010-06-22 13:17           ` Jan Kara
2010-06-22 13:17             ` Jan Kara
2010-06-22 13:52             ` Wu Fengguang
2010-06-22 13:52               ` Wu Fengguang
2010-06-22 13:59               ` Peter Zijlstra
2010-06-22 13:59                 ` Peter Zijlstra
2010-06-22 14:00               ` Peter Zijlstra
2010-06-22 14:36                 ` Wu Fengguang
2010-06-22 14:02               ` Jan Kara
2010-06-22 14:02                 ` Jan Kara
2010-06-22 14:24                 ` Wu Fengguang
2010-06-22 14:24                   ` Wu Fengguang
2010-06-22 22:29                 ` Dave Chinner
2010-06-23 13:15                   ` Jan Kara
2010-06-23 13:15                     ` Jan Kara
2010-06-23 23:06                     ` Dave Chinner
2010-06-22 14:31               ` Christoph Hellwig
2010-06-22 14:31                 ` Christoph Hellwig
2010-06-22 14:38                 ` Jan Kara
2010-06-22 14:38                   ` Jan Kara
2010-06-22 22:45                   ` Dave Chinner
2010-06-23  1:34                     ` Wu Fengguang
2010-06-23  1:34                       ` Wu Fengguang
2010-06-23  3:06                       ` Dave Chinner
2010-06-23  3:22                         ` Wu Fengguang
2010-06-23  3:22                           ` Wu Fengguang
2010-06-23  6:03                           ` Dave Chinner
2010-06-23  6:03                             ` Dave Chinner
2010-06-23  6:25                             ` Wu Fengguang
2010-06-23  6:25                               ` Wu Fengguang
2010-06-23 23:42                               ` Dave Chinner
2010-06-23 23:42                                 ` Dave Chinner
2010-06-22 14:41                 ` Wu Fengguang
2010-06-22 11:19       ` Jan Kara
2010-06-22 11:19         ` Jan Kara
2010-06-18 10:21 ` Peter Zijlstra
2010-06-21 13:31   ` Jan Kara
2010-06-18 10:21 ` Peter Zijlstra
2010-06-21 14:02   ` Jan Kara
2010-06-21 14:02     ` Jan Kara
2010-06-21 14:10     ` Jan Kara
2010-06-21 14:10       ` Jan Kara
2010-06-21 14:12       ` Peter Zijlstra
2010-06-18 10:21 ` Peter Zijlstra
2010-06-21 13:42   ` Jan Kara
2010-06-21 13:42     ` Jan Kara
2010-06-22  4:07     ` Wu Fengguang
2010-06-22  4:07       ` Wu Fengguang
2010-06-22 13:27       ` Jan Kara
2010-06-22 13:27         ` Jan Kara
2010-06-22 13:33         ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100618060901.GA6590@dastard \
    --to=david@fromorbit.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterz@infradead.org \
    --cc=wfg@mail.ustc.edu.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.