linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Wu Fengguang <fengguang.wu@intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Chris Mason <chris.mason@oracle.com>,
	Artem Bityutskiy <dedekind1@gmail.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"david@fromorbit.com" <david@fromorbit.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	Theodore Ts'o <tytso@mit.edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Fri, 2 Oct 2009 18:34:55 +0800	[thread overview]
Message-ID: <20091002103455.GA13308@localhost> (raw)
In-Reply-To: <20091002095459.GA24990@duck.suse.cz>

On Fri, Oct 02, 2009 at 05:54:59PM +0800, Jan Kara wrote:
> On Fri 02-10-09 10:25:12, Wu Fengguang wrote:
> > On Fri, Oct 02, 2009 at 05:35:23AM +0800, Jan Kara wrote:
> > > On Thu 01-10-09 22:54:43, Wu Fengguang wrote:
> > > > > > >   You probably didn't understand my comment in the previous email. This is
> > > > > > > too late to wakeup all the tasks. There are two limits - background_limit
> > > > > > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > > > > > above background_limit, we start the writeback but we don't throttle tasks
> > > > > > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > > > > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > > > > > single bdi, this means we start throttling threads only when 10% of memory
> > > > > > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > > > > > as their BDI gets below the dirty limit or when global number of dirty
> > > > > > > pages gets below (background_limit + dirty_limit) / 2.
> > > > > > 
> > > > > > Sure, but the design goal is to wakeup the throttled tasks in the
> > > > > > __bdi_writeout_inc() path instead of here. As long as some (background)
> > > > > > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > > > > > the tasks.  This "unthrottle all on exit of background writeback" is
> > > > > > merely a safeguard, since once background writeback (which could be
> > > > > > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > > > > > calls to __bdi_writeout_inc() is likely to stop.
> > > > >   The thing is: In the old code, tasks returned from balance_dirty_pages()
> > > > > as soon as we got below dirty_limit, regardless of how much they managed to
> > > > > write. So we want to wake them up from waiting as soon as we get below the
> > > > > dirty limit (maybe a bit later so that they don't immediately block again
> > > > > but I hope you get the point).
> > > > 
> > > > Ah good catch!  However overhitting the threshold by 1MB (maybe more with
> > > > concurrent dirtiers) should not be a problem. As you said, that avoids the
> > > > task being immediately blocked again.
> > > > 
> > > > The old code does the dirty_limit check in an opportunistic manner. There were
> > > > no guarantee. 2.6.32 further weakens it with the removal of congestion back off.
> > >   Sure, there are no guarantees but if we let threads sleep in
> > > balance_dirty_pages longer than necessary it will have a performance impact
> > > (application will sleep instead of doing useful work). So we should better
> > > make sure applications sleep as few as necessary in balance_dirty_pages.
> > 
> > To avoid long sleep, we limit write_chunk size for balance_dirty_pages.
> > That's all we need.  The "abort earlier if below dirty_limit" logic is
> > not necessary (or even undesirable) in three ways.
> > - just found that pre-31 kernels will normally succeed in writing the
> >   whole write_chunk because nonblocking=0, thus it won't backoff on
> >   congestion. So it's not over_bground_thresh() but over_dirty_limit()
> >   that will change behavior.
>   OK, good point.
> 
> > - whether it be abort on over_bground_thresh() or over_dirty_limit(),
> >   there is some constant threshold around which applications are
> >   throttled. The exact threshold level won't change the throttled
> >   dirty throughput. It is determined by the write IO throughput the
> >   block device can handle.
>   But the aim is to throttle applications at higher limit than a limit at
> which we start pdflush-style writeback. So that if writeback thread is fast
> enough to flush the data, applications don't get throttled at all. That's
> the reason for a difference between dirty_thresh and background_thresh.

When doing over_bground_thresh(), the real threshold won't be far from dirty_limit.
- for single dirtier, the threshold may be (dirty_limit - 4MB).
- for N dirtiers, it may be (dirty_limit - N*1MB) in worst case (the
  ratelimit will backoff on dirty_exceeded). However it's highly
  unlikely to reach worst case, because there are so many dirtiers and
  so much dirtying pressure, a small fraction of "unthrottled at the
  moment" dirtiers will be able to pump up the dirty pages to the
  dirty limit. Since the dirtiers are unthrottled one by one, it is
  unlikely for them to block at the same time. In stochastic, the
  more N, the less probability for N processes to enqueue at the same
  time. It's an exponential decreasing function.

> > - The over_bground_thresh() check is merely a safeguard which is not
> >   relevant in 99.9% time. But when increased to over_dirty_limit(), it
> >   may become a hot wakeup path comparable to the __bdi_writeout_inc()
> >   path.  The problem of this wakeup path is, it is "wakeup all". It's
> >   preferable to wake up processes one by one in __bdi_writeout_inc().
>   Well, it depends on the number of applications writing data (if there are
> 100 threads writing data, the last would get unblocked after 400 MB are
> written assuming ratelimit_pages = 1024). So in this case there are high
> chances that quite some threads will get woken up because we reach even
> background_thresh.

There is such a chance, but should be extremely low in probability :)

>   What I'm in fact a bit worried about is the latency - in the example
> above it can take quite a long time for an application to be woken in
> balance_dirty_pages (that's not a new problem I agree). When the threads

No worry it's fine :) The over_dirty_limit() could make things better,
but is not a guarantee. In fact there are no guarantee of latency at
all, when there are so many dirtiers competing the IO channel..

> are writing continuously losts of data, there's no way around this. But
> when it was just a short spike of IO, we'd win if we woke those threads
> earlier. But OK, probaly we can sort that out later.

Yes in this case it would be beneficial. The good thing is, the
over_dirty_limit() would be trivial to add if necessary.

Thanks,
Fengguang

  reply	other threads:[~2009-10-02 10:35 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27   ` Artem Bityutskiy
2009-09-08 10:41     ` Jens Axboe
2009-09-08 10:52       ` Artem Bityutskiy
2009-09-08 10:57         ` Jens Axboe
2009-09-08 11:01           ` Artem Bityutskiy
2009-09-08 11:05             ` Jens Axboe
2009-09-08 11:31               ` Artem Bityutskiy
2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46   ` Daniel Walker
2009-09-08 14:21     ` Jens Axboe
2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37   ` Artem Bityutskiy
2009-09-08 16:06     ` Peter Zijlstra
2009-09-08 16:29       ` Chris Mason
2009-09-08 16:56         ` Peter Zijlstra
2009-09-08 17:28           ` Chris Mason
2009-09-08 17:46             ` Peter Zijlstra
2009-09-08 17:55               ` Peter Zijlstra
2009-09-08 18:32                 ` Peter Zijlstra
2009-09-09 14:23                   ` Jan Kara
2009-09-09 14:37                     ` Wu Fengguang
2009-09-10 15:49                     ` Peter Zijlstra
2009-09-14 11:17                       ` Jan Kara
2009-09-24  8:33                         ` Wu Fengguang
2009-09-24 15:38                           ` Peter Zijlstra
2009-09-25  1:33                             ` Wu Fengguang
2009-09-29 17:35                           ` Jan Kara
2009-09-30  1:24                             ` Wu Fengguang
2009-09-30 11:55                               ` Jan Kara
2009-09-30 12:10                                 ` Jens Axboe
2009-10-01 15:17                                   ` Wu Fengguang
2009-10-01 13:36                                 ` Wu Fengguang
2009-10-01 14:22                                   ` Jan Kara
2009-10-01 14:54                                     ` Wu Fengguang
2009-10-01 21:35                                       ` Jan Kara
2009-10-02  2:25                                         ` Wu Fengguang
2009-10-02  9:54                                           ` Jan Kara
2009-10-02 10:34                                             ` Wu Fengguang [this message]
2009-09-08 18:35                 ` Chris Mason
2009-09-08 17:57               ` Chris Mason
2009-09-08 18:28                 ` Peter Zijlstra
2009-09-09  1:53           ` Dave Chinner
2009-09-09  3:52             ` Wu Fengguang
2009-09-08 18:06         ` Theodore Tso
     [not found]           ` <20090908181937.GA11545@infradead.org>
2009-09-08 19:34             ` Theodore Tso
2009-09-09  9:29         ` Wu Fengguang
2009-09-09 12:28           ` Christoph Hellwig
2009-09-09 12:32             ` Wu Fengguang
2009-09-09 12:36               ` Artem Bityutskiy
2009-09-09 12:37               ` Jens Axboe
2009-09-09 12:43                 ` Christoph Hellwig
2009-09-09 12:44                   ` Jens Axboe
2009-09-09 12:51                     ` Christoph Hellwig
2009-09-09 12:57                 ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04  7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28   ` Richard Kennedy
2009-09-05 13:26     ` Jamie Lokier
2009-09-05 16:18       ` Richard Kennedy
2009-09-05 16:46     ` Theodore Tso
2009-09-07 19:09   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091002103455.GA13308@localhost \
    --to=fengguang.wu@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=chris.mason@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dedekind1@gmail.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).