linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com, hch@infradead.org,
	akpm@linux-foundation.org, jack@suse.cz,
	"Theodore Ts'o" <tytso@mit.edu>,
	Wu Fengguang <fengguang.wu@intel.com>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Tue, 8 Sep 2009 13:57:56 -0400	[thread overview]
Message-ID: <20090908175756.GG2975@think> (raw)
In-Reply-To: <1252431974.7746.151.camel@twins>

On Tue, Sep 08, 2009 at 07:46:14PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > > Right, so what can we do to make it useful? I think the intent is to
> > > limit the number of pages in writeback and provide some progress
> > > feedback to the vm.
> > > 
> > > Going by your experience we're failing there.
> > 
> > Well, congestion_wait is a stop sign but not a queue.  So, if you're
> > being nice and honoring congestion but another process (say O_DIRECT
> > random writes) doesn't, then you back off forever and none of your IO
> > gets done.
> > 
> > To get around this, you can add code to make sure that you do
> > _some_ io, but this isn't enough for your work to get done
> > quickly, and you do end up waiting in get_request() so the async
> > benefits of using the congestion test go away.
> > 
> > If we changed everyone to honor congestion, we end up with a poll model
> > because a ton of congestion_wait() callers create a thundering herd.
> > 
> > So, we could add a queue, and then congestion_wait() would look a lot
> > like get_request_wait().  I'd rather that everyone just used
> > get_request_wait, and then have us fix any latency problems in the
> > elevator.
> 
> Except you'd need to lift it to the BDI layer, because not all backing
> devices are a block device.
> 
> Making it into a per-bdi queue sounds good to me though.
> 
> > For me, perfect would be one or more threads per-bdi doing the
> > writeback, and never checking for congestion (like what Jens' code
> > does).  The congestion_wait inside balance_dirty_pages() is really just
> > a schedule_timeout(), on a fully loaded box the congestion doesn't go
> > away anyway.  We should switch that to a saner system of waiting for
> > progress on the bdi writeback + dirty thresholds.
> 
> Right, one of the things we could possibly do is tie into
> __bdi_writeout_inc() and test levels there once every so often and then
> flip a bit when we're low enough to stop writing.
> 
> > Btrfs would love to be able to send down a bio non-blocking.  That would
> > let me get rid of the congestion check I have today (I think Jens said
> > that would be an easy change and then I talked him into some small mods
> > of the writeback path).
> 
> Wont that land us into trouble because the amount of writeback will
> become unwieldy?

The btrfs usage is a little different.  I've got a pile of bios all
setup and ready for submission, and I'm trying to send them down to N
devices from one thread.  So, if a given submit_bio call is going to
block, I'd rather move on to another device.

This is really what pdflush is using congestion for too, the difference
is that I've already got the bios made.

> 
> > > > > Now, suppose it were to do something useful, I'd think we'd want to
> > > > > limit write-out to whatever it takes so saturate the BDI.
> > > > 
> > > > If we don't want a blanket increase, 
> > > 
> > > The thing is, this sysctl seems an utter cop out, we can't even explain
> > > how to calculate a number that'll work for a situation, the best we can
> > > do is say, prod at it and pray -- that's not good.
> > > 
> > > Last time I also asked if an increased number is good for every
> > > situation, I have a machine with a RAID5 array and USB storage, will it
> > > harm either situation?
> > 
> > If the goal is to make sure that pdflush or balance_dirty_pages only
> > does IO until some condition is met, we should add a flag to the bdi
> > that gets set when that condition is met.  Things will go a lot more
> > smoothly than magic numbers.
> 
> Agreed - and from what I can make out, that really is the only goal
> here.
> 
> > Then we can add the fs_hint as another change so the FS can tell
> > write_cache_pages callers how to do optimal IO based on its allocation
> > decisions.
> 
> I think you lost me here, but I think you mean to provide some FS
> specific feedback to the generic write page routines -- whatever
> works ;-)

Going back to the streaming writer case, pretend the FS just created a
nice fat 256MB extent out of dealloc pages, but after we wrote the first
4k, we dropped below the dirty threshold and IO is no longer "required".

It would be silly to just write 4k.  We know we have a contiguous
area 256MB long on disk and 256MB of dirty pages.  In this case, pdflush
(or Jens' bdi threads) want to write some large portion of that 256MB.

You might argue a balance_dirty_pages callers wants to return quickly,
but even then we'd want to write at least 128k.

-chris


  parent reply	other threads:[~2009-09-08 17:58 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27   ` Artem Bityutskiy
2009-09-08 10:41     ` Jens Axboe
2009-09-08 10:52       ` Artem Bityutskiy
2009-09-08 10:57         ` Jens Axboe
2009-09-08 11:01           ` Artem Bityutskiy
2009-09-08 11:05             ` Jens Axboe
2009-09-08 11:31               ` Artem Bityutskiy
2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46   ` Daniel Walker
2009-09-08 14:21     ` Jens Axboe
2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37   ` Artem Bityutskiy
2009-09-08 16:06     ` Peter Zijlstra
2009-09-08 16:29       ` Chris Mason
2009-09-08 16:56         ` Peter Zijlstra
2009-09-08 17:28           ` Chris Mason
2009-09-08 17:46             ` Peter Zijlstra
2009-09-08 17:55               ` Peter Zijlstra
2009-09-08 18:32                 ` Peter Zijlstra
2009-09-09 14:23                   ` Jan Kara
2009-09-09 14:37                     ` Wu Fengguang
2009-09-10 15:49                     ` Peter Zijlstra
2009-09-14 11:17                       ` Jan Kara
2009-09-24  8:33                         ` Wu Fengguang
2009-09-24 15:38                           ` Peter Zijlstra
2009-09-25  1:33                             ` Wu Fengguang
2009-09-29 17:35                           ` Jan Kara
2009-09-30  1:24                             ` Wu Fengguang
2009-09-30 11:55                               ` Jan Kara
2009-09-30 12:10                                 ` Jens Axboe
2009-10-01 15:17                                   ` Wu Fengguang
2009-10-01 13:36                                 ` Wu Fengguang
2009-10-01 14:22                                   ` Jan Kara
2009-10-01 14:54                                     ` Wu Fengguang
2009-10-01 21:35                                       ` Jan Kara
2009-10-02  2:25                                         ` Wu Fengguang
2009-10-02  9:54                                           ` Jan Kara
2009-10-02 10:34                                             ` Wu Fengguang
2009-09-08 18:35                 ` Chris Mason
2009-09-08 17:57               ` Chris Mason [this message]
2009-09-08 18:28                 ` Peter Zijlstra
2009-09-09  1:53           ` Dave Chinner
2009-09-09  3:52             ` Wu Fengguang
2009-09-08 18:06         ` Theodore Tso
     [not found]           ` <20090908181937.GA11545@infradead.org>
2009-09-08 19:34             ` Theodore Tso
2009-09-09  9:29         ` Wu Fengguang
2009-09-09 12:28           ` Christoph Hellwig
2009-09-09 12:32             ` Wu Fengguang
2009-09-09 12:36               ` Artem Bityutskiy
2009-09-09 12:37               ` Jens Axboe
2009-09-09 12:43                 ` Christoph Hellwig
2009-09-09 12:44                   ` Jens Axboe
2009-09-09 12:51                     ` Christoph Hellwig
2009-09-09 12:57                 ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04  7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28   ` Richard Kennedy
2009-09-05 13:26     ` Jamie Lokier
2009-09-05 16:18       ` Richard Kennedy
2009-09-05 16:46     ` Theodore Tso
2009-09-07 19:09   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090908175756.GG2975@think \
    --to=chris.mason@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=dedekind1@gmail.com \
    --cc=fengguang.wu@intel.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).