linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Artem Bityutskiy <dedekind1@gmail.com>,
	Jens Axboe <jens.axboe@oracle.com>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	david@fromorbit.com, hch@infradead.org,
	akpm@linux-foundation.org, jack@suse.cz,
	"Theodore Ts'o" <tytso@mit.edu>
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Tue, 8 Sep 2009 12:29:36 -0400	[thread overview]
Message-ID: <20090908162936.GA2975@think> (raw)
In-Reply-To: <1252425983.7746.120.camel@twins>

On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > Hi,
> > 
> > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > From: Theodore Ts'o<tytso@mit.edu>
> > >
> > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > comment previously.)  This doesn't make sense now because the only
> > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > that case we need to write out all of the data anyway.  Previously
> > > there may have been other code paths that waited on I_SYNC, but not
> > > any more.
> > >
> > > According to Christoph, the current writeback size is way too small,
> > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > value was also problematic for ext4 as well, as it caused large files
> > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > the nr_to_write by a factor of two).
> > >
> > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > >
> > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > 
> > Would be nice to update doc files like
> > 
> > Documentation/sysctl/vm.txt
> > Documentation/filesystems/proc.txt
> 
> I'm still not convinced this knob is worth the patch and I'm inclined to
> flat out NAK it..
> 
> The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> dirty stats again and not write out too much.

The problem is that 'too much' is a very abstract thing.  When a process
is stuck in balance_dirty_pages, we want them to do the minimal amount
of work (or waiting) required to get them safely back inside file_write().

> 
> Clearly the current limit isn't sufficient for some people,
>  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> congestion_wait()
>  - ext4 generates inconveniently small extents

This is actually two different side of the same problem.  The filesystem
knows that bytes 0-N in the file are setup for delayed allocation.
Writepage is called on byte 0, and now the filesystem gets to decide how
big an extent to make.

It could decide to make an extent based on the total number of bytes
under delayed allocation, and hope the caller of writepage will be kind
enough to send down the pages contiguously afterward (xfs), or it could
make a smaller extent based on something closer to the total number of
bytes this particular writepages() call plans on writing (I guess what
ext4 is doing).

Either way, if pdflush or the bdi thread or whoever ends up switching to
another file during a big streaming write, the end result is that we
fragment.  We may fragment the file (ext4) or we may fragment the
writeback (xfs), but the end result isn't good.

Looking at two xfs examples, this is the IO for two concurrent streaming
writers (two different files) on 2.6.31-rc8 (pdflush is doing all the IO
in this graph, sorry the legend colors wrapped on me).  If you squint,
you can kind of see the fingers of IO as pdflush switches between files.

http://oss.oracle.com/~mason/seekwatcher/xfs-tag.png

And here is the IO when XFS forces nr_to_write much higher with a patch
from Christoph:

http://oss.oracle.com/~mason/seekwatcher/xfs-extend-tag.png

These graphs would look the same no matter what I did with
congestion_wait().  The first graph is slower just because pdflush
switches from one file to another.

> 
> 
> The first seems to suggest to me the number isn't well balanced against
> whatever drives congestion_wait() (that thing still gives me a
> head-ache).
> 
> # git grep clear_bdi_congested
> drivers/block/pktcdvd.c:                clear_bdi_congested(&pd->disk->queue->backing_dev_info,
> fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
> fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
> fs/nfs/write.c:         clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> include/linux/backing-dev.h:void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> include/linux/blkdev.h: clear_bdi_congested(&q->backing_dev_info, sync);
> mm/backing-dev.c:void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> mm/backing-dev.c:EXPORT_SYMBOL(clear_bdi_congested);
> 
> Suggests that regular block devices don't even manage device congestion
> and it reverts to a simple timeout -- should we fix that?

Look for blk_clear_queue_congested().  It is managed, I personally don't
think it is very useful.  But, that's a different thread ;)

> 
> Now, suppose it were to do something useful, I'd think we'd want to
> limit write-out to whatever it takes so saturate the BDI.

If we don't want a blanket increase, I'd suggest that we just give the
FS a way to say: 'I know nr_to_write is only 32, but if you just write a
few blocks more, the system will be better off'.

Something like wbc->fs_write_hint

This way, when the FS allocates a great big contiguous delalloc extent,
it can set the wbc to reflect that we've got cheap and easy IO here.

> 
> 
> As to the extends, shouldn't ext4 allocate extends based on the amount
> of dirty pages in the file instead of however much we're going to write
> out now?

It probably does a mixture of both.

-chris


  reply	other threads:[~2009-09-08 16:30 UTC|newest]

Thread overview: 67+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27   ` Artem Bityutskiy
2009-09-08 10:41     ` Jens Axboe
2009-09-08 10:52       ` Artem Bityutskiy
2009-09-08 10:57         ` Jens Axboe
2009-09-08 11:01           ` Artem Bityutskiy
2009-09-08 11:05             ` Jens Axboe
2009-09-08 11:31               ` Artem Bityutskiy
2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46   ` Daniel Walker
2009-09-08 14:21     ` Jens Axboe
2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37   ` Artem Bityutskiy
2009-09-08 16:06     ` Peter Zijlstra
2009-09-08 16:29       ` Chris Mason [this message]
2009-09-08 16:56         ` Peter Zijlstra
2009-09-08 17:28           ` Chris Mason
2009-09-08 17:46             ` Peter Zijlstra
2009-09-08 17:55               ` Peter Zijlstra
2009-09-08 18:32                 ` Peter Zijlstra
2009-09-09 14:23                   ` Jan Kara
2009-09-09 14:37                     ` Wu Fengguang
2009-09-10 15:49                     ` Peter Zijlstra
2009-09-14 11:17                       ` Jan Kara
2009-09-24  8:33                         ` Wu Fengguang
2009-09-24 15:38                           ` Peter Zijlstra
2009-09-25  1:33                             ` Wu Fengguang
2009-09-29 17:35                           ` Jan Kara
2009-09-30  1:24                             ` Wu Fengguang
2009-09-30 11:55                               ` Jan Kara
2009-09-30 12:10                                 ` Jens Axboe
2009-10-01 15:17                                   ` Wu Fengguang
2009-10-01 13:36                                 ` Wu Fengguang
2009-10-01 14:22                                   ` Jan Kara
2009-10-01 14:54                                     ` Wu Fengguang
2009-10-01 21:35                                       ` Jan Kara
2009-10-02  2:25                                         ` Wu Fengguang
2009-10-02  9:54                                           ` Jan Kara
2009-10-02 10:34                                             ` Wu Fengguang
2009-09-08 18:35                 ` Chris Mason
2009-09-08 17:57               ` Chris Mason
2009-09-08 18:28                 ` Peter Zijlstra
2009-09-09  1:53           ` Dave Chinner
2009-09-09  3:52             ` Wu Fengguang
2009-09-08 18:06         ` Theodore Tso
     [not found]           ` <20090908181937.GA11545@infradead.org>
2009-09-08 19:34             ` Theodore Tso
2009-09-09  9:29         ` Wu Fengguang
2009-09-09 12:28           ` Christoph Hellwig
2009-09-09 12:32             ` Wu Fengguang
2009-09-09 12:36               ` Artem Bityutskiy
2009-09-09 12:37               ` Jens Axboe
2009-09-09 12:43                 ` Christoph Hellwig
2009-09-09 12:44                   ` Jens Axboe
2009-09-09 12:51                     ` Christoph Hellwig
2009-09-09 12:57                 ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04  7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28   ` Richard Kennedy
2009-09-05 13:26     ` Jamie Lokier
2009-09-05 16:18       ` Richard Kennedy
2009-09-05 16:46     ` Theodore Tso
2009-09-07 19:09   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090908162936.GA2975@think \
    --to=chris.mason@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@fromorbit.com \
    --cc=dedekind1@gmail.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jens.axboe@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).