From: Theodore Tso <tytso@mit.edu>
To: Chris Mason <chris.mason@oracle.com>,
Peter Zijlstra <peterz@infradead.org>,
Artem Bityutskiy <dedekind1@gmail.com>,
Jens Axboe <jens.axboe@oracle.com>, linux-kernel@vger.kernel.org,
Subject: Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
Date: Tue, 8 Sep 2009 14:06:01 -0400 [thread overview]
Message-ID: <20090908180601.GN22901@mit.edu> (raw)
In-Reply-To: <20090908162936.GA2975@think>
On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> >
> > Clearly the current limit isn't sufficient for some people,
> > - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> > congestion_wait()
> > - ext4 generates inconveniently small extents
>
> This is actually two different side of the same problem. The filesystem
> knows that bytes 0-N in the file are setup for delayed allocation.
> Writepage is called on byte 0, and now the filesystem gets to decide how
> big an extent to make.
>
> It could decide to make an extent based on the total number of bytes
> under delayed allocation, and hope the caller of writepage will be kind
> enough to send down the pages contiguously afterward (xfs), or it could
> make a smaller extent based on something closer to the total number of
> bytes this particular writepages() call plans on writing (I guess what
> ext4 is doing).
>
> Either way, if pdflush or the bdi thread or whoever ends up switching to
> another file during a big streaming write, the end result is that we
> fragment. We may fragment the file (ext4) or we may fragment the
> writeback (xfs), but the end result isn't good.
Yep; the question is whether we want to fragment the read operation in
the future (ext4) or write operation now (XFS).
> > Now, suppose it were to do something useful, I'd think we'd want to
> > limit write-out to whatever it takes so saturate the BDI.
>
> If we don't want a blanket increase, I'd suggest that we just give the
> FS a way to say: 'I know nr_to_write is only 32, but if you just write a
> few blocks more, the system will be better off'.
Well, we can mostly do this now, using the XFS hack:
wbc->nr_to_write *= 4;
Which is another way of saying, we *know* the page writeback routines
are on crack, so we'll ignore their suggestion of how many pages to
write, and we'll try to write more than what they asked us to write.
(This wasn't a proposed change; it's in Linux 2.6 mainline already;
see fs/xfs/linux-2.6/xfs_aops.c, in xfs_vm_writepage). The fact that
filesystems are playing games like this should be a clear indication
that things are badly broken above....
> > As to the extends, shouldn't ext4 allocate extends based on the amount
> > of dirty pages in the file instead of however much we're going to write
> > out now?
>
> It probably does a mixture of both.
It does do a mixture, but in a fairly primitive way. I was thinking
about writing some ugly code to more precisely determine how many
dirty-and-delayed-allocation-pages exist beyond what we've currently
requested to write, but it seemed like most of the problem would be
solved simply by having the page writeback routines simply send more
pages down to the filesystem, instead of having the file system work
around brain damage in the VM writeback routines.
- Ted
next prev parent reply other threads:[~2009-09-08 18:06 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-08 9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08 9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27 ` Artem Bityutskiy
2009-09-08 10:41 ` Jens Axboe
2009-09-08 10:52 ` Artem Bityutskiy
2009-09-08 10:57 ` Jens Axboe
2009-09-08 11:01 ` Artem Bityutskiy
2009-09-08 11:05 ` Jens Axboe
2009-09-08 11:31 ` Artem Bityutskiy
2009-09-08 9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08 9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46 ` Daniel Walker
2009-09-08 14:21 ` Jens Axboe
2009-09-08 9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08 9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08 9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08 9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08 9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37 ` Artem Bityutskiy
2009-09-08 16:06 ` Peter Zijlstra
2009-09-08 16:29 ` Chris Mason
2009-09-08 16:56 ` Peter Zijlstra
2009-09-08 17:28 ` Chris Mason
2009-09-08 17:46 ` Peter Zijlstra
2009-09-08 17:55 ` Peter Zijlstra
2009-09-08 18:32 ` Peter Zijlstra
2009-09-09 14:23 ` Jan Kara
2009-09-09 14:37 ` Wu Fengguang
2009-09-10 15:49 ` Peter Zijlstra
2009-09-14 11:17 ` Jan Kara
2009-09-24 8:33 ` Wu Fengguang
2009-09-24 15:38 ` Peter Zijlstra
2009-09-25 1:33 ` Wu Fengguang
2009-09-29 17:35 ` Jan Kara
2009-09-30 1:24 ` Wu Fengguang
2009-09-30 11:55 ` Jan Kara
2009-09-30 12:10 ` Jens Axboe
2009-10-01 15:17 ` Wu Fengguang
2009-10-01 13:36 ` Wu Fengguang
2009-10-01 14:22 ` Jan Kara
2009-10-01 14:54 ` Wu Fengguang
2009-10-01 21:35 ` Jan Kara
2009-10-02 2:25 ` Wu Fengguang
2009-10-02 9:54 ` Jan Kara
2009-10-02 10:34 ` Wu Fengguang
2009-09-08 18:35 ` Chris Mason
2009-09-08 17:57 ` Chris Mason
2009-09-08 18:28 ` Peter Zijlstra
2009-09-09 1:53 ` Dave Chinner
2009-09-09 3:52 ` Wu Fengguang
2009-09-08 18:06 ` Theodore Tso [this message]
[not found] ` <20090908181937.GA11545@infradead.org>
2009-09-08 19:34 ` Theodore Tso
2009-09-09 9:29 ` Wu Fengguang
2009-09-09 12:28 ` Christoph Hellwig
2009-09-09 12:32 ` Wu Fengguang
2009-09-09 12:36 ` Artem Bityutskiy
2009-09-09 12:37 ` Jens Axboe
2009-09-09 12:43 ` Christoph Hellwig
2009-09-09 12:44 ` Jens Axboe
2009-09-09 12:51 ` Christoph Hellwig
2009-09-09 12:57 ` Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2009-09-04 7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04 7:46 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-04 15:28 ` Richard Kennedy
2009-09-05 13:26 ` Jamie Lokier
2009-09-05 16:18 ` Richard Kennedy
2009-09-05 16:46 ` Theodore Tso
2009-09-07 19:09 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090908180601.GN22901@mit.edu \
--to=tytso@mit.edu \
--cc=chris.mason@oracle.com \
--cc=dedekind1@gmail.com \
--cc=jens.axboe@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).