From: Nick Piggin <npiggin@kernel.dk>
To: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [patch 8/8] fs: add i_op->sync_inode
Date: Fri, 7 Jan 2011 15:47:34 +1100 [thread overview]
Message-ID: <20110107044734.GA4552@amd> (raw)
In-Reply-To: <20110106204510.GA2872@infradead.org>
On Thu, Jan 06, 2011 at 03:45:10PM -0500, Christoph Hellwig wrote:
> > > The problem is that currently we almost never do a pure blocking
> > > ->write_inode. The way the sync code is written we always do a
> > > non-blocking one first, then a blocking one. If you always do the
> > > synchronous one we'll get a lot more overhead - the first previous
> > > asynchronous one will write the inode (be it just into the log, or for
> > > real), then we write back data, and then we'll have to write it again
> > > becaus it has usually been redirtied again due to the data writeback in
> > > the meantime.
> >
> > It doesn't matter, the integrity still has to be enforced in .sync_fs,
> > because sync .write_inode may *never* get called, because of the fact
> > that async .write_inode also clears the inode metadata dirty bits.
> >
> > So if .sync_fs has to enforce integrity anyway, then you don't ever need
> > to do any actual waiting in your sync .write_inode. See?
>
> I'm not talking about the actual waiting. Right now we have two
> different use cases for ->write_inode:
>
> 1) sync_mode == WB_SYNC_NONE
>
> This tells the filesystem to start an opportunistic writeout.
>
> 2) sync_mode == WB_SYNC_ALL
>
> This tells the filesystem it needs to to a mandatory writeout.
>
> Note that writeout is losely defined. If a filesystems isn't
> exportable or implements the commit_metadata operation it's indeed
> enough to synchronize the state into internal fs data just enough for
> ->sync_fs.
>
> Or that's how it should be. As you pointed out the way the writeback
> code treats the WB_SYNC_NONE writeouts makes this not work as expected.
>
> There's various ways to fix this:
>
> 1) the one you advocate, that is treating all ->write_inode calls as
> if they were WB_SYNC_ALL. This does fix the issue of incorrectly
> updating the dirty state, but causes a lot of additional I/O -
> the way the sync process is designed we basically always call
> ->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
> 2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
> This also fixes the i_dirty state updates, but allows filesystems
> that keep internal dirty state to be smarted about avoiding I/O
> 3) remove the calls to ->write_inode with WB_SYNC_NONE. This might
> work well for calls from the sync() callchain, but we rely on
> inode background writeback from the flusher threads in lots of
> places. Note that we really do not want to block the flusher
> threads with blocking writes, which is another argument against
> (1).
> 4) require ->write_inode to update the dirty state itself after
> the inode is on disk or in a data structure caught by ->sync_fs.
> This keeps optimal behaviour, but requires a lot of code changes.
>
> If we want a quick fix only (2) seems feasibly to me, with the option
> of implementing (4) and parts of (3) later on.
No, you misunderstand 1. I am saying they should be treated as
WB_SYNC_NONE.
In fact 2 would cause much more IO, because dirty writeout would
never clean them so it will just keep writing them out. I don't
know how 2 could be feasible.
> > > We need to propagate the VFS dirty state into the fs-internal state,
> > > e.g. for XFS start a transaction. The reason for that is that the VFS
> > > simply writes timestamps into the inode and marks it dirty instead of
> > > telling the filesystem about timestamp updates. For XFS in
> > > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > > metadata changes, and thus now the only thing going through
> > > ->write_inode.
> >
> > Well then you have a bug, because a sync .write_inode *may never get
> > called*. You may only get an async one, even in the case of fsync,
> > because async writeback clears the vfs dirty bits.
>
> Yes, the bug about updating the dirty state for WB_SYNC_NONE affects
So, back to my original question: what is the performance problem
with treating write_inode as WB_SYNC_NONE, and then having .fsync
and .sync_fs do the integrity?
next prev parent reply other threads:[~2011-01-07 4:47 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-18 1:46 [patch 0/8] Inode data integrity patches Nick Piggin
2010-12-18 1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
2010-12-18 1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
2010-12-18 1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
2010-12-18 1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
2010-12-18 1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
2011-01-07 19:08 ` Ted Ts'o
2010-12-18 1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
2010-12-18 1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
2010-12-29 15:01 ` Christoph Hellwig
2011-01-03 15:03 ` Steven Whitehouse
2011-01-03 16:58 ` Christoph Hellwig
2011-01-04 7:12 ` Nick Piggin
2011-01-04 14:22 ` Steven Whitehouse
2011-01-04 6:04 ` Nick Piggin
2011-01-04 6:39 ` Christoph Hellwig
2011-01-04 7:52 ` Nick Piggin
2011-01-04 9:13 ` Christoph Hellwig
2011-01-04 9:28 ` Nick Piggin
2010-12-18 1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
2010-12-29 15:12 ` Christoph Hellwig
2011-01-04 6:27 ` Nick Piggin
2011-01-04 6:57 ` Christoph Hellwig
2011-01-04 8:03 ` Nick Piggin
2011-01-04 8:31 ` Nick Piggin
2011-01-04 9:25 ` Christoph Hellwig
2011-01-04 9:52 ` Nick Piggin
2011-01-06 20:49 ` Christoph Hellwig
2011-01-07 4:48 ` Nick Piggin
2011-01-07 7:25 ` Christoph Hellwig
2011-01-11 3:44 ` Nick Piggin
2011-01-04 9:25 ` Christoph Hellwig
2011-01-04 9:49 ` Nick Piggin
2011-01-06 20:45 ` Christoph Hellwig
2011-01-07 4:47 ` Nick Piggin [this message]
2011-01-07 7:24 ` Christoph Hellwig
2011-01-07 7:29 ` Christoph Hellwig
2011-01-07 13:10 ` Christoph Hellwig
2011-01-07 18:30 ` Ted Ts'o
2011-01-07 18:32 ` Christoph Hellwig
2011-01-07 19:06 ` Ted Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110107044734.GA4552@amd \
--to=npiggin@kernel.dk \
--cc=akpm@linux-foundation.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).