From: Nick Piggin <npiggin@kernel.dk>
To: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [patch 8/8] fs: add i_op->sync_inode
Date: Tue, 4 Jan 2011 19:03:23 +1100 [thread overview]
Message-ID: <20110104080323.GC4090@amd> (raw)
In-Reply-To: <20110104065736.GA8013@infradead.org>
On Tue, Jan 04, 2011 at 01:57:37AM -0500, Christoph Hellwig wrote:
> On Tue, Jan 04, 2011 at 05:27:25PM +1100, Nick Piggin wrote:
> > Basically the blocking or non blocking behaviour of ->write_inode is
> > irrelevant and will go away. Specific behaviour cannot be relied upon from
> > generic code, only filesystems. But even filesystems should not
> > differentiate between blocking and non-blocking (where they do, they
> > have been buggy). That is the *other* big bug in the code (the first one
> > being syncing code not waiting for writeback).
> >
> > So after my series, we guarantee ->write_inode is called after the inode
> > has been dirtied. We make no guarantees about what mode it is called in
> > (blocking or non blocking). So the filesystem should either have _all_
> > write_inode calls do sync writeout, or have them all do part of the op
> > (eg. clean struct inode by dirtying pagecache) and then syncing in fsync
> > and sync_fs. In the latter scheme, it doesn't make sense to do anything
> > more in the case of a "sync" call.
>
> There is absolutely no problem with writing out inodes asynchronously,
> it's just that the writeback code can't really cope with it right now.
> The fix is to only update i_state on I/O completion.
I don't know what you mean by this. As a filesystem generic API,
.write_inode says almost nothing. Some filesystems will do one thing,
some may do something in future after patches to the vfs to allow more
fancy work, others do other things, generic code can't assume any of
that.
> The blocking vs non-blocking difference is actually quite important for
> performance still.
I don't see how it could be. All integrity points still have to wait for
all potentially non blocking write_inode, so if that's the case just make
*all* the write_inode non blocking to begin with. Simpler code and
you'd get more performance at least in the sync(3) case where writeback
and waiting of multiple inodes can be batched.
> In fact for most modern filesystems we simply don't
> every want to do a normal writeout for the !sync case. Either it would
> be a no-op, or do something like we do in xfs currently, where we make
> sure to read the inode to start the read/modify/write cycle early, and
> put it at the end of the delayed write queue. The sync one on the other
> hand just logs the inode, so that the log force in ->sync_fs/etc can
> write them all out in one go.
Great, so why would you ever do a sync .write_inode?
> > > As for the actualy sync_inode operation: I don't think it's helpful.
> > > The *_sync_inode helpers in ext2 and fat are fine, but there's little
> > > point in going through an iops vector for it. Also adding the file
> > > syncing really complicates the API for now reason, epecially with
> > > the range interface.
> >
> > It does have a reason, which is the nfsd syncing callback -- using
> > sync_inode_metadata there is actually wrong and should really be
> > replaced with a warning that the fs cannot support such syncing.
> >
> > See the problem I explained above -- it really needs to do a full
> > fsync call. But it doesn't have a file descriptor, and most filesystems
> > don't need one, so I think a sync_inode operation is nice.
>
> It doesn't need to do an fsync, it's actually a quite different
> operations. That's why we added the ->commit_metadata operations. What
> NFSD really wants is to guarantee synchronous metadata operations.
> Traditionally unix required the filesystems to be mounted using -o wsync
> for that, but we try to emulate that from NFSD without affecting other
> access to the filesystem. The ->commit_metadata is there to ensure any
> previously started async operations completes. It does not have to
> push all dirty state to disk like fsync. Just compare the complexities
> of xfs_file_fsync and xfs_fs_nfs_commit_metadata - the latter simply
> checks if the inode has been logged, and if yes forces the log to disk
> up to the last operation on this inode. fsync does the same force if
> the inode is not dirty, but otherwise has to start a new transactions.
> It also has to wait for pending I/O completions before dealing with
> metadata, and issue barriers to data devices, which is completly
> superflous for nfs operations.
Thanks for the interesting comments, but I didn't say it needs an fsync,
I said sync_inode_metadata is not sufficient by definition because it
guarantees nothing.
> > Also giving filesystems the flexibility to do the data writeout can
> > be more optimal by doing all writeout at once or ordering to suit their
> > writeback patterns. Having range data could give some performance
> > advantages writing back mappings or commit operations over network. I
> > don't see it as a big complexity. It gives a chance to do range fsyncs
> > and sync_file_range and such properly too.
>
> We currently do all that just fine before calling into ->fsync.
What do you mean we do all that just fine? A filesystem can't schedule
data submission and waiting optimally versus metadata, it can't do
metadata operations or remote commits corresponding to data ranges, and
it doesn't support nfs sync operations.
next prev parent reply other threads:[~2011-01-04 8:03 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-12-18 1:46 [patch 0/8] Inode data integrity patches Nick Piggin
2010-12-18 1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
2010-12-18 1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
2010-12-18 1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
2010-12-18 1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
2010-12-18 1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
2011-01-07 19:08 ` Ted Ts'o
2010-12-18 1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
2010-12-18 1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
2010-12-29 15:01 ` Christoph Hellwig
2011-01-03 15:03 ` Steven Whitehouse
2011-01-03 16:58 ` Christoph Hellwig
2011-01-04 7:12 ` Nick Piggin
2011-01-04 14:22 ` Steven Whitehouse
2011-01-04 6:04 ` Nick Piggin
2011-01-04 6:39 ` Christoph Hellwig
2011-01-04 7:52 ` Nick Piggin
2011-01-04 9:13 ` Christoph Hellwig
2011-01-04 9:28 ` Nick Piggin
2010-12-18 1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
2010-12-29 15:12 ` Christoph Hellwig
2011-01-04 6:27 ` Nick Piggin
2011-01-04 6:57 ` Christoph Hellwig
2011-01-04 8:03 ` Nick Piggin [this message]
2011-01-04 8:31 ` Nick Piggin
2011-01-04 9:25 ` Christoph Hellwig
2011-01-04 9:52 ` Nick Piggin
2011-01-06 20:49 ` Christoph Hellwig
2011-01-07 4:48 ` Nick Piggin
2011-01-07 7:25 ` Christoph Hellwig
2011-01-11 3:44 ` Nick Piggin
2011-01-04 9:25 ` Christoph Hellwig
2011-01-04 9:49 ` Nick Piggin
2011-01-06 20:45 ` Christoph Hellwig
2011-01-07 4:47 ` Nick Piggin
2011-01-07 7:24 ` Christoph Hellwig
2011-01-07 7:29 ` Christoph Hellwig
2011-01-07 13:10 ` Christoph Hellwig
2011-01-07 18:30 ` Ted Ts'o
2011-01-07 18:32 ` Christoph Hellwig
2011-01-07 19:06 ` Ted Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110104080323.GC4090@amd \
--to=npiggin@kernel.dk \
--cc=akpm@linux-foundation.org \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).