linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Nick Piggin <npiggin@kernel.dk>
To: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [patch 8/8] fs: add i_op->sync_inode
Date: Fri, 7 Jan 2011 15:47:34 +1100	[thread overview]
Message-ID: <20110107044734.GA4552@amd> (raw)
In-Reply-To: <20110106204510.GA2872@infradead.org>

On Thu, Jan 06, 2011 at 03:45:10PM -0500, Christoph Hellwig wrote:
> > > The problem is that currently we almost never do a pure blocking
> > > ->write_inode.  The way the sync code is written we always do a
> > > non-blocking one first, then a blocking one.  If you always do the
> > > synchronous one we'll get a lot more overhead - the first previous
> > > asynchronous one will write the inode (be it just into the log, or for
> > > real), then we write back data, and then we'll have to write it again
> > > becaus it has usually been redirtied again due to the data writeback in
> > > the meantime.
> > 
> > It doesn't matter, the integrity still has to be enforced in .sync_fs,
> > because sync .write_inode may *never* get called, because of the fact
> > that async .write_inode also clears the inode metadata dirty bits.
> > 
> > So if .sync_fs has to enforce integrity anyway, then you don't ever need
> > to do any actual waiting in your sync .write_inode. See?
> 
> I'm not talking about the actual waiting.  Right now we have two
> different use cases for ->write_inode:
> 
>  1) sync_mode == WB_SYNC_NONE
> 
> 	This tells the filesystem to start an opportunistic writeout.
> 
>  2) sync_mode == WB_SYNC_ALL
> 
> 	This tells the filesystem it needs to to a mandatory writeout.
> 
> Note that writeout is losely defined.  If a filesystems isn't
> exportable or implements the commit_metadata operation it's indeed
> enough to synchronize the state into internal fs data just enough for
> ->sync_fs.
> 
> Or that's how it should be.  As you pointed out the way the writeback
> code treats the WB_SYNC_NONE writeouts makes this not work as expected.
> 
> There's various ways to fix this:
> 
>  1) the one you advocate, that is treating all ->write_inode calls as
>     if they were WB_SYNC_ALL.  This does fix the issue of incorrectly
>     updating the dirty state, but causes a lot of additional I/O -
>     the way the sync process is designed we basically always call
>     ->write_inode with WB_SYNC_NONE first, and then with WB_SYNC_ALL
>  2) keep the WB_SYNC_NONE calls, but never update dirty state for them.
>     This also fixes the i_dirty state updates, but allows filesystems
>     that keep internal dirty state to be smarted about avoiding I/O
>  3) remove the calls to ->write_inode with WB_SYNC_NONE.  This might
>     work well for calls from the sync() callchain, but we rely on
>     inode background writeback from the flusher threads in lots of
>     places.  Note that we really do not want to block the flusher
>     threads with blocking writes, which is another argument against
>     (1).
>  4) require ->write_inode to update the dirty state itself after
>     the inode is on disk or in a data structure caught by ->sync_fs.
>     This keeps optimal behaviour, but requires a lot of code changes.
> 
> If we want a quick fix only (2) seems feasibly to me, with the option
> of implementing (4) and parts of (3) later on.

No, you misunderstand 1. I am saying they should be treated as
WB_SYNC_NONE.

In fact 2 would cause much more IO, because dirty writeout would
never clean them so it will just keep writing them out. I don't
know how 2 could be feasible.


> > > We need to propagate the VFS dirty state into the fs-internal state,
> > > e.g. for XFS start a transaction.  The reason for that is that the VFS
> > > simply writes timestamps into the inode and marks it dirty instead of
> > > telling the filesystem about timestamp updates.  For XFS in
> > > 2.6.38+ timestamp updates and i_size updates are the only unlogged
> > > metadata changes, and thus now the only thing going through
> > > ->write_inode.
> > 
> > Well then you have a bug, because a sync .write_inode *may never get
> > called*. You may only get an async one, even in the case of fsync,
> > because async writeback clears the vfs dirty bits.
> 
> Yes, the bug about updating the dirty state for WB_SYNC_NONE affects

So, back to my original question: what is the performance problem
with treating write_inode as WB_SYNC_NONE, and then having .fsync
and .sync_fs do the integrity?

  reply	other threads:[~2011-01-07  4:47 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-18  1:46 [patch 0/8] Inode data integrity patches Nick Piggin
2010-12-18  1:46 ` [patch 1/8] fs: mark_inode_dirty barrier fix Nick Piggin
2010-12-18  1:46 ` [patch 2/8] fs: simple fsync race fix Nick Piggin
2010-12-18  1:46 ` [patch 3/8] fs: introduce inode writeback helpers Nick Piggin
2010-12-18  1:46 ` [patch 4/8] fs: preserve inode dirty bits on failed metadata writeback Nick Piggin
2010-12-18  1:46 ` [patch 5/8] fs: ext2 inode sync fix Nick Piggin
2011-01-07 19:08   ` Ted Ts'o
2010-12-18  1:46 ` [patch 6/8] fs: fsync optimisations Nick Piggin
2010-12-18  1:46 ` [patch 7/8] fs: fix or note I_DIRTY handling bugs in filesystems Nick Piggin
2010-12-29 15:01   ` Christoph Hellwig
2011-01-03 15:03     ` Steven Whitehouse
2011-01-03 16:58       ` Christoph Hellwig
2011-01-04  7:12         ` Nick Piggin
2011-01-04 14:22         ` Steven Whitehouse
2011-01-04  6:04     ` Nick Piggin
2011-01-04  6:39       ` Christoph Hellwig
2011-01-04  7:52         ` Nick Piggin
2011-01-04  9:13           ` Christoph Hellwig
2011-01-04  9:28             ` Nick Piggin
2010-12-18  1:46 ` [patch 8/8] fs: add i_op->sync_inode Nick Piggin
2010-12-29 15:12   ` Christoph Hellwig
2011-01-04  6:27     ` Nick Piggin
2011-01-04  6:57       ` Christoph Hellwig
2011-01-04  8:03         ` Nick Piggin
2011-01-04  8:31           ` Nick Piggin
2011-01-04  9:25             ` Christoph Hellwig
2011-01-04  9:52               ` Nick Piggin
2011-01-06 20:49                 ` Christoph Hellwig
2011-01-07  4:48                   ` Nick Piggin
2011-01-07  7:25                     ` Christoph Hellwig
2011-01-11  3:44                       ` Nick Piggin
2011-01-04  9:25           ` Christoph Hellwig
2011-01-04  9:49             ` Nick Piggin
2011-01-06 20:45               ` Christoph Hellwig
2011-01-07  4:47                 ` Nick Piggin [this message]
2011-01-07  7:24                   ` Christoph Hellwig
2011-01-07  7:29                     ` Christoph Hellwig
2011-01-07 13:10                       ` Christoph Hellwig
2011-01-07 18:30                       ` Ted Ts'o
2011-01-07 18:32                         ` Christoph Hellwig
2011-01-07 19:06   ` Ted Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110107044734.GA4552@amd \
    --to=npiggin@kernel.dk \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).