All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: "J. Bruce Fields" <bfields@fieldses.org>, Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	viro@zeniv.linux.org.uk, linux-nfs@vger.kernel.org,
	neilb@suse.de, jack@suse.de, linux-ext4@vger.kernel.org,
	tytso@mit.edu, adilger.kernel@dilger.ca,
	linux-xfs@vger.kernel.org, darrick.wong@oracle.com,
	david@fromorbit.com, linux-btrfs@vger.kernel.org, clm@fb.com,
	jbacik@fb.com, dsterba@suse.com, linux-integrity@vger.kernel.org,
	zohar@linux.vnet.ibm.com, dmitry.kasatkin@gmail.com,
	linux-afs@lists.infradead.org, dhowells@redhat.com,
	jaltman@auristor.com
Subject: Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
Date: Mon, 18 Dec 2017 14:35:20 -0500	[thread overview]
Message-ID: <1513625720.3422.68.camel@kernel.org> (raw)
In-Reply-To: <20171218173651.GB12454@fieldses.org>

On Mon, 2017-12-18 at 12:36 -0500, J. Bruce Fields wrote:
> On Mon, Dec 18, 2017 at 12:22:20PM -0500, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > > +	u64 cur, old, new;
> > > >  
> > > > -	atomic64_inc(ivp);
> > > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is clear then we needn't do anything */
> > > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > > +			return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> > 
> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> 
> I don't think we need full atomicity, but we *do* need the i_version
> increment to be seen as ordered after the inode change.  That should be
> relatively inexpensive, yes?
> 
> Otherwise an inode change could be overlooked indefinitely, because a
> client could cache the old contents with the new i_version value and
> never see the i_version change again as long as the inode isn't touched
> again.
> 
> (If the client instead caches new data with the old inode version, there
> may be an unnecessary cache invalidation next time the client queries
> the change attribute.  That's a performance bug, but not really a
> correctness problem, at least for clients depending only on
> close-to-open semantics: they'll only depend on seeing the new change
> attribute after the change has been flushed to disk.  The performance
> problem should be mitigated by the race being very rare.)

Perhaps something like this patch squashed into #19?

This should tighten up the initial loads and stores, like Jan was
suggesting (I think). The increments are done via cmpxchg so they
already have a full implied barrier (IIUC), and we don't exit the loop
until it succeeds or it looks like the flag is set.

To be clear though:

This series does not try to order the i_version update with any other
inode metadata or data (though we generally do update it after a write
is done, rather than before or during).

Inode field updates are not at all synchronized either and this series
does not change that. A stat() is also generally done locklessly so you
can easily get a half-baked attributes there if you hit the timing
wrong.

-------------------------8<--------------------------

[PATCH] SQUASH: add memory barriers around i_version accesses

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 include/linux/iversion.h | 40 +++++++++++++++++++++-------------------
 1 file changed, 21 insertions(+), 19 deletions(-)

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index a9fbf99709df..39ec9aa9e08e 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -87,6 +87,25 @@ static inline void
 inode_set_iversion_raw(struct inode *inode, const u64 val)
 {
 	atomic64_set(&inode->i_version, val);
+	smp_wmb();
+}
+
+/**
+ * inode_peek_iversion_raw - grab a "raw" iversion value
+ * @inode: inode from which i_version should be read
+ *
+ * Grab a "raw" inode->i_version value and return it. The i_version is not
+ * flagged or converted in any way. This is mostly used to access a self-managed
+ * i_version.
+ *
+ * With those filesystems, we want to treat the i_version as an entirely
+ * opaque value.
+ */
+static inline u64
+inode_peek_iversion_raw(const struct inode *inode)
+{
+	smp_rmb();
+	return atomic64_read(&inode->i_version);
 }
 
 /**
@@ -152,7 +171,7 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
 {
 	u64 cur, old, new;
 
-	cur = (u64)atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is clear then we needn't do anything */
 		if (!force && !(cur & I_VERSION_QUERIED))
@@ -183,23 +202,6 @@ inode_inc_iversion(struct inode *inode)
 	inode_maybe_inc_iversion(inode, true);
 }
 
-/**
- * inode_peek_iversion_raw - grab a "raw" iversion value
- * @inode: inode from which i_version should be read
- *
- * Grab a "raw" inode->i_version value and return it. The i_version is not
- * flagged or converted in any way. This is mostly used to access a self-managed
- * i_version.
- *
- * With those filesystems, we want to treat the i_version as an entirely
- * opaque value.
- */
-static inline u64
-inode_peek_iversion_raw(const struct inode *inode)
-{
-	return atomic64_read(&inode->i_version);
-}
-
 /**
  * inode_iversion_need_inc - is the i_version in need of being incremented?
  * @inode: inode to check
@@ -248,7 +250,7 @@ inode_query_iversion(struct inode *inode)
 {
 	u64 cur, old, new;
 
-	cur = atomic64_read(&inode->i_version);
+	cur = inode_peek_iversion_raw(inode);
 	for (;;) {
 		/* If flag is already set, then no need to swap */
 		if (cur & I_VERSION_QUERIED)
-- 
2.14.3


  reply	other threads:[~2017-12-18 19:35 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-18 15:11 [PATCH v3 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
2017-12-18 15:11 ` [PATCH v3 01/19] fs: new API for handling inode->i_version Jeff Layton
2017-12-18 17:46   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
2017-12-18 15:11 ` [PATCH v3 03/19] fat: convert to new i_version API Jeff Layton
2017-12-18 15:11 ` [PATCH v3 04/19] affs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 05/19] afs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 06/19] btrfs: " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 07/19] exofs: switch " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 08/19] ext2: convert " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 09/19] ext4: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 10/19] nfs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 11/19] nfsd: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 12/19] ocfs2: " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 13/19] ufs: use " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 14/19] xfs: convert to " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 15/19] IMA: switch IMA over " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
2017-12-18 16:07   ` Jan Kara
2017-12-18 16:07     ` Jan Kara
2017-12-18 17:25     ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2017-12-18 15:11 ` [PATCH v3 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2017-12-18 15:11 ` [PATCH v3 19/19] fs: handle inode->i_version more efficiently Jeff Layton
2017-12-18 16:34   ` Jan Kara
2017-12-18 17:22     ` Jeff Layton
2017-12-18 17:36       ` J. Bruce Fields
2017-12-18 19:35         ` Jeff Layton [this message]
2017-12-18 22:07           ` Dave Chinner
2017-12-18 22:07             ` Dave Chinner
2017-12-20 14:03             ` Jeff Layton
2017-12-20 14:03               ` Jeff Layton
2017-12-20 16:41               ` Jan Kara
2017-12-20 16:41                 ` Jan Kara
2017-12-21 11:25                 ` Jeff Layton
2017-12-21 11:25                   ` Jeff Layton
2017-12-21 11:48                   ` Jan Kara
2017-12-19  9:29       ` Jan Kara
2017-12-19 17:14         ` Jeff Layton
2017-12-19 17:14           ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1513625720.3422.68.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=bfields@fieldses.org \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=dmitry.kasatkin@gmail.com \
    --cc=dsterba@suse.com \
    --cc=jack@suse.cz \
    --cc=jack@suse.de \
    --cc=jaltman@auristor.com \
    --cc=jbacik@fb.com \
    --cc=linux-afs@lists.infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-integrity@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zohar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.