All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	viro@zeniv.linux.org.uk, linux-nfs@vger.kernel.org,
	bfields@fieldses.org, neilb@suse.de, jack@suse.de,
	linux-ext4@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, linux-xfs@vger.kernel.org,
	darrick.wong@oracle.com, david@fromorbit.com,
	linux-btrfs@vger.kernel.org, clm@fb.com, jbacik@fb.com,
	dsterba@suse.com, linux-integrity@vger.kernel.org,
	zohar@linux.vnet.ibm.com, dmitry.kasatkin@gmail.com,
	linux-afs@lists.infradead.org, dhowells@redhat.com,
	jaltman@auristor.com
Subject: Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
Date: Tue, 19 Dec 2017 12:14:30 -0500	[thread overview]
Message-ID: <1513703670.20392.21.camel@kernel.org> (raw)
In-Reply-To: <20171219092947.GC2277@quack2.suse.cz>

On Tue, 2017-12-19 at 10:29 +0100, Jan Kara wrote:
> On Mon 18-12-17 12:22:20, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > > +	u64 cur, old, new;
> > > >  
> > > > -	atomic64_inc(ivp);
> > > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is clear then we needn't do anything */
> > > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > > +			return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> 
> Yeah, so I don't suggest that you should fix unrelated issues but original
> i_lock protection did actually provide memory barriers (although
> semi-permeable, but in practice they are very often enough) and your patch
> removing those could have changed a theoretical issue to a practical
> problem. So at least preserving that original acquire-release semantics
> of i_version handling would be IMHO good.
> 

Agreed. I've no objection to memory barriers here and I'm looking at
that, I just need to go over Dave's comments and memory-barriers.txt
(again!) to make sure I get them right.

> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> > 
> > > > +
> > > > +		/* Since lowest bit is flag, add 2 to avoid it */
> > > > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > > > +
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (likely(old == cur))
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > >  	return true;
> > > >  }
> > > >  
> > > 
> > > ...
> > > 
> > > >  static inline u64
> > > >  inode_query_iversion(struct inode *inode)
> > > >  {
> > > > -	return inode_peek_iversion(inode);
> > > > +	u64 cur, old, new;
> > > > +
> > > > +	cur = atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is already set, then no need to swap */
> > > > +		if (cur & I_VERSION_QUERIED)
> > > > +			break;
> > > > +
> > > > +		new = cur | I_VERSION_QUERIED;
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (old == cur)
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > 
> > > Why not just use atomic64_or() here?
> > > 
> > 
> > If the cmpxchg fails, then either:
> > 
> > 1) it was incremented
> > 2) someone flagged it QUERIED
> > 
> > If an increment happened then we don't need to flag it as QUERIED if
> > we're returning an older value. If we use atomic64_or, then we can't
> > tell if an increment happened so we'd end up potentially flagging it
> > more than necessary.
> > 
> > In principle, either outcome is technically OK and we don't have to loop
> > if the cmpxchg doesn't work. That said, if we think there might be a
> > later i_version available, then I think we probably want to try to query
> > it again so we can return as late a one as possible.
> 
> OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to
> behave pretty badly in contended cases but I guess i_version won't be
> hammered *that* hard.
> 

That's the principle I'm operating under here, and I think it's valid
for almost all workloads. Incrementing the i_version on parallel writes
should be mostly uncontended now, whereas they at had to serialize on
the i_lock before.

The pessimal case here, I think is parallel increments and queries. We
may see that sort of workload under knfsd, but I'm fine with giving
knfsd a small performance hit to help performance on other workloads.

While we're on the subject of looping here, should I add a cpu_relax()
into these loops?

Thanks for the review so far!
-- 
Jeff Layton <jlayton@kernel.org>

WARNING: multiple messages have this Message-ID (diff)
From: Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
To: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org,
	neilb-l3A5Bk7waGM@public.gmane.org,
	jack-l3A5Bk7waGM@public.gmane.org,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	tytso-3s7WtUTddSA@public.gmane.org,
	adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
	david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	clm-b10kYP2dOMg@public.gmane.org,
	jbacik-b10kYP2dOMg@public.gmane.org,
	dsterba-IBi9RG/b67k@public.gmane.org,
	linux-integrity-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	zohar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org,
	dmitry.kasatkin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
	dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	jaltman-hRzEac23uH1Wk0Htik3J/w@public.gmane.org
Subject: Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
Date: Tue, 19 Dec 2017 12:14:30 -0500	[thread overview]
Message-ID: <1513703670.20392.21.camel@kernel.org> (raw)
In-Reply-To: <20171219092947.GC2277-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>

On Tue, 2017-12-19 at 10:29 +0100, Jan Kara wrote:
> On Mon 18-12-17 12:22:20, Jeff Layton wrote:
> > On Mon, 2017-12-18 at 17:34 +0100, Jan Kara wrote:
> > > On Mon 18-12-17 10:11:56, Jeff Layton wrote:
> > > >  static inline bool
> > > >  inode_maybe_inc_iversion(struct inode *inode, bool force)
> > > >  {
> > > > -	atomic64_t *ivp = (atomic64_t *)&inode->i_version;
> > > > +	u64 cur, old, new;
> > > >  
> > > > -	atomic64_inc(ivp);
> > > > +	cur = (u64)atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is clear then we needn't do anything */
> > > > +		if (!force && !(cur & I_VERSION_QUERIED))
> > > > +			return false;
> > > 
> > > The fast path here misses any memory barrier. Thus it seems this query
> > > could be in theory reordered before any store that happened to modify the
> > > inode? Or maybe we could race and miss the fact that in fact this i_version
> > > has already been queried? But maybe there's some higher level locking that
> > > makes sure this is all a non-issue... But in that case it would deserve
> > > some comment I guess.
> > > 
> > 
> > There's no higher-level locking. Getting locking out of this codepath is
> > a good thing IMO. The larger question here is whether we really care
> > about ordering this with anything else.
> > 
> > The i_version, as implemented today, is not ordered with actual changes
> > to the inode. We only take the i_lock today when modifying it, not when
> > querying it. It's possible today that you could see the results of a
> > change and then do a fetch of the i_version that doesn't show an
> > increment vs. a previous change.
> 
> Yeah, so I don't suggest that you should fix unrelated issues but original
> i_lock protection did actually provide memory barriers (although
> semi-permeable, but in practice they are very often enough) and your patch
> removing those could have changed a theoretical issue to a practical
> problem. So at least preserving that original acquire-release semantics
> of i_version handling would be IMHO good.
> 

Agreed. I've no objection to memory barriers here and I'm looking at
that, I just need to go over Dave's comments and memory-barriers.txt
(again!) to make sure I get them right.

> > It'd be nice if this were atomic with the actual changes that it
> > represents, but I think that would be prohibitively expensive. That may
> > be something we need to address. I'm not sure we really want to do it as
> > part of this patchset though.
> > 
> > > > +
> > > > +		/* Since lowest bit is flag, add 2 to avoid it */
> > > > +		new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
> > > > +
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (likely(old == cur))
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > >  	return true;
> > > >  }
> > > >  
> > > 
> > > ...
> > > 
> > > >  static inline u64
> > > >  inode_query_iversion(struct inode *inode)
> > > >  {
> > > > -	return inode_peek_iversion(inode);
> > > > +	u64 cur, old, new;
> > > > +
> > > > +	cur = atomic64_read(&inode->i_version);
> > > > +	for (;;) {
> > > > +		/* If flag is already set, then no need to swap */
> > > > +		if (cur & I_VERSION_QUERIED)
> > > > +			break;
> > > > +
> > > > +		new = cur | I_VERSION_QUERIED;
> > > > +		old = atomic64_cmpxchg(&inode->i_version, cur, new);
> > > > +		if (old == cur)
> > > > +			break;
> > > > +		cur = old;
> > > > +	}
> > > 
> > > Why not just use atomic64_or() here?
> > > 
> > 
> > If the cmpxchg fails, then either:
> > 
> > 1) it was incremented
> > 2) someone flagged it QUERIED
> > 
> > If an increment happened then we don't need to flag it as QUERIED if
> > we're returning an older value. If we use atomic64_or, then we can't
> > tell if an increment happened so we'd end up potentially flagging it
> > more than necessary.
> > 
> > In principle, either outcome is technically OK and we don't have to loop
> > if the cmpxchg doesn't work. That said, if we think there might be a
> > later i_version available, then I think we probably want to try to query
> > it again so we can return as late a one as possible.
> 
> OK, makes sense. I'm just a bit vary of cmpxchg loops as they tend to
> behave pretty badly in contended cases but I guess i_version won't be
> hammered *that* hard.
> 

That's the principle I'm operating under here, and I think it's valid
for almost all workloads. Incrementing the i_version on parallel writes
should be mostly uncontended now, whereas they at had to serialize on
the i_lock before.

The pessimal case here, I think is parallel increments and queries. We
may see that sort of workload under knfsd, but I'm fine with giving
knfsd a small performance hit to help performance on other workloads.

While we're on the subject of looping here, should I add a cpu_relax()
into these loops?

Thanks for the review so far!
-- 
Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2017-12-19 17:14 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-18 15:11 [PATCH v3 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
2017-12-18 15:11 ` [PATCH v3 01/19] fs: new API for handling inode->i_version Jeff Layton
2017-12-18 17:46   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
2017-12-18 15:11 ` [PATCH v3 03/19] fat: convert to new i_version API Jeff Layton
2017-12-18 15:11 ` [PATCH v3 04/19] affs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 05/19] afs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 06/19] btrfs: " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 07/19] exofs: switch " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 08/19] ext2: convert " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 09/19] ext4: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 10/19] nfs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 11/19] nfsd: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 12/19] ocfs2: " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 13/19] ufs: use " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 14/19] xfs: convert to " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 15/19] IMA: switch IMA over " Jeff Layton
2017-12-18 15:11   ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
2017-12-18 16:07   ` Jan Kara
2017-12-18 16:07     ` Jan Kara
2017-12-18 17:25     ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2017-12-18 15:11 ` [PATCH v3 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2017-12-18 15:11 ` [PATCH v3 19/19] fs: handle inode->i_version more efficiently Jeff Layton
2017-12-18 16:34   ` Jan Kara
2017-12-18 17:22     ` Jeff Layton
2017-12-18 17:36       ` J. Bruce Fields
2017-12-18 19:35         ` Jeff Layton
2017-12-18 22:07           ` Dave Chinner
2017-12-18 22:07             ` Dave Chinner
2017-12-20 14:03             ` Jeff Layton
2017-12-20 14:03               ` Jeff Layton
2017-12-20 16:41               ` Jan Kara
2017-12-20 16:41                 ` Jan Kara
2017-12-21 11:25                 ` Jeff Layton
2017-12-21 11:25                   ` Jeff Layton
2017-12-21 11:48                   ` Jan Kara
2017-12-19  9:29       ` Jan Kara
2017-12-19 17:14         ` Jeff Layton [this message]
2017-12-19 17:14           ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1513703670.20392.21.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=adilger.kernel@dilger.ca \
    --cc=bfields@fieldses.org \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dhowells@redhat.com \
    --cc=dmitry.kasatkin@gmail.com \
    --cc=dsterba@suse.com \
    --cc=jack@suse.cz \
    --cc=jack@suse.de \
    --cc=jaltman@auristor.com \
    --cc=jbacik@fb.com \
    --cc=linux-afs@lists.infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-integrity@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zohar@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.