From: Jan Kara <jack@suse.cz>
To: Jeff Layton <jlayton@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>, Jan Kara <jack@suse.cz>,
"J. Bruce Fields" <bfields@fieldses.org>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
viro@zeniv.linux.org.uk, linux-nfs@vger.kernel.org,
neilb@suse.de, jack@suse.de, linux-ext4@vger.kernel.org,
tytso@mit.edu, adilger.kernel@dilger.ca,
linux-xfs@vger.kernel.org, darrick.wong@oracle.com,
linux-btrfs@vger.kernel.org, clm@fb.com, jbacik@fb.com,
dsterba@suse.com, linux-integrity@vger.kernel.org,
zohar@linux.vnet.ibm.com, dmitry.kasatkin@gmail.com,
linux-afs@lists.infradead.org, dhowells@redhat.com,
jaltman@auristor.com
Subject: Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
Date: Wed, 20 Dec 2017 17:41:50 +0100 [thread overview]
Message-ID: <20171220164150.GF31584@quack2.suse.cz> (raw)
In-Reply-To: <1513778586.4513.18.camel@kernel.org>
On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > [PATCH] SQUASH: add memory barriers around i_version accesses
> >
> > Why explicit memory barriers rather than annotating the operations
> > with the required semantics and getting the barriers the arch
> > requires automatically? I suspect this should be using
> > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > the atomic_cmpxchg needs to have release semantics to match the
> > acquire semantics needed for the load of the current value.
> >
> > From include/linux/atomics.h:
> >
> > * For compound atomics performing both a load and a store, ACQUIRE
> > * semantics apply only to the load and RELEASE semantics only to the
> > * store portion of the operation. Note that a failed cmpxchg_acquire
> > * does -not- imply any memory ordering constraints.
> >
> > Memory barriers hurt my brain. :/
> >
> > At minimum, shouldn't the atomic op specific barriers be used rather
> > than full memory barriers? i.e:
> >
>
> They hurt my brain too. Yes, definitely atomic-specific barriers should
> be used here instead, since this is an atomic64_t now.
>
> After going over the docs again...my understanding has always been that
> you primarily need memory barriers to order accesses to different areas
> of memory.
That is correct.
> As Jan and I were discussing in the other thread, i_version is not
> synchronized with anything else. In this code, we're only dealing with a
> single 64-bit word. I don't think there are any races there wrt the API
> itself.
There are not but it is like saying that lock implementation is correct
because the lock state does not get corrupted ;). Who cares about protected
data...
> The "legacy" inode_inc_iversion() however _does_ have implied memory
> barriers from the i_lock. There could be some subtle de-facto ordering
> there, so I think we probably do want some barriers in here if only to
> preserve that. It's not likely to cost much, and may save us tracking
> down some fiddly bugs.
>
> What about this patch? Note that I've only added barriers to
> inode_maybe_inc_iversion. I don't see that we need it for the other
> functions, but please do tell me if I'm wrong there:
>
> --------------------8<---------------------------
>
> [PATCH] SQUASH: add memory barriers around i_version accesses
>
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
> include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
> 1 file changed, 32 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..02187a3bec3b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
> atomic64_set(&inode->i_version, val);
> }
>
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(&inode->i_version);
> +}
> +
> /**
> * inode_set_iversion - set i_version to a particular value
> * @inode: inode to set
> @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> {
> u64 cur, old, new;
>
> - cur = (u64)atomic64_read(&inode->i_version);
> + /*
> + * The i_version field is not strictly ordered with any other inode
> + * information, but the legacy inode_inc_iversion code used a spinlock
> + * to serialize increments.
> + *
> + * This code adds full memory barriers to ensure that any de-facto
> + * ordering with other info is preserved.
> + */
> + smp_mb__before_atomic();
This should be just smp_mb(). __before_atomic() pairs with atomic
operations like atomic_inc(). atomic_read() is completely unordered
operation (happens to be plain memory read on x86) and so __before_atomic()
is not enough.
> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>
> old = atomic64_cmpxchg(&inode->i_version, cur, new);
> - if (likely(old == cur))
> + if (likely(old == cur)) {
> + smp_mb__after_atomic();
I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - from Documentation/atomic_t.txt:
- RMW operations that have a return value are fully ordered;
> break;
> + }
> cur = old;
> }
> return true;
...
> @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
> {
> u64 cur, old, new;
>
> - cur = atomic64_read(&inode->i_version);
> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED)
And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
needed only if you are not going to do cmpxchg as that implies barrier as
well). "Safe" use of i_version would be:
Update:
modify inode
inode_maybe_inc_iversion(inode)
Read:
my_version = inode_query_iversion(inode)
get inode data
And you need to make sure 'get inode data' does not get speculatively
evaluated before you actually sample i_version so that you are guaranteed
that if data changes, you will observe larger i_version in the future.
Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:
/* This barrier pairs with the barrier in inode_query_iversion() */
and a similar comment to inode_query_iversion(). Because memory barriers
make sense only in pairs (see SMP BARRIER PAIRING in
Documentation/memory-barriers.txt).
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
To: Jeff Layton <jlayton-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>,
Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>,
"J. Bruce Fields"
<bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn@public.gmane.org,
linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
neilb-l3A5Bk7waGM@public.gmane.org,
jack-l3A5Bk7waGM@public.gmane.org,
linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
tytso-3s7WtUTddSA@public.gmane.org,
adilger.kernel-m1MBpc4rdrD3fQ9qLvQP4Q@public.gmane.org,
linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org,
linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
clm-b10kYP2dOMg@public.gmane.org,
jbacik-b10kYP2dOMg@public.gmane.org,
dsterba-IBi9RG/b67k@public.gmane.org,
linux-integrity-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
zohar-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org,
dmitry.kasatkin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org,
linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org,
dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
jaltman-hRzEac23uH1Wk0Htik3J/w@public.gmane.org
Subject: Re: [PATCH v3 19/19] fs: handle inode->i_version more efficiently
Date: Wed, 20 Dec 2017 17:41:50 +0100 [thread overview]
Message-ID: <20171220164150.GF31584@quack2.suse.cz> (raw)
In-Reply-To: <1513778586.4513.18.camel-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
On Wed 20-12-17 09:03:06, Jeff Layton wrote:
> On Tue, 2017-12-19 at 09:07 +1100, Dave Chinner wrote:
> > On Mon, Dec 18, 2017 at 02:35:20PM -0500, Jeff Layton wrote:
> > > [PATCH] SQUASH: add memory barriers around i_version accesses
> >
> > Why explicit memory barriers rather than annotating the operations
> > with the required semantics and getting the barriers the arch
> > requires automatically? I suspect this should be using
> > atomic_read_acquire() and atomic_cmpxchg_release(), because AFAICT
> > the atomic_cmpxchg needs to have release semantics to match the
> > acquire semantics needed for the load of the current value.
> >
> > From include/linux/atomics.h:
> >
> > * For compound atomics performing both a load and a store, ACQUIRE
> > * semantics apply only to the load and RELEASE semantics only to the
> > * store portion of the operation. Note that a failed cmpxchg_acquire
> > * does -not- imply any memory ordering constraints.
> >
> > Memory barriers hurt my brain. :/
> >
> > At minimum, shouldn't the atomic op specific barriers be used rather
> > than full memory barriers? i.e:
> >
>
> They hurt my brain too. Yes, definitely atomic-specific barriers should
> be used here instead, since this is an atomic64_t now.
>
> After going over the docs again...my understanding has always been that
> you primarily need memory barriers to order accesses to different areas
> of memory.
That is correct.
> As Jan and I were discussing in the other thread, i_version is not
> synchronized with anything else. In this code, we're only dealing with a
> single 64-bit word. I don't think there are any races there wrt the API
> itself.
There are not but it is like saying that lock implementation is correct
because the lock state does not get corrupted ;). Who cares about protected
data...
> The "legacy" inode_inc_iversion() however _does_ have implied memory
> barriers from the i_lock. There could be some subtle de-facto ordering
> there, so I think we probably do want some barriers in here if only to
> preserve that. It's not likely to cost much, and may save us tracking
> down some fiddly bugs.
>
> What about this patch? Note that I've only added barriers to
> inode_maybe_inc_iversion. I don't see that we need it for the other
> functions, but please do tell me if I'm wrong there:
>
> --------------------8<---------------------------
>
> [PATCH] SQUASH: add memory barriers around i_version accesses
>
> Signed-off-by: Jeff Layton <jlayton-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
> include/linux/iversion.h | 53 +++++++++++++++++++++++++++++-------------------
> 1 file changed, 32 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index a9fbf99709df..02187a3bec3b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -89,6 +89,23 @@ inode_set_iversion_raw(struct inode *inode, const u64 val)
> atomic64_set(&inode->i_version, val);
> }
>
> +/**
> + * inode_peek_iversion_raw - grab a "raw" iversion value
> + * @inode: inode from which i_version should be read
> + *
> + * Grab a "raw" inode->i_version value and return it. The i_version is not
> + * flagged or converted in any way. This is mostly used to access a self-managed
> + * i_version.
> + *
> + * With those filesystems, we want to treat the i_version as an entirely
> + * opaque value.
> + */
> +static inline u64
> +inode_peek_iversion_raw(const struct inode *inode)
> +{
> + return atomic64_read(&inode->i_version);
> +}
> +
> /**
> * inode_set_iversion - set i_version to a particular value
> * @inode: inode to set
> @@ -152,7 +169,16 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> {
> u64 cur, old, new;
>
> - cur = (u64)atomic64_read(&inode->i_version);
> + /*
> + * The i_version field is not strictly ordered with any other inode
> + * information, but the legacy inode_inc_iversion code used a spinlock
> + * to serialize increments.
> + *
> + * This code adds full memory barriers to ensure that any de-facto
> + * ordering with other info is preserved.
> + */
> + smp_mb__before_atomic();
This should be just smp_mb(). __before_atomic() pairs with atomic
operations like atomic_inc(). atomic_read() is completely unordered
operation (happens to be plain memory read on x86) and so __before_atomic()
is not enough.
> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -162,8 +188,10 @@ inode_maybe_inc_iversion(struct inode *inode, bool force)
> new = (cur & ~I_VERSION_QUERIED) + I_VERSION_INCREMENT;
>
> old = atomic64_cmpxchg(&inode->i_version, cur, new);
> - if (likely(old == cur))
> + if (likely(old == cur)) {
> + smp_mb__after_atomic();
I don't think you need this. Cmpxchg is guaranteed to be full memory
barrier - from Documentation/atomic_t.txt:
- RMW operations that have a return value are fully ordered;
> break;
> + }
> cur = old;
> }
> return true;
...
> @@ -248,7 +259,7 @@ inode_query_iversion(struct inode *inode)
> {
> u64 cur, old, new;
>
> - cur = atomic64_read(&inode->i_version);
> + cur = inode_peek_iversion_raw(inode);
> for (;;) {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED)
And here I'd expect smp_mb() after inode_peek_iversion_raw() (actually be
needed only if you are not going to do cmpxchg as that implies barrier as
well). "Safe" use of i_version would be:
Update:
modify inode
inode_maybe_inc_iversion(inode)
Read:
my_version = inode_query_iversion(inode)
get inode data
And you need to make sure 'get inode data' does not get speculatively
evaluated before you actually sample i_version so that you are guaranteed
that if data changes, you will observe larger i_version in the future.
Also please add a comment smp_mb() in inode_maybe_inc_iversion() like:
/* This barrier pairs with the barrier in inode_query_iversion() */
and a similar comment to inode_query_iversion(). Because memory barriers
make sense only in pairs (see SMP BARRIER PAIRING in
Documentation/memory-barriers.txt).
Honza
--
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-12-20 16:41 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-18 15:11 [PATCH v3 00/19] fs: rework and optimize i_version handling in filesystems Jeff Layton
2017-12-18 15:11 ` [PATCH v3 01/19] fs: new API for handling inode->i_version Jeff Layton
2017-12-18 17:46 ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 02/19] fs: don't take the i_lock in inode_inc_iversion Jeff Layton
2017-12-18 15:11 ` [PATCH v3 03/19] fat: convert to new i_version API Jeff Layton
2017-12-18 15:11 ` [PATCH v3 04/19] affs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 05/19] afs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 06/19] btrfs: " Jeff Layton
2017-12-18 15:11 ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 07/19] exofs: switch " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 08/19] ext2: convert " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 09/19] ext4: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 10/19] nfs: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 11/19] nfsd: " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 12/19] ocfs2: " Jeff Layton
2017-12-18 15:11 ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 13/19] ufs: use " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 14/19] xfs: convert to " Jeff Layton
2017-12-18 15:11 ` [PATCH v3 15/19] IMA: switch IMA over " Jeff Layton
2017-12-18 15:11 ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 16/19] fs: only set S_VERSION when updating times if necessary Jeff Layton
2017-12-18 16:07 ` Jan Kara
2017-12-18 16:07 ` Jan Kara
2017-12-18 17:25 ` Jeff Layton
2017-12-18 15:11 ` [PATCH v3 17/19] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2017-12-18 15:11 ` [PATCH v3 18/19] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2017-12-18 15:11 ` [PATCH v3 19/19] fs: handle inode->i_version more efficiently Jeff Layton
2017-12-18 16:34 ` Jan Kara
2017-12-18 17:22 ` Jeff Layton
2017-12-18 17:36 ` J. Bruce Fields
2017-12-18 19:35 ` Jeff Layton
2017-12-18 22:07 ` Dave Chinner
2017-12-18 22:07 ` Dave Chinner
2017-12-20 14:03 ` Jeff Layton
2017-12-20 14:03 ` Jeff Layton
2017-12-20 16:41 ` Jan Kara [this message]
2017-12-20 16:41 ` Jan Kara
2017-12-21 11:25 ` Jeff Layton
2017-12-21 11:25 ` Jeff Layton
2017-12-21 11:48 ` Jan Kara
2017-12-19 9:29 ` Jan Kara
2017-12-19 17:14 ` Jeff Layton
2017-12-19 17:14 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171220164150.GF31584@quack2.suse.cz \
--to=jack@suse.cz \
--cc=adilger.kernel@dilger.ca \
--cc=bfields@fieldses.org \
--cc=clm@fb.com \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=dhowells@redhat.com \
--cc=dmitry.kasatkin@gmail.com \
--cc=dsterba@suse.com \
--cc=jack@suse.de \
--cc=jaltman@auristor.com \
--cc=jbacik@fb.com \
--cc=jlayton@kernel.org \
--cc=linux-afs@lists.infradead.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-integrity@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
--cc=zohar@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.