* [PATCH] vfs: elide smp_mb in iversion handling in the common case
@ 2024-08-15 8:33 Mateusz Guzik
2024-08-16 10:56 ` Christian Brauner
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Mateusz Guzik @ 2024-08-15 8:33 UTC (permalink / raw)
To: brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel, Mateusz Guzik
According to bpftrace on these routines most calls result in cmpxchg,
which already provides the same guarantee.
In inode_maybe_inc_iversion elision is possible because even if the
wrong value was read due to now missing smp_mb fence, the issue is going
to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
the fence + reload are there bringing back previous behavior.
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
chances are this entire barrier guarantee is of no significance, but i'm
not signing up to review it
I verified the force flag is not *always* set (but it is set in the most common case).
fs/libfs.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/fs/libfs.c b/fs/libfs.c
index 8aa34870449f..61ae4811270a 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1990,13 +1990,19 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
* information, but the legacy inode_inc_iversion code used a spinlock
* to serialize increments.
*
- * Here, we add full memory barriers to ensure that any de-facto
- * ordering with other info is preserved.
+ * We add a full memory barrier to ensure that any de facto ordering
+ * with other state is preserved (either implicitly coming from cmpxchg
+ * or explicitly from smp_mb if we don't know upfront if we will execute
+ * the former).
*
- * This barrier pairs with the barrier in inode_query_iversion()
+ * These barriers pair with inode_query_iversion().
*/
- smp_mb();
cur = inode_peek_iversion_raw(inode);
+ if (!force && !(cur & I_VERSION_QUERIED)) {
+ smp_mb();
+ cur = inode_peek_iversion_raw(inode);
+ }
+
do {
/* If flag is clear then we needn't do anything */
if (!force && !(cur & I_VERSION_QUERIED))
@@ -2025,20 +2031,22 @@ EXPORT_SYMBOL(inode_maybe_inc_iversion);
u64 inode_query_iversion(struct inode *inode)
{
u64 cur, new;
+ bool fenced = false;
+ /*
+ * Memory barriers (implicit in cmpxchg, explicit in smp_mb) pair with
+ * inode_maybe_inc_iversion(), see that routine for more details.
+ */
cur = inode_peek_iversion_raw(inode);
do {
/* If flag is already set, then no need to swap */
if (cur & I_VERSION_QUERIED) {
- /*
- * This barrier (and the implicit barrier in the
- * cmpxchg below) pairs with the barrier in
- * inode_maybe_inc_iversion().
- */
- smp_mb();
+ if (!fenced)
+ smp_mb();
break;
}
+ fenced = true;
new = cur | I_VERSION_QUERIED;
} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
return cur >> I_VERSION_QUERIED_SHIFT;
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case
2024-08-15 8:33 [PATCH] vfs: elide smp_mb in iversion handling in the common case Mateusz Guzik
@ 2024-08-16 10:56 ` Christian Brauner
2024-08-27 10:00 ` Jan Kara
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Christian Brauner @ 2024-08-16 10:56 UTC (permalink / raw)
To: Mateusz Guzik; +Cc: Christian Brauner, viro, jack, linux-kernel, linux-fsdevel
On Thu, 15 Aug 2024 10:33:10 +0200, Mateusz Guzik wrote:
> According to bpftrace on these routines most calls result in cmpxchg,
> which already provides the same guarantee.
>
> In inode_maybe_inc_iversion elision is possible because even if the
> wrong value was read due to now missing smp_mb fence, the issue is going
> to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> the fence + reload are there bringing back previous behavior.
>
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/1] vfs: elide smp_mb in iversion handling in the common case
https://git.kernel.org/vfs/vfs/c/5570f04d0bb1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case
2024-08-15 8:33 [PATCH] vfs: elide smp_mb in iversion handling in the common case Mateusz Guzik
2024-08-16 10:56 ` Christian Brauner
@ 2024-08-27 10:00 ` Jan Kara
2024-08-27 10:21 ` Mateusz Guzik
2024-08-27 10:50 ` Jeff Layton
2024-08-27 12:29 ` Jeff Layton
3 siblings, 1 reply; 6+ messages in thread
From: Jan Kara @ 2024-08-27 10:00 UTC (permalink / raw)
To: Mateusz Guzik
Cc: brauner, viro, jack, linux-kernel, linux-fsdevel, Jeff Layton
On Thu 15-08-24 10:33:10, Mateusz Guzik wrote:
> According to bpftrace on these routines most calls result in cmpxchg,
> which already provides the same guarantee.
>
> In inode_maybe_inc_iversion elision is possible because even if the
> wrong value was read due to now missing smp_mb fence, the issue is going
> to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> the fence + reload are there bringing back previous behavior.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> chances are this entire barrier guarantee is of no significance, but i'm
> not signing up to review it
Jeff might have a ready answer here - added to CC. I think the barrier is
needed in principle so that you can guarantee that after a data change you
will be able to observe an i_version change.
> I verified the force flag is not *always* set (but it is set in the most
> common case).
Well, I'm not convinced the more complicated code is really worth it.
'force' will be set when we update timestamps which happens once per tick
(usually 1-4 ms). So that is common case on lightly / moderately loaded
system. On heavily write(2)-loaded system, 'force' should be mostly false
and unless you also heavily stat(2) the modified files, the common path is
exactly the "if (!force && !(cur & I_VERSION_QUERIED))" branch. So saving
one smp_mb() on moderately loaded system per couple of ms (per inode)
doesn't seem like a noticeable win...
Honza
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 8aa34870449f..61ae4811270a 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1990,13 +1990,19 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
> * information, but the legacy inode_inc_iversion code used a spinlock
> * to serialize increments.
> *
> - * Here, we add full memory barriers to ensure that any de-facto
> - * ordering with other info is preserved.
> + * We add a full memory barrier to ensure that any de facto ordering
> + * with other state is preserved (either implicitly coming from cmpxchg
> + * or explicitly from smp_mb if we don't know upfront if we will execute
> + * the former).
> *
> - * This barrier pairs with the barrier in inode_query_iversion()
> + * These barriers pair with inode_query_iversion().
> */
> - smp_mb();
> cur = inode_peek_iversion_raw(inode);
> + if (!force && !(cur & I_VERSION_QUERIED)) {
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
> + }
> +
> do {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -2025,20 +2031,22 @@ EXPORT_SYMBOL(inode_maybe_inc_iversion);
> u64 inode_query_iversion(struct inode *inode)
> {
> u64 cur, new;
> + bool fenced = false;
>
> + /*
> + * Memory barriers (implicit in cmpxchg, explicit in smp_mb) pair with
> + * inode_maybe_inc_iversion(), see that routine for more details.
> + */
> cur = inode_peek_iversion_raw(inode);
> do {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED) {
> - /*
> - * This barrier (and the implicit barrier in the
> - * cmpxchg below) pairs with the barrier in
> - * inode_maybe_inc_iversion().
> - */
> - smp_mb();
> + if (!fenced)
> + smp_mb();
> break;
> }
>
> + fenced = true;
> new = cur | I_VERSION_QUERIED;
> } while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> return cur >> I_VERSION_QUERIED_SHIFT;
> --
> 2.43.0
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case
2024-08-27 10:00 ` Jan Kara
@ 2024-08-27 10:21 ` Mateusz Guzik
0 siblings, 0 replies; 6+ messages in thread
From: Mateusz Guzik @ 2024-08-27 10:21 UTC (permalink / raw)
To: Jan Kara; +Cc: brauner, viro, linux-kernel, linux-fsdevel, Jeff Layton
On Tue, Aug 27, 2024 at 12:00 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-08-24 10:33:10, Mateusz Guzik wrote:
> > According to bpftrace on these routines most calls result in cmpxchg,
> > which already provides the same guarantee.
> >
> > In inode_maybe_inc_iversion elision is possible because even if the
> > wrong value was read due to now missing smp_mb fence, the issue is going
> > to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> > the fence + reload are there bringing back previous behavior.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> > ---
> >
> > chances are this entire barrier guarantee is of no significance, but i'm
> > not signing up to review it
>
> Jeff might have a ready answer here - added to CC. I think the barrier is
> needed in principle so that you can guarantee that after a data change you
> will be able to observe an i_version change.
>
That is the description, I am saying it is unclear if anyone needs it
and that I am not interested in spending time on finding out.
> > I verified the force flag is not *always* set (but it is set in the most
> > common case).
>
> Well, I'm not convinced the more complicated code is really worth it.
> 'force' will be set when we update timestamps which happens once per tick
> (usually 1-4 ms). So that is common case on lightly / moderately loaded
> system. On heavily write(2)-loaded system, 'force' should be mostly false
> and unless you also heavily stat(2) the modified files, the common path is
> exactly the "if (!force && !(cur & I_VERSION_QUERIED))" branch. So saving
> one smp_mb() on moderately loaded system per couple of ms (per inode)
> doesn't seem like a noticeable win...
>
inode_maybe_inc_iversion is used a lot and most commonly *with* the force flag.
You can try it out yourself: bpftrace -e
'kprobe:inode_maybe_inc_iversion { @[kstack(), arg1] = count(); }'
and run your favourite fs-using workload, for example this is a top
backtrace form few seconds of building the kernel:
@[
inode_maybe_inc_iversion+5
inode_update_timestamps+238
generic_update_time+19
file_update_time+125
shmem_file_write_iter+118
vfs_write+599
ksys_write+103
do_syscall_64+82
entry_SYSCALL_64_after_hwframe+118
, 1]: 1670
the '1' at the end indicates 'force' flag set to 1.
This also shows up on unlink et al.
The context here is that vfs is dog slow single-threaded in part due
to spurious barriers sneaked in all over the place, here is another
example:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=30eb7cc03875b508ce6683c8b3cf6e88442a2f87
--
Mateusz Guzik <mjguzik gmail.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case
2024-08-15 8:33 [PATCH] vfs: elide smp_mb in iversion handling in the common case Mateusz Guzik
2024-08-16 10:56 ` Christian Brauner
2024-08-27 10:00 ` Jan Kara
@ 2024-08-27 10:50 ` Jeff Layton
2024-08-27 12:29 ` Jeff Layton
3 siblings, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2024-08-27 10:50 UTC (permalink / raw)
To: Mateusz Guzik, brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel
On Thu, 2024-08-15 at 10:33 +0200, Mateusz Guzik wrote:
> According to bpftrace on these routines most calls result in cmpxchg,
> which already provides the same guarantee.
>
> In inode_maybe_inc_iversion elision is possible because even if the
> wrong value was read due to now missing smp_mb fence, the issue is going
> to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> the fence + reload are there bringing back previous behavior.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> chances are this entire barrier guarantee is of no significance, but i'm
> not signing up to review it
>
> I verified the force flag is not *always* set (but it is set in the most common case).
>
> fs/libfs.c | 28 ++++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 8aa34870449f..61ae4811270a 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1990,13 +1990,19 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
> * information, but the legacy inode_inc_iversion code used a spinlock
> * to serialize increments.
> *
> - * Here, we add full memory barriers to ensure that any de-facto
> - * ordering with other info is preserved.
> + * We add a full memory barrier to ensure that any de facto ordering
> + * with other state is preserved (either implicitly coming from cmpxchg
> + * or explicitly from smp_mb if we don't know upfront if we will execute
> + * the former).
> *
> - * This barrier pairs with the barrier in inode_query_iversion()
> + * These barriers pair with inode_query_iversion().
> */
> - smp_mb();
> cur = inode_peek_iversion_raw(inode);
> + if (!force && !(cur & I_VERSION_QUERIED)) {
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
> + }
> +
> do {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -2025,20 +2031,22 @@ EXPORT_SYMBOL(inode_maybe_inc_iversion);
> u64 inode_query_iversion(struct inode *inode)
> {
> u64 cur, new;
> + bool fenced = false;
>
> + /*
> + * Memory barriers (implicit in cmpxchg, explicit in smp_mb) pair with
> + * inode_maybe_inc_iversion(), see that routine for more details.
> + */
> cur = inode_peek_iversion_raw(inode);
> do {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED) {
> - /*
> - * This barrier (and the implicit barrier in the
> - * cmpxchg below) pairs with the barrier in
> - * inode_maybe_inc_iversion().
> - */
> - smp_mb();
> + if (!fenced)
> + smp_mb();
> break;
> }
>
> + fenced = true;
> new = cur | I_VERSION_QUERIED;
> } while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> return cur >> I_VERSION_QUERIED_SHIFT;
I'll look over the patch in more detail later, but...
Primarily, we set "force" when we know we're going to be doing a
journal transaction anyway. The idea there was that since we're doing
that (usually to update timestamps or something), then we might as well
bump the iversion too.
For instance, XFS does this:
inode_maybe_inc_iversion(inode, flags & XFS_ILOG_CORE)
...and XFS_ILOG_CORE is set when there is already a transaction queued.
That isn't actually required. If someone hasn't queried this value then
we're not required to bump the iversion at all. We could just elide
some of those "force" cases as well.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] vfs: elide smp_mb in iversion handling in the common case
2024-08-15 8:33 [PATCH] vfs: elide smp_mb in iversion handling in the common case Mateusz Guzik
` (2 preceding siblings ...)
2024-08-27 10:50 ` Jeff Layton
@ 2024-08-27 12:29 ` Jeff Layton
3 siblings, 0 replies; 6+ messages in thread
From: Jeff Layton @ 2024-08-27 12:29 UTC (permalink / raw)
To: Mateusz Guzik, brauner; +Cc: viro, jack, linux-kernel, linux-fsdevel
On Thu, 2024-08-15 at 10:33 +0200, Mateusz Guzik wrote:
> According to bpftrace on these routines most calls result in cmpxchg,
> which already provides the same guarantee.
>
> In inode_maybe_inc_iversion elision is possible because even if the
> wrong value was read due to now missing smp_mb fence, the issue is going
> to correct itself after cmpxchg. If it appears cmpxchg wont be issued,
> the fence + reload are there bringing back previous behavior.
>
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
>
> chances are this entire barrier guarantee is of no significance, but i'm
> not signing up to review it
>
> I verified the force flag is not *always* set (but it is set in the most common case).
>
> fs/libfs.c | 28 ++++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 8aa34870449f..61ae4811270a 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1990,13 +1990,19 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
> * information, but the legacy inode_inc_iversion code used a spinlock
> * to serialize increments.
> *
> - * Here, we add full memory barriers to ensure that any de-facto
> - * ordering with other info is preserved.
> + * We add a full memory barrier to ensure that any de facto ordering
> + * with other state is preserved (either implicitly coming from cmpxchg
> + * or explicitly from smp_mb if we don't know upfront if we will execute
> + * the former).
> *
> - * This barrier pairs with the barrier in inode_query_iversion()
> + * These barriers pair with inode_query_iversion().
> */
> - smp_mb();
> cur = inode_peek_iversion_raw(inode);
> + if (!force && !(cur & I_VERSION_QUERIED)) {
> + smp_mb();
> + cur = inode_peek_iversion_raw(inode);
> + }
> +
> do {
> /* If flag is clear then we needn't do anything */
> if (!force && !(cur & I_VERSION_QUERIED))
> @@ -2025,20 +2031,22 @@ EXPORT_SYMBOL(inode_maybe_inc_iversion);
> u64 inode_query_iversion(struct inode *inode)
> {
> u64 cur, new;
> + bool fenced = false;
>
> + /*
> + * Memory barriers (implicit in cmpxchg, explicit in smp_mb) pair with
> + * inode_maybe_inc_iversion(), see that routine for more details.
> + */
> cur = inode_peek_iversion_raw(inode);
> do {
> /* If flag is already set, then no need to swap */
> if (cur & I_VERSION_QUERIED) {
> - /*
> - * This barrier (and the implicit barrier in the
> - * cmpxchg below) pairs with the barrier in
> - * inode_maybe_inc_iversion().
> - */
> - smp_mb();
> + if (!fenced)
> + smp_mb();
> break;
> }
>
> + fenced = true;
> new = cur | I_VERSION_QUERIED;
> } while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> return cur >> I_VERSION_QUERIED_SHIFT;
This looks reasonable to me.
As I said in my earlier email, we could stop setting "force" in more
cases, as it's not usually required. The only reason we set it to true
in places like inode_update_timestamps is that we're updating the the
timestamps on disk anyway, so we might as well increment the change
attribute too.
That's not specifically required though. The only real requirement for
the change attribute is that two samples of it must be different if,
between the samples, the inode is changed in a way that would alter the
ctime. If these cmpxchg's are slowing down real workloads, we could
stop doing that when it hasn't been queried.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-08-27 12:29 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-15 8:33 [PATCH] vfs: elide smp_mb in iversion handling in the common case Mateusz Guzik
2024-08-16 10:56 ` Christian Brauner
2024-08-27 10:00 ` Jan Kara
2024-08-27 10:21 ` Mateusz Guzik
2024-08-27 10:50 ` Jeff Layton
2024-08-27 12:29 ` Jeff Layton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).