All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
@ 2024-08-08  2:50 Al Viro
  2024-08-08  3:06 ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Al Viro @ 2024-08-08  2:50 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, Mateusz Guzik, Eric Dumazet, Paul E. McKenney

	Take a look at fs/file.c:expand_fdtable() and fs/file.c:fd_install()

In the end of the former:
        copy_fdtable(new_fdt, cur_fdt);
        rcu_assign_pointer(files->fdt, new_fdt);
        if (cur_fdt != &files->fdtab)
                call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
        /* coupled with smp_rmb() in fd_install() */
        smp_wmb();
        return 1;
the only caller (expand_files()) has
        expanded = expand_fdtable(files, nr);
        files->resize_in_progress = false;

OTOH, in fd_install() there's this:
        if (unlikely(files->resize_in_progress)) {
                rcu_read_unlock_sched();
                spin_lock(&files->file_lock);
                fdt = files_fdtable(files);
                WARN_ON(fdt->fd[fd] != NULL);
                rcu_assign_pointer(fdt->fd[fd], file);
                spin_unlock(&files->file_lock);
                return;
        }
        /* coupled with smp_wmb() in expand_fdtable() */
        smp_rmb();
        fdt = rcu_dereference_sched(files->fdt);

What's the problem with droping both barriers and turning that
into
        expanded = expand_fdtable(files, nr);
        smp_store_release(&files->resize_in_progress, false);
and
        if (unlikely(smp_load_acquire(&files->resize_in_progress))) {
		....
                return;
        }
        fdt = rcu_dereference_sched(files->fdt);
resp.?  Anyone who sees ->resize_in_progress being false will
see all stores prior to setting ->fdt on the expand size...

That went into the tree in 8a81252b774b "fs/file.c: don't acquire
files->file_lock in fd_install()".  I don't remember that having been
discussed back then, but it had been 7 years ago...
<check the thread on lore>
Nobody asked, AFAICS.

Is there any problem with going that way?  Because I'm pretty sure
that it's would be cheaper to use store_release/load_acquire instead
of paying for smp_rmb() on the fd_install() side...

Am I missing something subtle here?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08  2:50 [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Al Viro
@ 2024-08-08  3:06 ` Linus Torvalds
  2024-08-08  3:35   ` Al Viro
  2024-08-08 13:20   ` [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Christian Brauner
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2024-08-08  3:06 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-fsdevel, Mateusz Guzik, Eric Dumazet, Paul E. McKenney

On Wed, 7 Aug 2024 at 19:50, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What's the problem with droping both barriers and turning that
> into
>         expanded = expand_fdtable(files, nr);
>         smp_store_release(&files->resize_in_progress, false);
> and
>         if (unlikely(smp_load_acquire(&files->resize_in_progress))) {
>                 ....
>                 return;
>         }

That should be fine. smp_store_release()->smp_load_acquire() is the
more modern model, and the better one. But I think we simply have a
long history of using the old smp_wmb()->smp_rmb() model, so we have a
lot of code that does that.

On x86, there's basically no difference - in all cases it ends up
being just an instruction scheduling barrier.

On arm64, store_release->load_acquire is likely better, but obviously
micro-architectural implementation issues might make it a wash.

On other architectures, there probably isn't a huge difference, but
acquire/release can be more expensive if the architecture is
explicitly designed for the old-style rmb/wmb model.

So on alpha, for example, store_release->load_acquire ends up being a
full memory barrier in both cases (rmb is always a full memory barrier
on alpha), which is hugely more expensive than wmb (well, again, in
theory this is all obviously dependent on microarchitectures, but wmb
in particular is very cheap unless the uarch really screwed the pooch
and just messed up its barriers entirely).

End result: wmb/rmb is usually never _horrific_, while release/acquire
can be rather expensive on bad machines.

But release/acquire is the RightThing(tm), and the fact that alpha
based its ordering on the bad old model is not really our problem.

So I'm ok with just saying "screw bad memory orderings, go with the
modern model"

             Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08  3:06 ` Linus Torvalds
@ 2024-08-08  3:35   ` Al Viro
  2024-08-08  3:46     ` Al Viro
  2024-08-08 13:20   ` [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Christian Brauner
  1 sibling, 1 reply; 27+ messages in thread
From: Al Viro @ 2024-08-08  3:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Mateusz Guzik, Eric Dumazet, Paul E. McKenney

On Wed, Aug 07, 2024 at 08:06:31PM -0700, Linus Torvalds wrote:

> That should be fine. smp_store_release()->smp_load_acquire() is the
> more modern model, and the better one. But I think we simply have a
> long history of using the old smp_wmb()->smp_rmb() model, so we have a
> lot of code that does that.
> 
> On x86, there's basically no difference - in all cases it ends up
> being just an instruction scheduling barrier.
> 
> On arm64, store_release->load_acquire is likely better, but obviously
> micro-architectural implementation issues might make it a wash.
> 
> On other architectures, there probably isn't a huge difference, but
> acquire/release can be more expensive if the architecture is
> explicitly designed for the old-style rmb/wmb model.
> 
> So on alpha, for example, store_release->load_acquire ends up being a
> full memory barrier in both cases (rmb is always a full memory barrier
> on alpha), which is hugely more expensive than wmb (well, again, in
> theory this is all obviously dependent on microarchitectures, but wmb
> in particular is very cheap unless the uarch really screwed the pooch
> and just messed up its barriers entirely).
> 
> End result: wmb/rmb is usually never _horrific_, while release/acquire
> can be rather expensive on bad machines.
> 
> But release/acquire is the RightThing(tm), and the fact that alpha
> based its ordering on the bad old model is not really our problem.

alpha would have fuckloads of full barriers simply from all those READ_ONCE()
in rcu reads...

smp_rmb() is on the side that is much hotter - fd_install() vs. up to what, 25 calls
of expand_fdtable() per files_struct instance history in the worst possible case?
With rather big memcpy() done by those calls, at that...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08  3:35   ` Al Viro
@ 2024-08-08  3:46     ` Al Viro
  2024-08-08  6:08       ` Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Al Viro @ 2024-08-08  3:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Mateusz Guzik, Eric Dumazet, Paul E. McKenney

On Thu, Aug 08, 2024 at 04:35:05AM +0100, Al Viro wrote:
> On Wed, Aug 07, 2024 at 08:06:31PM -0700, Linus Torvalds wrote:

> > But release/acquire is the RightThing(tm), and the fact that alpha
> > based its ordering on the bad old model is not really our problem.
> 
> alpha would have fuckloads of full barriers simply from all those READ_ONCE()
> in rcu reads...
> 
> smp_rmb() is on the side that is much hotter - fd_install() vs. up to what, 25 calls
> of expand_fdtable() per files_struct instance history in the worst possible case?
> With rather big memcpy() done by those calls, at that...

BTW, an alternative would be to have LSB of ->fdt (or ->fd, if we try to
eliminate that extra dereference) for ->resize_in_progress.  Then no barrier
is needed for ordering of those.  Would cost an extra &~1 on ->fdt fetches,
though...

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08  3:46     ` Al Viro
@ 2024-08-08  6:08       ` Mateusz Guzik
  2024-12-05 12:03         ` [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-08-08  6:08 UTC (permalink / raw)
  To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Eric Dumazet, Paul E. McKenney

On Thu, Aug 8, 2024 at 5:46 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Aug 08, 2024 at 04:35:05AM +0100, Al Viro wrote:
> > On Wed, Aug 07, 2024 at 08:06:31PM -0700, Linus Torvalds wrote:
>
> > > But release/acquire is the RightThing(tm), and the fact that alpha
> > > based its ordering on the bad old model is not really our problem.
> >
> > alpha would have fuckloads of full barriers simply from all those READ_ONCE()
> > in rcu reads...
> >
> > smp_rmb() is on the side that is much hotter - fd_install() vs. up to what, 25 calls
> > of expand_fdtable() per files_struct instance history in the worst possible case?
> > With rather big memcpy() done by those calls, at that...
>
> BTW, an alternative would be to have LSB of ->fdt (or ->fd, if we try to
> eliminate that extra dereference) for ->resize_in_progress.  Then no barrier
> is needed for ordering of those.  Would cost an extra &~1 on ->fdt fetches,
> though...

Note smp_load_acquire still emits a fenced instruction on arm64, but I
have no idea what the cost is.

While my understanding of RCU guarantees in face of synchronize_rcu is
rather limited here, I do suspect the entire thing can be handled with
a consume fence, which does expand to a regular load on everything but
alpha.

So the question to Paul is if given this in expand_fdtable:
         files->resize_in_progress = true;
         ....
        if (atomic_read(&files->count) > 1)
                synchronize_rcu();

does something like this work for fd_install:

        rcu_read_lock_sched();
        files = smp_load_consume(current->files);
        if (unlikely(files->resize_in_progress))
                ....
        fdt = rcu_dereference_sched(files->fdt);
        rcu_assign_pointer(fdt->fd[fd], file);
        rcu_read_unlock_sched();


-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08  3:06 ` Linus Torvalds
  2024-08-08  3:35   ` Al Viro
@ 2024-08-08 13:20   ` Christian Brauner
  2024-08-08 16:11     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Christian Brauner @ 2024-08-08 13:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, linux-fsdevel, Mateusz Guzik, Eric Dumazet,
	Paul E. McKenney

On Wed, Aug 07, 2024 at 08:06:31PM GMT, Linus Torvalds wrote:
> On Wed, 7 Aug 2024 at 19:50, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What's the problem with droping both barriers and turning that
> > into
> >         expanded = expand_fdtable(files, nr);
> >         smp_store_release(&files->resize_in_progress, false);
> > and
> >         if (unlikely(smp_load_acquire(&files->resize_in_progress))) {
> >                 ....
> >                 return;
> >         }
> 
> That should be fine. smp_store_release()->smp_load_acquire() is the
> more modern model, and the better one. But I think we simply have a
> long history of using the old smp_wmb()->smp_rmb() model, so we have a
> lot of code that does that.
> 
> On x86, there's basically no difference - in all cases it ends up
> being just an instruction scheduling barrier.
> 
> On arm64, store_release->load_acquire is likely better, but obviously
> micro-architectural implementation issues might make it a wash.
> 
> On other architectures, there probably isn't a huge difference, but
> acquire/release can be more expensive if the architecture is
> explicitly designed for the old-style rmb/wmb model.
> 
> So on alpha, for example, store_release->load_acquire ends up being a
> full memory barrier in both cases (rmb is always a full memory barrier
> on alpha), which is hugely more expensive than wmb (well, again, in
> theory this is all obviously dependent on microarchitectures, but wmb
> in particular is very cheap unless the uarch really screwed the pooch
> and just messed up its barriers entirely).
> 
> End result: wmb/rmb is usually never _horrific_, while release/acquire
> can be rather expensive on bad machines.
> 
> But release/acquire is the RightThing(tm), and the fact that alpha
> based its ordering on the bad old model is not really our problem.
> 
> So I'm ok with just saying "screw bad memory orderings, go with the
> modern model"

So that's what confused me in your earlier mail in the other thread
where the question around smp_{r,w}mb() and smp_store_release() and
smp_load_acquire() already came up.

Basically, I had always used smp_load_acquire() and smp_store_release()
based on the assumption that they're equivalent to smp_{r,w}mb().

But then multiple times people brought up that supposedly smp_rmb() and
smp_wmb() are cheaper because they only do load or store ordering
whereas smp_{load,store}_{acquire,release}() do load and store ordering.

And it doesn't help that we seemingly don't have a practical guideline
of the form "Generally prefer smp_load_acquire() and smp_store_release()
over smp_rmb() and smp_wmb()." written down anywhere. That really would
shortcut decisions such as this.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()?
  2024-08-08 13:20   ` [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Christian Brauner
@ 2024-08-08 16:11     ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2024-08-08 16:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, linux-fsdevel, Mateusz Guzik, Eric Dumazet,
	Paul E. McKenney

On Thu, 8 Aug 2024 at 06:20, Christian Brauner <brauner@kernel.org> wrote:
>
> But then multiple times people brought up that supposedly smp_rmb() and
> smp_wmb() are cheaper because they only do load or store ordering
> whereas smp_{load,store}_{acquire,release}() do load and store ordering.

It really can go either way.

But I think we've reached a point where release/acquire is "typically
cheaper", and the reason is simply arm64.

As mentioned, on x86 none of this matters. And on older architectures
that were designed around the concept of separate memory barriers, the
rmb/wmb model thus matches that architecture model and tends to be
natural and likely the best impedance match.

But the arm64 memory ordering was created after people had figured out
the rules of good memory ordering, and so we have this:

   https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

and this particular quote:

 "Weaker ordering requirements that are imposed by Load-Acquire and
  Store-Release instructions allow for micro-architectural
  optimizations, which could reduce some of the performance impacts that
  are otherwise imposed by an explicit memory barrier.

  If the ordering requirement is satisfied using either a Load-Acquire
  or Store-Release, then it would be preferable to use these
  instructions instead of a DMB"

iow we now have a relevant architecture that gets memory ordering
right, and that officially prefers release/acquire ordering.

End result: we *used* to prefer rmb/wmb pairs, because (a) it was how
we did memory ordering originally, (b) relevant architectures didn't
care, and (c) it matched the questionable architectures.

And now, in the last few years, the equation has simply shifted.

So rmb/wmb has gone from "this is the only way to do it" to "this is
the legacy way to do it and it performs ok everywhere" to "this is the
historical way that some people are more used to".

For new code, release/acquire is preferred. And if it's *critical*
code, maybe it's even worth converting from wmb/rmb to
release/acquire.

Partly because of that "it should be better on arm64", but also partly
because I think release/acquire is both a better model conceptually,
_and_ is more self-documenting (ie it's a nice explicit hand-off in
ways that some of our subtler "this wmb pairs with that rmb" code is
very much not at all self-documenting and needs very explicit and
clear comments).

Now, I'm not saying you shouldn't add a comment about a
release/acquire pair, but at the same time, the very fact that you
release a _particular_ variable and acquire that variable elsewhere
*is* a big clue. So when I'm saying it's "more self-documenting", I
want to emphasize that "more". I'm not claiming it's _completely_
self-documenting ;)

        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-08-08  6:08       ` Mateusz Guzik
@ 2024-12-05 12:03         ` Mateusz Guzik
  2024-12-05 14:18           ` Al Viro
                             ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 12:03 UTC (permalink / raw)
  To: paulmck
  Cc: brauner, viro, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel, Mateusz Guzik

See the added commentary for reasoning.

->resize_in_progress handling is moved inside of expand_fdtable() for
clarity.

Whacks an actual fence on arm64.

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---

To my reading of commentary above synchronize_rcu() this works fine(tm)
and there is even other code relying on the same idea (percpu rwsems
(see percpu_down_read for example), maybe there is more).

However, given that barriers like to be tricky and I know about squat of
RCU internals, I refer to Paul here.

Paul, does this work? If not, any trivial tweaks to make it so?

I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and
the specific patch does not work.

 fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 019fb9acf91b..d065a24980da 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
 	__acquires(files->file_lock)
 {
 	struct fdtable *new_fdt, *cur_fdt;
+	int err = 0;
 
+	BUG_ON(files->resize_in_progress);
+	files->resize_in_progress = true;
 	spin_unlock(&files->file_lock);
 	new_fdt = alloc_fdtable(nr + 1);
 
-	/* make sure all fd_install() have seen resize_in_progress
-	 * or have finished their rcu_read_lock_sched() section.
+	/*
+	 * Synchronize against the lockless fd_install().
+	 *
+	 * All work in that routine is enclosed with RCU sched section.
+	 *
+	 * We published ->resize_in_progress = true with the unlock above,
+	 * which makes new arrivals bail to locked operation.
+	 *
+	 * Now we only need to wait for CPUs which did not observe the flag to
+	 * leave and make sure their store to the fd table got published.
+	 *
+	 * We do it with synchronize_rcu(), which both waits for all sections to
+	 * finish (taking care of the first part) and guarantees all CPUs issued a
+	 * full fence (taking care of the second part).
+	 *
+	 * Note we know there is nobody to wait for if we are dealing with a
+	 * single-threaded process.
 	 */
 	if (atomic_read(&files->count) > 1)
 		synchronize_rcu();
 
 	spin_lock(&files->file_lock);
-	if (IS_ERR(new_fdt))
-		return PTR_ERR(new_fdt);
+	if (IS_ERR(new_fdt)) {
+		err = PTR_ERR(new_fdt);
+		goto out;
+	}
 	cur_fdt = files_fdtable(files);
 	BUG_ON(nr < cur_fdt->max_fds);
 	copy_fdtable(new_fdt, cur_fdt);
 	rcu_assign_pointer(files->fdt, new_fdt);
 	if (cur_fdt != &files->fdtab)
 		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
-	/* coupled with smp_rmb() in fd_install() */
+
+	/*
+	 * Publish everything before we unset ->resize_in_progress, see above
+	 * for an explanation.
+	 */
 	smp_wmb();
-	return 0;
+out:
+	files->resize_in_progress = false;
+	return err;
 }
 
 /*
@@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr)
 		return -EMFILE;
 
 	/* All good, so we try */
-	files->resize_in_progress = true;
 	error = expand_fdtable(files, nr);
-	files->resize_in_progress = false;
 
 	wake_up_all(&files->resize_wait);
 	return error;
@@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd);
 
 void fd_install(unsigned int fd, struct file *file)
 {
-	struct files_struct *files = current->files;
+	struct files_struct *files;
 	struct fdtable *fdt;
 
 	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
 		return;
 
+	/*
+	 * Synchronized with expand_fdtable(), see that routine for an
+	 * explanation.
+	 */
 	rcu_read_lock_sched();
+	files = READ_ONCE(current->files);
 
 	if (unlikely(files->resize_in_progress)) {
 		rcu_read_unlock_sched();
@@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file)
 		spin_unlock(&files->file_lock);
 		return;
 	}
-	/* coupled with smp_wmb() in expand_fdtable() */
-	smp_rmb();
+
 	fdt = rcu_dereference_sched(files->fdt);
 	BUG_ON(fdt->fd[fd] != NULL);
 	rcu_assign_pointer(fdt->fd[fd], file);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 12:03         ` [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Mateusz Guzik
@ 2024-12-05 14:18           ` Al Viro
  2024-12-05 14:43             ` Mateusz Guzik
  2024-12-05 14:46           ` Jan Kara
  2024-12-05 14:58           ` Christian Brauner
  2 siblings, 1 reply; 27+ messages in thread
From: Al Viro @ 2024-12-05 14:18 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: paulmck, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);

What are you trying to do with that READ_ONCE()?  current->files
itself is *not* changed by any of that code; current->files->fdtab is.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 14:18           ` Al Viro
@ 2024-12-05 14:43             ` Mateusz Guzik
  2024-12-05 18:41               ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 14:43 UTC (permalink / raw)
  To: Al Viro
  Cc: paulmck, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> >  void fd_install(unsigned int fd, struct file *file)
> >  {
> > -     struct files_struct *files = current->files;
> > +     struct files_struct *files;
> >       struct fdtable *fdt;
> >
> >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> >               return;
> >
> > +     /*
> > +      * Synchronized with expand_fdtable(), see that routine for an
> > +      * explanation.
> > +      */
> >       rcu_read_lock_sched();
> > +     files = READ_ONCE(current->files);
>
> What are you trying to do with that READ_ONCE()?  current->files
> itself is *not* changed by any of that code; current->files->fdtab is.

To my understanding this is the idiomatic way of spelling out the
non-existent in Linux smp_consume_load, for the resize_in_progress
flag.

Anyway to elaborate I'm gunning for a setup where the code is
semantically equivalent to having a lock around the work.
Pretend ->resize_lock exists, then:
fd_install:
files = current->files;
read_lock(files->resize_lock);
fdt = rcu_dereference_sched(files->fdt);
rcu_assign_pointer(fdt->fd[fd], file);
read_unlock(files->resize_lock);

expand_fdtable:
write_lock(files->resize_lock);
[snip]
rcu_assign_pointer(files->fdt, new_fdt);
write_unlock(files->resize_lock);

Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
synchronize_rcu do it.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 12:03         ` [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Mateusz Guzik
  2024-12-05 14:18           ` Al Viro
@ 2024-12-05 14:46           ` Jan Kara
  2024-12-05 15:01             ` Mateusz Guzik
  2024-12-05 14:58           ` Christian Brauner
  2 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2024-12-05 14:46 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: paulmck, brauner, viro, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> See the added commentary for reasoning.
> 
> ->resize_in_progress handling is moved inside of expand_fdtable() for
> clarity.
> 
> Whacks an actual fence on arm64.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

Hum, I don't think this works. What could happen now is:

CPU1					CPU2
expand_fdtable()			fd_install()
  files->resize_in_progress = true;
  ...
  if (atomic_read(&files->count) > 1)
    synchronize_rcu();
  ...
  rcu_assign_pointer(files->fdt, new_fdt);
  if (cur_fdt != &files->fdtab)
          call_rcu(&cur_fdt->rcu, free_fdtable_rcu);

					rcu_read_lock_sched()

					fdt = rcu_dereference_sched(files->fdt);
					/* Fetched old FD table - without
					 * smp_rmb() the read was reordered */
  rcu_assign_pointer(files->fdt, new_fdt);
  /*
   * Publish everything before we unset ->resize_in_progress, see above
   * for an explanation.
   */
  smp_wmb();
out:
  files->resize_in_progress = false;
					if (unlikely(files->resize_in_progress)) {
					  - false
					rcu_assign_pointer(fdt->fd[fd], file);
					  - store in the old table - boom.

								Honza

> diff --git a/fs/file.c b/fs/file.c
> index 019fb9acf91b..d065a24980da 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
>  	__acquires(files->file_lock)
>  {
>  	struct fdtable *new_fdt, *cur_fdt;
> +	int err = 0;
>  
> +	BUG_ON(files->resize_in_progress);
> +	files->resize_in_progress = true;
>  	spin_unlock(&files->file_lock);
>  	new_fdt = alloc_fdtable(nr + 1);
>  
> -	/* make sure all fd_install() have seen resize_in_progress
> -	 * or have finished their rcu_read_lock_sched() section.
> +	/*
> +	 * Synchronize against the lockless fd_install().
> +	 *
> +	 * All work in that routine is enclosed with RCU sched section.
> +	 *
> +	 * We published ->resize_in_progress = true with the unlock above,
> +	 * which makes new arrivals bail to locked operation.
> +	 *
> +	 * Now we only need to wait for CPUs which did not observe the flag to
> +	 * leave and make sure their store to the fd table got published.
> +	 *
> +	 * We do it with synchronize_rcu(), which both waits for all sections to
> +	 * finish (taking care of the first part) and guarantees all CPUs issued a
> +	 * full fence (taking care of the second part).
> +	 *
> +	 * Note we know there is nobody to wait for if we are dealing with a
> +	 * single-threaded process.
>  	 */
>  	if (atomic_read(&files->count) > 1)
>  		synchronize_rcu();
>  
>  	spin_lock(&files->file_lock);
> -	if (IS_ERR(new_fdt))
> -		return PTR_ERR(new_fdt);
> +	if (IS_ERR(new_fdt)) {
> +		err = PTR_ERR(new_fdt);
> +		goto out;
> +	}
>  	cur_fdt = files_fdtable(files);
>  	BUG_ON(nr < cur_fdt->max_fds);
>  	copy_fdtable(new_fdt, cur_fdt);
>  	rcu_assign_pointer(files->fdt, new_fdt);
>  	if (cur_fdt != &files->fdtab)
>  		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> -	/* coupled with smp_rmb() in fd_install() */
> +
> +	/*
> +	 * Publish everything before we unset ->resize_in_progress, see above
> +	 * for an explanation.
> +	 */
>  	smp_wmb();
> -	return 0;
> +out:
> +	files->resize_in_progress = false;
> +	return err;
>  }
>  
>  /*
> @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr)
>  		return -EMFILE;
>  
>  	/* All good, so we try */
> -	files->resize_in_progress = true;
>  	error = expand_fdtable(files, nr);
> -	files->resize_in_progress = false;
>  
>  	wake_up_all(&files->resize_wait);
>  	return error;
> @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd);
>  
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);
>  
>  	if (unlikely(files->resize_in_progress)) {
>  		rcu_read_unlock_sched();
> @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		spin_unlock(&files->file_lock);
>  		return;
>  	}
> -	/* coupled with smp_wmb() in expand_fdtable() */
> -	smp_rmb();
> +
>  	fdt = rcu_dereference_sched(files->fdt);
>  	BUG_ON(fdt->fd[fd] != NULL);
>  	rcu_assign_pointer(fdt->fd[fd], file);
> -- 
> 2.43.0
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 12:03         ` [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Mateusz Guzik
  2024-12-05 14:18           ` Al Viro
  2024-12-05 14:46           ` Jan Kara
@ 2024-12-05 14:58           ` Christian Brauner
  2024-12-05 15:06             ` Mateusz Guzik
  2 siblings, 1 reply; 27+ messages in thread
From: Christian Brauner @ 2024-12-05 14:58 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: paulmck, viro, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> See the added commentary for reasoning.
> 
> ->resize_in_progress handling is moved inside of expand_fdtable() for
> clarity.
> 
> Whacks an actual fence on arm64.
> 
> Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> ---
> 
> To my reading of commentary above synchronize_rcu() this works fine(tm)
> and there is even other code relying on the same idea (percpu rwsems
> (see percpu_down_read for example), maybe there is more).
> 
> However, given that barriers like to be tricky and I know about squat of
> RCU internals, I refer to Paul here.
> 
> Paul, does this work? If not, any trivial tweaks to make it so?
> 
> I mean smp_rmb looks dodgeable, at worst I made a mistake somewhere and
> the specific patch does not work.
> 
>  fs/file.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/file.c b/fs/file.c
> index 019fb9acf91b..d065a24980da 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -233,28 +233,54 @@ static int expand_fdtable(struct files_struct *files, unsigned int nr)
>  	__acquires(files->file_lock)
>  {
>  	struct fdtable *new_fdt, *cur_fdt;
> +	int err = 0;
>  
> +	BUG_ON(files->resize_in_progress);

I think this BUG_ON() here is a bit unnecessary.

> +	files->resize_in_progress = true;

Minor: Why move that into expand_fdtable()? Having
files->resize_in_progress in here doesn't add much more clarity than
just having it set around expand_fdtable() in the caller.

Leaving it there also makes the patch smaller and clearer to follow as
you neither need the additional err nor the goto.

>  	spin_unlock(&files->file_lock);
>  	new_fdt = alloc_fdtable(nr + 1);
>  
> -	/* make sure all fd_install() have seen resize_in_progress
> -	 * or have finished their rcu_read_lock_sched() section.
> +	/*
> +	 * Synchronize against the lockless fd_install().
> +	 *
> +	 * All work in that routine is enclosed with RCU sched section.
> +	 *
> +	 * We published ->resize_in_progress = true with the unlock above,
> +	 * which makes new arrivals bail to locked operation.
> +	 *
> +	 * Now we only need to wait for CPUs which did not observe the flag to
> +	 * leave and make sure their store to the fd table got published.
> +	 *
> +	 * We do it with synchronize_rcu(), which both waits for all sections to
> +	 * finish (taking care of the first part) and guarantees all CPUs issued a
> +	 * full fence (taking care of the second part).
> +	 *
> +	 * Note we know there is nobody to wait for if we are dealing with a
> +	 * single-threaded process.
>  	 */
>  	if (atomic_read(&files->count) > 1)
>  		synchronize_rcu();
>  
>  	spin_lock(&files->file_lock);
> -	if (IS_ERR(new_fdt))
> -		return PTR_ERR(new_fdt);
> +	if (IS_ERR(new_fdt)) {
> +		err = PTR_ERR(new_fdt);
> +		goto out;
> +	}
>  	cur_fdt = files_fdtable(files);
>  	BUG_ON(nr < cur_fdt->max_fds);
>  	copy_fdtable(new_fdt, cur_fdt);
>  	rcu_assign_pointer(files->fdt, new_fdt);
>  	if (cur_fdt != &files->fdtab)
>  		call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> -	/* coupled with smp_rmb() in fd_install() */
> +
> +	/*
> +	 * Publish everything before we unset ->resize_in_progress, see above
> +	 * for an explanation.
> +	 */
>  	smp_wmb();
> -	return 0;
> +out:
> +	files->resize_in_progress = false;
> +	return err;
>  }
>  
>  /*
> @@ -290,9 +316,7 @@ static int expand_files(struct files_struct *files, unsigned int nr)
>  		return -EMFILE;
>  
>  	/* All good, so we try */
> -	files->resize_in_progress = true;
>  	error = expand_fdtable(files, nr);
> -	files->resize_in_progress = false;
>  
>  	wake_up_all(&files->resize_wait);
>  	return error;
> @@ -629,13 +653,18 @@ EXPORT_SYMBOL(put_unused_fd);
>  
>  void fd_install(unsigned int fd, struct file *file)
>  {
> -	struct files_struct *files = current->files;
> +	struct files_struct *files;
>  	struct fdtable *fdt;
>  
>  	if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
>  		return;
>  
> +	/*
> +	 * Synchronized with expand_fdtable(), see that routine for an
> +	 * explanation.
> +	 */
>  	rcu_read_lock_sched();
> +	files = READ_ONCE(current->files);
>  
>  	if (unlikely(files->resize_in_progress)) {
>  		rcu_read_unlock_sched();
> @@ -646,8 +675,7 @@ void fd_install(unsigned int fd, struct file *file)
>  		spin_unlock(&files->file_lock);
>  		return;
>  	}
> -	/* coupled with smp_wmb() in expand_fdtable() */
> -	smp_rmb();
> +
>  	fdt = rcu_dereference_sched(files->fdt);
>  	BUG_ON(fdt->fd[fd] != NULL);
>  	rcu_assign_pointer(fdt->fd[fd], file);
> -- 
> 2.43.0
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 14:46           ` Jan Kara
@ 2024-12-05 15:01             ` Mateusz Guzik
  2024-12-05 15:29               ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 15:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: paulmck, brauner, viro, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> > See the added commentary for reasoning.
> >
> > ->resize_in_progress handling is moved inside of expand_fdtable() for
> > clarity.
> >
> > Whacks an actual fence on arm64.
> >
> > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
>
> Hum, I don't think this works. What could happen now is:
>
> CPU1                                    CPU2
> expand_fdtable()                        fd_install()
>   files->resize_in_progress = true;
>   ...
>   if (atomic_read(&files->count) > 1)
>     synchronize_rcu();
>   ...
>   rcu_assign_pointer(files->fdt, new_fdt);
>   if (cur_fdt != &files->fdtab)
>           call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
>
>                                         rcu_read_lock_sched()
>
>                                         fdt = rcu_dereference_sched(files->fdt);
>                                         /* Fetched old FD table - without
>                                          * smp_rmb() the read was reordered */
>   rcu_assign_pointer(files->fdt, new_fdt);
>   /*
>    * Publish everything before we unset ->resize_in_progress, see above
>    * for an explanation.
>    */
>   smp_wmb();
> out:
>   files->resize_in_progress = false;
>                                         if (unlikely(files->resize_in_progress)) {
>                                           - false
>                                         rcu_assign_pointer(fdt->fd[fd], file);
>                                           - store in the old table - boom.
>

I don't believe this ordering is possible because of both
synchronize_rcu and the fence before updating resize_in_progress.

Any CPU which could try racing like that had to come in after the
synchronize_rcu() call, meaning one of the 3 possibilities:
- the flag is true and the fd table is old
- the flag is true and the fd table is new
- the flag is false and the fd table is new

Suppose the CPU reordered loads of the flag and the fd table. There is
no ordering in which it can see both the old table and the unset flag.


--
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 14:58           ` Christian Brauner
@ 2024-12-05 15:06             ` Mateusz Guzik
  0 siblings, 0 replies; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 15:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: paulmck, viro, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 3:58 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > +     BUG_ON(files->resize_in_progress);
>
> I think this BUG_ON() here is a bit unnecessary.
>

I'm an assert-friendly guy, but I'm not going to argue about this one.

> > +     files->resize_in_progress = true;
>
> Minor: Why move that into expand_fdtable()? Having
> files->resize_in_progress in here doesn't add much more clarity than
> just having it set around expand_fdtable() in the caller.
>
> Leaving it there also makes the patch smaller and clearer to follow as
> you neither need the additional err nor the goto.

The is indeed not the best looking goto in the world, but per the
commit message I moved the flag in so that it can be easier referred
to when describing synchronisation against fd_install.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 15:01             ` Mateusz Guzik
@ 2024-12-05 15:29               ` Jan Kara
  2024-12-05 15:36                 ` Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2024-12-05 15:29 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, paulmck, brauner, viro, linux-fsdevel, torvalds,
	edumazet, linux-kernel

On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 3:46 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 05-12-24 13:03:32, Mateusz Guzik wrote:
> > > See the added commentary for reasoning.
> > >
> > > ->resize_in_progress handling is moved inside of expand_fdtable() for
> > > clarity.
> > >
> > > Whacks an actual fence on arm64.
> > >
> > > Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
> >
> > Hum, I don't think this works. What could happen now is:
> >
> > CPU1                                    CPU2
> > expand_fdtable()                        fd_install()
> >   files->resize_in_progress = true;
> >   ...
> >   if (atomic_read(&files->count) > 1)
> >     synchronize_rcu();
> >   ...
> >   rcu_assign_pointer(files->fdt, new_fdt);
> >   if (cur_fdt != &files->fdtab)
> >           call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
> >
> >                                         rcu_read_lock_sched()
> >
> >                                         fdt = rcu_dereference_sched(files->fdt);
> >                                         /* Fetched old FD table - without
> >                                          * smp_rmb() the read was reordered */
> >   rcu_assign_pointer(files->fdt, new_fdt);
> >   /*
> >    * Publish everything before we unset ->resize_in_progress, see above
> >    * for an explanation.
> >    */
> >   smp_wmb();
> > out:
> >   files->resize_in_progress = false;
> >                                         if (unlikely(files->resize_in_progress)) {
> >                                           - false
> >                                         rcu_assign_pointer(fdt->fd[fd], file);
> >                                           - store in the old table - boom.
> >
> 
> I don't believe this ordering is possible because of both
> synchronize_rcu and the fence before updating resize_in_progress.
> 
> Any CPU which could try racing like that had to come in after the
> synchronize_rcu() call, meaning one of the 3 possibilities:
> - the flag is true and the fd table is old
> - the flag is true and the fd table is new
> - the flag is false and the fd table is new

I agree here.

> Suppose the CPU reordered loads of the flag and the fd table. There is
> no ordering in which it can see both the old table and the unset flag.

But I disagree here. If the reads are reordered, then the fd table read can
happen during the "flag is true and the fd table is old" state and the flag
read can happen later in "flag is false and the fd table is new" state.
Just as I outlined above...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 15:29               ` Jan Kara
@ 2024-12-05 15:36                 ` Mateusz Guzik
  2024-12-06 15:32                   ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 15:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: paulmck, brauner, viro, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote:
> On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> > Suppose the CPU reordered loads of the flag and the fd table. There is
> > no ordering in which it can see both the old table and the unset flag.
>
> But I disagree here. If the reads are reordered, then the fd table read can
> happen during the "flag is true and the fd table is old" state and the flag
> read can happen later in "flag is false and the fd table is new" state.
> Just as I outlined above...

In your example all the work happens *after* synchronize_rcu(). The
thread resizing the table already published the result even before
calling into it. Furthermore by the time synchronize_rcu returns
everyone is guaranteed to have issued a full fence. Meaning nobody can
see the flag as unset.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 14:43             ` Mateusz Guzik
@ 2024-12-05 18:41               ` Paul E. McKenney
  2024-12-05 19:03                 ` Mateusz Guzik
  2024-12-05 19:26                 ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Paul E. McKenney @ 2024-12-05 18:41 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Al Viro, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > >  void fd_install(unsigned int fd, struct file *file)
> > >  {
> > > -     struct files_struct *files = current->files;
> > > +     struct files_struct *files;
> > >       struct fdtable *fdt;
> > >
> > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > >               return;
> > >
> > > +     /*
> > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > +      * explanation.
> > > +      */
> > >       rcu_read_lock_sched();
> > > +     files = READ_ONCE(current->files);
> >
> > What are you trying to do with that READ_ONCE()?  current->files
> > itself is *not* changed by any of that code; current->files->fdtab is.
> 
> To my understanding this is the idiomatic way of spelling out the
> non-existent in Linux smp_consume_load, for the resize_in_progress
> flag.

In Linus, "smp_consume_load()" is named rcu_dereference().

> Anyway to elaborate I'm gunning for a setup where the code is
> semantically equivalent to having a lock around the work.

Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
only with later RCU grace periods, such as those implemented by
synchronize_rcu().

> Pretend ->resize_lock exists, then:
> fd_install:
> files = current->files;
> read_lock(files->resize_lock);
> fdt = rcu_dereference_sched(files->fdt);
> rcu_assign_pointer(fdt->fd[fd], file);
> read_unlock(files->resize_lock);
> 
> expand_fdtable:
> write_lock(files->resize_lock);
> [snip]
> rcu_assign_pointer(files->fdt, new_fdt);
> write_unlock(files->resize_lock);
> 
> Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> synchronize_rcu do it.

OK, good, you did get the grace-period part of the puzzle.

Howver, please keep in mind that synchronize_rcu() has significant
latency by design.  There is a tradeoff between CPU consumption and
latency, and synchronize_rcu() therefore has latencies ranging upwards of
several milliseconds (not microseconds or nanoseconds).  I would be very
surprised if expand_fdtable() users would be happy with such a long delay.
Or are you using some trick to hide this delay?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 18:41               ` Paul E. McKenney
@ 2024-12-05 19:03                 ` Mateusz Guzik
  2024-12-05 20:01                   ` Paul E. McKenney
  2024-12-05 19:26                 ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 19:03 UTC (permalink / raw)
  To: paulmck
  Cc: Al Viro, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > >  void fd_install(unsigned int fd, struct file *file)
> > > >  {
> > > > -     struct files_struct *files = current->files;
> > > > +     struct files_struct *files;
> > > >       struct fdtable *fdt;
> > > >
> > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > >               return;
> > > >
> > > > +     /*
> > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > +      * explanation.
> > > > +      */
> > > >       rcu_read_lock_sched();
> > > > +     files = READ_ONCE(current->files);
> > >
> > > What are you trying to do with that READ_ONCE()?  current->files
> > > itself is *not* changed by any of that code; current->files->fdtab is.
> >
> > To my understanding this is the idiomatic way of spelling out the
> > non-existent in Linux smp_consume_load, for the resize_in_progress
> > flag.
>
> In Linus, "smp_consume_load()" is named rcu_dereference().
>

ok

> > Anyway to elaborate I'm gunning for a setup where the code is
> > semantically equivalent to having a lock around the work.
>
> Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> only with later RCU grace periods, such as those implemented by
> synchronize_rcu().
>

To my understanding the pre-case is already with the flag set upfront
and waiting for everyone to finish (which is already taking place in
stock code) + looking at it within the section.

> > Pretend ->resize_lock exists, then:
> > fd_install:
> > files = current->files;
> > read_lock(files->resize_lock);
> > fdt = rcu_dereference_sched(files->fdt);
> > rcu_assign_pointer(fdt->fd[fd], file);
> > read_unlock(files->resize_lock);
> >
> > expand_fdtable:
> > write_lock(files->resize_lock);
> > [snip]
> > rcu_assign_pointer(files->fdt, new_fdt);
> > write_unlock(files->resize_lock);
> >
> > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > synchronize_rcu do it.
>
> OK, good, you did get the grace-period part of the puzzle.
>
> Howver, please keep in mind that synchronize_rcu() has significant
> latency by design.  There is a tradeoff between CPU consumption and
> latency, and synchronize_rcu() therefore has latencies ranging upwards of
> several milliseconds (not microseconds or nanoseconds).  I would be very
> surprised if expand_fdtable() users would be happy with such a long delay.

The call is already there since 2015 and I only know of one case where
someone took an issue with it (and it could have been sorted out with
dup2 upfront to grow the table to the desired size). Amusingly I see
you patched it in 2018 from synchronize_sched to synchronize_rcu.
Bottom line though is that I'm not *adding* it. latency here. :)

So assuming the above can be ignored, do you confirm the patch works
(even if it needs some cosmetic changes)?

The entirety of the patch is about removing smp_rmb in fd_install with
small code rearrangement, while relying on the machinery which is
already there.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 18:41               ` Paul E. McKenney
  2024-12-05 19:03                 ` Mateusz Guzik
@ 2024-12-05 19:26                 ` Linus Torvalds
  2024-12-05 19:47                   ` Mateusz Guzik
  2024-12-05 20:06                   ` Paul E. McKenney
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2024-12-05 19:26 UTC (permalink / raw)
  To: paulmck
  Cc: Mateusz Guzik, Al Viro, brauner, jack, linux-fsdevel, edumazet,
	linux-kernel

On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
>
> > To my understanding this is the idiomatic way of spelling out the
> > non-existent in Linux smp_consume_load, for the resize_in_progress
> > flag.
>
> In Linus, "smp_consume_load()" is named rcu_dereference().

Linux.

But yes and no.

It's worth making it really really clear that "rcu_dereference()" is
*not* just a different name for some "smp_consume_load()" operation.

Why? Because a true smp_consume_load() would work with any random kind
of flags etc. And rcu_dereference() works only because it's a pointer,
and there's an inherent data dependency to what the result points to.

Paul obviously knows this, but let's make it very clear in this
discussion, because if somebody decided "I want a smp_consume_load(),
and I'll use rcu_dereference() to do that", the end result would
simply not work for arbitrary data, like a flags field or something,
where comparing it against a value will only result in a control
dependency, not an actual data dependency.

             Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 19:26                 ` Linus Torvalds
@ 2024-12-05 19:47                   ` Mateusz Guzik
  2024-12-05 20:11                     ` Paul E. McKenney
  2024-12-05 20:06                   ` Paul E. McKenney
  1 sibling, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: paulmck, Al Viro, brauner, jack, linux-fsdevel, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
>
> Linux.
>
> But yes and no.
>
> It's worth making it really really clear that "rcu_dereference()" is
> *not* just a different name for some "smp_consume_load()" operation.
>
> Why? Because a true smp_consume_load() would work with any random kind
> of flags etc. And rcu_dereference() works only because it's a pointer,
> and there's an inherent data dependency to what the result points to.
>
> Paul obviously knows this, but let's make it very clear in this
> discussion, because if somebody decided "I want a smp_consume_load(),
> and I'll use rcu_dereference() to do that", the end result would
> simply not work for arbitrary data, like a flags field or something,
> where comparing it against a value will only result in a control
> dependency, not an actual data dependency.
>

So I checked for kicks and rcu_dereference comes with type checking,
as in passing something which is not a pointer even fails to compile.

I'll note thought that a smp_load_consume_ptr or similarly named
routine would be nice and I'm rather confused why it was not added
given smp_load_acquire and smp_store_release being there.

One immediate user would be mnt_idmap(), like so:
iff --git a/include/linux/mount.h b/include/linux/mount.h
index 33f17b6e8732..4d3486ff67ed 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -76,7 +76,7 @@ struct vfsmount {
 static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
 {
        /* Pairs with smp_store_release() in do_idmap_mount(). */
-       return READ_ONCE(mnt->mnt_idmap);
+       return smp_load_consume_ptr(mnt->mnt_idmap);
 }

 extern int mnt_want_write(struct vfsmount *mnt);


-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 19:03                 ` Mateusz Guzik
@ 2024-12-05 20:01                   ` Paul E. McKenney
  2024-12-05 20:15                     ` Mateusz Guzik
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2024-12-05 20:01 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Al Viro, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > >  {
> > > > > -     struct files_struct *files = current->files;
> > > > > +     struct files_struct *files;
> > > > >       struct fdtable *fdt;
> > > > >
> > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > >               return;
> > > > >
> > > > > +     /*
> > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > +      * explanation.
> > > > > +      */
> > > > >       rcu_read_lock_sched();
> > > > > +     files = READ_ONCE(current->files);
> > > >
> > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
> 
> ok

And rcu_dereference(), and for that matter memory_order_consume, only
orders the load of the pointer against subsequent dereferences of that
same pointer against dereferences of that same pointer preceding the
store of that pointer.

	T1				T2
	a: p->a = 1;			d: q = rcu_dereference(gp);
	b: r1 = p->b;			e: r2 = p->a;
	c: rcu_assign_pointer(gp, p);	f: p->b = 42;

Here, if (and only if!) T2's load into q gets the value stored by
T1, then T1's statements e and f are guaranteed to happen after T2's
statements a and b.  In your patch, I do not see this pattern for the
files->resize_in_progress flag.

> > > Anyway to elaborate I'm gunning for a setup where the code is
> > > semantically equivalent to having a lock around the work.
> >
> > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > only with later RCU grace periods, such as those implemented by
> > synchronize_rcu().
> 
> To my understanding the pre-case is already with the flag set upfront
> and waiting for everyone to finish (which is already taking place in
> stock code) + looking at it within the section.

I freely confess that I do not understand the purpose of assigning to
files->resize_in_progress both before (pre-existing) and within (added)
expand_fdtable().  If the assignments before and after the call to
expand_fdtable() and the checks were under that lock, that could work,
but removing that lockless check might have performance and scalability
consequences.

> > > Pretend ->resize_lock exists, then:
> > > fd_install:
> > > files = current->files;
> > > read_lock(files->resize_lock);
> > > fdt = rcu_dereference_sched(files->fdt);
> > > rcu_assign_pointer(fdt->fd[fd], file);
> > > read_unlock(files->resize_lock);
> > >
> > > expand_fdtable:
> > > write_lock(files->resize_lock);
> > > [snip]
> > > rcu_assign_pointer(files->fdt, new_fdt);
> > > write_unlock(files->resize_lock);
> > >
> > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > synchronize_rcu do it.
> >
> > OK, good, you did get the grace-period part of the puzzle.
> >
> > Howver, please keep in mind that synchronize_rcu() has significant
> > latency by design.  There is a tradeoff between CPU consumption and
> > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > several milliseconds (not microseconds or nanoseconds).  I would be very
> > surprised if expand_fdtable() users would be happy with such a long delay.
> 
> The call is already there since 2015 and I only know of one case where
> someone took an issue with it (and it could have been sorted out with
> dup2 upfront to grow the table to the desired size). Amusingly I see
> you patched it in 2018 from synchronize_sched to synchronize_rcu.
> Bottom line though is that I'm not *adding* it. latency here. :)

Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
you are saying that, because otherwise your patch could simply remove
it without additional code changes.  On the other hand, if it is a key
component of the synchronization, I don't see how that smp_rmb() can be
removed while still preserving that synchronization without adding another
synchronize_rcu() to that function to compensate.

Now, it might be that you are somehow cleverly reusing the pre-existing
synchronize_rcu(), but I am not immediately seeing how this would work.

And no, I do not recall making that particular change back in the
day, only that I did change all the calls to synchronize_sched() to
synchronize_rcu().  Please accept my apologies for my having failed
to meet your expectations.  And do not be too surprised if others have
similar expectations of you at some point in the future.  ;-)

> So assuming the above can be ignored, do you confirm the patch works
> (even if it needs some cosmetic changes)?
> 
> The entirety of the patch is about removing smp_rmb in fd_install with
> small code rearrangement, while relying on the machinery which is
> already there.

The code to be synchronized is fairly small.  So why don't you
create a litmus test and ask herd7?  Please see tools/memory-model for
documentation and other example litmus tests.  This tool does the moral
equivalent of a full state-space search of the litmus tests, telling you
whether your "exists" condition is always, sometimes, or never satisfied.

							Thnax, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 19:26                 ` Linus Torvalds
  2024-12-05 19:47                   ` Mateusz Guzik
@ 2024-12-05 20:06                   ` Paul E. McKenney
  1 sibling, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2024-12-05 20:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, Al Viro, brauner, jack, linux-fsdevel, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 11:26:35AM -0800, Linus Torvalds wrote:
> On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > > To my understanding this is the idiomatic way of spelling out the
> > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > flag.
> >
> > In Linus, "smp_consume_load()" is named rcu_dereference().
> 
> Linux.

One of those days...  ;-)

> But yes and no.
> 
> It's worth making it really really clear that "rcu_dereference()" is
> *not* just a different name for some "smp_consume_load()" operation.
> 
> Why? Because a true smp_consume_load() would work with any random kind
> of flags etc. And rcu_dereference() works only because it's a pointer,
> and there's an inherent data dependency to what the result points to.
> 
> Paul obviously knows this, but let's make it very clear in this
> discussion, because if somebody decided "I want a smp_consume_load(),
> and I'll use rcu_dereference() to do that", the end result would
> simply not work for arbitrary data, like a flags field or something,
> where comparing it against a value will only result in a control
> dependency, not an actual data dependency.

Fair points!

And Linus (and Linux, for that matter) equally obviously already knows
this, but please note also that an smp_load_consume() would still order
only later dereferences of the thing returned from smp_load_consume(),
which means that it pretty much needs to be a pointer.  (Yes, in theory,
it could be an array index, but in practice compilers know way too much
about integer arithmetic for this to be advisable.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 19:47                   ` Mateusz Guzik
@ 2024-12-05 20:11                     ` Paul E. McKenney
  2024-12-06 12:11                       ` Christian Brauner
  0 siblings, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2024-12-05 20:11 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Linus Torvalds, Al Viro, brauner, jack, linux-fsdevel, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > > To my understanding this is the idiomatic way of spelling out the
> > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > flag.
> > >
> > > In Linus, "smp_consume_load()" is named rcu_dereference().
> >
> > Linux.
> >
> > But yes and no.
> >
> > It's worth making it really really clear that "rcu_dereference()" is
> > *not* just a different name for some "smp_consume_load()" operation.
> >
> > Why? Because a true smp_consume_load() would work with any random kind
> > of flags etc. And rcu_dereference() works only because it's a pointer,
> > and there's an inherent data dependency to what the result points to.
> >
> > Paul obviously knows this, but let's make it very clear in this
> > discussion, because if somebody decided "I want a smp_consume_load(),
> > and I'll use rcu_dereference() to do that", the end result would
> > simply not work for arbitrary data, like a flags field or something,
> > where comparing it against a value will only result in a control
> > dependency, not an actual data dependency.
> 
> So I checked for kicks and rcu_dereference comes with type checking,
> as in passing something which is not a pointer even fails to compile.

That is by design, keeping in mind that consume loads order only
later dereferences against the pointer load.

> I'll note thought that a smp_load_consume_ptr or similarly named
> routine would be nice and I'm rather confused why it was not added
> given smp_load_acquire and smp_store_release being there.

In recent kernels, READ_ONCE() is your smp_load_consume_ptr().  Or things
like rcu_dereference_check(p, 1).  But these can be used only when the
pointed-to object is guaranteed to live (as in not be freed) for the
full duration of the read-side use of that pointer.

> One immediate user would be mnt_idmap(), like so:
> iff --git a/include/linux/mount.h b/include/linux/mount.h
> index 33f17b6e8732..4d3486ff67ed 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -76,7 +76,7 @@ struct vfsmount {
>  static inline struct mnt_idmap *mnt_idmap(const struct vfsmount *mnt)
>  {
>         /* Pairs with smp_store_release() in do_idmap_mount(). */
> -       return READ_ONCE(mnt->mnt_idmap);
> +       return smp_load_consume_ptr(mnt->mnt_idmap);
>  }
> 
>  extern int mnt_want_write(struct vfsmount *mnt);

These would have the same semantics.  And in v6.12, this is instead
smp_load_acquire().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 20:01                   ` Paul E. McKenney
@ 2024-12-05 20:15                     ` Mateusz Guzik
  2024-12-05 21:17                       ` Paul E. McKenney
  0 siblings, 1 reply; 27+ messages in thread
From: Mateusz Guzik @ 2024-12-05 20:15 UTC (permalink / raw)
  To: paulmck
  Cc: Al Viro, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > >
> > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > >
> > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > > >  {
> > > > > > -     struct files_struct *files = current->files;
> > > > > > +     struct files_struct *files;
> > > > > >       struct fdtable *fdt;
> > > > > >
> > > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > >               return;
> > > > > >
> > > > > > +     /*
> > > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > > +      * explanation.
> > > > > > +      */
> > > > > >       rcu_read_lock_sched();
> > > > > > +     files = READ_ONCE(current->files);
> > > > >
> > > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > >
> > > > To my understanding this is the idiomatic way of spelling out the
> > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > flag.
> > >
> > > In Linus, "smp_consume_load()" is named rcu_dereference().
> >
> > ok
>
> And rcu_dereference(), and for that matter memory_order_consume, only
> orders the load of the pointer against subsequent dereferences of that
> same pointer against dereferences of that same pointer preceding the
> store of that pointer.
>
>         T1                              T2
>         a: p->a = 1;                    d: q = rcu_dereference(gp);
>         b: r1 = p->b;                   e: r2 = p->a;
>         c: rcu_assign_pointer(gp, p);   f: p->b = 42;
>
> Here, if (and only if!) T2's load into q gets the value stored by
> T1, then T1's statements e and f are guaranteed to happen after T2's
> statements a and b.  In your patch, I do not see this pattern for the
> files->resize_in_progress flag.
>
> > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > semantically equivalent to having a lock around the work.
> > >
> > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > only with later RCU grace periods, such as those implemented by
> > > synchronize_rcu().
> >
> > To my understanding the pre-case is already with the flag set upfront
> > and waiting for everyone to finish (which is already taking place in
> > stock code) + looking at it within the section.
>
> I freely confess that I do not understand the purpose of assigning to
> files->resize_in_progress both before (pre-existing) and within (added)
> expand_fdtable().  If the assignments before and after the call to
> expand_fdtable() and the checks were under that lock, that could work,
> but removing that lockless check might have performance and scalability
> consequences.
>
> > > > Pretend ->resize_lock exists, then:
> > > > fd_install:
> > > > files = current->files;
> > > > read_lock(files->resize_lock);
> > > > fdt = rcu_dereference_sched(files->fdt);
> > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > read_unlock(files->resize_lock);
> > > >
> > > > expand_fdtable:
> > > > write_lock(files->resize_lock);
> > > > [snip]
> > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > write_unlock(files->resize_lock);
> > > >
> > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > synchronize_rcu do it.
> > >
> > > OK, good, you did get the grace-period part of the puzzle.
> > >
> > > Howver, please keep in mind that synchronize_rcu() has significant
> > > latency by design.  There is a tradeoff between CPU consumption and
> > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > several milliseconds (not microseconds or nanoseconds).  I would be very
> > > surprised if expand_fdtable() users would be happy with such a long delay.
> >
> > The call is already there since 2015 and I only know of one case where
> > someone took an issue with it (and it could have been sorted out with
> > dup2 upfront to grow the table to the desired size). Amusingly I see
> > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > Bottom line though is that I'm not *adding* it. latency here. :)
>
> Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
> you are saying that, because otherwise your patch could simply remove
> it without additional code changes.  On the other hand, if it is a key
> component of the synchronization, I don't see how that smp_rmb() can be
> removed while still preserving that synchronization without adding another
> synchronize_rcu() to that function to compensate.
>
> Now, it might be that you are somehow cleverly reusing the pre-existing
> synchronize_rcu(), but I am not immediately seeing how this would work.
>
> And no, I do not recall making that particular change back in the
> day, only that I did change all the calls to synchronize_sched() to
> synchronize_rcu().  Please accept my apologies for my having failed
> to meet your expectations.  And do not be too surprised if others have
> similar expectations of you at some point in the future.  ;-)
>
> > So assuming the above can be ignored, do you confirm the patch works
> > (even if it needs some cosmetic changes)?
> >
> > The entirety of the patch is about removing smp_rmb in fd_install with
> > small code rearrangement, while relying on the machinery which is
> > already there.
>
> The code to be synchronized is fairly small.  So why don't you
> create a litmus test and ask herd7?  Please see tools/memory-model for
> documentation and other example litmus tests.  This tool does the moral
> equivalent of a full state-space search of the litmus tests, telling you
> whether your "exists" condition is always, sometimes, or never satisfied.
>

I think there is quite a degree of talking past each other in this thread.

I was not aware of herd7. Testing the thing with it sounds like a plan
to get out of it, so I'm going to do it and get back to you in a day
or two. Worst case the patch is a bust, best case the fence is already
of no use.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 20:15                     ` Mateusz Guzik
@ 2024-12-05 21:17                       ` Paul E. McKenney
  0 siblings, 0 replies; 27+ messages in thread
From: Paul E. McKenney @ 2024-12-05 21:17 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Al Viro, brauner, jack, linux-fsdevel, torvalds, edumazet,
	linux-kernel

On Thu, Dec 05, 2024 at 09:15:14PM +0100, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 9:01 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Thu, Dec 05, 2024 at 08:03:24PM +0100, Mateusz Guzik wrote:
> > > On Thu, Dec 5, 2024 at 7:41 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 05, 2024 at 03:43:41PM +0100, Mateusz Guzik wrote:
> > > > > On Thu, Dec 5, 2024 at 3:18 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > > >
> > > > > > On Thu, Dec 05, 2024 at 01:03:32PM +0100, Mateusz Guzik wrote:
> > > > > > >  void fd_install(unsigned int fd, struct file *file)
> > > > > > >  {
> > > > > > > -     struct files_struct *files = current->files;
> > > > > > > +     struct files_struct *files;
> > > > > > >       struct fdtable *fdt;
> > > > > > >
> > > > > > >       if (WARN_ON_ONCE(unlikely(file->f_mode & FMODE_BACKING)))
> > > > > > >               return;
> > > > > > >
> > > > > > > +     /*
> > > > > > > +      * Synchronized with expand_fdtable(), see that routine for an
> > > > > > > +      * explanation.
> > > > > > > +      */
> > > > > > >       rcu_read_lock_sched();
> > > > > > > +     files = READ_ONCE(current->files);
> > > > > >
> > > > > > What are you trying to do with that READ_ONCE()?  current->files
> > > > > > itself is *not* changed by any of that code; current->files->fdtab is.
> > > > >
> > > > > To my understanding this is the idiomatic way of spelling out the
> > > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > > flag.
> > > >
> > > > In Linus, "smp_consume_load()" is named rcu_dereference().
> > >
> > > ok
> >
> > And rcu_dereference(), and for that matter memory_order_consume, only
> > orders the load of the pointer against subsequent dereferences of that
> > same pointer against dereferences of that same pointer preceding the
> > store of that pointer.
> >
> >         T1                              T2
> >         a: p->a = 1;                    d: q = rcu_dereference(gp);
> >         b: r1 = p->b;                   e: r2 = p->a;
> >         c: rcu_assign_pointer(gp, p);   f: p->b = 42;
> >
> > Here, if (and only if!) T2's load into q gets the value stored by
> > T1, then T1's statements e and f are guaranteed to happen after T2's
> > statements a and b.  In your patch, I do not see this pattern for the
> > files->resize_in_progress flag.
> >
> > > > > Anyway to elaborate I'm gunning for a setup where the code is
> > > > > semantically equivalent to having a lock around the work.
> > > >
> > > > Except that rcu_read_lock_sched() provides mutual-exclusion guarantees
> > > > only with later RCU grace periods, such as those implemented by
> > > > synchronize_rcu().
> > >
> > > To my understanding the pre-case is already with the flag set upfront
> > > and waiting for everyone to finish (which is already taking place in
> > > stock code) + looking at it within the section.
> >
> > I freely confess that I do not understand the purpose of assigning to
> > files->resize_in_progress both before (pre-existing) and within (added)
> > expand_fdtable().  If the assignments before and after the call to
> > expand_fdtable() and the checks were under that lock, that could work,
> > but removing that lockless check might have performance and scalability
> > consequences.
> >
> > > > > Pretend ->resize_lock exists, then:
> > > > > fd_install:
> > > > > files = current->files;
> > > > > read_lock(files->resize_lock);
> > > > > fdt = rcu_dereference_sched(files->fdt);
> > > > > rcu_assign_pointer(fdt->fd[fd], file);
> > > > > read_unlock(files->resize_lock);
> > > > >
> > > > > expand_fdtable:
> > > > > write_lock(files->resize_lock);
> > > > > [snip]
> > > > > rcu_assign_pointer(files->fdt, new_fdt);
> > > > > write_unlock(files->resize_lock);
> > > > >
> > > > > Except rcu_read_lock_sched + appropriately fenced resize_in_progress +
> > > > > synchronize_rcu do it.
> > > >
> > > > OK, good, you did get the grace-period part of the puzzle.
> > > >
> > > > Howver, please keep in mind that synchronize_rcu() has significant
> > > > latency by design.  There is a tradeoff between CPU consumption and
> > > > latency, and synchronize_rcu() therefore has latencies ranging upwards of
> > > > several milliseconds (not microseconds or nanoseconds).  I would be very
> > > > surprised if expand_fdtable() users would be happy with such a long delay.
> > >
> > > The call is already there since 2015 and I only know of one case where
> > > someone took an issue with it (and it could have been sorted out with
> > > dup2 upfront to grow the table to the desired size). Amusingly I see
> > > you patched it in 2018 from synchronize_sched to synchronize_rcu.
> > > Bottom line though is that I'm not *adding* it. latency here. :)
> >
> > Are you saying that the smp_rmb() is unnecessary?  It doesn't seem like
> > you are saying that, because otherwise your patch could simply remove
> > it without additional code changes.  On the other hand, if it is a key
> > component of the synchronization, I don't see how that smp_rmb() can be
> > removed while still preserving that synchronization without adding another
> > synchronize_rcu() to that function to compensate.
> >
> > Now, it might be that you are somehow cleverly reusing the pre-existing
> > synchronize_rcu(), but I am not immediately seeing how this would work.
> >
> > And no, I do not recall making that particular change back in the
> > day, only that I did change all the calls to synchronize_sched() to
> > synchronize_rcu().  Please accept my apologies for my having failed
> > to meet your expectations.  And do not be too surprised if others have
> > similar expectations of you at some point in the future.  ;-)
> >
> > > So assuming the above can be ignored, do you confirm the patch works
> > > (even if it needs some cosmetic changes)?
> > >
> > > The entirety of the patch is about removing smp_rmb in fd_install with
> > > small code rearrangement, while relying on the machinery which is
> > > already there.
> >
> > The code to be synchronized is fairly small.  So why don't you
> > create a litmus test and ask herd7?  Please see tools/memory-model for
> > documentation and other example litmus tests.  This tool does the moral
> > equivalent of a full state-space search of the litmus tests, telling you
> > whether your "exists" condition is always, sometimes, or never satisfied.
> >
> 
> I think there is quite a degree of talking past each other in this thread.
> 
> I was not aware of herd7. Testing the thing with it sounds like a plan
> to get out of it, so I'm going to do it and get back to you in a day
> or two. Worst case the patch is a bust, best case the fence is already
> of no use.

Very good!  My grouchiness earlier in this thread notwithstanding, I am
happy to review your litmus tests.  (You will likely need more than one.)

And the inevitable unsolicited advice:  Make those litmus tests as small
as you possibly can.  Full-state-space search is extremely powerful,
but it does not scale very well.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 20:11                     ` Paul E. McKenney
@ 2024-12-06 12:11                       ` Christian Brauner
  0 siblings, 0 replies; 27+ messages in thread
From: Christian Brauner @ 2024-12-06 12:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mateusz Guzik, Linus Torvalds, Al Viro, jack, linux-fsdevel,
	edumazet, linux-kernel

On Thu, Dec 05, 2024 at 12:11:57PM -0800, Paul E. McKenney wrote:
> On Thu, Dec 05, 2024 at 08:47:24PM +0100, Mateusz Guzik wrote:
> > On Thu, Dec 5, 2024 at 8:26 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, 5 Dec 2024 at 10:41, Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > > To my understanding this is the idiomatic way of spelling out the
> > > > > non-existent in Linux smp_consume_load, for the resize_in_progress
> > > > > flag.
> > > >
> > > > In Linus, "smp_consume_load()" is named rcu_dereference().
> > >
> > > Linux.
> > >
> > > But yes and no.
> > >
> > > It's worth making it really really clear that "rcu_dereference()" is
> > > *not* just a different name for some "smp_consume_load()" operation.
> > >
> > > Why? Because a true smp_consume_load() would work with any random kind
> > > of flags etc. And rcu_dereference() works only because it's a pointer,
> > > and there's an inherent data dependency to what the result points to.
> > >
> > > Paul obviously knows this, but let's make it very clear in this
> > > discussion, because if somebody decided "I want a smp_consume_load(),
> > > and I'll use rcu_dereference() to do that", the end result would
> > > simply not work for arbitrary data, like a flags field or something,
> > > where comparing it against a value will only result in a control
> > > dependency, not an actual data dependency.
> > 
> > So I checked for kicks and rcu_dereference comes with type checking,
> > as in passing something which is not a pointer even fails to compile.
> 
> That is by design, keeping in mind that consume loads order only
> later dereferences against the pointer load.
> 
> > I'll note thought that a smp_load_consume_ptr or similarly named
> > routine would be nice and I'm rather confused why it was not added
> > given smp_load_acquire and smp_store_release being there.
> 
> In recent kernels, READ_ONCE() is your smp_load_consume_ptr().  Or things
> like rcu_dereference_check(p, 1).  But these can be used only when the
> pointed-to object is guaranteed to live (as in not be freed) for the
> full duration of the read-side use of that pointer.

Both true in the case of mnt_idmap(). All mounts start with mnt_idmap set to:
extern struct mnt_idmap nop_mnt_idmap which doesn't go away ever. And we
only allow to change the idmaping of a mount once. 
So the READ_ONCE() will always retrieve an object that is guaranteed to
stay alive for at least as long as the mount stays alive (in the
nop_mnt_idmap case obviously "forever").

I wanted to avoid a) pushing complicated RCU dances all through
filesystems and the VFS and b) taking any reference counts whatsoever on
mnt_idmap (other than sharing the same mnt_idmap between different
mounts at creation time). (Though I do have long-standing ideas how that
would work without changing the mnt_idmap pointer.).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH] fs: elide the smp_rmb fence in fd_install()
  2024-12-05 15:36                 ` Mateusz Guzik
@ 2024-12-06 15:32                   ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2024-12-06 15:32 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Jan Kara, paulmck, brauner, viro, linux-fsdevel, torvalds,
	edumazet, linux-kernel

On Thu 05-12-24 16:36:40, Mateusz Guzik wrote:
> On Thu, Dec 5, 2024 at 4:29 PM Jan Kara <jack@suse.cz> wrote:
> > On Thu 05-12-24 16:01:07, Mateusz Guzik wrote:
> > > Suppose the CPU reordered loads of the flag and the fd table. There is
> > > no ordering in which it can see both the old table and the unset flag.
> >
> > But I disagree here. If the reads are reordered, then the fd table read can
> > happen during the "flag is true and the fd table is old" state and the flag
> > read can happen later in "flag is false and the fd table is new" state.
> > Just as I outlined above...

Ugh, I might be missing something obvious so please bear with me.

> In your example all the work happens *after* synchronize_rcu().

Correct.

> The thread resizing the table already published the result even before
> calling into it.

Really? You proposed expand_table() does:

       BUG_ON(files->resize_in_progress);
       files->resize_in_progress = true;
       spin_unlock(&files->file_lock);
       new_fdt = alloc_fdtable(nr + 1);
       if (atomic_read(&files->count) > 1)
               synchronize_rcu();

       spin_lock(&files->file_lock);
       if (IS_ERR(new_fdt)) {
               err = PTR_ERR(new_fdt);
               goto out;
       }
       cur_fdt = files_fdtable(files);
       BUG_ON(nr < cur_fdt->max_fds);
       copy_fdtable(new_fdt, cur_fdt);
       rcu_assign_pointer(files->fdt, new_fdt);
       if (cur_fdt != &files->fdtab)
               call_rcu(&cur_fdt->rcu, free_fdtable_rcu);
       smp_wmb();
out:
       files->resize_in_progress = false;
       return err;

So synchronize_rcu() is called very early AFAICT. At that point we have
allocated the new table but copy & store in files->fdt happens *after*
synchronize_rcu() has finished. So the copy & store can be racing with
fd_install() calling rcu_read_lock_sched() and prefetching files->fdt (but
not files->resize_in_progress!) into a local CPU cache.

> Furthermore by the time synchronize_rcu returns
> everyone is guaranteed to have issued a full fence. Meaning nobody can
> see the flag as unset.

Well, nobody can see the flag unset only until expand_fdtable() reaches:

       smp_wmb();
out:
       files->resize_in_progress = false;
 
And smp_wmb() doesn't give you much unless the reads of
files->resize_in_progress and files->fdt are ordered somehow on the other
side (i.e., in fd_install()).

But I'm looking forward to the Litmus test to resolve our discussion :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2024-12-06 15:32 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-08  2:50 [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Al Viro
2024-08-08  3:06 ` Linus Torvalds
2024-08-08  3:35   ` Al Viro
2024-08-08  3:46     ` Al Viro
2024-08-08  6:08       ` Mateusz Guzik
2024-12-05 12:03         ` [RFC PATCH] fs: elide the smp_rmb fence in fd_install() Mateusz Guzik
2024-12-05 14:18           ` Al Viro
2024-12-05 14:43             ` Mateusz Guzik
2024-12-05 18:41               ` Paul E. McKenney
2024-12-05 19:03                 ` Mateusz Guzik
2024-12-05 20:01                   ` Paul E. McKenney
2024-12-05 20:15                     ` Mateusz Guzik
2024-12-05 21:17                       ` Paul E. McKenney
2024-12-05 19:26                 ` Linus Torvalds
2024-12-05 19:47                   ` Mateusz Guzik
2024-12-05 20:11                     ` Paul E. McKenney
2024-12-06 12:11                       ` Christian Brauner
2024-12-05 20:06                   ` Paul E. McKenney
2024-12-05 14:46           ` Jan Kara
2024-12-05 15:01             ` Mateusz Guzik
2024-12-05 15:29               ` Jan Kara
2024-12-05 15:36                 ` Mateusz Guzik
2024-12-06 15:32                   ` Jan Kara
2024-12-05 14:58           ` Christian Brauner
2024-12-05 15:06             ` Mateusz Guzik
2024-08-08 13:20   ` [RFC] why do we need smp_rmb/smp_wmb pair in fd_install()/expand_fdtable()? Christian Brauner
2024-08-08 16:11     ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.