[PATCH v2] vfs: shave work on failed file open

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] vfs: shave work on failed file open
@ 2023-09-26 16:22 Mateusz Guzik
  2023-09-26 19:00 ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-26 16:22 UTC (permalink / raw)
  To: brauner; +Cc: viro, linux-kernel, linux-fsdevel, torvalds, Mateusz Guzik

Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ------------------
    0.76    0.076233           5     15040      3688 openat

(this is tons of header files tried in different paths)

In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.

This is most notably delegation of freeing to task_work, which comes
with an enormous cost (see 021a160abf62 ("fs: use __fput_sync in
close(2)" for an example).

Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
[/snip]

Sapphire Rapids, openneg_processes -t 1 (ops/s):
before:	1950013
after:	2914973 (+49%)

file refcount is checked as a safety belt against buggy consumers with
an atomic cmpxchg. Technically it is not necessary, but it happens to
not be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.

v2:
- unexport fput_badopen and move to fs/internal.h
- handle the refcount with cmpxchg, adjust commentary accordingly
- tweak the commit message

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
---
 fs/file_table.c | 35 +++++++++++++++++++++++++++++++++++
 fs/internal.h   |  2 ++
 fs/namei.c      |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..6cbd5bc551d0 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -82,6 +82,16 @@ static inline void file_free(struct file *f)
 	call_rcu(&f->f_rcuhead, file_free_rcu);
 }
 
+static inline void file_free_badopen(struct file *f)
+{
+	BUG_ON(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
+	security_file_free(f);
+	put_cred(f->f_cred);
+	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+		percpu_counter_dec(&nr_files);
+	kmem_cache_free(filp_cachep, f);
+}
+
 /*
  * Return the total number of open files in the system
  */
@@ -468,6 +478,31 @@ void __fput_sync(struct file *file)
 EXPORT_SYMBOL(fput);
 EXPORT_SYMBOL(__fput_sync);
 
+/*
+ * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
+ *
+ * This represents opportunities to shave on work in the common case of
+ * FMODE_OPENED not being set:
+ * 1. there is nothing to close, just the file object to free and consequently
+ *    no need to delegate to task_work
+ * 2. as nobody else had seen the file then there is no need to delegate
+ *    freeing to RCU
+ */
+void fput_badopen(struct file *file)
+{
+	if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
+		fput(file);
+		return;
+	}
+
+	if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
+		fput(file);
+		return;
+	}
+
+	file_free_badopen(file);
+}
+
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
diff --git a/fs/internal.h b/fs/internal.h
index d64ae03998cc..93da6d815e90 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -95,6 +95,8 @@ struct file *alloc_empty_file(int flags, const struct cred *cred);
 struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
 struct file *alloc_empty_backing_file(int flags, const struct cred *cred);
 
+void fput_badopen(struct file *);
+
 static inline void put_file_access(struct file *file)
 {
 	if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
diff --git a/fs/namei.c b/fs/namei.c
index 567ee547492b..67579fe30b28 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3802,7 +3802,7 @@ static struct file *path_openat(struct nameidata *nd,
 		WARN_ON(1);
 		error = -EINVAL;
 	}
-	fput(file);
+	fput_badopen(file);
 	if (error == -EOPENSTALE) {
 		if (flags & LOOKUP_RCU)
 			error = -ECHILD;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-26 16:22 [PATCH v2] vfs: shave work on failed file open Mateusz Guzik
@ 2023-09-26 19:00 ` Linus Torvalds
  2023-09-26 19:28   ` Mateusz Guzik
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-26 19:00 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: brauner, viro, linux-kernel, linux-fsdevel

On Tue, 26 Sept 2023 at 09:22, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> +void fput_badopen(struct file *file)
> +{
> +       if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
> +               fput(file);
> +               return;
> +       }

I don't understand.

Why the FMODE_BACKING test?

The only thing that sets FMODE_BACKING is alloc_empty_backing_file(),
and we know that isn't involved, because the file that is free'd is

        file = alloc_empty_file(op->open_flag, current_cred());

so that test makes no sense.

It might make sense as another WARN_ON_ONCE(), but honestly, why even
that?  Why worry about FMODE_BACKING?

Now, the FMODE_OPENED check makes sense to me, in that it most
definitely can be set, and means we need to call the ->release()
callback and a lot more. Although I get the feeling that this test
would make more sense in the caller, since path_openat() _already_
checks for FMODE_OPENED in the non-error path too.

> +       if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1)) {
> +               fput(file);
> +               return;
> +       }

Ok, I kind of see why you'd want this safety check.  I don't see how
f_count could be validly anything else, but that's what the
WARN_ON_ONCE is all about.

Anyway, I think I'd be happier about this if it was more of a "just
the reverse of alloc_empty_file()", and path_openat() literally did
just

        if (likely(file->f_mode & FMODE_OPENED))
                release_empty_file(file);
        else
                fput(file);

instead of having this fput_badopen() helper that feels like it needs
to care about other cases than alloc_empty_file().

Don't take this email as a NAK, though. I don't hate the patch. I just
feel it could be more targeted, and more clearly "this is explicitly
avoiding the cost of 'fput()' in just path_openat() if we never
actually filled things in".

                   Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-26 19:00 ` Linus Torvalds
@ 2023-09-26 19:28   ` Mateusz Guzik
  2023-09-27 14:09     ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-26 19:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: brauner, viro, linux-kernel, linux-fsdevel

On 9/26/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 26 Sept 2023 at 09:22, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> +void fput_badopen(struct file *file)
>> +{
>> +       if (unlikely(file->f_mode & (FMODE_BACKING | FMODE_OPENED))) {
>> +               fput(file);
>> +               return;
>> +       }
>
> I don't understand.
>
> Why the FMODE_BACKING test?
>
> The only thing that sets FMODE_BACKING is alloc_empty_backing_file(),
> and we know that isn't involved, because the file that is free'd is
>
>         file = alloc_empty_file(op->open_flag, current_cred());
>
> so that test makes no sense.
>

I tried to future proof by dodging the thing, but I can drop it if you
insist. Also see below.

> It might make sense as another WARN_ON_ONCE(), but honestly, why even
> that?  Why worry about FMODE_BACKING?
>
> Now, the FMODE_OPENED check makes sense to me, in that it most
> definitely can be set, and means we need to call the ->release()
> callback and a lot more. Although I get the feeling that this test
> would make more sense in the caller, since path_openat() _already_
> checks for FMODE_OPENED in the non-error path too.
>
>> +       if (WARN_ON_ONCE(atomic_long_cmpxchg(&file->f_count, 1, 0) != 1))
>> {
>> +               fput(file);
>> +               return;
>> +       }
>
> Ok, I kind of see why you'd want this safety check.  I don't see how
> f_count could be validly anything else, but that's what the
> WARN_ON_ONCE is all about.
>

This would be VFSDEBUG or whatever if it was available. But between
nobody checking this and production kernels suffering the check when
they should not, I take the latter.

I wanted to propose debug macros for vfs but could not be bothered to
type it up and argue about it, maybe I'll get around to it.

> Anyway, I think I'd be happier about this if it was more of a "just
> the reverse of alloc_empty_file()", and path_openat() literally did
> just
>
>         if (likely(file->f_mode & FMODE_OPENED))
>                 release_empty_file(file);
>         else
>                 fput(file);
>
> instead of having this fput_badopen() helper that feels like it needs
> to care about other cases than alloc_empty_file().
>

I don't have a strong opinion, I think my variant is cleaner and more
generic, but this boils down to taste and this is definitely not the
hill I'm willing to die on.

I am enable to whatever tidy ups without a fight as long as the core
remains (task work and rcu dodged).

All that said, I think it is Christian's call on how it should look like.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-26 19:28   ` Mateusz Guzik
@ 2023-09-27 14:09     ` Christian Brauner
  2023-09-27 14:34       ` Mateusz Guzik
  2023-09-27 17:48       ` Linus Torvalds
  0 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2023-09-27 14:09 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Linus Torvalds, viro, linux-kernel, linux-fsdevel

> I don't have a strong opinion, I think my variant is cleaner and more
> generic, but this boils down to taste and this is definitely not the
> hill I'm willing to die on.

I kinda like the release_empty_file() approach but we should keep the
WARN_ON_ONCE() so we can see whether anyone is taking an extra reference
on this thing. It's super unlikely but I guess zebras exist and if some
(buggy) code were to call get_file() during ->open() and keep that
reference for some reason we'd want to know why. But I don't think
anything does that.

No need to resend I can massage this well enough in-tree.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 14:09     ` Christian Brauner
@ 2023-09-27 14:34       ` Mateusz Guzik
  2023-09-27 17:48       ` Linus Torvalds
  1 sibling, 0 replies; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-27 14:34 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Linus Torvalds, viro, linux-kernel, linux-fsdevel

On 9/27/23, Christian Brauner <brauner@kernel.org> wrote:
>> I don't have a strong opinion, I think my variant is cleaner and more
>> generic, but this boils down to taste and this is definitely not the
>> hill I'm willing to die on.
>
> I kinda like the release_empty_file() approach but we should keep the
> WARN_ON_ONCE() so we can see whether anyone is taking an extra reference
> on this thing. It's super unlikely but I guess zebras exist and if some
> (buggy) code were to call get_file() during ->open() and keep that
> reference for some reason we'd want to know why. But I don't think
> anything does that.
>
> No need to resend I can massage this well enough in-tree.
>

Ok, I'm buggering off to other patches.

Thanks.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 14:09     ` Christian Brauner
  2023-09-27 14:34       ` Mateusz Guzik
@ 2023-09-27 17:48       ` Linus Torvalds
  2023-09-27 17:56         ` Mateusz Guzik
  1 sibling, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-27 17:48 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Mateusz Guzik, viro, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]

On Wed, 27 Sept 2023 at 07:10, Christian Brauner <brauner@kernel.org> wrote:
>
> No need to resend I can massage this well enough in-tree.

Hmm. Please do, but here's some more food for thought for at least the
commit message.

Because there's more than the "__fput_sync()" issue at hand, we have
another delayed thing that this patch ends up short-circuiting, which
wasn't obvious from the original description.

I'm talking about the fact that our existing "file_free()" ends up
doing the actual release with

        call_rcu(&f->f_rcuhead, file_free_rcu);

and the patch under discussion avoids that part too.

And I actually like that it avoids it, I just think it should be
mentioned explicitly, because it wasn't obvious to me until I actually
looked at the old __fput() path. Particularly since it means that the
f_creds are free'd synchronously now.

I do think that's fine, although I forget what path it was that
required that rcu-delayed cred freeing. Worth mentioning, and maybe
worth thinking about.

However, when I *did* look at it, it strikes me that we could do this
differently.

Something like this (ENTIRELY UNTESTED) patch, which just moves this
logic into fput() itself.

Again: ENTIRELY UNTESTED, and I might easily have screwed up. But it
looks simpler and more straightforward to me. But again: that may be
because I missed something.

             Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1223 bytes --]

 fs/file_table.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..4fb87a0382d9 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -430,11 +430,33 @@ EXPORT_SYMBOL_GPL(flush_delayed_fput);

 static DECLARE_DELAYED_WORK(delayed_fput_work, delayed_fput);

+/*
+ * Called for files that were never fully opened, and
+ * don't need the RCU-delayed freeing: they have never
+ * been accessed in any other context.
+ */
+static void fput_immediate(struct file *f)
+{
+	security_file_free(f);
+	put_cred(f->f_cred);
+	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+		percpu_counter_dec(&nr_files);
+	if (unlikely(f->f_mode & FMODE_BACKING)) {
+		path_put(backing_file_real_path(f));
+		kfree(backing_file(f));
+	} else {
+		kmem_cache_free(filp_cachep, f);
+	}
+}
+
 void fput(struct file *file)
 {
 	if (atomic_long_dec_and_test(&file->f_count)) {
 		struct task_struct *task = current;

+		if (unlikely(!(file->f_mode & FMODE_OPENED)))
+			return fput_immediate(file);
+
 		if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
 			init_task_work(&file->f_rcuhead, ____fput);
 			if (!task_work_add(task, &file->f_rcuhead, TWA_RESUME))

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 17:48       ` Linus Torvalds
@ 2023-09-27 17:56         ` Mateusz Guzik
  2023-09-27 18:05           ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-27 17:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 27 Sept 2023 at 07:10, Christian Brauner <brauner@kernel.org>
> wrote:
>>
>> No need to resend I can massage this well enough in-tree.
>
> Hmm. Please do, but here's some more food for thought for at least the
> commit message.
>
> Because there's more than the "__fput_sync()" issue at hand, we have
> another delayed thing that this patch ends up short-circuiting, which
> wasn't obvious from the original description.
>
> I'm talking about the fact that our existing "file_free()" ends up
> doing the actual release with
>
>         call_rcu(&f->f_rcuhead, file_free_rcu);
>
> and the patch under discussion avoids that part too.
>

Comments in the patch explicitly mention dodgin RCU for the file object.

> And I actually like that it avoids it, I just think it should be
> mentioned explicitly, because it wasn't obvious to me until I actually
> looked at the old __fput() path. Particularly since it means that the
> f_creds are free'd synchronously now.
>

Well put_cred is called synchronously, but should this happen to be
the last ref on them, they will get call_rcu(&cred->rcu,
put_cred_rcu)'ed.

> I do think that's fine, although I forget what path it was that
> required that rcu-delayed cred freeing. Worth mentioning, and maybe
> worth thinking about.
>

See above. The only spot which which plays tricks with it is
faccessat, other than that all creds are explicitly freed with rcu.

> However, when I *did* look at it, it strikes me that we could do this
> differently.
>
> Something like this (ENTIRELY UNTESTED) patch, which just moves this
> logic into fput() itself.
>

I did not want to do it because failed open is a special case, quite
specific to one syscall (and maybe few others later).

As is you are adding a branch to all final fputs and are preventing
whacking that 1 -> 0 unref down the road, unless it gets moved out
again like in my patch.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 17:56         ` Mateusz Guzik
@ 2023-09-27 18:05           ` Linus Torvalds
  2023-09-27 18:32             ` Mateusz Guzik
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-27 18:05 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On Wed, 27 Sept 2023 at 10:56, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Comments in the patch explicitly mention dodgin RCU for the file object.

Not the commit message,. and the comment is also actually pretty
obscure and only talks about the freeing part.

The cred part is what actually made me go "why is that even rcu-free'd".

I *think* it's bogus, but I didn't go look at the history of it .

> Well put_cred is called synchronously, but should this happen to be
> the last ref on them, they will get call_rcu(&cred->rcu,
> put_cred_rcu)'ed.

Yes. But the way it's done in __fput() you end up potentially
RCU-delaying it twice. Odd.

The reason we rcu-delay the 'struct file *' is because of the
__fget_files_rcu() games.

But I don't see why the cred thing is there.

Historical mistake? But it all looks a bit odd, and because of that it
worries me.

              Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 18:05           ` Linus Torvalds
@ 2023-09-27 18:32             ` Mateusz Guzik
  2023-09-27 20:27               ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-27 18:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On Wed, Sep 27, 2023 at 11:05:37AM -0700, Linus Torvalds wrote:
> On Wed, 27 Sept 2023 at 10:56, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Comments in the patch explicitly mention dodgin RCU for the file object.
> 
> Not the commit message,. and the comment is also actually pretty
> obscure and only talks about the freeing part.
> 

How about this:

================== cut here ==================

vfs: shave work on failed file open

Failed opens (mostly ENOENT) legitimately happen a lot, for example here
are stats from stracing kernel build for few seconds (strace -fc make):

  % time     seconds  usecs/call     calls    errors syscall
  ------ ----------- ----------- --------- --------- ------------------
    0.76    0.076233           5     15040      3688 openat

(this is tons of header files tried in different paths)

In the common case of there being nothing to close (only the file object
to free) there is a lot of overhead which can be avoided.

This boils down to 2 items:
1. avoiding delegation of fput to task_work, see 021a160abf62 ("fs:
use __fput_sync in close(2)" for more details on overhead)
2. avoiding freeing the file with RCU

Benchmarked with will-it-scale with a custom testcase based on
tests/open1.c, stuffed into tests/openneg.c:
[snip]
        while (1) {
                int fd = open("/tmp/nonexistent", O_RDONLY);
                assert(fd == -1);

                (*iterations)++;
        }
[/snip]

Sapphire Rapids, openneg_processes -t 1 (ops/s):
before:	1950013
after:	2914973 (+49%)

file refcount is checked with an atomic cmpxchg as a safety belt against
buggy consumers. Technically it is not necessary, but it happens to not
be measurable due to several other atomics which immediately follow.
Optmizing them away to make this atomic into a problem is left as an
exercise for the reader.

================== cut here ==================

Comment in v2 is:

/*
 * Clean up after failing to open (e.g., open(2) returns with -ENOENT).
 *
 * This represents opportunities to shave on work in the common case of
 * FMODE_OPENED not being set:
 * 1. there is nothing to close, just the file object to free and consequently
 *    no need to delegate to task_work
 * 2. as nobody else had seen the file then there is no need to delegate
 *    freeing to RCU
 */

I don't see anything wrong with it as far as information goes.

> > Well put_cred is called synchronously, but should this happen to be
> > the last ref on them, they will get call_rcu(&cred->rcu,
> > put_cred_rcu)'ed.
> 
> Yes. But the way it's done in __fput() you end up potentially
> RCU-delaying it twice. Odd.
> 
> The reason we rcu-delay the 'struct file *' is because of the
> __fget_files_rcu() games.
> 
> But I don't see why the cred thing is there.
> 
> Historical mistake? But it all looks a bit odd, and because of that it
> worries me.
> 

put_cred showed up in file_free_rcu in d76b0d9b2d87 ("CRED: Use creds in
file structs"). Commit message does not claim any dependency on this
being in an rcu callback already and it looks like it was done this way
because this was the ony spot with kmem_cache_free(filp_cachep, f) --
you ensured put_cred was always called without inspecting any other
places.

If there is something magic going on here I don't see it, it definitely
was not intended at least.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 18:32             ` Mateusz Guzik
@ 2023-09-27 20:27               ` Linus Torvalds
  2023-09-27 21:06                 ` Mateusz Guzik
  2023-09-28 13:25                 ` Christian Brauner
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2023-09-27 20:27 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1951 bytes --]

On Wed, 27 Sept 2023 at 11:32, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> put_cred showed up in file_free_rcu in d76b0d9b2d87 ("CRED: Use creds in
> file structs"). Commit message does not claim any dependency on this
> being in an rcu callback already and it looks like it was done this way
> because this was the ony spot with kmem_cache_free(filp_cachep, f)

Yes, that looks about right. So the rcu-freeing is almost an accident.

Btw, I think we could get rid of the RCU freeing of 'struct file *' entirely.

The way to fix it is

 (a) make sure all f_count accesses are atomic ops (the one special
case is the "0 -> X" initialization, which is ok)

 (b) make filp_cachep be SLAB_TYPESAFE_BY_RCU

because then get_file_rcu() can do the atomic_long_inc_not_zero()
knowing it's still a 'struct file *' while holding the RCU read lock
even if it was just free'd.

And __fget_files_rcu() will then re-check that it's the *right*
'struct file *' and do a fput() on it and re-try if it isn't. End
result: no need for any RCU freeing.

But the difference is that a *new* 'struct file *' might see a
temporary atomic increment / decrement of the file pointer because
another CPU is going through that __fget_files_rcu() dance.

Which is why "0 -> X" is ok to do as a "atomic_long_set()", but
everything else would need to be done as "atomic_long_inc()" etc.

Which all seems to be the case already, so with the put_cred() not
needing the RCU delay, I thing we really could do this patch (note:
independent of other issues, but makes your patch require that
"atomic_long_cmpxchg()" and the WARN_ON() should probably go away,
because it can actually happen).

That should help the normal file open/close case a bit, in that it
doesn't cause that extra RCU work.

Of course, on some loads it might be advantageous to do a delayed
de-allocation in some other RCU context, so ..

What do you think?

             Linus

PS. And as always: ENTIRELY UNTESTED.

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1364 bytes --]

 fs/file_table.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index ee21b3da9d08..7b38ff7385cc 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -65,21 +65,21 @@ static void file_free_rcu(struct rcu_head *head)
 {
 	struct file *f = container_of(head, struct file, f_rcuhead);

-	put_cred(f->f_cred);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		kfree(backing_file(f));
-	else
-		kmem_cache_free(filp_cachep, f);
+	kfree(backing_file(f));
 }

 static inline void file_free(struct file *f)
 {
 	security_file_free(f);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		path_put(backing_file_real_path(f));
 	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_rcuhead, file_free_rcu);
+	put_cred(f->f_cred);
+	if (unlikely(f->f_mode & FMODE_BACKING)) {
+		path_put(backing_file_real_path(f));
+		call_rcu(&f->f_rcuhead, file_free_rcu);
+	} else {
+		kmem_cache_free(filp_cachep, f);
+	}
 }

 /*
@@ -471,7 +471,8 @@ EXPORT_SYMBOL(__fput_sync);
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
+			SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN
+			| SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 20:27               ` Linus Torvalds
@ 2023-09-27 21:06                 ` Mateusz Guzik
  2023-09-27 21:18                   ` Linus Torvalds
  2023-09-28 13:25                 ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-27 21:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Btw, I think we could get rid of the RCU freeing of 'struct file *'
> entirely.
>
> The way to fix it is
>
>  (a) make sure all f_count accesses are atomic ops (the one special
> case is the "0 -> X" initialization, which is ok)
>
>  (b) make filp_cachep be SLAB_TYPESAFE_BY_RCU
>
> because then get_file_rcu() can do the atomic_long_inc_not_zero()
> knowing it's still a 'struct file *' while holding the RCU read lock
> even if it was just free'd.
>
> And __fget_files_rcu() will then re-check that it's the *right*
> 'struct file *' and do a fput() on it and re-try if it isn't. End
> result: no need for any RCU freeing.
>
> But the difference is that a *new* 'struct file *' might see a
> temporary atomic increment / decrement of the file pointer because
> another CPU is going through that __fget_files_rcu() dance.
>

I think you attached the wrong file, it has next to no changes and in
particular nothing for fd lookup.

You may find it interesting that both NetBSD and FreeBSD have been
doing something to that extent for years now in order to provide
lockless fd lookup despite not having an equivalent to RCU (what they
did have at the time is "type stable" -- objs can get reused but the
memory can *never* get freed. utterly gross, but that's old Unix for
you).

It does work, but I always found it dodgy because it backpedals in a
way which is not free of side effects.

Note that validating you got the right file bare minimum requires
reloading the fd table pointer because you might have been racing
against close *and* resize.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 21:06                 ` Mateusz Guzik
@ 2023-09-27 21:18                   ` Linus Torvalds
  2023-09-27 21:30                     ` Mateusz Guzik
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-27 21:18 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On Wed, 27 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I think you attached the wrong file, it has next to no changes and in
> particular nothing for fd lookup.

The fd lookup is already safe.

It already does the whole "double-check the file pointer after doing
the increment" for other reasons - namely the whole "oh, the file
table can be re-allocated under us" thing.

So the fd lookup needs rcu, but it does all the checks to make it all
work with SLAB_TYPESAFE_BY_RCU.

> You may find it interesting that both NetBSD and FreeBSD have been
> doing something to that extent for years now in order to provide
> lockless fd lookup despite not having an equivalent to RCU (what they
> did have at the time is "type stable" -- objs can get reused but the
> memory can *never* get freed. utterly gross, but that's old Unix for
> you).

That kind of "never free'd" thing is indeed gross, but the
type-stability is useful.

Our SLAB_TYPESAFE_BY_RCU is somewhat widely used, exactly because it's
much cheaper than an *actual* RCU delayed free.

Of course, it also requires more care, but it so happens that we
already have that for other reasons for 'struct file'.

> It does work, but I always found it dodgy because it backpedals in a
> way which is not free of side effects.

Grep around for SLAB_TYPESAFE_BY_RCU and you'll see that we actually
have it in multiple places, most notably the sighand_struct.

> Note that validating you got the right file bare minimum requires
> reloading the fd table pointer because you might have been racing
> against close *and* resize.

Exactly. See __fget_files_rcu().

          Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 21:18                   ` Linus Torvalds
@ 2023-09-27 21:30                     ` Mateusz Guzik
  0 siblings, 0 replies; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-27 21:30 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Christian Brauner, viro, linux-kernel, linux-fsdevel

On 9/27/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 27 Sept 2023 at 14:06, Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I think you attached the wrong file, it has next to no changes and in
>> particular nothing for fd lookup.
>
> The fd lookup is already safe.
>
> It already does the whole "double-check the file pointer after doing
> the increment" for other reasons - namely the whole "oh, the file
> table can be re-allocated under us" thing.
>
> So the fd lookup needs rcu, but it does all the checks to make it all
> work with SLAB_TYPESAFE_BY_RCU.
>

Indeed, nice.

Sorry, I discounted the patch after not seeing anything for fd and
file_free_rcu still being there. Looked like a WIP.

I'm going to give it a spin tomorrow along with some benching.

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-27 20:27               ` Linus Torvalds
  2023-09-27 21:06                 ` Mateusz Guzik
@ 2023-09-28 13:25                 ` Christian Brauner
  2023-09-28 14:05                   ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2023-09-28 13:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Mateusz Guzik, viro, linux-kernel, linux-fsdevel

> Which all seems to be the case already, so with the put_cred() not
> needing the RCU delay, I thing we really could do this patch (note:

So I spent a good chunk of time going through this patch.

Before file->f_cred was introduced file->f_{g,u}id would have been
accessible just under rcu protection. And file->f_cred->f_fs{g,u}id
replaced that access. So I think the intention was that file->f_cred
would function the same way, i.e., it would be possible to go from file
to cred under rcu without requiring a reference.

But basically, file->f_cred is the only field that would give this
guarantee. Other pointers such as file->f_security
(security_file_free()) don't and are freed outside of the rcu delay
already as well.

This patch means that if someone wants to access file->f_cred under rcu
they now need to call get_file_rcu() first.

Nothing has relied on this rcu-only file->f_cred quirk/feature until now
so I think it's fine to change it.

Does that make sense?

Please take a look at:
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=e3f15ee79197fc8b17d3496b6fa4fa0fc20f5406
for testing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-28 13:25                 ` Christian Brauner
@ 2023-09-28 14:05                   ` Christian Brauner
  2023-09-28 14:43                     ` Jann Horn
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2023-09-28 14:05 UTC (permalink / raw)
  To: Linus Torvalds, Jann Horn
  Cc: Mateusz Guzik, viro, linux-kernel, linux-fsdevel

> So I spent a good chunk of time going through this patch.

The main thing that makes me go "we shouldn't do this" is that KASAN
isn't able to detect UAF issues as Jann pointed out so I'm getting
really nervous about this.

And Jann also pointed out some potential issues with
__fget_files_rcu() as well...

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-28 14:05                   ` Christian Brauner
@ 2023-09-28 14:43                     ` Jann Horn
  2023-09-28 17:21                       ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Jann Horn @ 2023-09-28 14:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Mateusz Guzik, viro, linux-kernel, linux-fsdevel

On Thu, Sep 28, 2023 at 4:05 PM Christian Brauner <brauner@kernel.org> wrote:
>
> > So I spent a good chunk of time going through this patch.
>
> The main thing that makes me go "we shouldn't do this" is that KASAN
> isn't able to detect UAF issues as Jann pointed out so I'm getting
> really nervous about this.

(FWIW there is an in-progress patch to address this that I sent a few
weeks ago but that is not landed yet,
<https://lore.kernel.org/linux-mm/20230825211426.3798691-1-jannh@google.com/>.
So currently KASAN can only detect UAF in SLAB_TYPESAFE_BY_RCU slabs
when the slab allocator has given them back to the page allocator.)

> And Jann also pointed out some potential issues with
> __fget_files_rcu() as well...

The issue I see with the current __fget_files_rcu() is that the
"file->f_mode & mask" is no longer effective in its current position,
it would have to be moved down below the get_file_rcu() call.
That's a semantic difference between manually RCU-freeing and
SLAB_TYPESAFE_BY_RCU - we no longer have the guarantee that an object
can't be freed and reallocated within a single RCU grace period.
With the current patch, we could race like this:

```
static inline struct file *__fget_files_rcu(struct files_struct *files,
        unsigned int fd, fmode_t mask)
{
        for (;;) {
                struct file *file;
                struct fdtable *fdt = rcu_dereference_raw(files->fdt);
                struct file __rcu **fdentry;

                if (unlikely(fd >= fdt->max_fds))
                        return NULL;

                fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
                file = rcu_dereference_raw(*fdentry);
                if (unlikely(!file))
                        return NULL;

                if (unlikely(file->f_mode & mask))
                        return NULL;

                [in another thread:]
                [file is removed from fd table and freed]
                [file is reallocated as something like an O_PATH file,
                 which the check above would not permit]
                [reallocated file is inserted in the fd table in the
same position]

                /*
                 * Ok, we have a file pointer. However, because we do
                 * this all locklessly under RCU, we may be racing with
                 * that file being closed.
                 *
                 * Such a race can take two forms:
                 *
                 *  (a) the file ref already went down to zero,
                 *      and get_file_rcu() fails. Just try again:
                 */
                if (unlikely(!get_file_rcu(file))) [succeeds]
                        continue;

                /*
                 *  (b) the file table entry has changed under us.
                 *       Note that we don't need to re-check the 'fdt->fd'
                 *       pointer having changed, because it always goes
                 *       hand-in-hand with 'fdt'.
                 *
                 * If so, we need to put our ref and try again.
                 */
                [recheck succeeds because the new file was inserted in
the same position]
                if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
                    unlikely(rcu_dereference_raw(*fdentry) != file)) {
                        fput(file);
                        continue;
                }

                /*
                 * Ok, we have a ref to the file, and checked that it
                 * still exists.
                 */
                [a file incompatible with the supplied mask is returned]
                return file;
        }
}
```

There are also some weird get_file_rcu() users in other places like
BPF's task_file_seq_get_next and in gfs2_glockfd_next_file that do
weird stuff without the recheck, especially gfs2_glockfd_next_file
even looks at the inodes of files without taking a reference (which
seems a little dodgy but maybe actually currently works because inodes
are also RCU-freed?). So I think you'd have to clean all of that up
before you can make this change.

Similar thing with get_mm_exe_file(), that relies on get_file_rcu()
success meaning that the file was not reallocated. And tid_fd_mode()
in procfs assumes that task_lookup_fd_rcu() returns a file* whose mode
can be inspected under RCU.

As Linus already mentioned, release_empty_file() is also broken now,
because it assumes that nobody will grab references to unopened files,
but actually that can now happen spuriously when a concurrent fget()
has called get_file_rcu() on a recycled file and not yet hit the
recheck fput(). Kinda like the thing with "struct page" where GUP can
randomly spuriously bump up the refcount of any page including ones
that are not mapped into userspace. So that would have to go through
the same fput() path as every other file freeing.

We also now rely on the "f_count" initialization in init_file()
happening after the point of no return, which is currently the case,
but that'd have to be documented to avoid someone adding a later
bailout in the future, and maybe could be clarified by actually moving
the count initialization after the bailout?

Heh, I grepped for `__rcu.*file`, and BPF has a thing in
kernel/bpf/verifier.c that seems to imply it would be safe for some
types of BPF programs to follow the mm->exe_file reference solely
protected by RCU, which already seems a little dodgy now but more
after this change:

```
/* RCU trusted: these fields are trusted in RCU CS and can be NULL */
BTF_TYPE_SAFE_RCU_OR_NULL(struct mm_struct) {
        struct file __rcu *exe_file;
};
```

(To be clear: This is not intended to be an exhaustive list.)

So I think conceptually this is something you can do but it would
require a bit of cleanup all around the kernel to make sure you really
just have one or two central functions that make use of the limited
RCU-ness of "struct file", and that nothing else relies on that or
makes assumptions about how non-zero refcounts move.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-28 14:43                     ` Jann Horn
@ 2023-09-28 17:21                       ` Linus Torvalds
  2023-09-29  9:20                         ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-28 17:21 UTC (permalink / raw)
  To: Jann Horn
  Cc: Christian Brauner, Mateusz Guzik, viro, linux-kernel,
	linux-fsdevel

On Thu, 28 Sept 2023 at 07:44, Jann Horn <jannh@google.com> wrote:
>
> The issue I see with the current __fget_files_rcu() is that the
> "file->f_mode & mask" is no longer effective in its current position,
> it would have to be moved down below the get_file_rcu() call.

Yes, you're right.

But moving it down below the "re-check that the fdt pointer and the
file pointer still matches" should be easy and sufficient.

> There are also some weird get_file_rcu() users in other places like
> BPF's task_file_seq_get_next and in gfs2_glockfd_next_file that do
> weird stuff without the recheck, especially gfs2_glockfd_next_file
> even looks at the inodes of files without taking a reference (which
> seems a little dodgy but maybe actually currently works because inodes
> are also RCU-freed?).

The inodes are also RCU-free'd, but that is indeed dodgy.

I think it happens to work, and we actually have a somewhat similar
pattern in the RCU lookup code (except with dentry->d_inode, not
file->f_inode), because as you say the inode data structure itself is
rcu-free'd, but more importantly, that code does the "get_file_rcu()"
afterwards.

And yes, right now that works fine, because it will fail if the file
f_count goes down to zero.

And f_count will go down to zero before we really tear down the inode with

        file->f_op->release(inode, file);

and (more importantly) the dput -> dentry_kill -> dentry_unlink_inode
-> release.

So that get_file_rcu() will currently protect against any "oh, the
inode is stale and about to be released".

But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
since then the "f_count is zero" is no longer a final thing.

It's fixable by having the same "double check the file table" that I
do think we should do regardless. That get_file_rcu() pattern may
*work*, but it's very very dodgy.

                Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-28 17:21                       ` Linus Torvalds
@ 2023-09-29  9:20                         ` Christian Brauner
  2023-09-29 13:31                           ` Jann Horn
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2023-09-29  9:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jann Horn, Mateusz Guzik, viro, linux-kernel, linux-fsdevel

> But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> since then the "f_count is zero" is no longer a final thing.

I've tried coming up with a patch that is simple enough so the pattern
is easy to follow and then converting all places to rely on a pattern
that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
thing is that we'll force a few places to now always acquire a reference
when they don't really need one right now and that already may cause
performance issues.

We also can't fully get rid of plain get_file_rcu() uses itself because
of users such as mm->exe_file. They don't go from one of the rcu fdtable
lookup helpers to the struct file obviously. They rcu replace the file
pointer in their struct ofc so we could change get_file_rcu() to take a
struct file __rcu **f and then comparing that the passed in pointer
hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
afaict should work for such cases.

But overall we would introduce a fairly big and at the same time subtle
semantic change. The idea is pretty neat and it was fun to do but I'm
just not convinced we should do it given how ubiquitous struct file is
used and now to make the semanics even more special by allowing
refcounts.

I've kept your original release_empty_file() proposal in vfs.misc which
I think is a really nice change.

Let me know if you all passionately disagree. ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29  9:20                         ` Christian Brauner
@ 2023-09-29 13:31                           ` Jann Horn
  2023-09-29 19:57                             ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Jann Horn @ 2023-09-29 13:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Mateusz Guzik, viro, linux-kernel, linux-fsdevel

On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org> wrote:
> > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> > since then the "f_count is zero" is no longer a final thing.
>
> I've tried coming up with a patch that is simple enough so the pattern
> is easy to follow and then converting all places to rely on a pattern
> that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
> thing is that we'll force a few places to now always acquire a reference
> when they don't really need one right now and that already may cause
> performance issues.

(Those places are probably used way less often than the hot
open/fget/close paths though.)

> We also can't fully get rid of plain get_file_rcu() uses itself because
> of users such as mm->exe_file. They don't go from one of the rcu fdtable
> lookup helpers to the struct file obviously. They rcu replace the file
> pointer in their struct ofc so we could change get_file_rcu() to take a
> struct file __rcu **f and then comparing that the passed in pointer
> hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
> afaict should work for such cases.
>
> But overall we would introduce a fairly big and at the same time subtle
> semantic change. The idea is pretty neat and it was fun to do but I'm
> just not convinced we should do it given how ubiquitous struct file is
> used and now to make the semanics even more special by allowing
> refcounts.
>
> I've kept your original release_empty_file() proposal in vfs.misc which
> I think is a really nice change.
>
> Let me know if you all passionately disagree. ;)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 13:31                           ` Jann Horn
@ 2023-09-29 19:57                             ` Christian Brauner
  2023-09-29 21:23                               ` Mateusz Guzik
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2023-09-29 19:57 UTC (permalink / raw)
  To: Jann Horn, Linus Torvalds
  Cc: Mateusz Guzik, viro, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 2098 bytes --]

On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org> wrote:
> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
> > > since then the "f_count is zero" is no longer a final thing.
> >
> > I've tried coming up with a patch that is simple enough so the pattern
> > is easy to follow and then converting all places to rely on a pattern
> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The obvious
> > thing is that we'll force a few places to now always acquire a reference
> > when they don't really need one right now and that already may cause
> > performance issues.
> 
> (Those places are probably used way less often than the hot
> open/fget/close paths though.)
> 
> > We also can't fully get rid of plain get_file_rcu() uses itself because
> > of users such as mm->exe_file. They don't go from one of the rcu fdtable
> > lookup helpers to the struct file obviously. They rcu replace the file
> > pointer in their struct ofc so we could change get_file_rcu() to take a
> > struct file __rcu **f and then comparing that the passed in pointer
> > hasn't changed before we managed to do atomic_long_inc_not_zero(). Which
> > afaict should work for such cases.
> >
> > But overall we would introduce a fairly big and at the same time subtle
> > semantic change. The idea is pretty neat and it was fun to do but I'm
> > just not convinced we should do it given how ubiquitous struct file is
> > used and now to make the semanics even more special by allowing
> > refcounts.
> >
> > I've kept your original release_empty_file() proposal in vfs.misc which
> > I think is a really nice change.
> >
> > Let me know if you all passionately disagree. ;)

So I'm appending the patch I had played with and a fix from Jann on top.
@Linus, if you have an opinion, let me know what you think.

Also available here:
https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu

Might be interesting if this could be perfed to see if there is any real
gain for workloads with massive numbers of fds.

[-- Attachment #2: 0001-PROBABLY-BROKEN-AS-ABSOLUTE-FSCK-AND-QUICKLY-DRAFTED.patch --]
[-- Type: text/x-diff, Size: 14296 bytes --]

From ad101054772181fa044f7891c4575c0a0b6205fd Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 29 Sep 2023 08:45:59 +0200
Subject: [PATCH 1/2] [PROBABLY BROKEN AS ABSOLUTE FSCK AND QUICKLY DRAFTED]
 file: convert to SLAB_TYPESAFE_BY_RCU

In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.

The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.

With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. In __fget_files_rcu() the check for f_mode
needs to move down to after we've acquired a reference on the object.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>

[PROBABLY BROKEN AS ABSOLUTE FSCK AND QUICKLY DRAFTED] file: convert to SLAB_TYPESAFE_BY_RCU

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 Documentation/filesystems/files.rst          |  7 +++
 arch/powerpc/platforms/cell/spufs/coredump.c |  7 ++-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c     |  2 +-
 fs/file.c                                    | 60 +++++++++++++++-----
 fs/file_table.c                              | 36 ++++++------
 fs/gfs2/glock.c                              |  7 ++-
 fs/notify/dnotify/dnotify.c                  |  4 +-
 fs/proc/fd.c                                 |  7 ++-
 include/linux/fdtable.h                      | 15 +++--
 include/linux/fs.h                           |  3 +-
 kernel/bpf/task_iter.c                       |  2 -
 kernel/fork.c                                |  4 +-
 kernel/kcmp.c                                |  2 +
 13 files changed, 105 insertions(+), 51 deletions(-)

diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index bcf84459917f..9e77f46a7389 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -126,3 +126,10 @@ the fdtable structure -
    Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
    the fdtable pointer (fdt) must be loaded after locate_fd().
 
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU. This means it isn't sufficient anymore to just acquire a
+reference to the file in question under rcu using atomic_long_inc_not_zero()
+since the file might have already been recycled and someone else might have
+bumped the reference. In other words, the caller might see reference
+count bumps from newer users. For this is reason it is necessary to verify that
+the pointer is the same before and after the reference count increment.
diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index 1a587618015c..6fe84037bccd 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -75,9 +75,12 @@ static struct spu_context *coredump_next_context(int *fd)
 
 	rcu_read_lock();
 	file = lookup_fd_rcu(*fd);
-	ctx = SPUFS_I(file_inode(file))->i_ctx;
-	get_spu_context(ctx);
 	rcu_read_unlock();
+	if (file) {
+		ctx = SPUFS_I(file_inode(file))->i_ctx;
+		get_spu_context(ctx);
+		fput(file);
+	}
 
 	return ctx;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index aa4d842d4c5a..b2f00f54218f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -917,7 +917,7 @@ static struct file *mmap_singleton(struct drm_i915_private *i915)
 
 	rcu_read_lock();
 	file = READ_ONCE(i915->gem.mmap_singleton);
-	if (file && !get_file_rcu(file))
+	if (!get_file_rcu(&file))
 		file = NULL;
 	rcu_read_unlock();
 	if (file)
diff --git a/fs/file.c b/fs/file.c
index 3e4a4dfa38fc..e983cf3b9e01 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -853,8 +853,39 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static inline struct file *__fget_files_rcu(struct files_struct *files,
-	unsigned int fd, fmode_t mask)
+struct file *get_file_rcu(struct file __rcu **f)
+{
+	for (;;) {
+		struct file __rcu *file;
+
+		file = rcu_dereference_raw(*f);
+		if (!file)
+			return NULL;
+
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
+
+		/*
+		 * atomic_long_inc_not_zero() serves as a full memory
+		 * barrier when we acquired a reference.
+		 *
+		 * This is paired with the write barrier from assigning
+		 * to the __rcu protected file pointer so that if that
+		 * pointer still matches the current file, we know we
+		 * have successfully acquire a reference to it.
+		 *
+		 * If the pointers don't match the file has been
+		 * reallocated by SLAB_TYPESAFE_BY_RCU. So verify that
+		 * we're holding the right reference.
+		 */
+		if (file == rcu_access_pointer(*f))
+			return rcu_pointer_handoff(file);
+
+		fput(file);
+	}
+}
+
+struct file *__fget_files_rcu(struct files_struct *files, unsigned int fd, fmode_t mask)
 {
 	for (;;) {
 		struct file *file;
@@ -865,12 +896,6 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 			return NULL;
 
 		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
-		file = rcu_dereference_raw(*fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(file->f_mode & mask))
-			return NULL;
 
 		/*
 		 * Ok, we have a file pointer. However, because we do
@@ -882,8 +907,9 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *  (a) the file ref already went down to zero,
 		 *      and get_file_rcu() fails. Just try again:
 		 */
-		if (unlikely(!get_file_rcu(file)))
-			continue;
+		file = get_file_rcu(fdentry);
+		if (unlikely(!file))
+			return NULL;
 
 		/*
 		 *  (b) the file table entry has changed under us.
@@ -893,12 +919,16 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
-		    unlikely(rcu_dereference_raw(*fdentry) != file)) {
+		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
 
+		if (unlikely(file->f_mode & mask)) {
+			fput(file);
+			return NULL;
+		}
+
 		/*
 		 * Ok, we have a ref to the file, and checked that it
 		 * still exists.
@@ -1272,12 +1302,16 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
 {
 	if (unlikely(newfd == oldfd)) { /* corner case */
 		struct files_struct *files = current->files;
+		struct file *f;
 		int retval = oldfd;
 
 		rcu_read_lock();
-		if (!files_lookup_fd_rcu(files, oldfd))
+		f = files_lookup_fd_rcu(files, oldfd);
+		if (!f)
 			retval = -EBADF;
 		rcu_read_unlock();
+		if (f)
+			fput(f);
 		return retval;
 	}
 	return ksys_dup3(oldfd, newfd, 0);
diff --git a/fs/file_table.c b/fs/file_table.c
index e68e97d4f00a..844c97d21b33 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -65,33 +65,34 @@ static void file_free_rcu(struct rcu_head *head)
 {
 	struct file *f = container_of(head, struct file, f_rcuhead);
 
-	put_cred(f->f_cred);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		kfree(backing_file(f));
-	else
-		kmem_cache_free(filp_cachep, f);
+	kfree(backing_file(f));
 }
 
 static inline void file_free(struct file *f)
 {
 	security_file_free(f);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		path_put(backing_file_real_path(f));
 	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_rcuhead, file_free_rcu);
+	put_cred(f->f_cred);
+	if (unlikely(f->f_mode & FMODE_BACKING)) {
+		path_put(backing_file_real_path(f));
+		call_rcu(&f->f_rcuhead, file_free_rcu);
+	} else {
+		kmem_cache_free(filp_cachep, f);
+	}
 }
 
 void release_empty_file(struct file *f)
 {
 	WARN_ON_ONCE(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
-	/* Uhm, we better find out who grabs references to an unopened file. */
-	WARN_ON_ONCE(atomic_long_cmpxchg(&f->f_count, 1, 0) != 1);
-	security_file_free(f);
-	put_cred(f->f_cred);
-	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
-		percpu_counter_dec(&nr_files);
-	kmem_cache_free(filp_cachep, f);
+	if (atomic_long_dec_and_test(&f->f_count)) {
+		security_file_free(f);
+		put_cred(f->f_cred);
+		if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+			percpu_counter_dec(&nr_files);
+		kmem_cache_free(filp_cachep, f);
+		return;
+	}
 }
 
 /*
@@ -176,7 +177,6 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
 		return error;
 	}
 
-	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	spin_lock_init(&f->f_lock);
 	mutex_init(&f->f_pos_lock);
@@ -184,6 +184,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
 	f->f_mode = OPEN_FMODE(flags);
 	/* f->f_version: 0 */
 
+	atomic_long_set(&f->f_count, 1);
 	return 0;
 }
 
@@ -483,7 +484,8 @@ EXPORT_SYMBOL(__fput_sync);
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
+				SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN |
+				SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }
 
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 9cbf8d98489a..ced04c49e37c 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2723,10 +2723,11 @@ static struct file *gfs2_glockfd_next_file(struct gfs2_glockfd_iter *i)
 			break;
 		}
 		inode = file_inode(i->file);
-		if (inode->i_sb != i->sb)
+		if (inode->i_sb != i->sb) {
+			fput(i->file);
 			continue;
-		if (get_file_rcu(i->file))
-			break;
+		}
+		break;
 	}
 	rcu_read_unlock();
 	return i->file;
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index ebdcc25df0f7..987db4c8bbff 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -265,7 +265,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 	struct dnotify_struct *dn;
 	struct inode *inode;
 	fl_owner_t id = current->files;
-	struct file *f;
+	struct file *f = NULL;
 	int destroy = 0, error = 0;
 	__u32 mask;
 
@@ -392,6 +392,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 		fsnotify_put_mark(new_fsn_mark);
 	if (dn)
 		kmem_cache_free(dnotify_struct_cache, dn);
+	if (f)
+		fput(f);
 	return error;
 }
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 6276b3938842..47a717142efa 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -114,9 +114,11 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 
 	rcu_read_lock();
 	file = task_lookup_fd_rcu(task, fd);
-	if (file)
-		*mode = file->f_mode;
 	rcu_read_unlock();
+	if (file) {
+		*mode = file->f_mode;
+		fput(file);
+	}
 	return !!file;
 }
 
@@ -265,6 +267,7 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 			break;
 		data.mode = f->f_mode;
 		rcu_read_unlock();
+		fput(f);
 		data.fd = fd;
 
 		len = snprintf(name, sizeof(name), "%u", fd);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..6d088f069228 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -77,6 +77,8 @@ struct dentry;
 #define files_fdtable(files) \
 	rcu_dereference_check_fdtable((files), (files)->fdt)
 
+struct file *__fget_files_rcu(struct files_struct *files, unsigned int fd, fmode_t mask);
+
 /*
  * The caller must ensure that fd table isn't shared or hold rcu or file lock
  */
@@ -98,16 +100,17 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un
 	return files_lookup_fd_raw(files, fd);
 }
 
-static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsigned int fd)
+static inline struct file *lookup_fd_rcu(unsigned int fd)
 {
-	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
-			   "suspicious rcu_dereference_check() usage");
-	return files_lookup_fd_raw(files, fd);
+	return __fget_files_rcu(current->files, fd, 0);
+
 }
 
-static inline struct file *lookup_fd_rcu(unsigned int fd)
+static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsigned int fd)
 {
-	return files_lookup_fd_rcu(current->files, fd);
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+			   "suspicious rcu_dereference_check() usage");
+	return lookup_fd_rcu(fd);
 }
 
 struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 58dea591a341..f9a601629517 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1042,7 +1042,8 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+struct file *get_file_rcu(struct file __rcu **f);
+
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
 #define	MAX_NON_LFS	((1UL<<31) - 1)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..ee1d5c0ccf5a 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -310,8 +310,6 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
 		struct file *f;
 		f = task_lookup_next_fd_rcu(curr_task, &curr_fd);
 		if (!f)
-			break;
-		if (!get_file_rcu(f))
 			continue;
 
 		/* set info->fd */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..640123767726 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1492,9 +1492,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 	struct file *exe_file;
 
 	rcu_read_lock();
-	exe_file = rcu_dereference(mm->exe_file);
-	if (exe_file && !get_file_rcu(exe_file))
-		exe_file = NULL;
+	exe_file = get_file_rcu(&mm->exe_file);
 	rcu_read_unlock();
 	return exe_file;
 }
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 5353edfad8e1..e0dfa82606cb 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -66,6 +66,8 @@ get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 	rcu_read_lock();
 	file = task_lookup_fd_rcu(task, idx);
 	rcu_read_unlock();
+	if (file)
+		fput(file);
 
 	return file;
 }
-- 
2.34.1


[-- Attachment #3: 0002-file-ensure-ordering-between-memory-reallocation-and.patch --]
[-- Type: text/x-diff, Size: 1482 bytes --]

From 479d59bdfb5a157a218f8cafb04d1556e175fc80 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 29 Sep 2023 21:49:39 +0200
Subject: [PATCH 2/2] file: ensure ordering between memory reallocation and
 pointer check

by ensuring that all subsequent loads have a dependency on the second
load from *f.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/file.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index e983cf3b9e01..8d3c10dfb98a 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -857,6 +857,8 @@ struct file *get_file_rcu(struct file __rcu **f)
 {
 	for (;;) {
 		struct file __rcu *file;
+		struct file __rcu *file_reloaded;
+		struct file __rcu *file_reloaded_cmp;
 
 		file = rcu_dereference_raw(*f);
 		if (!file)
@@ -877,9 +879,15 @@ struct file *get_file_rcu(struct file __rcu **f)
 		 * If the pointers don't match the file has been
 		 * reallocated by SLAB_TYPESAFE_BY_RCU. So verify that
 		 * we're holding the right reference.
+		 *
+		 * Ensure that all accesses have a dependency on the
+		 * load from rcu_dereference_raw().
 		 */
-		if (file == rcu_access_pointer(*f))
-			return rcu_pointer_handoff(file);
+		file_reloaded = rcu_dereference_raw(*f);
+		file_reloaded_cmp = file_reloaded;
+		OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
+		if (file == file_reloaded_cmp)
+			return file_reloaded;
 
 		fput(file);
 	}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 19:57                             ` Christian Brauner
@ 2023-09-29 21:23                               ` Mateusz Guzik
  2023-09-29 21:39                                 ` Mateusz Guzik
  2023-09-29 22:24                                 ` Matthew Wilcox
  0 siblings, 2 replies; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-29 21:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, viro, linux-kernel, linux-fsdevel

On 9/29/23, Christian Brauner <brauner@kernel.org> wrote:
> On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
>> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org>
>> wrote:
>> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
>> > > since then the "f_count is zero" is no longer a final thing.
>> >
>> > I've tried coming up with a patch that is simple enough so the pattern
>> > is easy to follow and then converting all places to rely on a pattern
>> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The
>> > obvious
>> > thing is that we'll force a few places to now always acquire a
>> > reference
>> > when they don't really need one right now and that already may cause
>> > performance issues.
>>
>> (Those places are probably used way less often than the hot
>> open/fget/close paths though.)
>>
>> > We also can't fully get rid of plain get_file_rcu() uses itself because
>> > of users such as mm->exe_file. They don't go from one of the rcu
>> > fdtable
>> > lookup helpers to the struct file obviously. They rcu replace the file
>> > pointer in their struct ofc so we could change get_file_rcu() to take a
>> > struct file __rcu **f and then comparing that the passed in pointer
>> > hasn't changed before we managed to do atomic_long_inc_not_zero().
>> > Which
>> > afaict should work for such cases.
>> >
>> > But overall we would introduce a fairly big and at the same time subtle
>> > semantic change. The idea is pretty neat and it was fun to do but I'm
>> > just not convinced we should do it given how ubiquitous struct file is
>> > used and now to make the semanics even more special by allowing
>> > refcounts.
>> >
>> > I've kept your original release_empty_file() proposal in vfs.misc which
>> > I think is a really nice change.
>> >
>> > Let me know if you all passionately disagree. ;)
>
> So I'm appending the patch I had played with and a fix from Jann on top.
> @Linus, if you have an opinion, let me know what you think.
>
> Also available here:
> https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu
>
> Might be interesting if this could be perfed to see if there is any real
> gain for workloads with massive numbers of fds.
>

I would feel safer with a guaranteed way to tell that the file was reallocated.

I think this could track allocs/frees with a sequence counter embedded
into the object, say odd means deallocated and even means allocated.

Then you would know for a fact whether you raced with the file getting
whacked and would never have to wonder if you double-checked
everything you needed (like that f_mode) thing.

This would also mean that consumers which get away with poking around
the file without getting a ref could still do it, this is at least
true for tid_fd_mode. All of them would need patching though.

Extending struct file is not ideal by any means, but the good news is that:
1. there is a 4 byte hole in there, if one is fine with an int-sized counter
2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
(debian). still some room up to 256, so it may be tolerable?

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 21:23                               ` Mateusz Guzik
@ 2023-09-29 21:39                                 ` Mateusz Guzik
  2023-09-29 23:57                                   ` Linus Torvalds
  2023-09-29 22:24                                 ` Matthew Wilcox
  1 sibling, 1 reply; 29+ messages in thread
From: Mateusz Guzik @ 2023-09-29 21:39 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jann Horn, Linus Torvalds, viro, linux-kernel, linux-fsdevel

On 9/29/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 9/29/23, Christian Brauner <brauner@kernel.org> wrote:
>> On Fri, Sep 29, 2023 at 03:31:29PM +0200, Jann Horn wrote:
>>> On Fri, Sep 29, 2023 at 11:20 AM Christian Brauner <brauner@kernel.org>
>>> wrote:
>>> > > But yes, that protection would be broken by SLAB_TYPESAFE_BY_RCU,
>>> > > since then the "f_count is zero" is no longer a final thing.
>>> >
>>> > I've tried coming up with a patch that is simple enough so the pattern
>>> > is easy to follow and then converting all places to rely on a pattern
>>> > that combine lookup_fd_rcu() or similar with get_file_rcu(). The
>>> > obvious
>>> > thing is that we'll force a few places to now always acquire a
>>> > reference
>>> > when they don't really need one right now and that already may cause
>>> > performance issues.
>>>
>>> (Those places are probably used way less often than the hot
>>> open/fget/close paths though.)
>>>
>>> > We also can't fully get rid of plain get_file_rcu() uses itself
>>> > because
>>> > of users such as mm->exe_file. They don't go from one of the rcu
>>> > fdtable
>>> > lookup helpers to the struct file obviously. They rcu replace the file
>>> > pointer in their struct ofc so we could change get_file_rcu() to take
>>> > a
>>> > struct file __rcu **f and then comparing that the passed in pointer
>>> > hasn't changed before we managed to do atomic_long_inc_not_zero().
>>> > Which
>>> > afaict should work for such cases.
>>> >
>>> > But overall we would introduce a fairly big and at the same time
>>> > subtle
>>> > semantic change. The idea is pretty neat and it was fun to do but I'm
>>> > just not convinced we should do it given how ubiquitous struct file is
>>> > used and now to make the semanics even more special by allowing
>>> > refcounts.
>>> >
>>> > I've kept your original release_empty_file() proposal in vfs.misc
>>> > which
>>> > I think is a really nice change.
>>> >
>>> > Let me know if you all passionately disagree. ;)
>>
>> So I'm appending the patch I had played with and a fix from Jann on top.
>> @Linus, if you have an opinion, let me know what you think.
>>
>> Also available here:
>> https://gitlab.com/brauner/linux/-/commits/vfs.file.rcu
>>
>> Might be interesting if this could be perfed to see if there is any real
>> gain for workloads with massive numbers of fds.
>>
>
> I would feel safer with a guaranteed way to tell that the file was
> reallocated.
>
> I think this could track allocs/frees with a sequence counter embedded
> into the object, say odd means deallocated and even means allocated.
>
> Then you would know for a fact whether you raced with the file getting
> whacked and would never have to wonder if you double-checked
> everything you needed (like that f_mode) thing.
>
> This would also mean that consumers which get away with poking around
> the file without getting a ref could still do it, this is at least
> true for tid_fd_mode. All of them would need patching though.
>
> Extending struct file is not ideal by any means, but the good news is that:
> 1. there is a 4 byte hole in there, if one is fine with an int-sized
> counter
> 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> (debian). still some room up to 256, so it may be tolerable?
>

So to be clear, obtaining the initial count would require a dedicated
accessor. First you would find the file obj, wait for the count to
reach "allocated" state, validate the source still has the right
pointer, validate the count did not change (with acq fences sprinkled
in there). At the end of it you know that the seq counter you got from
the file was there when the file was still "installed".

Then you can poke around and validate you poked around the right thing
by once more validating  the counter.

Maybe I missed something, but the idea in general should work.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 21:39                                 ` Mateusz Guzik
@ 2023-09-29 23:57                                   ` Linus Torvalds
  2023-09-30  9:04                                     ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Linus Torvalds @ 2023-09-29 23:57 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Jann Horn, viro, linux-kernel, linux-fsdevel

On Fri, 29 Sept 2023 at 14:39, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> So to be clear, obtaining the initial count would require a dedicated
> accessor.

Please, no.

Sequence numbers here are fundamentally broken, since getting that
initial sequence number would involve either (a) making it something
outside of 'struct file' itself or (b) require the same re-validation
of the file pointer that the non-sequence number code needed in the
first place.

We already have the right model in the only place that really matters
(ie fd lookup). Using that same "validate file pointer after you got
the ref to it" for the two or three other cases that didn't do it (and
are simpler: the exec pointer in particular doesn't need the fdt
re-validation at all).

The fact that we had some fd lookup that didn't do the full thing that
a *real* fd lookup did is just bad. Let's fix it, not introduce a
sequence counter that only adds more complexity.

          Linus

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 23:57                                   ` Linus Torvalds
@ 2023-09-30  9:04                                     ` Christian Brauner
  2023-10-03 16:45                                       ` Nathan Chancellor
  2023-10-10  3:06                                       ` Al Viro
  0 siblings, 2 replies; 29+ messages in thread
From: Christian Brauner @ 2023-09-30  9:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mateusz Guzik, Jann Horn, viro, linux-kernel, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1455 bytes --]

On Fri, Sep 29, 2023 at 04:57:29PM -0700, Linus Torvalds wrote:
> On Fri, 29 Sept 2023 at 14:39, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > So to be clear, obtaining the initial count would require a dedicated
> > accessor.
> 
> Please, no.
> 
> Sequence numbers here are fundamentally broken, since getting that
> initial sequence number would involve either (a) making it something
> outside of 'struct file' itself or (b) require the same re-validation
> of the file pointer that the non-sequence number code needed in the
> first place.
> 
> We already have the right model in the only place that really matters
> (ie fd lookup). Using that same "validate file pointer after you got
> the ref to it" for the two or three other cases that didn't do it (and
> are simpler: the exec pointer in particular doesn't need the fdt
> re-validation at all).
> 
> The fact that we had some fd lookup that didn't do the full thing that
> a *real* fd lookup did is just bad. Let's fix it, not introduce a
> sequence counter that only adds more complexity.

I agree.

So I guess we're trying this. The appeneded patch now includes
documentation and renames *lookup_*_fd_rcu() to *lookup_*_fdget_rcu() to
reflect the refcount bump. It's now tentatively in vfs.misc (cf. [1])
and I've merged it into vfs.all to let -next chew on it. Please take a
close look and may the rcu gods be with us all...

[1]: git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git

[-- Attachment #2: 0001-file-convert-to-SLAB_TYPESAFE_BY_RCU.patch --]
[-- Type: text/x-diff, Size: 21347 bytes --]

From d266eee9d9d917f07774e2c2bab0115d2119a311 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Fri, 29 Sep 2023 08:45:59 +0200
Subject: [PATCH] file: convert to SLAB_TYPESAFE_BY_RCU

In recent discussions around some performance improvements in the file
handling area we discussed switching the file cache to rely on
SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
freeing for files completely. This is a pretty sensitive change overall
but it might actually be worth doing.

The main downside is the subtlety. The other one is that we should
really wait for Jann's patch to land that enables KASAN to handle
SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
exists.

With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
which requires a few changes. So it isn't sufficient anymore to just
acquire a reference to the file in question under rcu using
atomic_long_inc_not_zero() since the file might have already been
recycled and someone else might have bumped the reference.

In other words, callers might see reference count bumps from newer
users. For this is reason it is necessary to verify that the pointer is
the same before and after the reference count increment. This pattern
can be seen in get_file_rcu() and __files_get_rcu().

In addition, it isn't possible to access or check fields in struct file
without first aqcuiring a reference on it. Not doing that was always
very dodgy and it was only usable for non-pointer data in struct file.
With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
reference under rcu or they must hold the files_lock of the fdtable.
Failing to do either one of this is a bug.

Thanks to Jann for pointing out that we need to ensure memory ordering
between reallocations and pointer check by ensuring that all subsequent
loads have a dependency on the second load in get_file_rcu() and
providing a fixup that was folded into this patch.

Cc: Jann Horn <jannh@google.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 Documentation/filesystems/files.rst          | 51 +++++------
 arch/powerpc/platforms/cell/spufs/coredump.c |  9 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c     |  2 +-
 fs/file.c                                    | 96 ++++++++++++++++----
 fs/file_table.c                              | 41 +++++----
 fs/gfs2/glock.c                              | 11 ++-
 fs/notify/dnotify/dnotify.c                  |  6 +-
 fs/proc/fd.c                                 | 11 ++-
 include/linux/fdtable.h                      | 16 ++--
 include/linux/fs.h                           |  4 +-
 kernel/bpf/task_iter.c                       |  4 +-
 kernel/fork.c                                |  4 +-
 kernel/kcmp.c                                |  4 +-
 13 files changed, 163 insertions(+), 96 deletions(-)

diff --git a/Documentation/filesystems/files.rst b/Documentation/filesystems/files.rst
index bcf84459917f..f761bdae961d 100644
--- a/Documentation/filesystems/files.rst
+++ b/Documentation/filesystems/files.rst
@@ -62,7 +62,7 @@ the fdtable structure -
    be held.
 
 4. To look up the file structure given an fd, a reader
-   must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
+   must use either lookup_fdget_rcu() or files_lookup_fdget_rcu() APIs. These
    take care of barrier requirements due to lock-free lookup.
 
    An example::
@@ -70,43 +70,22 @@ the fdtable structure -
 	struct file *file;
 
 	rcu_read_lock();
-	file = lookup_fd_rcu(fd);
-	if (file) {
-		...
-	}
-	....
+	file = lookup_fdget_rcu(fd);
 	rcu_read_unlock();
-
-5. Handling of the file structures is special. Since the look-up
-   of the fd (fget()/fget_light()) are lock-free, it is possible
-   that look-up may race with the last put() operation on the
-   file structure. This is avoided using atomic_long_inc_not_zero()
-   on ->f_count::
-
-	rcu_read_lock();
-	file = files_lookup_fd_rcu(files, fd);
 	if (file) {
-		if (atomic_long_inc_not_zero(&file->f_count))
-			*fput_needed = 1;
-		else
-		/* Didn't get the reference, someone's freed */
-			file = NULL;
+		...
+                fput(file);
 	}
-	rcu_read_unlock();
 	....
-	return file;
-
-   atomic_long_inc_not_zero() detects if refcounts is already zero or
-   goes to zero during increment. If it does, we fail
-   fget()/fget_light().
 
-6. Since both fdtable and file structures can be looked up
+5. Since both fdtable and file structures can be looked up
    lock-free, they must be installed using rcu_assign_pointer()
    API. If they are looked up lock-free, rcu_dereference()
    must be used. However it is advisable to use files_fdtable()
-   and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
+   and lookup_fdget_rcu()/files_lookup_fdget_rcu() which take care of these
+   issues.
 
-7. While updating, the fdtable pointer must be looked up while
+6. While updating, the fdtable pointer must be looked up while
    holding files->file_lock. If ->file_lock is dropped, then
    another thread expand the files thereby creating a new
    fdtable and making the earlier fdtable pointer stale.
@@ -126,3 +105,17 @@ the fdtable structure -
    Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
    the fdtable pointer (fdt) must be loaded after locate_fd().
 
+On newer kernels rcu based file lookup has been switched to rely on
+SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore to just
+acquire a reference to the file in question under rcu using
+atomic_long_inc_not_zero() since the file might have already been recycled and
+someone else might have bumped the reference. In other words, the caller might
+see reference count bumps from newer users. For this is reason it is necessary
+to verify that the pointer is the same before and after the reference count
+increment. This pattern can be seen in get_file_rcu() and __files_get_rcu().
+
+In addition, it isn't possible to access or check fields in struct file without
+first aqcuiring a reference on it. Not doing that was always very dodgy and it
+was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU
+it is necessary that callers first acquire a reference under rcu or they must
+hold the files_lock of the fdtable. Failing to do either one of this is a bug.
diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index 1a587618015c..5e157f48995e 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -74,10 +74,13 @@ static struct spu_context *coredump_next_context(int *fd)
 	*fd = n - 1;
 
 	rcu_read_lock();
-	file = lookup_fd_rcu(*fd);
-	ctx = SPUFS_I(file_inode(file))->i_ctx;
-	get_spu_context(ctx);
+	file = lookup_fdget_rcu(*fd);
 	rcu_read_unlock();
+	if (file) {
+		ctx = SPUFS_I(file_inode(file))->i_ctx;
+		get_spu_context(ctx);
+		fput(file);
+	}
 
 	return ctx;
 }
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_mman.c b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
index aa4d842d4c5a..b2f00f54218f 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_mman.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_mman.c
@@ -917,7 +917,7 @@ static struct file *mmap_singleton(struct drm_i915_private *i915)
 
 	rcu_read_lock();
 	file = READ_ONCE(i915->gem.mmap_singleton);
-	if (file && !get_file_rcu(file))
+	if (!get_file_rcu(&file))
 		file = NULL;
 	rcu_read_unlock();
 	if (file)
diff --git a/fs/file.c b/fs/file.c
index 3e4a4dfa38fc..dc0ad2ca3faa 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -853,8 +853,53 @@ void do_close_on_exec(struct files_struct *files)
 	spin_unlock(&files->file_lock);
 }
 
-static inline struct file *__fget_files_rcu(struct files_struct *files,
-	unsigned int fd, fmode_t mask)
+struct file *get_file_rcu(struct file __rcu **f)
+{
+	for (;;) {
+		struct file __rcu *file;
+		struct file __rcu *file_reloaded;
+		struct file __rcu *file_reloaded_cmp;
+
+		file = rcu_dereference_raw(*f);
+		if (!file)
+			return NULL;
+
+		if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
+			continue;
+
+		file_reloaded = rcu_dereference_raw(*f);
+
+		/*
+		 * Ensure that all accesses have a dependency on the
+		 * load from rcu_dereference_raw() above so we get
+		 * correct ordering between reuse/allocation and the
+		 * pointer check below.
+		 */
+		file_reloaded_cmp = file_reloaded;
+		OPTIMIZER_HIDE_VAR(file_reloaded_cmp);
+
+		/*
+		 * atomic_long_inc_not_zero() serves as a full memory
+		 * barrier when we acquired a reference.
+		 *
+		 * This is paired with the write barrier from assigning
+		 * to the __rcu protected file pointer so that if that
+		 * pointer still matches the current file, we know we
+		 * have successfully acquire a reference to it.
+		 *
+		 * If the pointers don't match the file has been
+		 * reallocated by SLAB_TYPESAFE_BY_RCU. So verify that
+		 * we're holding the right reference.
+		 */
+		if (file == file_reloaded_cmp)
+			return file_reloaded;
+
+		fput(file);
+	}
+}
+
+static struct file *__fget_files_rcu(struct files_struct *files,
+				     unsigned int fd, fmode_t mask)
 {
 	for (;;) {
 		struct file *file;
@@ -865,12 +910,6 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 			return NULL;
 
 		fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
-		file = rcu_dereference_raw(*fdentry);
-		if (unlikely(!file))
-			return NULL;
-
-		if (unlikely(file->f_mode & mask))
-			return NULL;
 
 		/*
 		 * Ok, we have a file pointer. However, because we do
@@ -882,8 +921,9 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *  (a) the file ref already went down to zero,
 		 *      and get_file_rcu() fails. Just try again:
 		 */
-		if (unlikely(!get_file_rcu(file)))
-			continue;
+		file = get_file_rcu(fdentry);
+		if (unlikely(!file))
+			return NULL;
 
 		/*
 		 *  (b) the file table entry has changed under us.
@@ -893,12 +933,16 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 		 *
 		 * If so, we need to put our ref and try again.
 		 */
-		if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
-		    unlikely(rcu_dereference_raw(*fdentry) != file)) {
+		if (unlikely(rcu_dereference_raw(files->fdt) != fdt)) {
 			fput(file);
 			continue;
 		}
 
+		if (unlikely(file->f_mode & mask)) {
+			fput(file);
+			return NULL;
+		}
+
 		/*
 		 * Ok, we have a ref to the file, and checked that it
 		 * still exists.
@@ -907,6 +951,11 @@ static inline struct file *__fget_files_rcu(struct files_struct *files,
 	}
 }
 
+struct file *fget_files_rcu(struct files_struct *files, unsigned int fd)
+{
+	return __fget_files_rcu(files, fd, 0);
+}
+
 static struct file *__fget_files(struct files_struct *files, unsigned int fd,
 				 fmode_t mask)
 {
@@ -948,7 +997,14 @@ struct file *fget_task(struct task_struct *task, unsigned int fd)
 	return file;
 }
 
-struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
+static inline struct file *files_lookup_fdget_rcu(struct files_struct *files, unsigned int fd)
+{
+	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
+			 "suspicious rcu_dereference_check() usage");
+	return lookup_fdget_rcu(fd);
+}
+
+struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd)
 {
 	/* Must be called with rcu_read_lock held */
 	struct files_struct *files;
@@ -957,13 +1013,13 @@ struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
 	task_lock(task);
 	files = task->files;
 	if (files)
-		file = files_lookup_fd_rcu(files, fd);
+		file = files_lookup_fdget_rcu(files, fd);
 	task_unlock(task);
 
 	return file;
 }
 
-struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret_fd)
+struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *ret_fd)
 {
 	/* Must be called with rcu_read_lock held */
 	struct files_struct *files;
@@ -974,7 +1030,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 	files = task->files;
 	if (files) {
 		for (; fd < files_fdtable(files)->max_fds; fd++) {
-			file = files_lookup_fd_rcu(files, fd);
+			file = files_lookup_fdget_rcu(files, fd);
 			if (file)
 				break;
 		}
@@ -983,7 +1039,7 @@ struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 	*ret_fd = fd;
 	return file;
 }
-EXPORT_SYMBOL(task_lookup_next_fd_rcu);
+EXPORT_SYMBOL(task_lookup_next_fdget_rcu);
 
 /*
  * Lightweight file lookup - no refcnt increment if fd table isn't shared.
@@ -1272,12 +1328,16 @@ SYSCALL_DEFINE2(dup2, unsigned int, oldfd, unsigned int, newfd)
 {
 	if (unlikely(newfd == oldfd)) { /* corner case */
 		struct files_struct *files = current->files;
+		struct file *f;
 		int retval = oldfd;
 
 		rcu_read_lock();
-		if (!files_lookup_fd_rcu(files, oldfd))
+		f = files_lookup_fdget_rcu(files, oldfd);
+		if (!f)
 			retval = -EBADF;
 		rcu_read_unlock();
+		if (f)
+			fput(f);
 		return retval;
 	}
 	return ksys_dup3(oldfd, newfd, 0);
diff --git a/fs/file_table.c b/fs/file_table.c
index e68e97d4f00a..17b06b32fdee 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -65,33 +65,34 @@ static void file_free_rcu(struct rcu_head *head)
 {
 	struct file *f = container_of(head, struct file, f_rcuhead);
 
-	put_cred(f->f_cred);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		kfree(backing_file(f));
-	else
-		kmem_cache_free(filp_cachep, f);
+	kfree(backing_file(f));
 }
 
 static inline void file_free(struct file *f)
 {
 	security_file_free(f);
-	if (unlikely(f->f_mode & FMODE_BACKING))
-		path_put(backing_file_real_path(f));
 	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_rcuhead, file_free_rcu);
+	put_cred(f->f_cred);
+	if (unlikely(f->f_mode & FMODE_BACKING)) {
+		path_put(backing_file_real_path(f));
+		call_rcu(&f->f_rcuhead, file_free_rcu);
+	} else {
+		kmem_cache_free(filp_cachep, f);
+	}
 }
 
 void release_empty_file(struct file *f)
 {
 	WARN_ON_ONCE(f->f_mode & (FMODE_BACKING | FMODE_OPENED));
-	/* Uhm, we better find out who grabs references to an unopened file. */
-	WARN_ON_ONCE(atomic_long_cmpxchg(&f->f_count, 1, 0) != 1);
-	security_file_free(f);
-	put_cred(f->f_cred);
-	if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
-		percpu_counter_dec(&nr_files);
-	kmem_cache_free(filp_cachep, f);
+	if (atomic_long_dec_and_test(&f->f_count)) {
+		security_file_free(f);
+		put_cred(f->f_cred);
+		if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
+			percpu_counter_dec(&nr_files);
+		kmem_cache_free(filp_cachep, f);
+		return;
+	}
 }
 
 /*
@@ -176,7 +177,6 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
 		return error;
 	}
 
-	atomic_long_set(&f->f_count, 1);
 	rwlock_init(&f->f_owner.lock);
 	spin_lock_init(&f->f_lock);
 	mutex_init(&f->f_pos_lock);
@@ -184,6 +184,12 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
 	f->f_mode = OPEN_FMODE(flags);
 	/* f->f_version: 0 */
 
+	/*
+	 * We're SLAB_TYPESAFE_BY_RCU so initialize f_count last. While
+	 * fget-rcu pattern users need to be able to handle spurious
+	 * refcount bumps we should reinitialize the reused file first.
+	 */
+	atomic_long_set(&f->f_count, 1);
 	return 0;
 }
 
@@ -483,7 +489,8 @@ EXPORT_SYMBOL(__fput_sync);
 void __init files_init(void)
 {
 	filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
-			SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
+				SLAB_TYPESAFE_BY_RCU | SLAB_HWCACHE_ALIGN |
+				SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	percpu_counter_init(&nr_files, 0, GFP_KERNEL);
 }
 
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 9cbf8d98489a..b4bc873aab7d 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -2717,16 +2717,19 @@ static struct file *gfs2_glockfd_next_file(struct gfs2_glockfd_iter *i)
 	for(;; i->fd++) {
 		struct inode *inode;
 
-		i->file = task_lookup_next_fd_rcu(i->task, &i->fd);
+		i->file = task_lookup_next_fdget_rcu(i->task, &i->fd);
 		if (!i->file) {
 			i->fd = 0;
 			break;
 		}
+
 		inode = file_inode(i->file);
-		if (inode->i_sb != i->sb)
-			continue;
-		if (get_file_rcu(i->file))
+		if (inode->i_sb == i->sb)
 			break;
+
+		rcu_read_unlock();
+		fput(i->file);
+		rcu_read_lock();
 	}
 	rcu_read_unlock();
 	return i->file;
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index ebdcc25df0f7..869b016014d2 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -265,7 +265,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 	struct dnotify_struct *dn;
 	struct inode *inode;
 	fl_owner_t id = current->files;
-	struct file *f;
+	struct file *f = NULL;
 	int destroy = 0, error = 0;
 	__u32 mask;
 
@@ -345,7 +345,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 	}
 
 	rcu_read_lock();
-	f = lookup_fd_rcu(fd);
+	f = lookup_fdget_rcu(fd);
 	rcu_read_unlock();
 
 	/* if (f != filp) means that we lost a race and another task/thread
@@ -392,6 +392,8 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned int arg)
 		fsnotify_put_mark(new_fsn_mark);
 	if (dn)
 		kmem_cache_free(dnotify_struct_cache, dn);
+	if (f)
+		fput(f);
 	return error;
 }
 
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 6276b3938842..6e72e5ad42bc 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -113,10 +113,12 @@ static bool tid_fd_mode(struct task_struct *task, unsigned fd, fmode_t *mode)
 	struct file *file;
 
 	rcu_read_lock();
-	file = task_lookup_fd_rcu(task, fd);
-	if (file)
-		*mode = file->f_mode;
+	file = task_lookup_fdget_rcu(task, fd);
 	rcu_read_unlock();
+	if (file) {
+		*mode = file->f_mode;
+		fput(file);
+	}
 	return !!file;
 }
 
@@ -259,12 +261,13 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
 		char name[10 + 1];
 		unsigned int len;
 
-		f = task_lookup_next_fd_rcu(p, &fd);
+		f = task_lookup_next_fdget_rcu(p, &fd);
 		ctx->pos = fd + 2LL;
 		if (!f)
 			break;
 		data.mode = f->f_mode;
 		rcu_read_unlock();
+		fput(f);
 		data.fd = fd;
 
 		len = snprintf(name, sizeof(name), "%u", fd);
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..805305a1d4fd 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -77,6 +77,8 @@ struct dentry;
 #define files_fdtable(files) \
 	rcu_dereference_check_fdtable((files), (files)->fdt)
 
+struct file *fget_files_rcu(struct files_struct *files, unsigned int fd);
+
 /*
  * The caller must ensure that fd table isn't shared or hold rcu or file lock
  */
@@ -98,20 +100,14 @@ static inline struct file *files_lookup_fd_locked(struct files_struct *files, un
 	return files_lookup_fd_raw(files, fd);
 }
 
-static inline struct file *files_lookup_fd_rcu(struct files_struct *files, unsigned int fd)
+static inline struct file *lookup_fdget_rcu(unsigned int fd)
 {
-	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
-			   "suspicious rcu_dereference_check() usage");
-	return files_lookup_fd_raw(files, fd);
-}
+	return fget_files_rcu(current->files, fd);
 
-static inline struct file *lookup_fd_rcu(unsigned int fd)
-{
-	return files_lookup_fd_rcu(current->files, fd);
 }
 
-struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd);
-struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *fd);
+struct file *task_lookup_fdget_rcu(struct task_struct *task, unsigned int fd);
+struct file *task_lookup_next_fdget_rcu(struct task_struct *task, unsigned int *fd);
 
 struct task_struct;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 58dea591a341..ceafc40cc25f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1042,7 +1042,9 @@ static inline struct file *get_file(struct file *f)
 	atomic_long_inc(&f->f_count);
 	return f;
 }
-#define get_file_rcu(x) atomic_long_inc_not_zero(&(x)->f_count)
+
+struct file *get_file_rcu(struct file __rcu **f);
+
 #define file_count(x)	atomic_long_read(&(x)->f_count)
 
 #define	MAX_NON_LFS	((1UL<<31) - 1)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c4ab9d6cdbe9..d82f0ece42d2 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -308,10 +308,8 @@ task_file_seq_get_next(struct bpf_iter_seq_task_file_info *info)
 	rcu_read_lock();
 	for (;; curr_fd++) {
 		struct file *f;
-		f = task_lookup_next_fd_rcu(curr_task, &curr_fd);
+		f = task_lookup_next_fdget_rcu(curr_task, &curr_fd);
 		if (!f)
-			break;
-		if (!get_file_rcu(f))
 			continue;
 
 		/* set info->fd */
diff --git a/kernel/fork.c b/kernel/fork.c
index 3b6d20dfb9a8..640123767726 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1492,9 +1492,7 @@ struct file *get_mm_exe_file(struct mm_struct *mm)
 	struct file *exe_file;
 
 	rcu_read_lock();
-	exe_file = rcu_dereference(mm->exe_file);
-	if (exe_file && !get_file_rcu(exe_file))
-		exe_file = NULL;
+	exe_file = get_file_rcu(&mm->exe_file);
 	rcu_read_unlock();
 	return exe_file;
 }
diff --git a/kernel/kcmp.c b/kernel/kcmp.c
index 5353edfad8e1..b0639f21041f 100644
--- a/kernel/kcmp.c
+++ b/kernel/kcmp.c
@@ -64,8 +64,10 @@ get_file_raw_ptr(struct task_struct *task, unsigned int idx)
 	struct file *file;
 
 	rcu_read_lock();
-	file = task_lookup_fd_rcu(task, idx);
+	file = task_lookup_fdget_rcu(task, idx);
 	rcu_read_unlock();
+	if (file)
+		fput(file);
 
 	return file;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-30  9:04                                     ` Christian Brauner
@ 2023-10-03 16:45                                       ` Nathan Chancellor
  2023-10-10  3:06                                       ` Al Viro
  1 sibling, 0 replies; 29+ messages in thread
From: Nathan Chancellor @ 2023-10-03 16:45 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Mateusz Guzik, Jann Horn, viro, linux-kernel,
	linux-fsdevel, llvm, linuxppc-dev

Hi Christian,

> >From d266eee9d9d917f07774e2c2bab0115d2119a311 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Fri, 29 Sep 2023 08:45:59 +0200
> Subject: [PATCH] file: convert to SLAB_TYPESAFE_BY_RCU
> 
> In recent discussions around some performance improvements in the file
> handling area we discussed switching the file cache to rely on
> SLAB_TYPESAFE_BY_RCU which allows us to get rid of call_rcu() based
> freeing for files completely. This is a pretty sensitive change overall
> but it might actually be worth doing.
> 
> The main downside is the subtlety. The other one is that we should
> really wait for Jann's patch to land that enables KASAN to handle
> SLAB_TYPESAFE_BY_RCU UAFs. Currently it doesn't but a patch for this
> exists.
> 
> With SLAB_TYPESAFE_BY_RCU objects may be freed and reused multiple times
> which requires a few changes. So it isn't sufficient anymore to just
> acquire a reference to the file in question under rcu using
> atomic_long_inc_not_zero() since the file might have already been
> recycled and someone else might have bumped the reference.
> 
> In other words, callers might see reference count bumps from newer
> users. For this is reason it is necessary to verify that the pointer is
> the same before and after the reference count increment. This pattern
> can be seen in get_file_rcu() and __files_get_rcu().
> 
> In addition, it isn't possible to access or check fields in struct file
> without first aqcuiring a reference on it. Not doing that was always
> very dodgy and it was only usable for non-pointer data in struct file.
> With SLAB_TYPESAFE_BY_RCU it is necessary that callers first acquire a
> reference under rcu or they must hold the files_lock of the fdtable.
> Failing to do either one of this is a bug.
> 
> Thanks to Jann for pointing out that we need to ensure memory ordering
> between reallocations and pointer check by ensuring that all subsequent
> loads have a dependency on the second load in get_file_rcu() and
> providing a fixup that was folded into this patch.
> 
> Cc: Jann Horn <jannh@google.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---

<snip>

> --- a/arch/powerpc/platforms/cell/spufs/coredump.c
> +++ b/arch/powerpc/platforms/cell/spufs/coredump.c
> @@ -74,10 +74,13 @@ static struct spu_context *coredump_next_context(int *fd)
>  	*fd = n - 1;
>  
>  	rcu_read_lock();
> -	file = lookup_fd_rcu(*fd);
> -	ctx = SPUFS_I(file_inode(file))->i_ctx;
> -	get_spu_context(ctx);
> +	file = lookup_fdget_rcu(*fd);
>  	rcu_read_unlock();
> +	if (file) {
> +		ctx = SPUFS_I(file_inode(file))->i_ctx;
> +		get_spu_context(ctx);
> +		fput(file);
> +	}
>  
>  	return ctx;
>  }

This hunk now causes a clang warning (or error, since arch/powerpc builds
with -Werror by default) in next-20231003.

  $ make -skj"$(nproc)" ARCH=powerpc LLVM=1 ppc64_guest_defconfig arch/powerpc/platforms/cell/spufs/coredump.o
  ...
  arch/powerpc/platforms/cell/spufs/coredump.c:79:6: error: variable 'ctx' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
     79 |         if (file) {
        |             ^~~~
  arch/powerpc/platforms/cell/spufs/coredump.c:85:9: note: uninitialized use occurs here
     85 |         return ctx;
        |                ^~~
  arch/powerpc/platforms/cell/spufs/coredump.c:79:2: note: remove the 'if' if its condition is always true
     79 |         if (file) {
        |         ^~~~~~~~~
  arch/powerpc/platforms/cell/spufs/coredump.c:69:25: note: initialize the variable 'ctx' to silence this warning
     69 |         struct spu_context *ctx;
        |                                ^
        |                                 = NULL
  1 error generated.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-30  9:04                                     ` Christian Brauner
  2023-10-03 16:45                                       ` Nathan Chancellor
@ 2023-10-10  3:06                                       ` Al Viro
  2023-10-10  8:29                                         ` Christian Brauner
  1 sibling, 1 reply; 29+ messages in thread
From: Al Viro @ 2023-10-10  3:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, Mateusz Guzik, Jann Horn, linux-kernel,
	linux-fsdevel

On Sat, Sep 30, 2023 at 11:04:20AM +0200, Christian Brauner wrote:
> +On newer kernels rcu based file lookup has been switched to rely on
> +SLAB_TYPESAFE_BY_RCU instead of call_rcu(). It isn't sufficient anymore to just
> +acquire a reference to the file in question under rcu using
> +atomic_long_inc_not_zero() since the file might have already been recycled and
> +someone else might have bumped the reference. In other words, the caller might
> +see reference count bumps from newer users. For this is reason it is necessary
> +to verify that the pointer is the same before and after the reference count
> +increment. This pattern can be seen in get_file_rcu() and __files_get_rcu().
> +
> +In addition, it isn't possible to access or check fields in struct file without
> +first aqcuiring a reference on it. Not doing that was always very dodgy and it
> +was only usable for non-pointer data in struct file. With SLAB_TYPESAFE_BY_RCU
> +it is necessary that callers first acquire a reference under rcu or they must
> +hold the files_lock of the fdtable. Failing to do either one of this is a bug.

Trivial correction: the last paragraph applies only to rcu lookups - something
like
        spin_lock(&files->file_lock);
        fdt = files_fdtable(files);
        if (close->fd >= fdt->max_fds) {
                spin_unlock(&files->file_lock);
                goto err;  
        }
        file = rcu_dereference_protected(fdt->fd[close->fd],
                        lockdep_is_held(&files->file_lock));
        if (!file || io_is_uring_fops(file)) {
		     ^^^^^^^^^^^^^^^^^^^^^ fetches file->f_op
                spin_unlock(&files->file_lock);
                goto err;
        }
	...

should be still valid.  As written, the reference to "rcu based file lookup"
is buried in the previous paragraph and it's not obvious that it applies to
the last one as well.  Incidentally, I would probably turn that fragment
(in io_uring/openclose.c:io_close()) into
	spin_lock(&files->file_lock);
	file = files_lookup_fd_locked(files, close->fd);
	if (!file || io_is_uring_fops(file)) {
		spin_unlock(&files->file_lock);
		goto err;
	}
	...

> diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
> index 1a587618015c..5e157f48995e 100644
> --- a/arch/powerpc/platforms/cell/spufs/coredump.c
> +++ b/arch/powerpc/platforms/cell/spufs/coredump.c
> @@ -74,10 +74,13 @@ static struct spu_context *coredump_next_context(int *fd)
>  	*fd = n - 1;
>  
>  	rcu_read_lock();
> -	file = lookup_fd_rcu(*fd);
> -	ctx = SPUFS_I(file_inode(file))->i_ctx;
> -	get_spu_context(ctx);
> +	file = lookup_fdget_rcu(*fd);
>  	rcu_read_unlock();
> +	if (file) {
> +		ctx = SPUFS_I(file_inode(file))->i_ctx;
> +		get_spu_context(ctx);
> +		fput(file);
> +	}

Well...  Here we should have descriptor table unshared, and we really
do rely upon that - we expect the file we'd found to have been a spufs
one *and* to have stayed that way.  So if anyone could change the
descriptor table behind our back, we'd be FUBAR.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-10-10  3:06                                       ` Al Viro
@ 2023-10-10  8:29                                         ` Christian Brauner
  0 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2023-10-10  8:29 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Mateusz Guzik, Jann Horn, linux-kernel,
	linux-fsdevel

> is buried in the previous paragraph and it's not obvious that it applies to
> the last one as well.  Incidentally, I would probably turn that fragment

massaged to clarify

> (in io_uring/openclose.c:io_close()) into
> 	spin_lock(&files->file_lock);
> 	file = files_lookup_fd_locked(files, close->fd);
> 	if (!file || io_is_uring_fops(file)) {
> 		spin_unlock(&files->file_lock);
> 		goto err;
> 	}

done

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 21:23                               ` Mateusz Guzik
  2023-09-29 21:39                                 ` Mateusz Guzik
@ 2023-09-29 22:24                                 ` Matthew Wilcox
  2023-09-29 23:02                                   ` Jann Horn
  1 sibling, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2023-09-29 22:24 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Jann Horn, Linus Torvalds, viro, linux-kernel,
	linux-fsdevel

On Fri, Sep 29, 2023 at 11:23:04PM +0200, Mateusz Guzik wrote:
> Extending struct file is not ideal by any means, but the good news is that:
> 1. there is a 4 byte hole in there, if one is fine with an int-sized counter
> 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> (debian). still some room up to 256, so it may be tolerable?

256 isn't quite the magic number for slabs ... at 256 bytes, we'd get 16
per 4kB page, but at 232 bytes we get 17 objects per 4kB page (or 35 per
8kB pair of pages).

That said, I thik a 32-bit counter is almost certainly sufficient.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v2] vfs: shave work on failed file open
  2023-09-29 22:24                                 ` Matthew Wilcox
@ 2023-09-29 23:02                                   ` Jann Horn
  0 siblings, 0 replies; 29+ messages in thread
From: Jann Horn @ 2023-09-29 23:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mateusz Guzik, Christian Brauner, Linus Torvalds, viro,
	linux-kernel, linux-fsdevel

On Sat, Sep 30, 2023 at 12:24 AM Matthew Wilcox <willy@infradead.org> wrote:
> On Fri, Sep 29, 2023 at 11:23:04PM +0200, Mateusz Guzik wrote:
> > Extending struct file is not ideal by any means, but the good news is that:
> > 1. there is a 4 byte hole in there, if one is fine with an int-sized counter
> > 2. if one insists on 8 bytes, the struct is 232 bytes on my kernel
> > (debian). still some room up to 256, so it may be tolerable?
>
> 256 isn't quite the magic number for slabs ... at 256 bytes, we'd get 16
> per 4kB page, but at 232 bytes we get 17 objects per 4kB page (or 35 per
> 8kB pair of pages).
>
> That said, I thik a 32-bit counter is almost certainly sufficient.

I don't like the sequence number proposal because it seems to me like
it's adding one more layer of complication, but if this does happen, I
very much would want that number to be 64-bit. A computer doesn't take
_that_ long to count to 2^32, and especially with preemptible RCU it's
kinda hard to reason about how long a task might stay in the middle of
an RCU grace period. Like, are we absolutely sure that there is no
pessimal case where the scheduler will not schedule a runnable
cpu-pinned idle-priority task for a few minutes? Either because we hit
some pessimal case in the scheduler or because the task gets preempted
by something that's spinning a very long time with preemption
disabled?
(And yes, I know, seqlocks...)

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-10-10  8:29 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-26 16:22 [PATCH v2] vfs: shave work on failed file open Mateusz Guzik
2023-09-26 19:00 ` Linus Torvalds
2023-09-26 19:28   ` Mateusz Guzik
2023-09-27 14:09     ` Christian Brauner
2023-09-27 14:34       ` Mateusz Guzik
2023-09-27 17:48       ` Linus Torvalds
2023-09-27 17:56         ` Mateusz Guzik
2023-09-27 18:05           ` Linus Torvalds
2023-09-27 18:32             ` Mateusz Guzik
2023-09-27 20:27               ` Linus Torvalds
2023-09-27 21:06                 ` Mateusz Guzik
2023-09-27 21:18                   ` Linus Torvalds
2023-09-27 21:30                     ` Mateusz Guzik
2023-09-28 13:25                 ` Christian Brauner
2023-09-28 14:05                   ` Christian Brauner
2023-09-28 14:43                     ` Jann Horn
2023-09-28 17:21                       ` Linus Torvalds
2023-09-29  9:20                         ` Christian Brauner
2023-09-29 13:31                           ` Jann Horn
2023-09-29 19:57                             ` Christian Brauner
2023-09-29 21:23                               ` Mateusz Guzik
2023-09-29 21:39                                 ` Mateusz Guzik
2023-09-29 23:57                                   ` Linus Torvalds
2023-09-30  9:04                                     ` Christian Brauner
2023-10-03 16:45                                       ` Nathan Chancellor
2023-10-10  3:06                                       ` Al Viro
2023-10-10  8:29                                         ` Christian Brauner
2023-09-29 22:24                                 ` Matthew Wilcox
2023-09-29 23:02                                   ` Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).