Re: [GIT PULL] Block fixes for 6.18-rc3

linux-security-module.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [GIT PULL] Block fixes for 6.18-rc3
       [not found] <37fb8720-bee9-43b7-b0ff-0214a8ad33a2@kernel.dk>
@ 2025-10-24 20:31 ` Linus Torvalds
  2025-10-26 21:09   ` Serge E. Hallyn
  2025-10-31 15:43   ` Christian Brauner
  0 siblings, 2 replies; 8+ messages in thread
From: Linus Torvalds @ 2025-10-24 20:31 UTC (permalink / raw)
  To: Jens Axboe, Paul Moore, Serge Hallyn, Christian Brauner
  Cc: linux-block@vger.kernel.org, LSM List

[ Adding LSM people. Also Christian, because he did the cred refcount
cleanup with override_creds() and friends last year, and I'm
suggesting taking that one step further ]

On Fri, 24 Oct 2025 at 06:58, Jens Axboe <axboe@kernel.dk> wrote:
>
> Ondrej Mosnacek (1):
>       nbd: override creds to kernel when calling sock_{send,recv}msg()

I've pulled this, but looking at the patch, I note that more than half
the patch - 75% to be exact - is just boilerplate for "I need to
allocate the kernel cred and deal with error handling there".

It literally has three lines of new actual useful code (two statements
and one local variable declaration), and then nine lines of the "setup
dance".

Which isn't wrong, but when the infrastructure boilerplate is three
times more than the actual code, it makes me think we should maybe
just get rid of the

    my_kernel_cred = prepare_kernel_cred(&init_task);

pattern for this use-case, and just let people use "init_cred"
directly for things like this.

Because that's essentially what that prepare_kernel_cred() thing
returns, except it allocates a new copy of said thing, so now you have
error handling and you have to free it after-the-fact.

And I'm not seeing that the extra error handling and freeing dance
actually buys us anything at all.

Now, some *other* users actually go on to change the creds: they want
that prepare_kernel_cred() dance because they then actually do
something else like using their own keyring or whatever (eg the NFS
idmap code or some other filesystem stuff).

So it's not like prepare_kernel_cred() is wrong, but in this kind of
case where people just go "I'm a driver with hardware access, I want
to do something with kernel privileges not user privileges", it
actually seems counterproductive to have extra code just to complicate
things.

Now, my gut feel is that if we just let people use 'init_cred'
directly, we should also make sure that it's always exposed as a
'const struct cred' , but wouldn't that be a whole lot simpler and
more straightforward?

This is *not* the only use case of that.

We now have at least four use-cases of this "raw kernel cred" pattern:
core-dumping over unix domain socket, nbd, firmware loading and SCSI
target all do this exact thing as far as I can tell.

So  they all just want that bare kernel cred, and this interface then
forces it to do extra work instead of just doing

        old_cred = override_creds(&init_cred);
        ...
        revert_creds(old_cred);

and it ends up being extra code for allocating and freeing that copy
of a cred that we already *had* and could just have used directly.

I did just check that making 'init_cred' be const

  --- a/include/linux/init_task.h
  +++ b/include/linux/init_task.h
  @@ -28 +28 @@ extern struct nsproxy init_nsproxy;
  -extern struct cred init_cred;
  +extern const struct cred init_cred;
  --- a/kernel/cred.c
  +++ b/kernel/cred.c
  @@ -44 +44 @@ static struct group_info init_groups = { .usage =
REFCOUNT_INIT(2) };
  -struct cred init_cred = {
  +const struct cred init_cred = {

seems to build just fine and would seem to be the right thing to do
even if we *don't* expect people to use it. And override_creds() is
perfectly happy with a

Maybe there's some reason for that extra work that I'm not seeing and
thinking of? But it all smells like make-believe work to me that
probably has a historical reason for it, but doesn't seem to make a
lot of sense any more.

Hmm?

               Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-24 20:31 ` [GIT PULL] Block fixes for 6.18-rc3 Linus Torvalds
@ 2025-10-26 21:09   ` Serge E. Hallyn
  2025-10-26 22:57     ` Linus Torvalds
  2025-10-31 15:43   ` Christian Brauner
  1 sibling, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2025-10-26 21:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, Christian Brauner,
	linux-block@vger.kernel.org, LSM List

On Fri, Oct 24, 2025 at 01:31:11PM -0700, Linus Torvalds wrote:
> [ Adding LSM people. Also Christian, because he did the cred refcount
> cleanup with override_creds() and friends last year, and I'm
> suggesting taking that one step further ]
> 
> On Fri, 24 Oct 2025 at 06:58, Jens Axboe <axboe@kernel.dk> wrote:
> >
> > Ondrej Mosnacek (1):
> >       nbd: override creds to kernel when calling sock_{send,recv}msg()
> 
> I've pulled this, but looking at the patch, I note that more than half
> the patch - 75% to be exact - is just boilerplate for "I need to
> allocate the kernel cred and deal with error handling there".
> 
> It literally has three lines of new actual useful code (two statements
> and one local variable declaration), and then nine lines of the "setup
> dance".
> 
> Which isn't wrong, but when the infrastructure boilerplate is three
> times more than the actual code, it makes me think we should maybe
> just get rid of the
> 
>     my_kernel_cred = prepare_kernel_cred(&init_task);
> 
> pattern for this use-case, and just let people use "init_cred"
> directly for things like this.
> 
> Because that's essentially what that prepare_kernel_cred() thing
> returns, except it allocates a new copy of said thing, so now you have
> error handling and you have to free it after-the-fact.
> 
> And I'm not seeing that the extra error handling and freeing dance
> actually buys us anything at all.
> 
> Now, some *other* users actually go on to change the creds: they want
> that prepare_kernel_cred() dance because they then actually do
> something else like using their own keyring or whatever (eg the NFS
> idmap code or some other filesystem stuff).
> 
> So it's not like prepare_kernel_cred() is wrong, but in this kind of
> case where people just go "I'm a driver with hardware access, I want
> to do something with kernel privileges not user privileges", it
> actually seems counterproductive to have extra code just to complicate
> things.
> 
> Now, my gut feel is that if we just let people use 'init_cred'
> directly, we should also make sure that it's always exposed as a
> 'const struct cred' , but wouldn't that be a whole lot simpler and
> more straightforward?
> 
> This is *not* the only use case of that.
> 
> We now have at least four use-cases of this "raw kernel cred" pattern:
> core-dumping over unix domain socket, nbd, firmware loading and SCSI
> target all do this exact thing as far as I can tell.
> 
> So  they all just want that bare kernel cred, and this interface then
> forces it to do extra work instead of just doing
> 
>         old_cred = override_creds(&init_cred);
>         ...
>         revert_creds(old_cred);
> 
> and it ends up being extra code for allocating and freeing that copy
> of a cred that we already *had* and could just have used directly.
> 
> I did just check that making 'init_cred' be const
> 
>   --- a/include/linux/init_task.h
>   +++ b/include/linux/init_task.h
>   @@ -28 +28 @@ extern struct nsproxy init_nsproxy;
>   -extern struct cred init_cred;
>   +extern const struct cred init_cred;
>   --- a/kernel/cred.c
>   +++ b/kernel/cred.c
>   @@ -44 +44 @@ static struct group_info init_groups = { .usage =
> REFCOUNT_INIT(2) };
>   -struct cred init_cred = {
>   +const struct cred init_cred = {
> 
> seems to build just fine and would seem to be the right thing to do
> even if we *don't* expect people to use it. And override_creds() is
> perfectly happy with a
> 
> Maybe there's some reason for that extra work that I'm not seeing and
> thinking of? But it all smells like make-believe work to me that

The keychains are all NULL and won't be allocated (by init) without
copying a new cred, right?  And it seems like smack, selinux, and
apparmor at least each set the security field to a copy of the
daemon's.  Now, in theory, some LSM *could* come by and try to merge
current's info with init's, but that would probably be misguided.

So this does seem like it should work.

> probably has a historical reason for it, but doesn't seem to make a
> lot of sense any more.
> 
> Hmm?
> 
>                Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-26 21:09   ` Serge E. Hallyn
@ 2025-10-26 22:57     ` Linus Torvalds
  2025-10-27 20:24       ` Paul Moore
  0 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2025-10-26 22:57 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, Christian Brauner,
	linux-block@vger.kernel.org, LSM List

On Sun, 26 Oct 2025 at 14:10, Serge E. Hallyn <serge@hallyn.com> wrote:
>
> The keychains are all NULL and won't be allocated (by init) without
> copying a new cred, right?

Right. As mentioned, 'struct init_cred' really should be 'const' -
it's not *technically* really constant, because the reference counting
casts away the const, but refs are designed to be copy-on-write apart
from the reference counting.

So whenever you change it, that's when you are supposed to always copy
things. So that  prepare_kernel_cred() thing exists for a good reason.

But the pattern here in nbd (and the other three usage cases I found)
is really just "use the kernel creds as-is".

They don't even need any reference counting as long as they can just
rely on the cred staying around for the duration of the use - which
obviously is the case for init_cred.

> Now, in theory, some LSM *could* come by and try to merge
> current's info with init's, but that would probably be misguided.
>
> So this does seem like it should work.

Yeah, I can't see how any LSM could possibly do anything about
init_cred - it really ends up being the source of all other creds. You
can't really validly mess with it or deny it anything.

          Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-26 22:57     ` Linus Torvalds
@ 2025-10-27 20:24       ` Paul Moore
  0 siblings, 0 replies; 8+ messages in thread
From: Paul Moore @ 2025-10-27 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Serge E. Hallyn, Jens Axboe, Serge Hallyn, Christian Brauner,
	linux-block@vger.kernel.org, LSM List

On Sun, Oct 26, 2025 at 6:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, 26 Oct 2025 at 14:10, Serge E. Hallyn <serge@hallyn.com> wrote:
> >
> > The keychains are all NULL and won't be allocated (by init) without
> > copying a new cred, right?
>
> Right. As mentioned, 'struct init_cred' really should be 'const' -
> it's not *technically* really constant, because the reference counting
> casts away the const, but refs are designed to be copy-on-write apart
> from the reference counting.
>
> So whenever you change it, that's when you are supposed to always copy
> things. So that  prepare_kernel_cred() thing exists for a good reason.
>
> But the pattern here in nbd (and the other three usage cases I found)
> is really just "use the kernel creds as-is".
>
> They don't even need any reference counting as long as they can just
> rely on the cred staying around for the duration of the use - which
> obviously is the case for init_cred.
>
> > Now, in theory, some LSM *could* come by and try to merge
> > current's info with init's, but that would probably be misguided.
> >
> > So this does seem like it should work.
>
> Yeah, I can't see how any LSM could possibly do anything about
> init_cred - it really ends up being the source of all other creds. You
> can't really validly mess with it or deny it anything.

This came in just as I was logging off for the weekend and I've been
kicking it around in my head and I can't think of any *good* LSM
related reason why this should be a problem, however I do have a
somewhat generic concern about potential future issues caused by
someone choosing the wrong access pattern and causing an odd bug.  In
theory, a const attribute should catch a lot of that before it starts,
but that assumes we don't have some casting somewhere doing odd
things.

If we care about this, and I'm not sure we do, or rather I'm not sure
how much I care about this, we could create a new cred instance, say
'kernel_cred', that is purely for things like nbd where no changes are
expected and it can be accessed directly.  This would limit the
direct-access pattern to just kernel_cred, making code
inspection/review easier and leaving the door open for WARN_ON-esque
assertions in things like prepare_creds() and similar*.

* This reminds me that we need to talk some more with the keys folks
and see if we can get rid of the ugliness that is
key_change_session_keyring()/security_transfer_creds().  Jann had some
patches for that, but if I recall correctly there was a concern about
backwards compatibility.

-- 
paul-moore.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-24 20:31 ` [GIT PULL] Block fixes for 6.18-rc3 Linus Torvalds
  2025-10-26 21:09   ` Serge E. Hallyn
@ 2025-10-31 15:43   ` Christian Brauner
  2025-10-31 15:53     ` Christian Brauner
  2025-10-31 16:30     ` Linus Torvalds
  1 sibling, 2 replies; 8+ messages in thread
From: Christian Brauner @ 2025-10-31 15:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, linux-block@vger.kernel.org,
	LSM List

[-- Attachment #1: Type: text/plain, Size: 4471 bytes --]

On Fri, Oct 24, 2025 at 01:31:11PM -0700, Linus Torvalds wrote:
> [ Adding LSM people. Also Christian, because he did the cred refcount

Sorry, late to the party. I was working on other stuf. Let me see...

> cleanup with override_creds() and friends last year, and I'm
> suggesting taking that one step further ]
> 
> On Fri, 24 Oct 2025 at 06:58, Jens Axboe <axboe@kernel.dk> wrote:
> >
> > Ondrej Mosnacek (1):
> >       nbd: override creds to kernel when calling sock_{send,recv}msg()
> 
> I've pulled this, but looking at the patch, I note that more than half
> the patch - 75% to be exact - is just boilerplate for "I need to
> allocate the kernel cred and deal with error handling there".
> 
> It literally has three lines of new actual useful code (two statements
> and one local variable declaration), and then nine lines of the "setup
> dance".
> 
> Which isn't wrong, but when the infrastructure boilerplate is three
> times more than the actual code, it makes me think we should maybe
> just get rid of the
> 
>     my_kernel_cred = prepare_kernel_cred(&init_task);
> 
> pattern for this use-case, and just let people use "init_cred"
> directly for things like this.
> 
> Because that's essentially what that prepare_kernel_cred() thing
> returns, except it allocates a new copy of said thing, so now you have
> error handling and you have to free it after-the-fact.
> 
> And I'm not seeing that the extra error handling and freeing dance
> actually buys us anything at all.
> 
> Now, some *other* users actually go on to change the creds: they want
> that prepare_kernel_cred() dance because they then actually do
> something else like using their own keyring or whatever (eg the NFS
> idmap code or some other filesystem stuff).
> 
> So it's not like prepare_kernel_cred() is wrong, but in this kind of
> case where people just go "I'm a driver with hardware access, I want
> to do something with kernel privileges not user privileges", it
> actually seems counterproductive to have extra code just to complicate
> things.
> 
> Now, my gut feel is that if we just let people use 'init_cred'
> directly, we should also make sure that it's always exposed as a
> 'const struct cred' , but wouldn't that be a whole lot simpler and
> more straightforward?
> 
> This is *not* the only use case of that.
> 
> We now have at least four use-cases of this "raw kernel cred" pattern:
> core-dumping over unix domain socket, nbd, firmware loading and SCSI
> target all do this exact thing as far as I can tell.
> 
> So  they all just want that bare kernel cred, and this interface then
> forces it to do extra work instead of just doing
> 
>         old_cred = override_creds(&init_cred);
>         ...
>         revert_creds(old_cred);
> 
> and it ends up being extra code for allocating and freeing that copy
> of a cred that we already *had* and could just have used directly.
> 
> I did just check that making 'init_cred' be const

Hm, two immediate observations before I go off and write the series.

(1) The thing is that init_cred would have to be exposed to modules via
    EXPORT_SYMBOL() for this to work. It would be easier to just force
    the use of init_task->cred instead.

    That pointer deref won't matter in the face of the allocations and
    refcounts we wipe out with this. Then we should also move init_cred
    to init/init_task.c and make it static const. Nobody really needs it
    currently.

(2) I think the plain override_creds() would work but we can do better.
    I envision we can leverage CLASS() to completely hide any access to
    init_cred and force a scope with kernel creds.

/me goess off to write that up.

Ok, so I have it and it survives the coredump socket tests. They are a
prime example for this sort of thing. Any unprivileged task needs to be
able to connect to the coredump socket when it coredumps so we override
credentials only for the path lookup. With my patchset this becomes:

        if (flags & SOCK_COREDUMP) {
                struct path root;

                task_lock(&init_task);
                get_fs_root(init_task.fs, &root);
                task_unlock(&init_task);

                scoped_with_kernel_creds() 
			err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
					      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
					      LOOKUP_NO_MAGICLINKS, &path);
                path_put(&root);
                if (err)
                        goto fail;
        } else {

Patches appended.

[-- Attachment #2: 0000-creds-add-scoped_-with_kernel_creds.eml --]
[-- Type: message/rfc822, Size: 2435 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 0/6] creds: add {scoped_}with_kernel_creds()
Date: Fri, 31 Oct 2025 16:37:35 +0100
Message-ID: <20251031-work-creds-init_cred-v1-0-cbf0400d6e0e@kernel.org>

Don't needlessly duplicate the initial credentials just to briefly use
them. Avoid all that work and add guards that hide all this away.

Note, using init_task in the macros to get at the credentials has the
advantage that we don't actually need to export init_cred to modules
such as nbd.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (6):
      cred: add {scoped_}with_kernel_creds
      creds: make init_cred static
      firmware: don't copy kernel creds
      nbd: don't copy kernel creds
      target: don't copy kernel creds
      unix: don't copy creds

 drivers/base/firmware_loader/main.c   | 59 +++++++++++++++--------------------
 drivers/block/nbd.c                   | 17 ++--------
 drivers/target/target_core_configfs.c | 14 ++-------
 include/linux/cred.h                  | 25 +++++++++++++++
 include/linux/init_task.h             |  1 -
 init/init_task.c                      | 24 ++++++++++++++
 kernel/cred.c                         | 24 --------------
 net/unix/af_unix.c                    | 17 +++-------
 security/keys/process_keys.c          |  2 +-
 9 files changed, 83 insertions(+), 100 deletions(-)
---
base-commit: d2818517e3486d11c9bd55aca3e14059e4c69886
change-id: 20251031-work-creds-init_cred-db9556a70d67


[-- Attachment #3: 0001-cred-add-scoped_-with_kernel_creds.eml --]
[-- Type: message/rfc822, Size: 2559 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 1/6] cred: add {scoped_}with_kernel_creds
Date: Fri, 31 Oct 2025 16:37:36 +0100
Message-ID: <20251031-work-creds-init_cred-v1-1-cbf0400d6e0e@kernel.org>

Add a new CLASS(with_kernel_creds) allowing code to run with kernel
creds without having to duplicate them first. Only use the CLASS() never
the plain helpers!

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/cred.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 89ae50ad2ace..8a8d6b3fbadb 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -20,6 +20,8 @@
 struct cred;
 struct inode;
 
+extern struct task_struct init_task;
+
 /*
  * COW Supplementary groups list
  */
@@ -180,6 +182,29 @@ static inline const struct cred *revert_creds(const struct cred *revert_cred)
 	return rcu_replace_pointer(current->cred, revert_cred, 1);
 }
 
+static inline const struct cred *__assume_kernel_creds(void)
+{
+	return rcu_replace_pointer(current->cred, init_task.cred, 1);
+}
+
+static inline void __yield_kernel_creds(const struct cred *revert_cred)
+{
+	WARN_ON_ONCE(current->cred != init_task.cred);
+	rcu_replace_pointer(current->cred, revert_cred, 1);
+}
+
+DEFINE_CLASS(with_kernel_creds,
+	     const struct cred *,
+	     __yield_kernel_creds(_T),
+	     __assume_kernel_creds(), void)
+
+#define with_kernel_creds() \
+	CLASS(with_kernel_creds, __UNIQUE_ID(cred))()
+
+#define scoped_with_kernel_creds() \
+	for (CLASS(with_kernel_creds, __UNIQUE_ID(cred))(), \
+	     *__p = (void *)1; __p; __p = NULL)
+
 /**
  * get_cred_many - Get references on a set of credentials
  * @cred: The credentials to reference

-- 
2.47.3


[-- Attachment #4: 0002-creds-make-init_cred-static.eml --]
[-- Type: message/rfc822, Size: 4417 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 2/6] creds: make init_cred static
Date: Fri, 31 Oct 2025 16:37:37 +0100
Message-ID: <20251031-work-creds-init_cred-v1-2-cbf0400d6e0e@kernel.org>

There's zero need to expose struct init_cred.
The very few places that need direct access can just go through
init_task and be done with it.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/init_task.h    |  1 -
 init/init_task.c             | 24 ++++++++++++++++++++++++
 kernel/cred.c                | 24 ------------------------
 security/keys/process_keys.c |  2 +-
 4 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index bccb3f1f6262..a6cb241ea00c 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -25,7 +25,6 @@
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
 extern struct nsproxy init_nsproxy;
-extern struct cred init_cred;
 
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 #define INIT_PREV_CPUTIME(x)	.prev_cputime = {			\
diff --git a/init/init_task.c b/init/init_task.c
index a55e2189206f..68059eac9a1e 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -62,6 +62,30 @@ unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] = {
 };
 #endif
 
+/*
+ * The initial credentials for the initial task
+ */
+static const struct cred init_cred = {
+	.usage			= ATOMIC_INIT(4),
+	.uid			= GLOBAL_ROOT_UID,
+	.gid			= GLOBAL_ROOT_GID,
+	.suid			= GLOBAL_ROOT_UID,
+	.sgid			= GLOBAL_ROOT_GID,
+	.euid			= GLOBAL_ROOT_UID,
+	.egid			= GLOBAL_ROOT_GID,
+	.fsuid			= GLOBAL_ROOT_UID,
+	.fsgid			= GLOBAL_ROOT_GID,
+	.securebits		= SECUREBITS_DEFAULT,
+	.cap_inheritable	= CAP_EMPTY_SET,
+	.cap_permitted		= CAP_FULL_SET,
+	.cap_effective		= CAP_FULL_SET,
+	.cap_bset		= CAP_FULL_SET,
+	.user			= INIT_USER,
+	.user_ns		= &init_user_ns,
+	.group_info		= &init_groups,
+	.ucounts		= &init_ucounts,
+};
+
 /*
  * Set up the first task table, touch at your own risk!. Base=0,
  * limit=0x1fffff (=2MB)
diff --git a/kernel/cred.c b/kernel/cred.c
index dbf6b687dc5c..9ff0b349b80b 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -38,30 +38,6 @@ static struct kmem_cache *cred_jar;
 /* init to 2 - one for init_task, one to ensure it is never freed */
 static struct group_info init_groups = { .usage = REFCOUNT_INIT(2) };
 
-/*
- * The initial credentials for the initial task
- */
-struct cred init_cred = {
-	.usage			= ATOMIC_INIT(4),
-	.uid			= GLOBAL_ROOT_UID,
-	.gid			= GLOBAL_ROOT_GID,
-	.suid			= GLOBAL_ROOT_UID,
-	.sgid			= GLOBAL_ROOT_GID,
-	.euid			= GLOBAL_ROOT_UID,
-	.egid			= GLOBAL_ROOT_GID,
-	.fsuid			= GLOBAL_ROOT_UID,
-	.fsgid			= GLOBAL_ROOT_GID,
-	.securebits		= SECUREBITS_DEFAULT,
-	.cap_inheritable	= CAP_EMPTY_SET,
-	.cap_permitted		= CAP_FULL_SET,
-	.cap_effective		= CAP_FULL_SET,
-	.cap_bset		= CAP_FULL_SET,
-	.user			= INIT_USER,
-	.user_ns		= &init_user_ns,
-	.group_info		= &init_groups,
-	.ucounts		= &init_ucounts,
-};
-
 /*
  * The RCU callback to actually dispose of a set of credentials
  */
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index b5d5333ab330..98ba8a7d3118 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -51,7 +51,7 @@ static struct key *get_user_register(struct user_namespace *user_ns)
 	if (!reg_keyring) {
 		reg_keyring = keyring_alloc(".user_reg",
 					    user_ns->owner, INVALID_GID,
-					    &init_cred,
+					    init_task.cred,
 					    KEY_POS_WRITE | KEY_POS_SEARCH |
 					    KEY_USR_VIEW | KEY_USR_READ,
 					    0,

-- 
2.47.3


[-- Attachment #5: 0003-firmware-don-t-copy-kernel-creds.eml --]
[-- Type: message/rfc822, Size: 3963 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 3/6] firmware: don't copy kernel creds
Date: Fri, 31 Oct 2025 16:37:38 +0100
Message-ID: <20251031-work-creds-init_cred-v1-3-cbf0400d6e0e@kernel.org>

No need to copy kernel credentials.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/base/firmware_loader/main.c | 59 ++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 34 deletions(-)

diff --git a/drivers/base/firmware_loader/main.c b/drivers/base/firmware_loader/main.c
index 6942c62fa59d..bee3050a20d9 100644
--- a/drivers/base/firmware_loader/main.c
+++ b/drivers/base/firmware_loader/main.c
@@ -829,8 +829,6 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
 		  size_t offset, u32 opt_flags)
 {
 	struct firmware *fw = NULL;
-	struct cred *kern_cred = NULL;
-	const struct cred *old_cred;
 	bool nondirect = false;
 	int ret;
 
@@ -871,45 +869,38 @@ _request_firmware(const struct firmware **firmware_p, const char *name,
 	 * called by a driver when serving an unrelated request from userland, we use
 	 * the kernel credentials to read the file.
 	 */
-	kern_cred = prepare_kernel_cred(&init_task);
-	if (!kern_cred) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	old_cred = override_creds(kern_cred);
+	scoped_with_kernel_creds() {
+		ret = fw_get_filesystem_firmware(device, fw->priv, "", NULL);
 
-	ret = fw_get_filesystem_firmware(device, fw->priv, "", NULL);
-
-	/* Only full reads can support decompression, platform, and sysfs. */
-	if (!(opt_flags & FW_OPT_PARTIAL))
-		nondirect = true;
+		/* Only full reads can support decompression, platform, and sysfs. */
+		if (!(opt_flags & FW_OPT_PARTIAL))
+			nondirect = true;
 
 #ifdef CONFIG_FW_LOADER_COMPRESS_ZSTD
-	if (ret == -ENOENT && nondirect)
-		ret = fw_get_filesystem_firmware(device, fw->priv, ".zst",
-						 fw_decompress_zstd);
+		if (ret == -ENOENT && nondirect)
+			ret = fw_get_filesystem_firmware(device, fw->priv, ".zst",
+							 fw_decompress_zstd);
 #endif
 #ifdef CONFIG_FW_LOADER_COMPRESS_XZ
-	if (ret == -ENOENT && nondirect)
-		ret = fw_get_filesystem_firmware(device, fw->priv, ".xz",
-						 fw_decompress_xz);
+		if (ret == -ENOENT && nondirect)
+			ret = fw_get_filesystem_firmware(device, fw->priv, ".xz",
+							 fw_decompress_xz);
 #endif
-	if (ret == -ENOENT && nondirect)
-		ret = firmware_fallback_platform(fw->priv);
+		if (ret == -ENOENT && nondirect)
+			ret = firmware_fallback_platform(fw->priv);
 
-	if (ret) {
-		if (!(opt_flags & FW_OPT_NO_WARN))
-			dev_warn(device,
-				 "Direct firmware load for %s failed with error %d\n",
-				 name, ret);
-		if (nondirect)
-			ret = firmware_fallback_sysfs(fw, name, device,
-						      opt_flags, ret);
-	} else
-		ret = assign_fw(fw, device);
-
-	revert_creds(old_cred);
-	put_cred(kern_cred);
+		if (ret) {
+			if (!(opt_flags & FW_OPT_NO_WARN))
+				dev_warn(device,
+					 "Direct firmware load for %s failed with error %d\n",
+					 name, ret);
+			if (nondirect)
+				ret = firmware_fallback_sysfs(fw, name, device,
+							      opt_flags, ret);
+		} else {
+			ret = assign_fw(fw, device);
+		}
+	}
 
 out:
 	if (ret < 0) {

-- 
2.47.3


[-- Attachment #6: 0004-nbd-don-t-copy-kernel-creds.eml --]
[-- Type: message/rfc822, Size: 3029 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 4/6] nbd: don't copy kernel creds
Date: Fri, 31 Oct 2025 16:37:39 +0100
Message-ID: <20251031-work-creds-init_cred-v1-4-cbf0400d6e0e@kernel.org>

No need to copy kernel credentials.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/block/nbd.c | 17 ++---------------
 1 file changed, 2 insertions(+), 15 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index a853c65ac65d..1f0d89e21ec8 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -52,7 +52,6 @@
 static DEFINE_IDR(nbd_index_idr);
 static DEFINE_MUTEX(nbd_index_mutex);
 static struct workqueue_struct *nbd_del_wq;
-static struct cred *nbd_cred;
 static int nbd_total_devices = 0;
 
 struct nbd_sock {
@@ -555,7 +554,6 @@ static int __sock_xmit(struct nbd_device *nbd, struct socket *sock, int send,
 	int result;
 	struct msghdr msg = {} ;
 	unsigned int noreclaim_flag;
-	const struct cred *old_cred;
 
 	if (unlikely(!sock)) {
 		dev_err_ratelimited(disk_to_dev(nbd->disk),
@@ -564,10 +562,10 @@ static int __sock_xmit(struct nbd_device *nbd, struct socket *sock, int send,
 		return -EINVAL;
 	}
 
-	old_cred = override_creds(nbd_cred);
-
 	msg.msg_iter = *iter;
 
+	with_kernel_creds();
+
 	noreclaim_flag = memalloc_noreclaim_save();
 	do {
 		sock->sk->sk_allocation = GFP_NOIO | __GFP_MEMALLOC;
@@ -590,8 +588,6 @@ static int __sock_xmit(struct nbd_device *nbd, struct socket *sock, int send,
 
 	memalloc_noreclaim_restore(noreclaim_flag);
 
-	revert_creds(old_cred);
-
 	return result;
 }
 
@@ -2683,15 +2679,7 @@ static int __init nbd_init(void)
 		return -ENOMEM;
 	}
 
-	nbd_cred = prepare_kernel_cred(&init_task);
-	if (!nbd_cred) {
-		destroy_workqueue(nbd_del_wq);
-		unregister_blkdev(NBD_MAJOR, "nbd");
-		return -ENOMEM;
-	}
-
 	if (genl_register_family(&nbd_genl_family)) {
-		put_cred(nbd_cred);
 		destroy_workqueue(nbd_del_wq);
 		unregister_blkdev(NBD_MAJOR, "nbd");
 		return -EINVAL;
@@ -2746,7 +2734,6 @@ static void __exit nbd_cleanup(void)
 	/* Also wait for nbd_dev_remove_work() completes */
 	destroy_workqueue(nbd_del_wq);
 
-	put_cred(nbd_cred);
 	idr_destroy(&nbd_index_idr);
 	unregister_blkdev(NBD_MAJOR, "nbd");
 }

-- 
2.47.3


[-- Attachment #7: 0005-target-don-t-copy-kernel-creds.eml --]
[-- Type: message/rfc822, Size: 2249 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 5/6] target: don't copy kernel creds
Date: Fri, 31 Oct 2025 16:37:40 +0100
Message-ID: <20251031-work-creds-init_cred-v1-5-cbf0400d6e0e@kernel.org>

Get rid of all the boilerplate and tightly scope when the task runs with
kernel creds.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/target/target_core_configfs.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/target/target_core_configfs.c b/drivers/target/target_core_configfs.c
index b19acd662726..9e51c535ba8c 100644
--- a/drivers/target/target_core_configfs.c
+++ b/drivers/target/target_core_configfs.c
@@ -3670,8 +3670,6 @@ static int __init target_core_init_configfs(void)
 {
 	struct configfs_subsystem *subsys = &target_core_fabrics;
 	struct t10_alua_lu_gp *lu_gp;
-	struct cred *kern_cred;
-	const struct cred *old_cred;
 	int ret;
 
 	pr_debug("TARGET_CORE[0]: Loading Generic Kernel Storage"
@@ -3748,16 +3746,8 @@ static int __init target_core_init_configfs(void)
 	if (ret < 0)
 		goto out;
 
-	/* We use the kernel credentials to access the target directory */
-	kern_cred = prepare_kernel_cred(&init_task);
-	if (!kern_cred) {
-		ret = -ENOMEM;
-		goto out;
-	}
-	old_cred = override_creds(kern_cred);
-	target_init_dbroot();
-	revert_creds(old_cred);
-	put_cred(kern_cred);
+	scoped_with_kernel_creds()
+		target_init_dbroot();
 
 	return 0;
 

-- 
2.47.3


[-- Attachment #8: 0006-unix-don-t-copy-creds.eml --]
[-- Type: message/rfc822, Size: 2244 bytes --]

From: Christian Brauner <brauner@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Subject: [PATCH 6/6] unix: don't copy creds
Date: Fri, 31 Oct 2025 16:37:41 +0100
Message-ID: <20251031-work-creds-init_cred-v1-6-cbf0400d6e0e@kernel.org>

No need to copy kernel credentials.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 net/unix/af_unix.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 768098dec231..68c94f49f7b5 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1210,25 +1210,16 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 	unix_mkname_bsd(sunaddr, addr_len);
 
 	if (flags & SOCK_COREDUMP) {
-		const struct cred *cred;
-		struct cred *kcred;
 		struct path root;
 
-		kcred = prepare_kernel_cred(&init_task);
-		if (!kcred) {
-			err = -ENOMEM;
-			goto fail;
-		}
-
 		task_lock(&init_task);
 		get_fs_root(init_task.fs, &root);
 		task_unlock(&init_task);
 
-		cred = override_creds(kcred);
-		err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
-				      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
-				      LOOKUP_NO_MAGICLINKS, &path);
-		put_cred(revert_creds(cred));
+		scoped_with_kernel_creds()
+			err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
+					      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
+					      LOOKUP_NO_MAGICLINKS, &path);
 		path_put(&root);
 		if (err)
 			goto fail;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-31 15:43   ` Christian Brauner
@ 2025-10-31 15:53     ` Christian Brauner
  2025-10-31 16:30     ` Linus Torvalds
  1 sibling, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2025-10-31 15:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, linux-block@vger.kernel.org,
	LSM List

On Fri, Oct 31, 2025 at 04:43:54PM +0100, Christian Brauner wrote:
> On Fri, Oct 24, 2025 at 01:31:11PM -0700, Linus Torvalds wrote:
> > [ Adding LSM people. Also Christian, because he did the cred refcount
> 
> Sorry, late to the party. I was working on other stuf. Let me see...
> 
> > cleanup with override_creds() and friends last year, and I'm
> > suggesting taking that one step further ]
> > 
> > On Fri, 24 Oct 2025 at 06:58, Jens Axboe <axboe@kernel.dk> wrote:
> > >
> > > Ondrej Mosnacek (1):
> > >       nbd: override creds to kernel when calling sock_{send,recv}msg()
> > 
> > I've pulled this, but looking at the patch, I note that more than half
> > the patch - 75% to be exact - is just boilerplate for "I need to
> > allocate the kernel cred and deal with error handling there".
> > 
> > It literally has three lines of new actual useful code (two statements
> > and one local variable declaration), and then nine lines of the "setup
> > dance".
> > 
> > Which isn't wrong, but when the infrastructure boilerplate is three
> > times more than the actual code, it makes me think we should maybe
> > just get rid of the
> > 
> >     my_kernel_cred = prepare_kernel_cred(&init_task);
> > 
> > pattern for this use-case, and just let people use "init_cred"
> > directly for things like this.
> > 
> > Because that's essentially what that prepare_kernel_cred() thing
> > returns, except it allocates a new copy of said thing, so now you have
> > error handling and you have to free it after-the-fact.
> > 
> > And I'm not seeing that the extra error handling and freeing dance
> > actually buys us anything at all.
> > 
> > Now, some *other* users actually go on to change the creds: they want
> > that prepare_kernel_cred() dance because they then actually do
> > something else like using their own keyring or whatever (eg the NFS
> > idmap code or some other filesystem stuff).
> > 
> > So it's not like prepare_kernel_cred() is wrong, but in this kind of
> > case where people just go "I'm a driver with hardware access, I want
> > to do something with kernel privileges not user privileges", it
> > actually seems counterproductive to have extra code just to complicate
> > things.
> > 
> > Now, my gut feel is that if we just let people use 'init_cred'
> > directly, we should also make sure that it's always exposed as a
> > 'const struct cred' , but wouldn't that be a whole lot simpler and
> > more straightforward?
> > 
> > This is *not* the only use case of that.
> > 
> > We now have at least four use-cases of this "raw kernel cred" pattern:
> > core-dumping over unix domain socket, nbd, firmware loading and SCSI
> > target all do this exact thing as far as I can tell.
> > 
> > So  they all just want that bare kernel cred, and this interface then
> > forces it to do extra work instead of just doing
> > 
> >         old_cred = override_creds(&init_cred);
> >         ...
> >         revert_creds(old_cred);
> > 
> > and it ends up being extra code for allocating and freeing that copy
> > of a cred that we already *had* and could just have used directly.
> > 
> > I did just check that making 'init_cred' be const
> 
> Hm, two immediate observations before I go off and write the series.
> 
> (1) The thing is that init_cred would have to be exposed to modules via
>     EXPORT_SYMBOL() for this to work. It would be easier to just force
>     the use of init_task->cred instead.
> 
>     That pointer deref won't matter in the face of the allocations and
>     refcounts we wipe out with this. Then we should also move init_cred
>     to init/init_task.c and make it static const. Nobody really needs it
>     currently.
> 
> (2) I think the plain override_creds() would work but we can do better.
>     I envision we can leverage CLASS() to completely hide any access to
>     init_cred and force a scope with kernel creds.
> 
> /me goess off to write that up.
> 
> Ok, so I have it and it survives the coredump socket tests. They are a
> prime example for this sort of thing. Any unprivileged task needs to be
> able to connect to the coredump socket when it coredumps so we override
> credentials only for the path lookup. With my patchset this becomes:
> 
>         if (flags & SOCK_COREDUMP) {
>                 struct path root;
> 
>                 task_lock(&init_task);
>                 get_fs_root(init_task.fs, &root);
>                 task_unlock(&init_task);
> 
>                 scoped_with_kernel_creds() 
> 			err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
> 					      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
> 					      LOOKUP_NO_MAGICLINKS, &path);
>                 path_put(&root);
>                 if (err)
>                         goto fail;
>         } else {
> 
> Patches appended.

> Date: Fri, 31 Oct 2025 16:37:35 +0100
> From: Christian Brauner <brauner@kernel.org>
> To: Christian Brauner <brauner@kernel.org>
> Subject: [PATCH 0/6] creds: add {scoped_}with_kernel_creds()
> Message-Id: <20251031-work-creds-init_cred-v1-0-cbf0400d6e0e@kernel.org>

Needs one diff I forgot to fold:

diff --git a/init/init_task.c b/init/init_task.c
index 68059eac9a1e..15288e62334f 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -62,6 +62,9 @@ unsigned long init_shadow_call_stack[SCS_SIZE / sizeof(long)] = {
 };
 #endif

+/* init to 2 - one for init_task, one to ensure it is never freed */
+static struct group_info init_groups = { .usage = REFCOUNT_INIT(2) };
+
 /*
  * The initial credentials for the initial task
  */
diff --git a/kernel/cred.c b/kernel/cred.c
index 9ff0b349b80b..ac87ed9d43b1 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -35,9 +35,6 @@ do {                                                                  \

 static struct kmem_cache *cred_jar;

-/* init to 2 - one for init_task, one to ensure it is never freed */
-static struct group_info init_groups = { .usage = REFCOUNT_INIT(2) };
-
 /*
  * The RCU callback to actually dispose of a set of credentials
  */


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-31 15:43   ` Christian Brauner
  2025-10-31 15:53     ` Christian Brauner
@ 2025-10-31 16:30     ` Linus Torvalds
  2025-11-01 13:33       ` Christian Brauner
  1 sibling, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2025-10-31 16:30 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, linux-block@vger.kernel.org,
	LSM List

On Fri, 31 Oct 2025 at 08:44, Christian Brauner <brauner@kernel.org> wrote:
>
> Hm, two immediate observations before I go off and write the series.
>
> (1) The thing is that init_cred would have to be exposed to modules via
>     EXPORT_SYMBOL() for this to work. It would be easier to just force
>     the use of init_task->cred instead.

Yea, I guess we already export that.

>     That pointer deref won't matter in the face of the allocations and
>     refcounts we wipe out with this. Then we should also move init_cred
>     to init/init_task.c and make it static const. Nobody really needs it
>     currently.

Well, I did the "does it compile ok" with it marked as 'const', but as
mentioned, those 'struct cred' instances aren't *really* const, they
are only pseudo-const things in that they are *marked* const so that
nobody modifies them by mistake, but then the ref-counting will cast
the constness away in order to update references.

So I don't think we can *actually* mark it "static const", because
that will put the data structure in the const data section, and then
the refcounting will trigger kernel page faults.

End result: I think we can indeed move it to init/init_task.c. And
yes, we can and should make it static to that file, but not plain
'const'.

If we expose it to others - but I think you're right that maybe it's
not a good idea - we should *expose* it as a 'const' data structure.

But we should probably put it in some explicitly writable section (I
was going to suggest marking it "__read_mostly", but it turns out some
architectures #define that to be empty, so a "const __read_mosyly"
data structure could still end up in a read-only section).

> (2) I think the plain override_creds() would work but we can do better.
>     I envision we can leverage CLASS() to completely hide any access to
>     init_cred and force a scope with kernel creds.

Ack.

                  Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [GIT PULL] Block fixes for 6.18-rc3
  2025-10-31 16:30     ` Linus Torvalds
@ 2025-11-01 13:33       ` Christian Brauner
  0 siblings, 0 replies; 8+ messages in thread
From: Christian Brauner @ 2025-11-01 13:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Paul Moore, Serge Hallyn, linux-block@vger.kernel.org,
	LSM List

On Fri, Oct 31, 2025 at 09:30:11AM -0700, Linus Torvalds wrote:
> On Fri, 31 Oct 2025 at 08:44, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Hm, two immediate observations before I go off and write the series.
> >
> > (1) The thing is that init_cred would have to be exposed to modules via
> >     EXPORT_SYMBOL() for this to work. It would be easier to just force
> >     the use of init_task->cred instead.
> 
> Yea, I guess we already export that.
> 
> >     That pointer deref won't matter in the face of the allocations and
> >     refcounts we wipe out with this. Then we should also move init_cred
> >     to init/init_task.c and make it static const. Nobody really needs it
> >     currently.
> 
> Well, I did the "does it compile ok" with it marked as 'const', but as
> mentioned, those 'struct cred' instances aren't *really* const, they
> are only pseudo-const things in that they are *marked* const so that
> nobody modifies them by mistake, but then the ref-counting will cast
> the constness away in order to update references.
> 
> So I don't think we can *actually* mark it "static const", because
> that will put the data structure in the const data section, and then
> the refcounting will trigger kernel page faults.
> 
> End result: I think we can indeed move it to init/init_task.c. And
> yes, we can and should make it static to that file, but not plain
> 'const'.
> 
> If we expose it to others - but I think you're right that maybe it's
> not a good idea - we should *expose* it as a 'const' data structure.
> 
> But we should probably put it in some explicitly writable section (I
> was going to suggest marking it "__read_mostly", but it turns out some
> architectures #define that to be empty, so a "const __read_mosyly"
> data structure could still end up in a read-only section).

For some init data structures that are heavily used such as:

init_pid_ns

it often makes sense to just skip the refcounting completely because we
know they are always around. Take the pid namespace as an example:

static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
	if (ns != &init_pid_ns)
		ns_ref_inc(ns);
	return ns;
}

void put_pid_ns(struct pid_namespace *ns)
{
	if (ns && ns != &init_pid_ns && ns_ref_put(ns))
		schedule_work(&ns->work);
}

While it has the obvious disadvantage that it introduces a special-case
into the refcounting and it would obviously be more elegant if we just
did:

void put_pid_ns(struct pid_namespace *ns)
{
	if (ns_ref_put(ns))
		schedule_work(&ns->work);
}

it does elide a ton of refcount increments and decrements during task
creation.

While that's not true for init_creds it would still be easy to just not
refcount them at all if it's worth it.

Now that I think about it: given that I reworked all the namespace
reference counting completely it should be easy to make all initial
namespaces not get or put reference counts at all, like:

static __always_inline bool is_initial_namespace(struct ns_common *ns)
{
	VFS_WARN_ON_ONCE(ns->ns_id == 0);
	/* initial namespaces have fixed ids and the ids aren't recycled */
	return ns->ns_id <= NS_LAST_INIT_ID;
}

diff --git a/include/linux/ns_common.h b/include/linux/ns_common.h
index 241eb1e98e1d..fe9c81963786 100644
--- a/include/linux/ns_common.h
+++ b/include/linux/ns_common.h
@@ -136,9 +136,8 @@ struct ns_common *__must_check ns_owner(struct ns_common *ns);

 #define to_ns_common(__ns)                                    \
@@ -225,6 +224,8 @@ static __always_inline __must_check int __ns_ref_active_read(const struct ns_com

 static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns)
 {
+       if (is_initial_namespace(ns))
+               return false;
        if (refcount_dec_and_test(&ns->__ns_ref)) {
                VFS_WARN_ON_ONCE(__ns_ref_active_read(ns));
                return true;
@@ -234,6 +235,8 @@ static __always_inline __must_check bool __ns_ref_put(struct ns_common *ns)

 static __always_inline __must_check bool __ns_ref_get(struct ns_common *ns)
 {
+       if (is_initial_namespace(ns))
+               return true;
        if (refcount_inc_not_zero(&ns->__ns_ref))
                return true;
        VFS_WARN_ON_ONCE(__ns_ref_active_read(ns));
@@ -246,7 +249,8 @@ static __always_inline __must_check int __ns_ref_read(const struct ns_common *ns
 }

 #define ns_ref_read(__ns) __ns_ref_read(to_ns_common((__ns)))
-#define ns_ref_inc(__ns) refcount_inc(&to_ns_common((__ns))->__ns_ref)
+#define ns_ref_inc(__ns) \
+       do { if (!is_initial_namespace(to_ns_common(__ns))) refcount_inc(&to_ns_common((__ns))->__ns_ref); } while (0)
 #define ns_ref_get(__ns) __ns_ref_get(to_ns_common((__ns)))
 #define ns_ref_put(__ns) __ns_ref_put(to_ns_common((__ns)))
 #define ns_ref_put_and_lock(__ns, __lock) \

This effectively means we can drop all the special-casing in the
namespace helpers like:

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 445517a72ad0..ef06c3d3fb52 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -61,9 +61,7 @@ static inline struct pid_namespace *to_pid_ns(struct ns_common *ns)

 static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
 {
-       if (ns != &init_pid_ns)
-               ns_ref_inc(ns);
-       return ns;
+       ns_ref_inc(ns);
 }

 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 650be58d8d18..e48f5de41361 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -184,7 +184,7 @@ struct pid_namespace *copy_pid_ns(u64 flags,

 void put_pid_ns(struct pid_namespace *ns)
 {
-       if (ns && ns != &init_pid_ns && ns_ref_put(ns))
+       if (ns && ns_ref_put(ns))
                schedule_work(&ns->work);
 }
 EXPORT_SYMBOL_GPL(put_pid_ns);

And all the other ones - without having looked into any potential
pitfalls - would get the same behavior as the pidns for free. Worth it?

I think especially for the network namespace that might potentially
avoid a bunch of cacheline ping-pong. But idk, it's just a theory. But
it's easy enough to implement.

> 
> > (2) I think the plain override_creds() would work but we can do better.
> >     I envision we can leverage CLASS() to completely hide any access to
> >     init_cred and force a scope with kernel creds.
> 
> Ack.
> 
>                   Linus

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-11-01 13:33 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <37fb8720-bee9-43b7-b0ff-0214a8ad33a2@kernel.dk>
2025-10-24 20:31 ` [GIT PULL] Block fixes for 6.18-rc3 Linus Torvalds
2025-10-26 21:09   ` Serge E. Hallyn
2025-10-26 22:57     ` Linus Torvalds
2025-10-27 20:24       ` Paul Moore
2025-10-31 15:43   ` Christian Brauner
2025-10-31 15:53     ` Christian Brauner
2025-10-31 16:30     ` Linus Torvalds
2025-11-01 13:33       ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).