From: Al Viro <viro@zeniv.linux.org.uk>
To: Waiman Long <llong@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>, Eric Paris <eparis@redhat.com>,
Christian Brauner <brauner@kernel.org>,
linux-kernel@vger.kernel.org, audit@vger.kernel.org,
Richard Guy Briggs <rgb@redhat.com>,
Ricardo Robaina <rrobaina@redhat.com>
Subject: Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths
Date: Fri, 6 Feb 2026 06:31:24 +0000 [thread overview]
Message-ID: <20260206063124.GW3183987@ZenIV> (raw)
In-Reply-To: <20260206052218.GV3183987@ZenIV>
On Fri, Feb 06, 2026 at 05:22:18AM +0000, Al Viro wrote:
> On Thu, Feb 05, 2026 at 11:11:51PM -0500, Waiman Long wrote:
>
> > __latent_entropy
> > struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
> > struct user_namespace *user_ns, struct fs_struct *new_fs)
> > {
> > :
> > if (new_fs) {
> > if (&p->mnt == new_fs->root.mnt) {
> > new_fs->root.mnt = mntget(&q->mnt);
> > rootmnt = &p->mnt;
> > }
> > if (&p->mnt == new_fs->pwd.mnt) {
> > new_fs->pwd.mnt = mntget(&q->mnt);
> > pwdmnt = &p->mnt;
> > }
> > }
> >
> > It is replacing the fs->pwd.mnt with a new one while pwd_refs is 1. I can
> > make this work with the new fs_struct field. I do have one question though.
> > Do we need to acquire write_seqlock(&new_fs->seq) if we are changing root or
> > pwd here or if the new_fs are in such a state that it will never change when
> > this copying operation is in progress?
>
> In all cases when we get to that point, new_fs is always a freshly
> created private copy of current->fs, not reachable from anywhere
> other than stack frames of the callers, but the proof is not pretty.
> copy_mnt_ns() is called only by create_new_namespaces() and it gets to
> copying anything if and only if CLONE_NEWNS is in the flags. So far,
> so good. The call in create_new_namespaces() is
> new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
> and both flags and new_fs come straight from create_new_namespaces()
> callers. There are 3 of those:
> 1) int copy_namespaces(u64 flags, struct task_struct *tsk):
> new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
> That gets called by copy_process(), with tsk being the child we are
> creating there. tsk->fs is set a bit earlier, by copy_fs() call - either
> to extra ref to current->fs (when CLONE_FS is present in flags) or to
> a private copy thereof. Since in the very beginning of copy_process()
> we have
> if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> return ERR_PTR(-EINVAL);
> and we are interested in case when CLONE_NEWNS is set, tsk->fs is going
> to be a private copy.
>
> 2) int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs):
> *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
> new_fs ? new_fs : current->fs);
> That gets called from ksys_unshare(). Earlier in ksys_unshare() we have
> /*
> * If unsharing namespace, must also unshare filesystem information.
> */
> if (unshare_flags & CLONE_NEWNS)
> unshare_flags |= CLONE_FS;
> so in our case CLONE_FS is going to be set. new_fs is initially set
> to NULL and, since CLONE_FS is there, the call of unshare_fs() has
> replaced it with a reference to private copy of current->fs. Again,
> we get a private copy.
>
> 3) int exec_task_namespaces(void):
> new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> No CLONE_NEWNS in flags, so we don't get there at all.
>
> Incidentally, one bogosity I've spotted in unshare_fs() today is
> if (!(unshare_flags & CLONE_FS) || !fs)
> ^^^^^^ this
> return 0;
>
> The history is amusing - it had been faithfully preserved since
> cf2e340f4249 ("[PATCH] unshare system call -v5: system call handler
> function") back in 2006, when unshare(2) had been added; initially it
> had been a stub:
> +static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
> +{
> + struct fs_struct *fs = current->fs;
> +
> + if ((unshare_flags & CLONE_FS) &&
> + (fs && atomic_read(&fs->count) > 1))
> + return -EINVAL;
> +
> + return 0;
> +}
>
> The thing is, task->fs does not become NULL until exit_fs() clears it, at
> which point we'd better *not* run into any syscalls, unshare(2) included
> ;-) The same had been true back in 2006; as the matter of fact, I don't
> know if it _ever_ had not been true. I suspect that the logics had been
> copied from exit_fs(), which also has a pointless check that seems to have
> been added there in 1.3.26, when task->fs went kmalloc'ed. The thing
> is, in 1.3.26 copy_fs() a failed allocation aborted do_fork() (today's
> copy_process()) with exit_fs() never called for the child-not-to-be...
Actually, now I remember - until 2.5.<early> we used to have kernel threads
that did exit_fs() and normally followed that by manual switch to extra
reference to init_task.fs. Mostly - by calling daemonize(), some - manually;
there had been some (mtdblock definitely among those, might be more) that
didn't bother with the second part.
The upshot had been that we could get transient task->fs switching to
NULL and then back to &init_fs (that could happen only to kernel threads)
and some of those didn't bother to switch back to non-NULL.
None of that had ever been relevant for unshare(2); the last remnants
of that bogosity went away in 2009 when we got task_lock(current) over
all changes of current->fs, with NULL never being stored there for
as long as the thread remains alive.
Mea culpa - mnt_copy_ns() (all way back to 2001, when it had been
added as copy_namespace()) is in the same boat as unshare(2); none
of those kernel threads would ever spawn a new namespace, let alone
doing that right in the middle between exit_fs() and setting ->fs
to &init_task.fs. Which is to say, that
if (new_fs) {
in there had always been just as bogus - condition is never false ;-/
next prev parent reply other threads:[~2026-02-06 6:29 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-03 19:44 [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Waiman Long
2026-02-03 19:59 ` Al Viro
2026-02-03 20:18 ` Waiman Long
2026-02-03 20:05 ` Al Viro
2026-02-03 20:32 ` Waiman Long
2026-02-03 21:50 ` Al Viro
2026-02-03 23:26 ` Al Viro
2026-02-04 4:21 ` Waiman Long
2026-02-04 6:26 ` Al Viro
2026-02-04 18:16 ` Waiman Long
2026-02-04 20:18 ` Al Viro
2026-02-05 3:03 ` Waiman Long
2026-02-05 4:45 ` Waiman Long
2026-02-05 23:53 ` Al Viro
2026-02-06 1:20 ` Waiman Long
2026-02-06 4:11 ` Waiman Long
2026-02-06 4:19 ` Waiman Long
2026-02-06 5:22 ` Al Viro
2026-02-06 6:31 ` Al Viro [this message]
2026-02-06 6:38 ` Al Viro
2026-02-06 7:13 ` Al Viro
2026-02-06 19:16 ` Waiman Long
2026-02-06 20:04 ` Waiman Long
2026-02-06 20:38 ` Al Viro
2026-02-07 8:25 ` [PATCH][RFC] bug in unshare(2) failure recovery Al Viro
2026-02-07 23:06 ` Waiman Long
2026-02-17 12:49 ` Christian Brauner
2026-02-17 12:49 ` Christian Brauner
2026-02-06 20:29 ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-06 20:58 ` setns(2) vs. pivot_root(2) (was Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths) Al Viro
2026-02-06 21:09 ` Al Viro
2026-02-17 13:12 ` Christian Brauner
2026-02-06 8:15 ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-05 5:22 ` Al Viro
2026-02-05 13:59 ` Waiman Long
2026-02-05 17:53 ` Mateusz Guzik
2026-02-17 13:33 ` Christian Brauner
2026-02-17 13:44 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260206063124.GW3183987@ZenIV \
--to=viro@zeniv.linux.org.uk \
--cc=audit@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=eparis@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=llong@redhat.com \
--cc=paul@paul-moore.com \
--cc=rgb@redhat.com \
--cc=rrobaina@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox