From: Al Viro <viro@zeniv.linux.org.uk>
To: Waiman Long <llong@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>, Eric Paris <eparis@redhat.com>,
Christian Brauner <brauner@kernel.org>,
linux-kernel@vger.kernel.org, audit@vger.kernel.org,
Richard Guy Briggs <rgb@redhat.com>,
Ricardo Robaina <rrobaina@redhat.com>
Subject: Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths
Date: Fri, 6 Feb 2026 05:22:18 +0000 [thread overview]
Message-ID: <20260206052218.GV3183987@ZenIV> (raw)
In-Reply-To: <3a5f84fc-5c4e-4ce1-b2dd-6e07b109ce78@redhat.com>
On Thu, Feb 05, 2026 at 11:11:51PM -0500, Waiman Long wrote:
> __latent_entropy
> struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
> struct user_namespace *user_ns, struct fs_struct *new_fs)
> {
> :
> if (new_fs) {
> if (&p->mnt == new_fs->root.mnt) {
> new_fs->root.mnt = mntget(&q->mnt);
> rootmnt = &p->mnt;
> }
> if (&p->mnt == new_fs->pwd.mnt) {
> new_fs->pwd.mnt = mntget(&q->mnt);
> pwdmnt = &p->mnt;
> }
> }
>
> It is replacing the fs->pwd.mnt with a new one while pwd_refs is 1. I can
> make this work with the new fs_struct field. I do have one question though.
> Do we need to acquire write_seqlock(&new_fs->seq) if we are changing root or
> pwd here or if the new_fs are in such a state that it will never change when
> this copying operation is in progress?
In all cases when we get to that point, new_fs is always a freshly
created private copy of current->fs, not reachable from anywhere
other than stack frames of the callers, but the proof is not pretty.
copy_mnt_ns() is called only by create_new_namespaces() and it gets to
copying anything if and only if CLONE_NEWNS is in the flags. So far,
so good. The call in create_new_namespaces() is
new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
and both flags and new_fs come straight from create_new_namespaces()
callers. There are 3 of those:
1) int copy_namespaces(u64 flags, struct task_struct *tsk):
new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
That gets called by copy_process(), with tsk being the child we are
creating there. tsk->fs is set a bit earlier, by copy_fs() call - either
to extra ref to current->fs (when CLONE_FS is present in flags) or to
a private copy thereof. Since in the very beginning of copy_process()
we have
if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
return ERR_PTR(-EINVAL);
and we are interested in case when CLONE_NEWNS is set, tsk->fs is going
to be a private copy.
2) int unshare_nsproxy_namespaces(unsigned long unshare_flags,
struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs):
*new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
new_fs ? new_fs : current->fs);
That gets called from ksys_unshare(). Earlier in ksys_unshare() we have
/*
* If unsharing namespace, must also unshare filesystem information.
*/
if (unshare_flags & CLONE_NEWNS)
unshare_flags |= CLONE_FS;
so in our case CLONE_FS is going to be set. new_fs is initially set
to NULL and, since CLONE_FS is there, the call of unshare_fs() has
replaced it with a reference to private copy of current->fs. Again,
we get a private copy.
3) int exec_task_namespaces(void):
new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
No CLONE_NEWNS in flags, so we don't get there at all.
Incidentally, one bogosity I've spotted in unshare_fs() today is
if (!(unshare_flags & CLONE_FS) || !fs)
^^^^^^ this
return 0;
The history is amusing - it had been faithfully preserved since
cf2e340f4249 ("[PATCH] unshare system call -v5: system call handler
function") back in 2006, when unshare(2) had been added; initially it
had been a stub:
+static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
+{
+ struct fs_struct *fs = current->fs;
+
+ if ((unshare_flags & CLONE_FS) &&
+ (fs && atomic_read(&fs->count) > 1))
+ return -EINVAL;
+
+ return 0;
+}
The thing is, task->fs does not become NULL until exit_fs() clears it, at
which point we'd better *not* run into any syscalls, unshare(2) included
;-) The same had been true back in 2006; as the matter of fact, I don't
know if it _ever_ had not been true. I suspect that the logics had been
copied from exit_fs(), which also has a pointless check that seems to have
been added there in 1.3.26, when task->fs went kmalloc'ed. The thing
is, in 1.3.26 copy_fs() a failed allocation aborted do_fork() (today's
copy_process()) with exit_fs() never called for the child-not-to-be...
Anyway, with exit_fs() there might have been some value in making it
idempotent, but for unshare_fs() that never made sense. Nobody noticed
and nobody who'd massaged that function afterwards (myself included)
had ever asked WTF would that exit be about and what would happen if we
ever ended up going there (answer: an oops galore)...
next prev parent reply other threads:[~2026-02-06 5:20 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-03 19:44 [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Waiman Long
2026-02-03 19:59 ` Al Viro
2026-02-03 20:18 ` Waiman Long
2026-02-03 20:05 ` Al Viro
2026-02-03 20:32 ` Waiman Long
2026-02-03 21:50 ` Al Viro
2026-02-03 23:26 ` Al Viro
2026-02-04 4:21 ` Waiman Long
2026-02-04 6:26 ` Al Viro
2026-02-04 18:16 ` Waiman Long
2026-02-04 20:18 ` Al Viro
2026-02-05 3:03 ` Waiman Long
2026-02-05 4:45 ` Waiman Long
2026-02-05 23:53 ` Al Viro
2026-02-06 1:20 ` Waiman Long
2026-02-06 4:11 ` Waiman Long
2026-02-06 4:19 ` Waiman Long
2026-02-06 5:22 ` Al Viro [this message]
2026-02-06 6:31 ` Al Viro
2026-02-06 6:38 ` Al Viro
2026-02-06 7:13 ` Al Viro
2026-02-06 19:16 ` Waiman Long
2026-02-06 20:04 ` Waiman Long
2026-02-06 20:38 ` Al Viro
2026-02-07 8:25 ` [PATCH][RFC] bug in unshare(2) failure recovery Al Viro
2026-02-07 23:06 ` Waiman Long
2026-02-17 12:49 ` Christian Brauner
2026-02-17 12:49 ` Christian Brauner
2026-02-06 20:29 ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-06 20:58 ` setns(2) vs. pivot_root(2) (was Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths) Al Viro
2026-02-06 21:09 ` Al Viro
2026-02-17 13:12 ` Christian Brauner
2026-02-06 8:15 ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-05 5:22 ` Al Viro
2026-02-05 13:59 ` Waiman Long
2026-02-05 17:53 ` Mateusz Guzik
2026-02-17 13:33 ` Christian Brauner
2026-02-17 13:44 ` Mateusz Guzik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260206052218.GV3183987@ZenIV \
--to=viro@zeniv.linux.org.uk \
--cc=audit@vger.kernel.org \
--cc=brauner@kernel.org \
--cc=eparis@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=llong@redhat.com \
--cc=paul@paul-moore.com \
--cc=rgb@redhat.com \
--cc=rrobaina@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox