Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@zeniv.linux.org.uk>
To: Waiman Long <llong@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>, Eric Paris <eparis@redhat.com>,
	Christian Brauner <brauner@kernel.org>,
	linux-kernel@vger.kernel.org, audit@vger.kernel.org,
	Richard Guy Briggs <rgb@redhat.com>,
	Ricardo Robaina <rrobaina@redhat.com>
Subject: Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths
Date: Fri, 6 Feb 2026 06:31:24 +0000	[thread overview]
Message-ID: <20260206063124.GW3183987@ZenIV> (raw)
In-Reply-To: <20260206052218.GV3183987@ZenIV>

On Fri, Feb 06, 2026 at 05:22:18AM +0000, Al Viro wrote:
> On Thu, Feb 05, 2026 at 11:11:51PM -0500, Waiman Long wrote:
> 
> > __latent_entropy
> > struct mnt_namespace *copy_mnt_ns(u64 flags, struct mnt_namespace *ns,
> >                 struct user_namespace *user_ns, struct fs_struct *new_fs)
> > {
> >   :
> >                 if (new_fs) {
> >                         if (&p->mnt == new_fs->root.mnt) {
> >                                 new_fs->root.mnt = mntget(&q->mnt);
> >                                 rootmnt = &p->mnt;
> >                         }
> >                         if (&p->mnt == new_fs->pwd.mnt) {
> >                                 new_fs->pwd.mnt = mntget(&q->mnt);
> >                                 pwdmnt = &p->mnt;
> >                         }
> >                 }
> > 
> > It is replacing the fs->pwd.mnt with a new one while pwd_refs is 1. I can
> > make this work with the new fs_struct field. I do have one question though.
> > Do we need to acquire write_seqlock(&new_fs->seq) if we are changing root or
> > pwd here or if the new_fs are in such a state that it will never change when
> > this copying operation is in progress?
> 
> In all cases when we get to that point, new_fs is always a freshly
> created private copy of current->fs, not reachable from anywhere
> other than stack frames of the callers, but the proof is not pretty.
> copy_mnt_ns() is called only by create_new_namespaces() and it gets to
> copying anything if and only if CLONE_NEWNS is in the flags.  So far,
> so good.  The call in create_new_namespaces() is
> 	new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
> and both flags and new_fs come straight from create_new_namespaces()
> callers.  There are 3 of those:
> 1) int copy_namespaces(u64 flags, struct task_struct *tsk):
> 	new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs);
> That gets called by copy_process(), with tsk being the child we are
> creating there.  tsk->fs is set a bit earlier, by copy_fs() call - either
> to extra ref to current->fs (when CLONE_FS is present in flags) or to
> a private copy thereof.  Since in the very beginning of copy_process()
> we have
> 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
> 		return ERR_PTR(-EINVAL);
> and we are interested in case when CLONE_NEWNS is set, tsk->fs is going
> to be a private copy.
> 
> 2) int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> 	struct nsproxy **new_nsp, struct cred *new_cred, struct fs_struct *new_fs):
>         *new_nsp = create_new_namespaces(unshare_flags, current, user_ns,
> 					 new_fs ? new_fs : current->fs);
> That gets called from ksys_unshare().  Earlier in ksys_unshare() we have
>         /*
> 	 * If unsharing namespace, must also unshare filesystem information.
> 	 */
> 	if (unshare_flags & CLONE_NEWNS)
> 		unshare_flags |= CLONE_FS;
> so in our case CLONE_FS is going to be set.  new_fs is initially set
> to NULL and, since CLONE_FS is there, the call of unshare_fs() has
> replaced it with a reference to private copy of current->fs.  Again,
> we get a private copy.
> 
> 3) int exec_task_namespaces(void):
>         new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
> No CLONE_NEWNS in flags, so we don't get there at all.
> 
> Incidentally, one bogosity I've spotted in unshare_fs() today is
>         if (!(unshare_flags & CLONE_FS) || !fs)
> 					^^^^^^ this
> 		return 0;
> 
> The history is amusing - it had been faithfully preserved since
> cf2e340f4249 ("[PATCH] unshare system call -v5: system call handler
> function") back in 2006, when unshare(2) had been added; initially it
> had been a stub:
> +static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
> +{
> +       struct fs_struct *fs = current->fs;
> +
> +       if ((unshare_flags & CLONE_FS) &&
> +           (fs && atomic_read(&fs->count) > 1))
> +               return -EINVAL;
> +
> +       return 0;
> +}
> 
> The thing is, task->fs does not become NULL until exit_fs() clears it, at
> which point we'd better *not* run into any syscalls, unshare(2) included
> ;-) The same had been true back in 2006; as the matter of fact, I don't
> know if it _ever_ had not been true.  I suspect that the logics had been
> copied from exit_fs(), which also has a pointless check that seems to have
> been added there in 1.3.26, when task->fs went kmalloc'ed.  The thing
> is, in 1.3.26 copy_fs() a failed allocation aborted do_fork() (today's
> copy_process()) with exit_fs() never called for the child-not-to-be...

Actually, now I remember - until 2.5.<early> we used to have kernel threads
that did exit_fs() and normally followed that by manual switch to extra
reference to init_task.fs.  Mostly - by calling daemonize(), some - manually;
there had been some (mtdblock definitely among those, might be more) that
didn't bother with the second part.

The upshot had been that we could get transient task->fs switching to
NULL and then back to &init_fs (that could happen only to kernel threads)
and some of those didn't bother to switch back to non-NULL.

None of that had ever been relevant for unshare(2); the last remnants
of that bogosity went away in 2009 when we got task_lock(current) over
all changes of current->fs, with NULL never being stored there for
as long as the thread remains alive.

Mea culpa - mnt_copy_ns() (all way back to 2001, when it had been
added as copy_namespace()) is in the same boat as unshare(2); none
of those kernel threads would ever spawn a new namespace, let alone
doing that right in the middle between exit_fs() and setting ->fs
to &init_task.fs.  Which is to say, that
	if (new_fs) {
in there had always been just as bogus - condition is never false ;-/

next prev parent reply	other threads:[~2026-02-06  6:29 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-03 19:44 [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Waiman Long
2026-02-03 19:59 ` Al Viro
2026-02-03 20:18   ` Waiman Long
2026-02-03 20:05 ` Al Viro
2026-02-03 20:32   ` Waiman Long
2026-02-03 21:50     ` Al Viro
2026-02-03 23:26       ` Al Viro
2026-02-04  4:21         ` Waiman Long
2026-02-04  6:26           ` Al Viro
2026-02-04 18:16             ` Waiman Long
2026-02-04 20:18               ` Al Viro
2026-02-05  3:03                 ` Waiman Long
2026-02-05  4:45                   ` Waiman Long
2026-02-05 23:53                     ` Al Viro
2026-02-06  1:20                       ` Waiman Long
2026-02-06  4:11                         ` Waiman Long
2026-02-06  4:19                           ` Waiman Long
2026-02-06  5:22                           ` Al Viro
2026-02-06  6:31                             ` Al Viro [this message]
2026-02-06  6:38                               ` Al Viro
2026-02-06  7:13                             ` Al Viro
2026-02-06 19:16                             ` Waiman Long
2026-02-06 20:04                               ` Waiman Long
2026-02-06 20:38                                 ` Al Viro
2026-02-07  8:25                                 ` [PATCH][RFC] bug in unshare(2) failure recovery Al Viro
2026-02-07 23:06                                   ` Waiman Long
2026-02-17 12:49                                   ` Christian Brauner
2026-02-17 12:49                                   ` Christian Brauner
2026-02-06 20:29                               ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-06 20:58                                 ` setns(2) vs. pivot_root(2) (was Re: [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths) Al Viro
2026-02-06 21:09                                   ` Al Viro
2026-02-17 13:12                                     ` Christian Brauner
2026-02-06  8:15                       ` [PATCH v2] audit: Avoid excessive dput/dget in audit_context setup and reset paths Al Viro
2026-02-05  5:22                   ` Al Viro
2026-02-05 13:59                     ` Waiman Long
2026-02-05 17:53                     ` Mateusz Guzik
2026-02-17 13:33                       ` Christian Brauner
2026-02-17 13:44                         ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260206063124.GW3183987@ZenIV \
    --to=viro@zeniv.linux.org.uk \
    --cc=audit@vger.kernel.org \
    --cc=brauner@kernel.org \
    --cc=eparis@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=llong@redhat.com \
    --cc=paul@paul-moore.com \
    --cc=rgb@redhat.com \
    --cc=rrobaina@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.