Linux userland API discussions
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Aleksa Sarai <cyphar@cyphar.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
	 Jonathan Corbet <corbet@lwn.net>, Shuah Khan <shuah@kernel.org>,
	linux-kernel@vger.kernel.org,  linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	 linux-kselftest@vger.kernel.org
Subject: Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
Date: Thu, 31 Jul 2025 12:31:40 +0200	[thread overview]
Message-ID: <20250731-angliederung-mahlt-9e5811969817@brauner> (raw)
In-Reply-To: <2025-07-25.1753409614-vile-track-icky-epidemic-frail-antidote-d7NYuu@cyphar.com>

On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote:
> On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > > /proc has historically had very opaque semantics about PID namespaces,
> > > which is a little unfortunate for container runtimes and other programs
> > > that deal with switching namespaces very often. One common issue is that
> > > of converting between PIDs in the process's namespace and PIDs in the
> > > namespace of /proc.
> > > 
> > > In principle, it is possible to do this today by opening a pidfd with
> > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > > contain a PID value translated to the pid namespace associated with that
> > > procfs superblock). However, allocating a new file for each PID to be
> > > converted is less than ideal for programs that may need to scan procfs,
> > > and it is generally useful for userspace to be able to finally get this
> > > information from procfs.
> > > 
> > > So, add a new API for this in the form of an ioctl(2) you can call on
> > > the root directory of procfs. The returned file descriptor will have
> > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > > option, finally allowing userspace full control of the pid namespaces
> > > associated with procfs instances.
> > > 
> > > The permission model for this is a bit looser than that of the "pidns"
> > > mount option, but this is mainly because /proc/1/ns/pid provides the
> > > same information, so as long as you have access to that magic-link (or
> > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > > or being in an ancestor pid namespace) it makes sense to allow userspace
> > > to grab a handle. setns(2) will still have their own permission checks,
> > > so being able to open a pidns handle doesn't really provide too many
> > > other capabilities.
> > > 
> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > ---
> > >  Documentation/filesystems/proc.rst |  4 +++
> > >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/fs.h            |  3 +++
> > >  3 files changed, 59 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index c520b9f8a3fd..506383273c9d 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > >  will be used by the procfs instance when translating pids. By default, procfs
> > >  will use the calling process's active pid namespace.
> > >  
> > > +Processes can check which pid namespace is used by a procfs instance by using
> > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > > +instance.
> > > +
> > >  Chapter 5: Filesystem behavior
> > >  ==============================
> > >  
> > > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > > index 057c8a125c6e..548a57ec2152 100644
> > > --- a/fs/proc/root.c
> > > +++ b/fs/proc/root.c
> > > @@ -23,8 +23,10 @@
> > >  #include <linux/cred.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/ptrace.h>
> > >  
> > >  #include "internal.h"
> > > +#include "../internal.h"
> > >  
> > >  struct proc_fs_context {
> > >  	struct pid_namespace	*pid_ns;
> > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > >  	return proc_pid_readdir(file, ctx);
> > >  }
> > >  
> > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	switch (cmd) {
> > > +#ifdef CONFIG_PID_NS
> > > +	case PROCFS_GET_PID_NAMESPACE: {
> > > +		struct pid_namespace *active = task_active_pid_ns(current);
> > > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > > +		bool can_access_pidns = false;
> > > +
> > > +		/*
> > > +		 * If we are in an ancestors of the pidns, or have join
> > > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > > +		 * would be able to grab a handle to the pidns.
> > > +		 *
> > > +		 * Otherwise, if there is a root process, then being able to
> > > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > > +		 * we should probably match the permission model. For empty
> > > +		 * namespaces it seems unlikely for there to be a downside to
> > > +		 * allowing unprivileged users to open a handle to it (setns
> > > +		 * will fail for unprivileged users anyway).
> > > +		 */
> > > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> > 
> > This seems to imply that if @ns is a descendant of @active that the
> > caller holds privileges over it. Is that actually always true?
> > 
> > IOW, why is the check different from the previous pidns= mount option
> > check. I would've expected:
> > 
> > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> > 
> > and then the ptrace check as a fallback.
> 
> That would mirror pidns_install(), and I did think about it. The primary
> (mostly handwave-y) reasoning I had for making it less strict was that:
> 
>  * If you are in an ancestor pidns, then you can already see those
>    processes in your own /proc. In theory that means that you will be
>    able to access /proc/$pid/ns/pid for at least some subprocess there
>    (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
>    cleared on ).
> 
>    Though hypothetically if they are all running as a different user,
>    this does not apply (and you could create scenarios where a child
>    pidns is owned by a userns that you do not have privileges over -- if
>    you deal with setuid binaries). Maybe that risk means we should just
>    combine them, I'm not sure.
> 
>  * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
>    strange to disallow access even if it is not in an ancestor
>    namespace. This is distinct to pidns_install(), where you want to
>    ensure you cannot escape to a parent pid namespace, this is about
>    getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
> 
> Maybe they should be combined to match pidns_install(), but then I would
> expect the ptrace_may_access() check to apply to all processes in the
> pidns to make it less restrictive, which is not something you can
> practically do (and there is a higher chance that pid1 will have
> SUID_DUMP_DISABLE than some random subprocess, which almost certainly
> will not be SUID_DUMP_DISABLE).
> 
> Fundamentally, I guess I'm still trying to see what the risk is of
> allowing a process to get a handle to a pidns that they have some kind
> of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being

There shouldn't be. For example, you kinda implicitly do that with a
pidfd, no? Because you can pass the pidfd to setns() instead of a
namespace fd itself. Maybe that's the argument you're lookin for?

  reply	other threads:[~2025-07-31 10:31 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
2025-07-24  7:06   ` Christian Brauner
2025-07-22 23:18 ` [PATCH RFC v2 2/4] procfs: add "pidns" mount option Aleksa Sarai
2025-07-24  7:25   ` Christian Brauner
2025-07-25  2:13     ` Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
2025-07-24  7:34   ` Christian Brauner
2025-07-25  2:24     ` Aleksa Sarai
2025-07-31 10:31       ` Christian Brauner [this message]
2025-07-31 14:21         ` Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
2025-07-24  7:36 ` [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250731-angliederung-mahlt-9e5811969817@brauner \
    --to=brauner@kernel.org \
    --cc=corbet@lwn.net \
    --cc=cyphar@cyphar.com \
    --cc=jack@suse.cz \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=shuah@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox