[PATCH RFC v2 0/4] procfs: make reference pidns more user-visible

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible
@ 2025-07-22 23:18 Aleksa Sarai
  2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-22 23:18 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest, Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).

/* pidns mount option for procfs */

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:

 * In order to bypass the mnt_too_revealing() check, Kubernetes creates
   a procfs mount from an empty pidns so that user namespaced containers
   can be nested (without this, the nested containers would fail to
   mount procfs). But this requires forking off a helper process because
   you cannot just one-shot this using mount(2).

 * Container runtimes in general need to fork into a container before
   configuring its mounts, which can lead to security issues in the case
   of shared-pidns containers (a privileged process in the pidns can
   interact with your container runtime process). While
   SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
   strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

    fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
    fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

    // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
    mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.

/* ioctl(PROCFS_GET_PID_NAMESPACE) */

In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.

To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.

It's not quite clear what is the correct security model for this API,
but the current approach I've taken is to:

 * Make the ioctl only valid on the root (meaning that a process without
   access to the procfs root -- such as only having an fd to a procfs
   file or some open_tree(2)-like subset -- cannot use this API).

 * Require that the process requesting either has access to
   /proc/1/ns/pid anyway (i.e. has ptrace-read access to the pidns
   pid1), has CAP_SYS_ADMIN access to the pidns (i.e. has administrative
   access to it and can join it if they had a handle), or is in a pidns
   that is a direct ancestor of the target pidns (i.e. all of the pids
   are already visible in the procfs for the current process's pidns).

The security model for this is a little loose, as it seems to me that
all of the cases mentioned are valid cases to allow access, but I'm open
to suggestions for whether we need to make this stricter or looser.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
  separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cyphar.com>

---
Aleksa Sarai (4):
      pidns: move is-ancestor logic to helper
      procfs: add "pidns" mount option
      procfs: add PROCFS_GET_PID_NAMESPACE ioctl
      selftests/proc: add tests for new pidns APIs

 Documentation/filesystems/proc.rst        |  10 ++
 fs/proc/root.c                            | 144 ++++++++++++++-
 include/linux/pid_namespace.h             |   9 +
 include/uapi/linux/fs.h                   |   3 +
 kernel/pid_namespace.c                    |  23 ++-
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++
 8 files changed, 461 insertions(+), 16 deletions(-)
---
base-commit: 4c838c7672c39ec6ec48456c6ce22d14a68f4cda
change-id: 20250717-procfs-pidns-api-8ed1583431f0

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper
  2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
@ 2025-07-22 23:18 ` Aleksa Sarai
  2025-07-24  7:06   ` Christian Brauner
  2025-07-22 23:18 ` [PATCH RFC v2 2/4] procfs: add "pidns" mount option Aleksa Sarai
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-22 23:18 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest, Aleksa Sarai

This check will be needed in later patches, and there's no point
open-coding it each time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h |  9 +++++++++
 kernel/pid_namespace.c        | 23 +++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7c67a5811199..17fdc059f8da 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -84,6 +84,9 @@ extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
 
+extern bool pidns_is_ancestor(struct pid_namespace *child,
+			      struct pid_namespace *ancestor);
+
 #else /* !CONFIG_PID_NS */
 #include <linux/err.h>
 
@@ -118,6 +121,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
 	return 0;
 }
+
+static inline bool pidns_is_ancestor(struct pid_namespace *child,
+				     struct pid_namespace *ancestor)
+{
+	return false;
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 7098ed44e717..c2783c5fa90b 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -390,11 +390,24 @@ static void pidns_put(struct ns_common *ns)
 	put_pid_ns(to_pid_ns(ns));
 }
 
+bool pidns_is_ancestor(struct pid_namespace *child,
+		       struct pid_namespace *ancestor)
+{
+	struct pid_namespace *ns;
+
+	if (child->level < ancestor->level)
+		return false;
+	for (ns = child; ns->level > ancestor->level; ns = ns->parent)
+		;
+	return ns == ancestor;
+}
+EXPORT_SYMBOL_GPL(pidns_is_ancestor);
+
 static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 {
 	struct nsproxy *nsproxy = nsset->nsproxy;
 	struct pid_namespace *active = task_active_pid_ns(current);
-	struct pid_namespace *ancestor, *new = to_pid_ns(ns);
+	struct pid_namespace *new = to_pid_ns(ns);
 
 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
 	    !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
@@ -408,13 +421,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 	 * this maintains the property that processes and their
 	 * children can not escape their current pid namespace.
 	 */
-	if (new->level < active->level)
-		return -EINVAL;
-
-	ancestor = new;
-	while (ancestor->level > active->level)
-		ancestor = ancestor->parent;
-	if (ancestor != active)
+	if (!pidns_is_ancestor(new, active))
 		return -EINVAL;
 
 	put_pid_ns(nsproxy->pid_ns_for_children);

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 2/4] procfs: add "pidns" mount option
  2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
  2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
@ 2025-07-22 23:18 ` Aleksa Sarai
  2025-07-24  7:25   ` Christian Brauner
  2025-07-22 23:18 ` [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-22 23:18 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest, Aleksa Sarai

Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.

Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.

So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:

  fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
  fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or with classic mount(2) / mount(8):

  // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
  mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.

Note that the mount infrastructure also allows userspace to reconfigure
the pidns of an existing procfs mount, which may or may not be useful to
some users.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  6 +++
 fs/proc/root.c                     | 90 +++++++++++++++++++++++++++++++++++---
 2 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5236cb52e357..c520b9f8a3fd 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2360,6 +2360,7 @@ The following mount options are supported:
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
 	subset=		Show only the specified subset of procfs.
+	pidns=		Specify a the namespace used by this procfs.
 	=========	========================================================
 
 hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2392,6 +2393,11 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ed86ac710384..057c8a125c6e 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -38,12 +38,18 @@ enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
 	Opt_subset,
+#ifdef CONFIG_PID_NS
+	Opt_pidns,
+#endif
 };
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
-	fsparam_u32("gid",	Opt_gid),
+	fsparam_u32("gid",		Opt_gid),
 	fsparam_string("hidepid",	Opt_hidepid),
 	fsparam_string("subset",	Opt_subset),
+#ifdef CONFIG_PID_NS
+	fsparam_file_or_string("pidns",	Opt_pidns),
+#endif
 	{}
 };
 
@@ -109,11 +115,67 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
 	return 0;
 }
 
+#ifdef CONFIG_PID_NS
+static int proc_parse_pidns_param(struct fs_context *fc,
+				  struct fs_parameter *param,
+				  struct fs_parse_result *result)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct pid_namespace *target, *active = task_active_pid_ns(current);
+	struct ns_common *ns;
+	struct file *ns_filp __free(fput) = NULL;
+
+	switch (param->type) {
+	case fs_value_is_file:
+		/* came throug fsconfig, steal the file reference */
+		ns_filp = param->file;
+		param->file = NULL;
+		break;
+	case fs_value_is_string:
+		ns_filp = filp_open(param->string, O_RDONLY, 0);
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		break;
+	}
+	if (!ns_filp)
+		ns_filp = ERR_PTR(-EBADF);
+	if (IS_ERR(ns_filp)) {
+		errorfc(fc, "could not get file from pidns argument");
+		return PTR_ERR(ns_filp);
+	}
+
+	if (!proc_ns_file(ns_filp))
+		return invalfc(fc, "pidns argument is not an nsfs file");
+	ns = get_proc_ns(file_inode(ns_filp));
+	if (ns->ops->type != CLONE_NEWPID)
+		return invalfc(fc, "pidns argument is not a pidns file");
+	target = container_of(ns, struct pid_namespace, ns);
+
+	/*
+	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
+	 * permission model should be the same as pidns_install().
+	 */
+	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
+		errorfc(fc, "insufficient permissions to set pidns");
+		return -EPERM;
+	}
+	if (!pidns_is_ancestor(target, active))
+		return invalfc(fc, "cannot set pidns to non-descendant pidns");
+
+	put_pid_ns(ctx->pid_ns);
+	ctx->pid_ns = get_pid_ns(target);
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
+	return 0;
+}
+#endif /* CONFIG_PID_NS */
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 	struct fs_parse_result result;
-	int opt;
+	int opt, err;
 
 	opt = fs_parse(fc, proc_fs_parameters, param, &result);
 	if (opt < 0)
@@ -125,14 +187,24 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
-		if (proc_parse_hidepid_param(fc, param))
-			return -EINVAL;
+		err = proc_parse_hidepid_param(fc, param);
+		if (err)
+			return err;
 		break;
 
 	case Opt_subset:
-		if (proc_parse_subset_param(fc, param->string) < 0)
-			return -EINVAL;
+		err = proc_parse_subset_param(fc, param->string);
+		if (err)
+			return err;
+		break;
+
+#ifdef CONFIG_PID_NS
+	case Opt_pidns:
+		err = proc_parse_pidns_param(fc, param, &result);
+		if (err)
+			return err;
 		break;
+#endif
 
 	default:
 		return -EINVAL;
@@ -154,6 +226,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset))
 		fs_info->pidonly = ctx->pidonly;
+#ifdef CONFIG_PID_NS
+	if (ctx->mask & (1 << Opt_pidns)) {
+		put_pid_ns(fs_info->pid_ns);
+		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	}
+#endif
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
  2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
  2025-07-22 23:18 ` [PATCH RFC v2 2/4] procfs: add "pidns" mount option Aleksa Sarai
@ 2025-07-22 23:18 ` Aleksa Sarai
  2025-07-24  7:34   ` Christian Brauner
  2025-07-22 23:18 ` [PATCH RFC v2 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
  2025-07-24  7:36 ` [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Christian Brauner
  4 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-22 23:18 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest, Aleksa Sarai

/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.

In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.

So, add a new API for this in the form of an ioctl(2) you can call on
the root directory of procfs. The returned file descriptor will have
O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
option, finally allowing userspace full control of the pid namespaces
associated with procfs instances.

The permission model for this is a bit looser than that of the "pidns"
mount option, but this is mainly because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as privileges with CAP_SYS_ADMIN
or being in an ancestor pid namespace) it makes sense to allow userspace
to grab a handle. setns(2) will still have their own permission checks,
so being able to open a pidns handle doesn't really provide too many
other capabilities.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  4 +++
 fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/fs.h            |  3 +++
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c520b9f8a3fd..506383273c9d 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
 will be used by the procfs instance when translating pids. By default, procfs
 will use the calling process's active pid namespace.
 
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 057c8a125c6e..548a57ec2152 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
 #include <linux/cred.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/ptrace.h>
 
 #include "internal.h"
+#include "../internal.h"
 
 struct proc_fs_context {
 	struct pid_namespace	*pid_ns;
@@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 	return proc_pid_readdir(file, ctx);
 }
 
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+#ifdef CONFIG_PID_NS
+	case PROCFS_GET_PID_NAMESPACE: {
+		struct pid_namespace *active = task_active_pid_ns(current);
+		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+		bool can_access_pidns = false;
+
+		/*
+		 * If we are in an ancestors of the pidns, or have join
+		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
+		 * would be able to grab a handle to the pidns.
+		 *
+		 * Otherwise, if there is a root process, then being able to
+		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
+		 * we should probably match the permission model. For empty
+		 * namespaces it seems unlikely for there to be a downside to
+		 * allowing unprivileged users to open a handle to it (setns
+		 * will fail for unprivileged users anyway).
+		 */
+		can_access_pidns = pidns_is_ancestor(ns, active) ||
+				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
+		if (!can_access_pidns) {
+			bool cannot_ptrace_pid1 = false;
+
+			read_lock(&tasklist_lock);
+			if (ns->child_reaper)
+				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+								       PTRACE_MODE_READ_FSCREDS);
+			read_unlock(&tasklist_lock);
+			can_access_pidns = !cannot_ptrace_pid1;
+		}
+		if (!can_access_pidns)
+			return -EPERM;
+
+		/* open_namespace() unconditionally consumes the reference. */
+		get_pid_ns(ns);
+		return open_namespace(to_ns_common(ns));
+	}
+#endif /* CONFIG_PID_NS */
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 /*
  * The root /proc directory is special, as it has the
  * <pid> directories. Thus we don't use the generic
  * directory handling functions for that..
  */
 static const struct file_operations proc_root_operations = {
-	.read		 = generic_read_dir,
-	.iterate_shared	 = proc_root_readdir,
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_root_readdir,
 	.llseek		= generic_file_llseek,
+	.unlocked_ioctl = proc_root_ioctl,
+	.compat_ioctl   = compat_ptr_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..aa642cb48feb 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
 
 #define PROCFS_IOCTL_MAGIC 'f'
 
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
+
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH RFC v2 4/4] selftests/proc: add tests for new pidns APIs
  2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
                   ` (2 preceding siblings ...)
  2025-07-22 23:18 ` [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
@ 2025-07-22 23:18 ` Aleksa Sarai
  2025-07-24  7:36 ` [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Christian Brauner
  4 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-22 23:18 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest, Aleksa Sarai

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 286 ++++++++++++++++++++++++++++++
 3 files changed, 288 insertions(+)

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 973968f45bba..2dced03e9e0e 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -17,6 +17,7 @@
 /proc-tid0
 /proc-uptime-001
 /proc-uptime-002
+/proc-pidns
 /read
 /self
 /setns-dcache
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index b12921b9794b..c6f7046b9860 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -27,5 +27,6 @@ TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
 TEST_GEN_PROGS += proc-multiple-procfs
 TEST_GEN_PROGS += proc-fsconfig-hidepid
+TEST_GEN_PROGS += proc-pidns
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-pidns.c b/tools/testing/selftests/proc/proc-pidns.c
new file mode 100644
index 000000000000..e7e34c78d383
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-pidns.c
@@ -0,0 +1,286 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2025 SUSE LLC.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+
+#include "../kselftest_harness.h"
+
+#define bail(fmt, ...)							\
+	do {								\
+		fprintf(stderr, fmt ": %m", __VA_ARGS__);		\
+		exit(1);						\
+	} while (0)
+
+#define ASSERT_SUCCESS	ASSERT_FALSE
+#define ASSERT_FAIL	ASSERT_TRUE
+
+int touch(char *path)
+{
+	int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644);
+	if (fd < 0 || close(fd) < 0)
+		return -errno;
+	return 0;
+}
+
+FIXTURE(ns)
+{
+	int host_mntns, host_pidns;
+	int dummy_pidns;
+};
+
+FIXTURE_SETUP(ns)
+{
+	/* Stash the old mntns. */
+	self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC);
+	ASSERT_GE(self->host_mntns, 0);
+
+	/* Create a new mount namespace and make it private. */
+	ASSERT_SUCCESS(unshare(CLONE_NEWNS));
+	ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
+
+	/*
+	 * Create a proper tmpfs that we can use and will disappear once we
+	 * leave this mntns.
+	 */
+	ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL));
+
+	/*
+	 * Create a pidns we can use for later tests. We need to fork off a
+	 * child so that we get a usable nsfd that we can bind-mount and open.
+	 */
+	ASSERT_SUCCESS(touch("/tmp/dummy-pidns"));
+
+	self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC);
+	ASSERT_GE(self->host_pidns, 0);
+	ASSERT_SUCCESS(unshare(CLONE_NEWPID));
+
+	pid_t pid = fork();
+	ASSERT_GE(pid, 0);
+	if (!pid) {
+		prctl(PR_SET_PDEATHSIG, SIGKILL);
+		ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy-pidns", NULL, MS_BIND, 0));
+		exit(0);
+	}
+
+	int wstatus;
+	ASSERT_EQ(waitpid(pid, &wstatus, 0), pid);
+	ASSERT_TRUE(WIFEXITED(wstatus));
+	ASSERT_EQ(WEXITSTATUS(wstatus), 0);
+
+	ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID));
+
+	self->dummy_pidns = open("/tmp/dummy-pidns", O_RDONLY|O_CLOEXEC);
+	ASSERT_GE(self->dummy_pidns, 0);
+}
+
+FIXTURE_TEARDOWN(ns)
+{
+	ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS));
+	ASSERT_SUCCESS(close(self->host_mntns));
+
+	ASSERT_SUCCESS(close(self->host_pidns));
+	ASSERT_SUCCESS(close(self->dummy_pidns));
+}
+
+TEST_F(ns, pidns_mount_string_path)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+	ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy-pidns"));
+	ASSERT_FAIL(access("/tmp/proc-dummy/1/", X_OK));
+	ASSERT_FAIL(access("/tmp/proc-dummy/self/", X_OK));
+}
+
+TEST_F(ns, pidns_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy-pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	ASSERT_FAIL(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_FAIL(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	ASSERT_FAIL(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_FAIL(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_remount)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, ""));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+
+	ASSERT_SUCCESS(mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy-pidns"));
+	ASSERT_FAIL(access("/tmp/proc/self/", X_OK));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy-pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0));
+
+	ASSERT_FAIL(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_FAIL(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0));
+
+	ASSERT_FAIL(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_FAIL(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+int is_same_inode(int fd1, int fd2)
+{
+	struct stat stat1, stat2;
+
+	assert(fstat(fd1, &stat1) == 0);
+	assert(fstat(fd2, &stat2) == 0);
+
+	return stat1.st_ino == stat2.st_ino && stat1.st_dev == stat2.st_dev;
+}
+
+#define PROCFS_IOCTL_MAGIC 'f'
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
+
+TEST_F(ns, get_pidns_ioctl)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	/* fsmount returns an O_PATH, which ioctl(2) doesn't accept. */
+	int new_mountfd = openat(mountfd, ".", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
+	ASSERT_GE(new_mountfd, 0);
+
+	ASSERT_SUCCESS(close(mountfd));
+	mountfd = -EBADF;
+
+	int procfs_pidns = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_GE(procfs_pidns, 0);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns);
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(new_mountfd));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_F(ns, reconfigure_get_pidns_ioctl)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_GE(fsfd, 0);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_GE(mountfd, 0);
+
+	/* fsmount returns an O_PATH, which ioctl(2) doesn't accept. */
+	int new_mountfd = openat(mountfd, ".", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
+	ASSERT_GE(new_mountfd, 0);
+
+	ASSERT_SUCCESS(close(mountfd));
+	mountfd = -EBADF;
+
+	int procfs_pidns1 = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_GE(procfs_pidns1, 0);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns1);
+	ASSERT_TRUE(is_same_inode(self->host_pidns, procfs_pidns1));
+	ASSERT_FALSE(is_same_inode(self->dummy_pidns, procfs_pidns1));
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy-pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0));
+
+	int procfs_pidns2 = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_GE(procfs_pidns2, 0);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns2);
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns2));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns2));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(new_mountfd));
+	ASSERT_SUCCESS(close(procfs_pidns1));
+	ASSERT_SUCCESS(close(procfs_pidns2));
+}
+
+TEST_HARNESS_MAIN

-- 
2.50.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper
  2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
@ 2025-07-24  7:06   ` Christian Brauner
  0 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2025-07-24  7:06 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

On Wed, Jul 23, 2025 at 09:18:51AM +1000, Aleksa Sarai wrote:
> This check will be needed in later patches, and there's no point
> open-coding it each time.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  include/linux/pid_namespace.h |  9 +++++++++
>  kernel/pid_namespace.c        | 23 +++++++++++++++--------
>  2 files changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 7c67a5811199..17fdc059f8da 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -84,6 +84,9 @@ extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
>  extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
>  extern void put_pid_ns(struct pid_namespace *ns);
>  
> +extern bool pidns_is_ancestor(struct pid_namespace *child,
> +			      struct pid_namespace *ancestor);
> +
>  #else /* !CONFIG_PID_NS */
>  #include <linux/err.h>
>  
> @@ -118,6 +121,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  {
>  	return 0;
>  }
> +
> +static inline bool pidns_is_ancestor(struct pid_namespace *child,
> +				     struct pid_namespace *ancestor)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_PID_NS */
>  
>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 7098ed44e717..c2783c5fa90b 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -390,11 +390,24 @@ static void pidns_put(struct ns_common *ns)
>  	put_pid_ns(to_pid_ns(ns));
>  }
>  
> +bool pidns_is_ancestor(struct pid_namespace *child,
> +		       struct pid_namespace *ancestor)
> +{
> +	struct pid_namespace *ns;
> +
> +	if (child->level < ancestor->level)
> +		return false;
> +	for (ns = child; ns->level > ancestor->level; ns = ns->parent)
> +		;
> +	return ns == ancestor;
> +}
> +EXPORT_SYMBOL_GPL(pidns_is_ancestor);

Why do you need to export this? Afaict, this is only used from procfs
and iirc procfs cannot be a module. This could also be a static inline
completely in the header? Otherwise this looks good.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 2/4] procfs: add "pidns" mount option
  2025-07-22 23:18 ` [PATCH RFC v2 2/4] procfs: add "pidns" mount option Aleksa Sarai
@ 2025-07-24  7:25   ` Christian Brauner
  2025-07-25  2:13     ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2025-07-24  7:25 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

On Wed, Jul 23, 2025 at 09:18:52AM +1000, Aleksa Sarai wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

Fwiw, namespace mount options could just be VFS generic mount options.
But it's not something that we need to solve right now.

> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> Note that the mount infrastructure also allows userspace to reconfigure
> the pidns of an existing procfs mount, which may or may not be useful to
> some users.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  6 +++
>  fs/proc/root.c                     | 90 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 90 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..c520b9f8a3fd 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.
>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,11 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace.
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index ed86ac710384..057c8a125c6e 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -38,12 +38,18 @@ enum proc_param {
>  	Opt_gid,
>  	Opt_hidepid,
>  	Opt_subset,
> +#ifdef CONFIG_PID_NS
> +	Opt_pidns,
> +#endif
>  };
>  
>  static const struct fs_parameter_spec proc_fs_parameters[] = {
> -	fsparam_u32("gid",	Opt_gid),
> +	fsparam_u32("gid",		Opt_gid),
>  	fsparam_string("hidepid",	Opt_hidepid),
>  	fsparam_string("subset",	Opt_subset),
> +#ifdef CONFIG_PID_NS
> +	fsparam_file_or_string("pidns",	Opt_pidns),
> +#endif
>  	{}
>  };
>  
> @@ -109,11 +115,67 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PID_NS
> +static int proc_parse_pidns_param(struct fs_context *fc,
> +				  struct fs_parameter *param,
> +				  struct fs_parse_result *result)
> +{
> +	struct proc_fs_context *ctx = fc->fs_private;
> +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> +	struct ns_common *ns;
> +	struct file *ns_filp __free(fput) = NULL;
> +
> +	switch (param->type) {
> +	case fs_value_is_file:
> +		/* came throug fsconfig, steal the file reference */
> +		ns_filp = param->file;
> +		param->file = NULL;

This can be shortened to:

ns_filp = no_free_ptr(param->file);

> +		break;
> +	case fs_value_is_string:
> +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> +		break;
> +	default:
> +		WARN_ON_ONCE(true);
> +		break;
> +	}
> +	if (!ns_filp)
> +		ns_filp = ERR_PTR(-EBADF);
> +	if (IS_ERR(ns_filp)) {
> +		errorfc(fc, "could not get file from pidns argument");
> +		return PTR_ERR(ns_filp);
> +	}
> +
> +	if (!proc_ns_file(ns_filp))
> +		return invalfc(fc, "pidns argument is not an nsfs file");
> +	ns = get_proc_ns(file_inode(ns_filp));
> +	if (ns->ops->type != CLONE_NEWPID)
> +		return invalfc(fc, "pidns argument is not a pidns file");
> +	target = container_of(ns, struct pid_namespace, ns);
> +
> +	/*
> +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> +	 * permission model should be the same as pidns_install().
> +	 */
> +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> +		errorfc(fc, "insufficient permissions to set pidns");
> +		return -EPERM;
> +	}
> +	if (!pidns_is_ancestor(target, active))
> +		return invalfc(fc, "cannot set pidns to non-descendant pidns");

This made me think. If one rewrote this as:

if (!ns_capable(task_active_pidns(current)->user_ns, CAP_SYS_ADMIN))

if (!pidns_is_ancestor(target, active))

that would also work, right? IOW, you'd be checking whether you're
capable over your current pid namespace owning userns and if the target
pidns is an ancestor it's also implied by the first check that you're
capable over it.

The only way this would not be true is if a descendant pidns would be
owned by a userns over which you don't hold privileges and I wondered
whether that's even possible? I don't think it is but maybe you see a
way.

> +
> +	put_pid_ns(ctx->pid_ns);
> +	ctx->pid_ns = get_pid_ns(target);
> +	put_user_ns(fc->user_ns);
> +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> +	return 0;
> +}
> +#endif /* CONFIG_PID_NS */
> +
>  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
>  	struct proc_fs_context *ctx = fc->fs_private;
>  	struct fs_parse_result result;
> -	int opt;
> +	int opt, err;
>  
>  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
>  	if (opt < 0)
> @@ -125,14 +187,24 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		break;
>  
>  	case Opt_hidepid:
> -		if (proc_parse_hidepid_param(fc, param))
> -			return -EINVAL;
> +		err = proc_parse_hidepid_param(fc, param);
> +		if (err)
> +			return err;
>  		break;
>  
>  	case Opt_subset:
> -		if (proc_parse_subset_param(fc, param->string) < 0)
> -			return -EINVAL;
> +		err = proc_parse_subset_param(fc, param->string);
> +		if (err)
> +			return err;
> +		break;
> +
> +#ifdef CONFIG_PID_NS
> +	case Opt_pidns:

I think it would be easier if we returned EOPNOTSUPP when !CONFIG_PID_NS
instead of EINVALing this?

> +		err = proc_parse_pidns_param(fc, param, &result);
> +		if (err)
> +			return err;
>  		break;
> +#endif
>  
>  	default:
>  		return -EINVAL;
> @@ -154,6 +226,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
>  		fs_info->hide_pid = ctx->hidepid;
>  	if (ctx->mask & (1 << Opt_subset))
>  		fs_info->pidonly = ctx->pidonly;
> +#ifdef CONFIG_PID_NS
> +	if (ctx->mask & (1 << Opt_pidns)) {
> +		put_pid_ns(fs_info->pid_ns);
> +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> +	}
> +#endif
>  }
>  
>  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> 
> -- 
> 2.50.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-07-22 23:18 ` [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
@ 2025-07-24  7:34   ` Christian Brauner
  2025-07-25  2:24     ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2025-07-24  7:34 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> /proc has historically had very opaque semantics about PID namespaces,
> which is a little unfortunate for container runtimes and other programs
> that deal with switching namespaces very often. One common issue is that
> of converting between PIDs in the process's namespace and PIDs in the
> namespace of /proc.
> 
> In principle, it is possible to do this today by opening a pidfd with
> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> contain a PID value translated to the pid namespace associated with that
> procfs superblock). However, allocating a new file for each PID to be
> converted is less than ideal for programs that may need to scan procfs,
> and it is generally useful for userspace to be able to finally get this
> information from procfs.
> 
> So, add a new API for this in the form of an ioctl(2) you can call on
> the root directory of procfs. The returned file descriptor will have
> O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> option, finally allowing userspace full control of the pid namespaces
> associated with procfs instances.
> 
> The permission model for this is a bit looser than that of the "pidns"
> mount option, but this is mainly because /proc/1/ns/pid provides the
> same information, so as long as you have access to that magic-link (or
> something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> or being in an ancestor pid namespace) it makes sense to allow userspace
> to grab a handle. setns(2) will still have their own permission checks,
> so being able to open a pidns handle doesn't really provide too many
> other capabilities.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  4 +++
>  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/fs.h            |  3 +++
>  3 files changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index c520b9f8a3fd..506383273c9d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
>  will be used by the procfs instance when translating pids. By default, procfs
>  will use the calling process's active pid namespace.
>  
> +Processes can check which pid namespace is used by a procfs instance by using
> +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> +instance.
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index 057c8a125c6e..548a57ec2152 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -23,8 +23,10 @@
>  #include <linux/cred.h>
>  #include <linux/magic.h>
>  #include <linux/slab.h>
> +#include <linux/ptrace.h>
>  
>  #include "internal.h"
> +#include "../internal.h"
>  
>  struct proc_fs_context {
>  	struct pid_namespace	*pid_ns;
> @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
>  	return proc_pid_readdir(file, ctx);
>  }
>  
> +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> +{
> +	switch (cmd) {
> +#ifdef CONFIG_PID_NS
> +	case PROCFS_GET_PID_NAMESPACE: {
> +		struct pid_namespace *active = task_active_pid_ns(current);
> +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> +		bool can_access_pidns = false;
> +
> +		/*
> +		 * If we are in an ancestors of the pidns, or have join
> +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> +		 * would be able to grab a handle to the pidns.
> +		 *
> +		 * Otherwise, if there is a root process, then being able to
> +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> +		 * we should probably match the permission model. For empty
> +		 * namespaces it seems unlikely for there to be a downside to
> +		 * allowing unprivileged users to open a handle to it (setns
> +		 * will fail for unprivileged users anyway).
> +		 */
> +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);

This seems to imply that if @ns is a descendant of @active that the
caller holds privileges over it. Is that actually always true?

IOW, why is the check different from the previous pidns= mount option
check. I would've expected:

ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)

and then the ptrace check as a fallback.

> +		if (!can_access_pidns) {
> +			bool cannot_ptrace_pid1 = false;
> +
> +			read_lock(&tasklist_lock);
> +			if (ns->child_reaper)
> +				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
> +								       PTRACE_MODE_READ_FSCREDS);
> +			read_unlock(&tasklist_lock);
> +			can_access_pidns = !cannot_ptrace_pid1;
> +		}
> +		if (!can_access_pidns)
> +			return -EPERM;
> +
> +		/* open_namespace() unconditionally consumes the reference. */
> +		get_pid_ns(ns);
> +		return open_namespace(to_ns_common(ns));
> +	}
> +#endif /* CONFIG_PID_NS */
> +	default:
> +		return -ENOIOCTLCMD;
> +	}
> +}
> +
>  /*
>   * The root /proc directory is special, as it has the
>   * <pid> directories. Thus we don't use the generic
>   * directory handling functions for that..
>   */
>  static const struct file_operations proc_root_operations = {
> -	.read		 = generic_read_dir,
> -	.iterate_shared	 = proc_root_readdir,
> +	.read		= generic_read_dir,
> +	.iterate_shared	= proc_root_readdir,
>  	.llseek		= generic_file_llseek,
> +	.unlocked_ioctl = proc_root_ioctl,
> +	.compat_ioctl   = compat_ptr_ioctl,
>  };
>  
>  /*
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 0bd678a4a10e..aa642cb48feb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
>  
>  #define PROCFS_IOCTL_MAGIC 'f'
>  
> +/* procfs root ioctls */
> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
> +
>  /* Pagemap ioctl */
>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
>  
> 
> -- 
> 2.50.0
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible
  2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
                   ` (3 preceding siblings ...)
  2025-07-22 23:18 ` [PATCH RFC v2 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
@ 2025-07-24  7:36 ` Christian Brauner
  4 siblings, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2025-07-24  7:36 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

On Wed, Jul 23, 2025 at 09:18:50AM +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */

I like it. I think this will be very useful!
Fwiw, I think sysfs could probably use the same treatment.
It should probably gain a pidns & netns mount option and the ioctls to
get those out of sysfs so you know where that sysfs belongs. Thoughts?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 2/4] procfs: add "pidns" mount option
  2025-07-24  7:25   ` Christian Brauner
@ 2025-07-25  2:13     ` Aleksa Sarai
  0 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-25  2:13 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 9153 bytes --]

On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jul 23, 2025 at 09:18:52AM +1000, Aleksa Sarai wrote:
> > Since the introduction of pid namespaces, their interaction with procfs
> > has been entirely implicit in ways that require a lot of dancing around
> > by programs that need to construct sandboxes with different PID
> > namespaces.
> > 
> > Being able to explicitly specify the pid namespace to use when
> > constructing a procfs super block will allow programs to no longer need
> > to fork off a process which does then does unshare(2) / setns(2) and
> > forks again in order to construct a procfs in a pidns.
> > 
> > So, provide a "pidns" mount option which allows such users to just
> > explicitly state which pid namespace they want that procfs instance to
> > use. This interface can be used with fsconfig(2) either with a file
> > descriptor or a path:
> > 
> >   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> Fwiw, namespace mount options could just be VFS generic mount options.
> But it's not something that we need to solve right now.

Yeah if we add this to sysfs it probably should be made generic, but
let's punt this to later. :D

> > or with classic mount(2) / mount(8):
> > 
> >   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
> >   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> > 
> > As this new API is effectively shorthand for setns(2) followed by
> > mount(2), the permission model for this mirrors pidns_install() to avoid
> > opening up new attack surfaces by loosening the existing permission
> > model.
> > 
> > Note that the mount infrastructure also allows userspace to reconfigure
> > the pidns of an existing procfs mount, which may or may not be useful to
> > some users.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  6 +++
> >  fs/proc/root.c                     | 90 +++++++++++++++++++++++++++++++++++---
> >  2 files changed, 90 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index 5236cb52e357..c520b9f8a3fd 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2360,6 +2360,7 @@ The following mount options are supported:
> >  	hidepid=	Set /proc/<pid>/ access mode.
> >  	gid=		Set the group authorized to learn processes information.
> >  	subset=		Show only the specified subset of procfs.
> > +	pidns=		Specify a the namespace used by this procfs.
> >  	=========	========================================================
> >  
> >  hidepid=off or hidepid=0 means classic mode - everybody may access all
> > @@ -2392,6 +2393,11 @@ information about processes information, just add identd to this group.
> >  subset=pid hides all top level files and directories in the procfs that
> >  are not related to tasks.
> >  
> > +pidns= specifies a pid namespace (either as a string path to something like
> > +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> > +will be used by the procfs instance when translating pids. By default, procfs
> > +will use the calling process's active pid namespace.
> > +
> >  Chapter 5: Filesystem behavior
> >  ==============================
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index ed86ac710384..057c8a125c6e 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -38,12 +38,18 @@ enum proc_param {
> >  	Opt_gid,
> >  	Opt_hidepid,
> >  	Opt_subset,
> > +#ifdef CONFIG_PID_NS
> > +	Opt_pidns,
> > +#endif
> >  };
> >  
> >  static const struct fs_parameter_spec proc_fs_parameters[] = {
> > -	fsparam_u32("gid",	Opt_gid),
> > +	fsparam_u32("gid",		Opt_gid),
> >  	fsparam_string("hidepid",	Opt_hidepid),
> >  	fsparam_string("subset",	Opt_subset),
> > +#ifdef CONFIG_PID_NS
> > +	fsparam_file_or_string("pidns",	Opt_pidns),
> > +#endif
> >  	{}
> >  };
> >  
> > @@ -109,11 +115,67 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
> >  	return 0;
> >  }
> >  
> > +#ifdef CONFIG_PID_NS
> > +static int proc_parse_pidns_param(struct fs_context *fc,
> > +				  struct fs_parameter *param,
> > +				  struct fs_parse_result *result)
> > +{
> > +	struct proc_fs_context *ctx = fc->fs_private;
> > +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> > +	struct ns_common *ns;
> > +	struct file *ns_filp __free(fput) = NULL;
> > +
> > +	switch (param->type) {
> > +	case fs_value_is_file:
> > +		/* came throug fsconfig, steal the file reference */
> > +		ns_filp = param->file;
> > +		param->file = NULL;
> 
> This can be shortened to:
> 
> ns_filp = no_free_ptr(param->file);

I really need to take a closer look at <linux/cleanup.h>, each time I
look at it I learn about another handy helper.

> > +		break;
> > +	case fs_value_is_string:
> > +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> > +		break;
> > +	default:
> > +		WARN_ON_ONCE(true);
> > +		break;
> > +	}
> > +	if (!ns_filp)
> > +		ns_filp = ERR_PTR(-EBADF);
> > +	if (IS_ERR(ns_filp)) {
> > +		errorfc(fc, "could not get file from pidns argument");
> > +		return PTR_ERR(ns_filp);
> > +	}
> > +
> > +	if (!proc_ns_file(ns_filp))
> > +		return invalfc(fc, "pidns argument is not an nsfs file");
> > +	ns = get_proc_ns(file_inode(ns_filp));
> > +	if (ns->ops->type != CLONE_NEWPID)
> > +		return invalfc(fc, "pidns argument is not a pidns file");
> > +	target = container_of(ns, struct pid_namespace, ns);
> > +
> > +	/*
> > +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> > +	 * permission model should be the same as pidns_install().
> > +	 */
> > +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> > +		errorfc(fc, "insufficient permissions to set pidns");
> > +		return -EPERM;
> > +	}
> > +	if (!pidns_is_ancestor(target, active))
> > +		return invalfc(fc, "cannot set pidns to non-descendant pidns");
> 
> This made me think. If one rewrote this as:
> 
> if (!ns_capable(task_active_pidns(current)->user_ns, CAP_SYS_ADMIN))
> 
> if (!pidns_is_ancestor(target, active))
> 
> that would also work, right? IOW, you'd be checking whether you're
> capable over your current pid namespace owning userns and if the target
> pidns is an ancestor it's also implied by the first check that you're
> capable over it.
> 
> The only way this would not be true is if a descendant pidns would be
> owned by a userns over which you don't hold privileges and I wondered
> whether that's even possible? I don't think it is but maybe you see a
> way.

Well, if you run a setuid binary, it could create a pidns that is a
child but is owned by a more privileged userns than you. My main goal
here was to just mirror pidns_install() exactly, to make sure that the
permission model was identical.

> > +
> > +	put_pid_ns(ctx->pid_ns);
> > +	ctx->pid_ns = get_pid_ns(target);
> > +	put_user_ns(fc->user_ns);
> > +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> > +	return 0;
> > +}
> > +#endif /* CONFIG_PID_NS */
> > +
> >  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  {
> >  	struct proc_fs_context *ctx = fc->fs_private;
> >  	struct fs_parse_result result;
> > -	int opt;
> > +	int opt, err;
> >  
> >  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
> >  	if (opt < 0)
> > @@ -125,14 +187,24 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >  		break;
> >  
> >  	case Opt_hidepid:
> > -		if (proc_parse_hidepid_param(fc, param))
> > -			return -EINVAL;
> > +		err = proc_parse_hidepid_param(fc, param);
> > +		if (err)
> > +			return err;
> >  		break;
> >  
> >  	case Opt_subset:
> > -		if (proc_parse_subset_param(fc, param->string) < 0)
> > -			return -EINVAL;
> > +		err = proc_parse_subset_param(fc, param->string);
> > +		if (err)
> > +			return err;
> > +		break;
> > +
> > +#ifdef CONFIG_PID_NS
> > +	case Opt_pidns:
> 
> I think it would be easier if we returned EOPNOTSUPP when !CONFIG_PID_NS
> instead of EINVALing this?
> 
> > +		err = proc_parse_pidns_param(fc, param, &result);
> > +		if (err)
> > +			return err;
> >  		break;
> > +#endif
> >  
> >  	default:
> >  		return -EINVAL;
> > @@ -154,6 +226,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
> >  		fs_info->hide_pid = ctx->hidepid;
> >  	if (ctx->mask & (1 << Opt_subset))
> >  		fs_info->pidonly = ctx->pidonly;
> > +#ifdef CONFIG_PID_NS
> > +	if (ctx->mask & (1 << Opt_pidns)) {
> > +		put_pid_ns(fs_info->pid_ns);
> > +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> > +	}
> > +#endif
> >  }
> >  
> >  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> > 
> > -- 
> > 2.50.0
> > 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-07-24  7:34   ` Christian Brauner
@ 2025-07-25  2:24     ` Aleksa Sarai
  2025-07-31 10:31       ` Christian Brauner
  0 siblings, 1 reply; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-25  2:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 8852 bytes --]

On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > /proc has historically had very opaque semantics about PID namespaces,
> > which is a little unfortunate for container runtimes and other programs
> > that deal with switching namespaces very often. One common issue is that
> > of converting between PIDs in the process's namespace and PIDs in the
> > namespace of /proc.
> > 
> > In principle, it is possible to do this today by opening a pidfd with
> > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > contain a PID value translated to the pid namespace associated with that
> > procfs superblock). However, allocating a new file for each PID to be
> > converted is less than ideal for programs that may need to scan procfs,
> > and it is generally useful for userspace to be able to finally get this
> > information from procfs.
> > 
> > So, add a new API for this in the form of an ioctl(2) you can call on
> > the root directory of procfs. The returned file descriptor will have
> > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > option, finally allowing userspace full control of the pid namespaces
> > associated with procfs instances.
> > 
> > The permission model for this is a bit looser than that of the "pidns"
> > mount option, but this is mainly because /proc/1/ns/pid provides the
> > same information, so as long as you have access to that magic-link (or
> > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > or being in an ancestor pid namespace) it makes sense to allow userspace
> > to grab a handle. setns(2) will still have their own permission checks,
> > so being able to open a pidns handle doesn't really provide too many
> > other capabilities.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  4 +++
> >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/fs.h            |  3 +++
> >  3 files changed, 59 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > index c520b9f8a3fd..506383273c9d 100644
> > --- a/Documentation/filesystems/proc.rst
> > +++ b/Documentation/filesystems/proc.rst
> > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> >  will be used by the procfs instance when translating pids. By default, procfs
> >  will use the calling process's active pid namespace.
> >  
> > +Processes can check which pid namespace is used by a procfs instance by using
> > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > +instance.
> > +
> >  Chapter 5: Filesystem behavior
> >  ==============================
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index 057c8a125c6e..548a57ec2152 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -23,8 +23,10 @@
> >  #include <linux/cred.h>
> >  #include <linux/magic.h>
> >  #include <linux/slab.h>
> > +#include <linux/ptrace.h>
> >  
> >  #include "internal.h"
> > +#include "../internal.h"
> >  
> >  struct proc_fs_context {
> >  	struct pid_namespace	*pid_ns;
> > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> >  	return proc_pid_readdir(file, ctx);
> >  }
> >  
> > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > +{
> > +	switch (cmd) {
> > +#ifdef CONFIG_PID_NS
> > +	case PROCFS_GET_PID_NAMESPACE: {
> > +		struct pid_namespace *active = task_active_pid_ns(current);
> > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > +		bool can_access_pidns = false;
> > +
> > +		/*
> > +		 * If we are in an ancestors of the pidns, or have join
> > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > +		 * would be able to grab a handle to the pidns.
> > +		 *
> > +		 * Otherwise, if there is a root process, then being able to
> > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > +		 * we should probably match the permission model. For empty
> > +		 * namespaces it seems unlikely for there to be a downside to
> > +		 * allowing unprivileged users to open a handle to it (setns
> > +		 * will fail for unprivileged users anyway).
> > +		 */
> > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> 
> This seems to imply that if @ns is a descendant of @active that the
> caller holds privileges over it. Is that actually always true?
> 
> IOW, why is the check different from the previous pidns= mount option
> check. I would've expected:
> 
> ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> 
> and then the ptrace check as a fallback.

That would mirror pidns_install(), and I did think about it. The primary
(mostly handwave-y) reasoning I had for making it less strict was that:

 * If you are in an ancestor pidns, then you can already see those
   processes in your own /proc. In theory that means that you will be
   able to access /proc/$pid/ns/pid for at least some subprocess there
   (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
   cleared on ).

   Though hypothetically if they are all running as a different user,
   this does not apply (and you could create scenarios where a child
   pidns is owned by a userns that you do not have privileges over -- if
   you deal with setuid binaries). Maybe that risk means we should just
   combine them, I'm not sure.

 * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
   strange to disallow access even if it is not in an ancestor
   namespace. This is distinct to pidns_install(), where you want to
   ensure you cannot escape to a parent pid namespace, this is about
   getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).

Maybe they should be combined to match pidns_install(), but then I would
expect the ptrace_may_access() check to apply to all processes in the
pidns to make it less restrictive, which is not something you can
practically do (and there is a higher chance that pid1 will have
SUID_DUMP_DISABLE than some random subprocess, which almost certainly
will not be SUID_DUMP_DISABLE).

Fundamentally, I guess I'm still trying to see what the risk is of
allowing a process to get a handle to a pidns that they have some kind
of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being
able to see and address all processes in the namespace, or by being able
to open /proc/$pidns_pid1/ns/pid anyway) but cannot join.

Then again, maybe the fact that it is kind of strange to explain is
enough of a reason to just make it simpler...

> > +		if (!can_access_pidns) {
> > +			bool cannot_ptrace_pid1 = false;
> > +
> > +			read_lock(&tasklist_lock);
> > +			if (ns->child_reaper)
> > +				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
> > +								       PTRACE_MODE_READ_FSCREDS);
> > +			read_unlock(&tasklist_lock);
> > +			can_access_pidns = !cannot_ptrace_pid1;
> > +		}
> > +		if (!can_access_pidns)
> > +			return -EPERM;
> > +
> > +		/* open_namespace() unconditionally consumes the reference. */
> > +		get_pid_ns(ns);
> > +		return open_namespace(to_ns_common(ns));
> > +	}
> > +#endif /* CONFIG_PID_NS */
> > +	default:
> > +		return -ENOIOCTLCMD;
> > +	}
> > +}
> > +
> >  /*
> >   * The root /proc directory is special, as it has the
> >   * <pid> directories. Thus we don't use the generic
> >   * directory handling functions for that..
> >   */
> >  static const struct file_operations proc_root_operations = {
> > -	.read		 = generic_read_dir,
> > -	.iterate_shared	 = proc_root_readdir,
> > +	.read		= generic_read_dir,
> > +	.iterate_shared	= proc_root_readdir,
> >  	.llseek		= generic_file_llseek,
> > +	.unlocked_ioctl = proc_root_ioctl,
> > +	.compat_ioctl   = compat_ptr_ioctl,
> >  };
> >  
> >  /*
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 0bd678a4a10e..aa642cb48feb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -437,6 +437,9 @@ typedef int __bitwise __kernel_rwf_t;
> >  
> >  #define PROCFS_IOCTL_MAGIC 'f'
> >  
> > +/* procfs root ioctls */
> > +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 1)
> > +
> >  /* Pagemap ioctl */
> >  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
> >  
> > 
> > -- 
> > 2.50.0
> > 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-07-25  2:24     ` Aleksa Sarai
@ 2025-07-31 10:31       ` Christian Brauner
  2025-07-31 14:21         ` Aleksa Sarai
  0 siblings, 1 reply; 13+ messages in thread
From: Christian Brauner @ 2025-07-31 10:31 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote:
> On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > > /proc has historically had very opaque semantics about PID namespaces,
> > > which is a little unfortunate for container runtimes and other programs
> > > that deal with switching namespaces very often. One common issue is that
> > > of converting between PIDs in the process's namespace and PIDs in the
> > > namespace of /proc.
> > > 
> > > In principle, it is possible to do this today by opening a pidfd with
> > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > > contain a PID value translated to the pid namespace associated with that
> > > procfs superblock). However, allocating a new file for each PID to be
> > > converted is less than ideal for programs that may need to scan procfs,
> > > and it is generally useful for userspace to be able to finally get this
> > > information from procfs.
> > > 
> > > So, add a new API for this in the form of an ioctl(2) you can call on
> > > the root directory of procfs. The returned file descriptor will have
> > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > > option, finally allowing userspace full control of the pid namespaces
> > > associated with procfs instances.
> > > 
> > > The permission model for this is a bit looser than that of the "pidns"
> > > mount option, but this is mainly because /proc/1/ns/pid provides the
> > > same information, so as long as you have access to that magic-link (or
> > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > > or being in an ancestor pid namespace) it makes sense to allow userspace
> > > to grab a handle. setns(2) will still have their own permission checks,
> > > so being able to open a pidns handle doesn't really provide too many
> > > other capabilities.
> > > 
> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > ---
> > >  Documentation/filesystems/proc.rst |  4 +++
> > >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/fs.h            |  3 +++
> > >  3 files changed, 59 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index c520b9f8a3fd..506383273c9d 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > >  will be used by the procfs instance when translating pids. By default, procfs
> > >  will use the calling process's active pid namespace.
> > >  
> > > +Processes can check which pid namespace is used by a procfs instance by using
> > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > > +instance.
> > > +
> > >  Chapter 5: Filesystem behavior
> > >  ==============================
> > >  
> > > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > > index 057c8a125c6e..548a57ec2152 100644
> > > --- a/fs/proc/root.c
> > > +++ b/fs/proc/root.c
> > > @@ -23,8 +23,10 @@
> > >  #include <linux/cred.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/ptrace.h>
> > >  
> > >  #include "internal.h"
> > > +#include "../internal.h"
> > >  
> > >  struct proc_fs_context {
> > >  	struct pid_namespace	*pid_ns;
> > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > >  	return proc_pid_readdir(file, ctx);
> > >  }
> > >  
> > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	switch (cmd) {
> > > +#ifdef CONFIG_PID_NS
> > > +	case PROCFS_GET_PID_NAMESPACE: {
> > > +		struct pid_namespace *active = task_active_pid_ns(current);
> > > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > > +		bool can_access_pidns = false;
> > > +
> > > +		/*
> > > +		 * If we are in an ancestors of the pidns, or have join
> > > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > > +		 * would be able to grab a handle to the pidns.
> > > +		 *
> > > +		 * Otherwise, if there is a root process, then being able to
> > > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > > +		 * we should probably match the permission model. For empty
> > > +		 * namespaces it seems unlikely for there to be a downside to
> > > +		 * allowing unprivileged users to open a handle to it (setns
> > > +		 * will fail for unprivileged users anyway).
> > > +		 */
> > > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> > 
> > This seems to imply that if @ns is a descendant of @active that the
> > caller holds privileges over it. Is that actually always true?
> > 
> > IOW, why is the check different from the previous pidns= mount option
> > check. I would've expected:
> > 
> > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> > 
> > and then the ptrace check as a fallback.
> 
> That would mirror pidns_install(), and I did think about it. The primary
> (mostly handwave-y) reasoning I had for making it less strict was that:
> 
>  * If you are in an ancestor pidns, then you can already see those
>    processes in your own /proc. In theory that means that you will be
>    able to access /proc/$pid/ns/pid for at least some subprocess there
>    (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
>    cleared on ).
> 
>    Though hypothetically if they are all running as a different user,
>    this does not apply (and you could create scenarios where a child
>    pidns is owned by a userns that you do not have privileges over -- if
>    you deal with setuid binaries). Maybe that risk means we should just
>    combine them, I'm not sure.
> 
>  * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
>    strange to disallow access even if it is not in an ancestor
>    namespace. This is distinct to pidns_install(), where you want to
>    ensure you cannot escape to a parent pid namespace, this is about
>    getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
> 
> Maybe they should be combined to match pidns_install(), but then I would
> expect the ptrace_may_access() check to apply to all processes in the
> pidns to make it less restrictive, which is not something you can
> practically do (and there is a higher chance that pid1 will have
> SUID_DUMP_DISABLE than some random subprocess, which almost certainly
> will not be SUID_DUMP_DISABLE).
> 
> Fundamentally, I guess I'm still trying to see what the risk is of
> allowing a process to get a handle to a pidns that they have some kind
> of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being

There shouldn't be. For example, you kinda implicitly do that with a
pidfd, no? Because you can pass the pidfd to setns() instead of a
namespace fd itself. Maybe that's the argument you're lookin for?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-07-31 10:31       ` Christian Brauner
@ 2025-07-31 14:21         ` Aleksa Sarai
  0 siblings, 0 replies; 13+ messages in thread
From: Aleksa Sarai @ 2025-07-31 14:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 7820 bytes --]

On 2025-07-31, Christian Brauner <brauner@kernel.org> wrote:
> On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote:
> > On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> > > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > > > /proc has historically had very opaque semantics about PID namespaces,
> > > > which is a little unfortunate for container runtimes and other programs
> > > > that deal with switching namespaces very often. One common issue is that
> > > > of converting between PIDs in the process's namespace and PIDs in the
> > > > namespace of /proc.
> > > > 
> > > > In principle, it is possible to do this today by opening a pidfd with
> > > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > > > contain a PID value translated to the pid namespace associated with that
> > > > procfs superblock). However, allocating a new file for each PID to be
> > > > converted is less than ideal for programs that may need to scan procfs,
> > > > and it is generally useful for userspace to be able to finally get this
> > > > information from procfs.
> > > > 
> > > > So, add a new API for this in the form of an ioctl(2) you can call on
> > > > the root directory of procfs. The returned file descriptor will have
> > > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > > > option, finally allowing userspace full control of the pid namespaces
> > > > associated with procfs instances.
> > > > 
> > > > The permission model for this is a bit looser than that of the "pidns"
> > > > mount option, but this is mainly because /proc/1/ns/pid provides the
> > > > same information, so as long as you have access to that magic-link (or
> > > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > > > or being in an ancestor pid namespace) it makes sense to allow userspace
> > > > to grab a handle. setns(2) will still have their own permission checks,
> > > > so being able to open a pidns handle doesn't really provide too many
> > > > other capabilities.
> > > > 
> > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > ---
> > > >  Documentation/filesystems/proc.rst |  4 +++
> > > >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> > > >  include/uapi/linux/fs.h            |  3 +++
> > > >  3 files changed, 59 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > index c520b9f8a3fd..506383273c9d 100644
> > > > --- a/Documentation/filesystems/proc.rst
> > > > +++ b/Documentation/filesystems/proc.rst
> > > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > > >  will be used by the procfs instance when translating pids. By default, procfs
> > > >  will use the calling process's active pid namespace.
> > > >  
> > > > +Processes can check which pid namespace is used by a procfs instance by using
> > > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > > > +instance.
> > > > +
> > > >  Chapter 5: Filesystem behavior
> > > >  ==============================
> > > >  
> > > > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > > > index 057c8a125c6e..548a57ec2152 100644
> > > > --- a/fs/proc/root.c
> > > > +++ b/fs/proc/root.c
> > > > @@ -23,8 +23,10 @@
> > > >  #include <linux/cred.h>
> > > >  #include <linux/magic.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/ptrace.h>
> > > >  
> > > >  #include "internal.h"
> > > > +#include "../internal.h"
> > > >  
> > > >  struct proc_fs_context {
> > > >  	struct pid_namespace	*pid_ns;
> > > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > > >  	return proc_pid_readdir(file, ctx);
> > > >  }
> > > >  
> > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > > > +{
> > > > +	switch (cmd) {
> > > > +#ifdef CONFIG_PID_NS
> > > > +	case PROCFS_GET_PID_NAMESPACE: {
> > > > +		struct pid_namespace *active = task_active_pid_ns(current);
> > > > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > > > +		bool can_access_pidns = false;
> > > > +
> > > > +		/*
> > > > +		 * If we are in an ancestors of the pidns, or have join
> > > > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > > > +		 * would be able to grab a handle to the pidns.
> > > > +		 *
> > > > +		 * Otherwise, if there is a root process, then being able to
> > > > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > > > +		 * we should probably match the permission model. For empty
> > > > +		 * namespaces it seems unlikely for there to be a downside to
> > > > +		 * allowing unprivileged users to open a handle to it (setns
> > > > +		 * will fail for unprivileged users anyway).
> > > > +		 */
> > > > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > > > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> > > 
> > > This seems to imply that if @ns is a descendant of @active that the
> > > caller holds privileges over it. Is that actually always true?
> > > 
> > > IOW, why is the check different from the previous pidns= mount option
> > > check. I would've expected:
> > > 
> > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> > > 
> > > and then the ptrace check as a fallback.
> > 
> > That would mirror pidns_install(), and I did think about it. The primary
> > (mostly handwave-y) reasoning I had for making it less strict was that:
> > 
> >  * If you are in an ancestor pidns, then you can already see those
> >    processes in your own /proc. In theory that means that you will be
> >    able to access /proc/$pid/ns/pid for at least some subprocess there
> >    (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
> >    cleared on ).
> > 
> >    Though hypothetically if they are all running as a different user,
> >    this does not apply (and you could create scenarios where a child
> >    pidns is owned by a userns that you do not have privileges over -- if
> >    you deal with setuid binaries). Maybe that risk means we should just
> >    combine them, I'm not sure.
> > 
> >  * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
> >    strange to disallow access even if it is not in an ancestor
> >    namespace. This is distinct to pidns_install(), where you want to
> >    ensure you cannot escape to a parent pid namespace, this is about
> >    getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
> > 
> > Maybe they should be combined to match pidns_install(), but then I would
> > expect the ptrace_may_access() check to apply to all processes in the
> > pidns to make it less restrictive, which is not something you can
> > practically do (and there is a higher chance that pid1 will have
> > SUID_DUMP_DISABLE than some random subprocess, which almost certainly
> > will not be SUID_DUMP_DISABLE).
> > 
> > Fundamentally, I guess I'm still trying to see what the risk is of
> > allowing a process to get a handle to a pidns that they have some kind
> > of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being
> 
> There shouldn't be. For example, you kinda implicitly do that with a
> pidfd, no? Because you can pass the pidfd to setns() instead of a
> namespace fd itself. Maybe that's the argument you're lookin for?

That argument works for me! I'll rewrite the commit message to make sure
it sounds like I came up with it. ;)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-07-31 14:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 23:18 [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
2025-07-24  7:06   ` Christian Brauner
2025-07-22 23:18 ` [PATCH RFC v2 2/4] procfs: add "pidns" mount option Aleksa Sarai
2025-07-24  7:25   ` Christian Brauner
2025-07-25  2:13     ` Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
2025-07-24  7:34   ` Christian Brauner
2025-07-25  2:24     ` Aleksa Sarai
2025-07-31 10:31       ` Christian Brauner
2025-07-31 14:21         ` Aleksa Sarai
2025-07-22 23:18 ` [PATCH RFC v2 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
2025-07-24  7:36 ` [PATCH RFC v2 0/4] procfs: make reference pidns more user-visible Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).