[PATCH v4 0/4] procfs: make reference pidns more user-visible

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] procfs: make reference pidns more user-visible
@ 2025-08-05  5:45 Aleksa Sarai
  2025-08-05  5:45 ` [PATCH v4 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

Ever since the introduction of pid namespaces, procfs has had very
implicit behaviour surrounding them (the pidns used by a procfs mount is
auto-selected based on the mounting process's active pidns, and the
pidns itself is basically hidden once the mount has been constructed).

/* pidns mount option for procfs */

This implicit behaviour has historically meant that userspace was
required to do some special dances in order to configure the pidns of a
procfs mount as desired. Examples include:

 * In order to bypass the mnt_too_revealing() check, Kubernetes creates
   a procfs mount from an empty pidns so that user namespaced containers
   can be nested (without this, the nested containers would fail to
   mount procfs). But this requires forking off a helper process because
   you cannot just one-shot this using mount(2).

 * Container runtimes in general need to fork into a container before
   configuring its mounts, which can lead to security issues in the case
   of shared-pidns containers (a privileged process in the pidns can
   interact with your container runtime process). While
   SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
   strict need for this due to a minor uAPI wart is kind of unfortunate.

Things would be much easier if there was a way for userspace to just
specify the pidns they want. Patch 1 implements a new "pidns" argument
which can be set using fsconfig(2):

    fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
    fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or classic mount(2) / mount(8):

    // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
    mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

The initial security model I have in this RFC is to be as conservative
as possible and just mirror the security model for setns(2) -- which
means that you can only set pidns=... to pid namespaces that your
current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
privileges over the pid namespace. This fulfils the requirements of
container runtimes, but I suspect that this may be too strict for some
usecases.

The pidns argument is not displayed in mountinfo -- it's not clear to me
what value it would make sense to show (maybe we could just use ns_dname
to provide an identifier for the namespace, but this number would be
fairly useless to userspace). I'm open to suggestions. Note that
PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
information about this outside of mountinfo.

Note that you cannot change the pidns of an already-created procfs
instance. The primary reason is that allowing this to be changed would
require RCU-protecting proc_pid_ns(sb) and thus auditing all of
fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
the pid namespace. Since creating procfs instances is very cheap, it
seems unnecessary to overcomplicate this upfront. Trying to reconfigure
procfs this way errors out with -EBUSY.

/* ioctl(PROCFS_GET_PID_NAMESPACE) */

In addition, being able to figure out what pid namespace is being used
by a procfs mount is quite useful when you have an administrative
process (such as a container runtime) which wants to figure out the
correct way of mapping PIDs between its own namespace and the namespace
for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
alternative ways to do this, but they all rely on ancillary information
that third-party libraries and tools do not necessarily have access to.

To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
can be used to get a reference to the pidns that a procfs is using.

Rather than copying the (fairly strict) security model for setns(2),
apply a slightly looser model to better match what userspace can already
do:

 * Make the ioctl only valid on the root (meaning that a process without
   access to the procfs root -- such as only having an fd to a procfs
   file or some open_tree(2)-like subset -- cannot use this API). This
   means that the process already has some level of access to the
   /proc/$pid directories.

 * If the calling process is in an ancestor pidns, then they can already
   create pidfd for processes inside the pidns, which is morally
   equivalent to a pidns file descriptor according to setns(2). So it
   seems reasonable to just allow it in this case. (The justification
   for this model was suggested by Christian.)

 * If the process has access to /proc/1/ns/pid already (i.e. has
   ptrace-read access to the pidns pid1), then this ioctl is equivalent
   to just opening a handle to it that way.

   Ideally we would check for ptrace-read access against all processes
   in the pidns (which is very likely to be true for at least one
   process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
   by most programs), but this would obviously not scale.

I'm open to suggestions for whether we need to make this stricter (or
possibly allow more cases).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v4:
- Remove unneeded EXPORT_SYMBOL_GPL. [Christian Brauner]
- Return -EOPNOTSUPP for new APIs for CONFIG_PID_NS=n rather than
  pretending they don't exist entirely. [Christian Brauner]
- PROCFS_IOCTL_MAGIC conflicts with XSDFEC_MAGIC, so we need to allocate
  subvalues more carefully (switch to _IO(PROCFS_IOCTL_MAGIC, 32)).
- Add some more selftests for PROCFS_GET_PID_NAMESPACE.
- Reword argument for PROCFS_GET_PID_NAMESPACE security model based on
  Christian's suggestion, and remove CAP_SYS_ADMIN edge-case (in most
  cases, such a process would also have ptrace-read credentials over the
  pidns pid1).
- v3: <https://lore.kernel.org/r/20250724-procfs-pidns-api-v3-0-4c685c910923@cyphar.com>

Changes in v3:
- Disallow changing pidns for existing procfs instances, as we'd
  probably have to RCU-protect everything that touches the pinned pidns
  reference.
- Improve tests with slightly nicer ASSERT_ERRNO* macros.
- v2: <https://lore.kernel.org/r/20250723-procfs-pidns-api-v2-0-621e7edd8e40@cyphar.com>

Changes in v2:
- #ifdef CONFIG_PID_NS
- Improve cover letter wording to make it clear we're talking about two
  separate features with different permission models. [Andy Lutomirski]
- Fix build warnings in pidns_is_ancestor() patch. [kernel test robot]
- v1: <https://lore.kernel.org/r/20250721-procfs-pidns-api-v1-0-5cd9007e512d@cyphar.com>

---
Aleksa Sarai (4):
      pidns: move is-ancestor logic to helper
      procfs: add "pidns" mount option
      procfs: add PROCFS_GET_PID_NAMESPACE ioctl
      selftests/proc: add tests for new pidns APIs

 Documentation/filesystems/proc.rst        |  12 ++
 fs/proc/root.c                            | 166 +++++++++++++++-
 include/linux/pid_namespace.h             |   9 +
 include/uapi/linux/fs.h                   |   4 +
 kernel/pid_namespace.c                    |  22 ++-
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
 8 files changed, 514 insertions(+), 16 deletions(-)
---
base-commit: 66639db858112bf6b0f76677f7517643d586e575
change-id: 20250717-procfs-pidns-api-8ed1583431f0

Best regards,
-- 
Aleksa Sarai <cyphar@cyphar.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/4] pidns: move is-ancestor logic to helper
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
@ 2025-08-05  5:45 ` Aleksa Sarai
  2025-08-05  5:45 ` [PATCH v4 2/4] procfs: add "pidns" mount option Aleksa Sarai
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

This check will be needed in later patches, and there's no point
open-coding it each time.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 include/linux/pid_namespace.h |  9 +++++++++
 kernel/pid_namespace.c        | 22 ++++++++++++++--------
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7c67a5811199..17fdc059f8da 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -84,6 +84,9 @@ extern void zap_pid_ns_processes(struct pid_namespace *pid_ns);
 extern int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd);
 extern void put_pid_ns(struct pid_namespace *ns);
 
+extern bool pidns_is_ancestor(struct pid_namespace *child,
+			      struct pid_namespace *ancestor);
+
 #else /* !CONFIG_PID_NS */
 #include <linux/err.h>
 
@@ -118,6 +121,12 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
 	return 0;
 }
+
+static inline bool pidns_is_ancestor(struct pid_namespace *child,
+				     struct pid_namespace *ancestor)
+{
+	return false;
+}
 #endif /* CONFIG_PID_NS */
 
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 7098ed44e717..b7b45c2597ec 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -390,11 +390,23 @@ static void pidns_put(struct ns_common *ns)
 	put_pid_ns(to_pid_ns(ns));
 }
 
+bool pidns_is_ancestor(struct pid_namespace *child,
+		       struct pid_namespace *ancestor)
+{
+	struct pid_namespace *ns;
+
+	if (child->level < ancestor->level)
+		return false;
+	for (ns = child; ns->level > ancestor->level; ns = ns->parent)
+		;
+	return ns == ancestor;
+}
+
 static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 {
 	struct nsproxy *nsproxy = nsset->nsproxy;
 	struct pid_namespace *active = task_active_pid_ns(current);
-	struct pid_namespace *ancestor, *new = to_pid_ns(ns);
+	struct pid_namespace *new = to_pid_ns(ns);
 
 	if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
 	    !ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN))
@@ -408,13 +420,7 @@ static int pidns_install(struct nsset *nsset, struct ns_common *ns)
 	 * this maintains the property that processes and their
 	 * children can not escape their current pid namespace.
 	 */
-	if (new->level < active->level)
-		return -EINVAL;
-
-	ancestor = new;
-	while (ancestor->level > active->level)
-		ancestor = ancestor->parent;
-	if (ancestor != active)
+	if (!pidns_is_ancestor(new, active))
 		return -EINVAL;
 
 	put_pid_ns(nsproxy->pid_ns_for_children);

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
  2025-08-05  5:45 ` [PATCH v4 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
@ 2025-08-05  5:45 ` Aleksa Sarai
  2025-08-05  7:29   ` Aleksa Sarai
  2025-08-06  0:19   ` Randy Dunlap
  2025-08-05  5:45 ` [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

Since the introduction of pid namespaces, their interaction with procfs
has been entirely implicit in ways that require a lot of dancing around
by programs that need to construct sandboxes with different PID
namespaces.

Being able to explicitly specify the pid namespace to use when
constructing a procfs super block will allow programs to no longer need
to fork off a process which does then does unshare(2) / setns(2) and
forks again in order to construct a procfs in a pidns.

So, provide a "pidns" mount option which allows such users to just
explicitly state which pid namespace they want that procfs instance to
use. This interface can be used with fsconfig(2) either with a file
descriptor or a path:

  fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
  fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);

or with classic mount(2) / mount(8):

  // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
  mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");

As this new API is effectively shorthand for setns(2) followed by
mount(2), the permission model for this mirrors pidns_install() to avoid
opening up new attack surfaces by loosening the existing permission
model.

In order to avoid having to RCU-protect all users of proc_pid_ns() (to
avoid UAFs), attempting to reconfigure an existing procfs instance's pid
namespace will error out with -EBUSY. Creating new procfs instances is
quite cheap, so this should not be an impediment to most users, and lets
us avoid a lot of churn in fs/proc/* for a feature that it seems
unlikely userspace would use.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  8 ++++
 fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
 2 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5236cb52e357..5a157dadea0b 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2360,6 +2360,7 @@ The following mount options are supported:
 	hidepid=	Set /proc/<pid>/ access mode.
 	gid=		Set the group authorized to learn processes information.
 	subset=		Show only the specified subset of procfs.
+	pidns=		Specify a the namespace used by this procfs.
 	=========	========================================================
 
 hidepid=off or hidepid=0 means classic mode - everybody may access all
@@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+pidns= specifies a pid namespace (either as a string path to something like
+`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
+will be used by the procfs instance when translating pids. By default, procfs
+will use the calling process's active pid namespace. Note that the pid
+namespace of an existing procfs instance cannot be modified (attempting to do
+so will give an `-EBUSY` error).
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ed86ac710384..fd1f1c8a939a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -38,12 +38,14 @@ enum proc_param {
 	Opt_gid,
 	Opt_hidepid,
 	Opt_subset,
+	Opt_pidns,
 };
 
 static const struct fs_parameter_spec proc_fs_parameters[] = {
-	fsparam_u32("gid",	Opt_gid),
+	fsparam_u32("gid",		Opt_gid),
 	fsparam_string("hidepid",	Opt_hidepid),
 	fsparam_string("subset",	Opt_subset),
+	fsparam_file_or_string("pidns",	Opt_pidns),
 	{}
 };
 
@@ -109,11 +111,66 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
 	return 0;
 }
 
+#ifdef CONFIG_PID_NS
+static int proc_parse_pidns_param(struct fs_context *fc,
+				  struct fs_parameter *param,
+				  struct fs_parse_result *result)
+{
+	struct proc_fs_context *ctx = fc->fs_private;
+	struct pid_namespace *target, *active = task_active_pid_ns(current);
+	struct ns_common *ns;
+	struct file *ns_filp __free(fput) = NULL;
+
+	switch (param->type) {
+	case fs_value_is_file:
+		/* came through fsconfig, steal the file reference */
+		ns_filp = no_free_ptr(param->file);
+		break;
+	case fs_value_is_string:
+		ns_filp = filp_open(param->string, O_RDONLY, 0);
+		break;
+	default:
+		WARN_ON_ONCE(true);
+		break;
+	}
+	if (!ns_filp)
+		ns_filp = ERR_PTR(-EBADF);
+	if (IS_ERR(ns_filp)) {
+		errorfc(fc, "could not get file from pidns argument");
+		return PTR_ERR(ns_filp);
+	}
+
+	if (!proc_ns_file(ns_filp))
+		return invalfc(fc, "pidns argument is not an nsfs file");
+	ns = get_proc_ns(file_inode(ns_filp));
+	if (ns->ops->type != CLONE_NEWPID)
+		return invalfc(fc, "pidns argument is not a pidns file");
+	target = container_of(ns, struct pid_namespace, ns);
+
+	/*
+	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
+	 * permission model should be the same as pidns_install().
+	 */
+	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
+		errorfc(fc, "insufficient permissions to set pidns");
+		return -EPERM;
+	}
+	if (!pidns_is_ancestor(target, active))
+		return invalfc(fc, "cannot set pidns to non-descendant pidns");
+
+	put_pid_ns(ctx->pid_ns);
+	ctx->pid_ns = get_pid_ns(target);
+	put_user_ns(fc->user_ns);
+	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
+	return 0;
+}
+#endif /* CONFIG_PID_NS */
+
 static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
 	struct fs_parse_result result;
-	int opt;
+	int opt, err;
 
 	opt = fs_parse(fc, proc_fs_parameters, param, &result);
 	if (opt < 0)
@@ -125,14 +182,38 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 		break;
 
 	case Opt_hidepid:
-		if (proc_parse_hidepid_param(fc, param))
-			return -EINVAL;
+		err = proc_parse_hidepid_param(fc, param);
+		if (err)
+			return err;
 		break;
 
 	case Opt_subset:
-		if (proc_parse_subset_param(fc, param->string) < 0)
-			return -EINVAL;
+		err = proc_parse_subset_param(fc, param->string);
+		if (err)
+			return err;
+		break;
+
+	case Opt_pidns:
+#ifdef CONFIG_PID_NS
+		/*
+		 * We would have to RCU-protect every proc_pid_ns() or
+		 * proc_sb_info() access if we allowed this to be reconfigured
+		 * for an existing procfs instance. Luckily, procfs instances
+		 * are cheap to create, and mount-beneath would let you
+		 * atomically replace an instance even with overmounts.
+		 */
+		if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
+			errorfc(fc, "cannot reconfigure pidns for existing procfs");
+			return -EBUSY;
+		}
+		err = proc_parse_pidns_param(fc, param, &result);
+		if (err)
+			return err;
 		break;
+#else
+		errorfc(fc, "pidns mount flag not supported on this system");
+		return -EOPNOTSUPP;
+#endif
 
 	default:
 		return -EINVAL;
@@ -154,6 +235,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset))
 		fs_info->pidonly = ctx->pidonly;
+	if (ctx->mask & (1 << Opt_pidns) &&
+	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
+		put_pid_ns(fs_info->pid_ns);
+		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	}
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
  2025-08-05  5:45 ` [PATCH v4 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
  2025-08-05  5:45 ` [PATCH v4 2/4] procfs: add "pidns" mount option Aleksa Sarai
@ 2025-08-05  5:45 ` Aleksa Sarai
  2025-08-06  0:25   ` Randy Dunlap
  2025-08-05  5:45 ` [PATCH v4 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

/proc has historically had very opaque semantics about PID namespaces,
which is a little unfortunate for container runtimes and other programs
that deal with switching namespaces very often. One common issue is that
of converting between PIDs in the process's namespace and PIDs in the
namespace of /proc.

In principle, it is possible to do this today by opening a pidfd with
pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
contain a PID value translated to the pid namespace associated with that
procfs superblock). However, allocating a new file for each PID to be
converted is less than ideal for programs that may need to scan procfs,
and it is generally useful for userspace to be able to finally get this
information from procfs.

So, add a new API to get the pid namespace of a procfs instance, in the
form of an ioctl(2) you can call on the root directory of said procfs.
The returned file descriptor will have O_CLOEXEC set. This acts as a
sister feature to the new "pidns" mount option, finally allowing
userspace full control of the pid namespaces associated with procfs
instances.

The permission model for this is a bit looser than that of the "pidns"
mount option (and also setns(2)) because /proc/1/ns/pid provides the
same information, so as long as you have access to that magic-link (or
something equivalently reasonable such as being in an ancestor pid
namespace) it makes sense to allow userspace to grab a handle. Ideally
we would check for ptrace-read access against all processes in the pidns
(which is very likely to be true for at least one process, as
SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
programs), but this would obviously not scale.

setns(2) will still have their own permission checks, so being able to
open a pidns handle doesn't really provide too many other capabilities.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 Documentation/filesystems/proc.rst |  4 +++
 fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/fs.h            |  4 +++
 3 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a157dadea0b..840f820fb467 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2400,6 +2400,10 @@ will use the calling process's active pid namespace. Note that the pid
 namespace of an existing procfs instance cannot be modified (attempting to do
 so will give an `-EBUSY` error).
 
+Processes can check which pid namespace is used by a procfs instance by using
+the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
+instance.
+
 Chapter 5: Filesystem behavior
 ==============================
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index fd1f1c8a939a..ac9b115fad7b 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -23,8 +23,10 @@
 #include <linux/cred.h>
 #include <linux/magic.h>
 #include <linux/slab.h>
+#include <linux/ptrace.h>
 
 #include "internal.h"
+#include "../internal.h"
 
 struct proc_fs_context {
 	struct pid_namespace	*pid_ns;
@@ -426,15 +428,77 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
 	return proc_pid_readdir(file, ctx);
 }
 
+static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	switch (cmd) {
+	case PROCFS_GET_PID_NAMESPACE: {
+#ifdef CONFIG_PID_NS
+		struct pid_namespace *active = task_active_pid_ns(current);
+		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
+		bool can_access_pidns = false;
+
+		/*
+		 * Having a handle to a pidns is not sufficient to do anything
+		 * particularly harmful, as setns(2) has its own separate
+		 * privilege checks. So, we can loosen the privilege
+		 * requirements here a little to make this more ergonomic.
+		 *
+		 * If we are in an ancestor pidns of the pidns, then we can
+		 * already address any process in the pidns. From a setns(2)
+		 * privileges perspective, we can create a pidfd which setns(2)
+		 * would also accept (pending any privilege checks).
+		 *
+		 * If we are not in an ancestor pidns, because this operation
+		 * is being done on the root of the /proc instance, the caller
+		 * can try to access /proc/1/ns/pid which is equivalent to this
+		 * ioctl and so we should copy the PTRACE_MODE_READ_FSCREDS
+		 * permission model use by proc_ns_get_link(). Ideally we would
+		 * check for ptrace-read access against all processes in the
+		 * pidns (which is very likely to be true for at least one
+		 * process, as SUID_DUMP_DISABLE is cleared on exec(2) and is
+		 * rarely set by most programs), but this would obviously not
+		 * scale.
+		 *
+		 * If there is no root process, then there is no real downside
+		 * to unprivileged users to open a handle to it.
+		 */
+		can_access_pidns = pidns_is_ancestor(ns, active);
+		if (!can_access_pidns) {
+			bool cannot_ptrace_pid1 = false;
+
+			read_lock(&tasklist_lock);
+			if (ns->child_reaper)
+				cannot_ptrace_pid1 = ptrace_may_access(ns->child_reaper,
+								       PTRACE_MODE_READ_FSCREDS);
+			read_unlock(&tasklist_lock);
+			can_access_pidns = !cannot_ptrace_pid1;
+		}
+		if (!can_access_pidns)
+			return -EPERM;
+
+		/* open_namespace() unconditionally consumes the reference. */
+		get_pid_ns(ns);
+		return open_namespace(to_ns_common(ns));
+#else
+		return -EOPNOTSUPP;
+#endif
+	}
+	default:
+		return -ENOIOCTLCMD;
+	}
+}
+
 /*
  * The root /proc directory is special, as it has the
  * <pid> directories. Thus we don't use the generic
  * directory handling functions for that..
  */
 static const struct file_operations proc_root_operations = {
-	.read		 = generic_read_dir,
-	.iterate_shared	 = proc_root_readdir,
+	.read		= generic_read_dir,
+	.iterate_shared	= proc_root_readdir,
 	.llseek		= generic_file_llseek,
+	.unlocked_ioctl = proc_root_ioctl,
+	.compat_ioctl   = compat_ptr_ioctl,
 };
 
 /*
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 0bd678a4a10e..68e65e6d7d6b 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
 			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
 			 RWF_DONTCACHE)
 
+/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
 #define PROCFS_IOCTL_MAGIC 'f'
 
+/* procfs root ioctls */
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
+
 /* Pagemap ioctl */
 #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
 

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/4] selftests/proc: add tests for new pidns APIs
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
                   ` (2 preceding siblings ...)
  2025-08-05  5:45 ` [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
@ 2025-08-05  5:45 ` Aleksa Sarai
  2025-09-02  9:54 ` (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible Christian Brauner
  2025-09-02 10:02 ` Christian Brauner
  5 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  5:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Aleksa Sarai

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 tools/testing/selftests/proc/.gitignore   |   1 +
 tools/testing/selftests/proc/Makefile     |   1 +
 tools/testing/selftests/proc/proc-pidns.c | 315 ++++++++++++++++++++++++++++++
 3 files changed, 317 insertions(+)

diff --git a/tools/testing/selftests/proc/.gitignore b/tools/testing/selftests/proc/.gitignore
index 973968f45bba..2dced03e9e0e 100644
--- a/tools/testing/selftests/proc/.gitignore
+++ b/tools/testing/selftests/proc/.gitignore
@@ -17,6 +17,7 @@
 /proc-tid0
 /proc-uptime-001
 /proc-uptime-002
+/proc-pidns
 /read
 /self
 /setns-dcache
diff --git a/tools/testing/selftests/proc/Makefile b/tools/testing/selftests/proc/Makefile
index b12921b9794b..c6f7046b9860 100644
--- a/tools/testing/selftests/proc/Makefile
+++ b/tools/testing/selftests/proc/Makefile
@@ -27,5 +27,6 @@ TEST_GEN_PROGS += setns-sysvipc
 TEST_GEN_PROGS += thread-self
 TEST_GEN_PROGS += proc-multiple-procfs
 TEST_GEN_PROGS += proc-fsconfig-hidepid
+TEST_GEN_PROGS += proc-pidns
 
 include ../lib.mk
diff --git a/tools/testing/selftests/proc/proc-pidns.c b/tools/testing/selftests/proc/proc-pidns.c
new file mode 100644
index 000000000000..f7dd80a2c150
--- /dev/null
+++ b/tools/testing/selftests/proc/proc-pidns.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Author: Aleksa Sarai <cyphar@cyphar.com>
+ * Copyright (C) 2025 SUSE LLC.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <sched.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <stdio.h>
+#include <sys/mount.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <sys/prctl.h>
+
+#include "../kselftest_harness.h"
+
+#define ASSERT_ERRNO(expected, _t, seen)				\
+	__EXPECT(expected, #expected,					\
+		({__typeof__(seen) _tmp_seen = (seen);			\
+		  _tmp_seen >= 0 ? _tmp_seen : -errno; }), #seen, _t, 1)
+
+#define ASSERT_ERRNO_EQ(expected, seen) \
+	ASSERT_ERRNO(expected, ==, seen)
+
+#define ASSERT_SUCCESS(seen) \
+	ASSERT_ERRNO(0, <=, seen)
+
+static int touch(char *path)
+{
+	int fd = open(path, O_WRONLY|O_CREAT|O_CLOEXEC, 0644);
+	if (fd < 0)
+		return -1;
+	return close(fd);
+}
+
+FIXTURE(ns)
+{
+	int host_mntns, host_pidns;
+	int dummy_pidns;
+};
+
+FIXTURE_SETUP(ns)
+{
+	/* Stash the old mntns. */
+	self->host_mntns = open("/proc/self/ns/mnt", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_mntns);
+
+	/* Create a new mount namespace and make it private. */
+	ASSERT_SUCCESS(unshare(CLONE_NEWNS));
+	ASSERT_SUCCESS(mount(NULL, "/", NULL, MS_PRIVATE|MS_REC, NULL));
+
+	/*
+	 * Create a proper tmpfs that we can use and will disappear once we
+	 * leave this mntns.
+	 */
+	ASSERT_SUCCESS(mount("tmpfs", "/tmp", "tmpfs", 0, NULL));
+
+	/*
+	 * Create a pidns we can use for later tests. We need to fork off a
+	 * child so that we get a usable nsfd that we can bind-mount and open.
+	 */
+	ASSERT_SUCCESS(mkdir("/tmp/dummy", 0755));
+	ASSERT_SUCCESS(touch("/tmp/dummy/pidns"));
+	ASSERT_SUCCESS(mkdir("/tmp/dummy/proc", 0755));
+
+	self->host_pidns = open("/proc/self/ns/pid", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->host_pidns);
+	ASSERT_SUCCESS(unshare(CLONE_NEWPID));
+
+	pid_t pid = fork();
+	ASSERT_SUCCESS(pid);
+	if (!pid) {
+		prctl(PR_SET_PDEATHSIG, SIGKILL);
+		ASSERT_SUCCESS(mount("/proc/self/ns/pid", "/tmp/dummy/pidns", NULL, MS_BIND, NULL));
+		ASSERT_SUCCESS(mount("proc", "/tmp/dummy/proc", "proc", 0, NULL));
+		exit(0);
+	}
+
+	int wstatus;
+	ASSERT_EQ(waitpid(pid, &wstatus, 0), pid);
+	ASSERT_TRUE(WIFEXITED(wstatus));
+	ASSERT_EQ(WEXITSTATUS(wstatus), 0);
+
+	ASSERT_SUCCESS(setns(self->host_pidns, CLONE_NEWPID));
+
+	self->dummy_pidns = open("/tmp/dummy/pidns", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(self->dummy_pidns);
+}
+
+FIXTURE_TEARDOWN(ns)
+{
+	ASSERT_SUCCESS(setns(self->host_mntns, CLONE_NEWNS));
+	ASSERT_SUCCESS(close(self->host_mntns));
+
+	ASSERT_SUCCESS(close(self->host_pidns));
+	ASSERT_SUCCESS(close(self->dummy_pidns));
+}
+
+TEST_F(ns, pidns_mount_string_path)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+	ASSERT_SUCCESS(access("/tmp/proc-host/self/", X_OK));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns"));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/1/", X_OK));
+	ASSERT_ERRNO_EQ(-ENOENT, access("/tmp/proc-dummy/self/", X_OK));
+}
+
+TEST_F(ns, pidns_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_ERRNO_EQ(-ENOENT, faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_remount)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc", "proc", 0, ""));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+
+	ASSERT_ERRNO_EQ(-EBUSY, mount(NULL, "/tmp/proc", NULL, MS_REMOUNT, "pidns=/tmp/dummy/pidns"));
+
+	ASSERT_SUCCESS(access("/tmp/proc/1/", X_OK));
+	ASSERT_SUCCESS(access("/tmp/proc/self/", X_OK));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_string_path)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_STRING, "pidns", "/tmp/dummy/pidns", 0));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+TEST_F(ns, pidns_reconfigure_fsconfig_fd)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_ERRNO_EQ(-EBUSY, fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0)); /* noop */
+
+	ASSERT_SUCCESS(faccessat(mountfd, "1/", X_OK, 0));
+	ASSERT_SUCCESS(faccessat(mountfd, "self/", X_OK, 0));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(mountfd));
+}
+
+int is_same_inode(int fd1, int fd2)
+{
+	struct stat stat1, stat2;
+
+	assert(fstat(fd1, &stat1) == 0);
+	assert(fstat(fd2, &stat2) == 0);
+
+	return stat1.st_ino == stat2.st_ino && stat1.st_dev == stat2.st_dev;
+}
+
+#define PROCFS_IOCTL_MAGIC 'f'
+#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
+
+TEST_F(ns, host_get_pidns_ioctl)
+{
+	int procfs = open("/proc", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(procfs);
+
+	int procfs_pidns = ioctl(procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_TRUE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_FALSE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(procfs));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_F(ns, mount_implicit_get_pidns_ioctl)
+{
+	int procfs = open("/tmp/dummy/proc", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(procfs);
+
+	int procfs_pidns = ioctl(procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(procfs));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_F(ns, mount_pidns_get_pidns_ioctl)
+{
+	ASSERT_SUCCESS(mkdir("/tmp/proc-host", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-host", "proc", 0, "pidns=/proc/self/ns/pid"));
+
+	int host_procfs = open("/tmp/proc-host", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(host_procfs);
+	int host_procfs_pidns = ioctl(host_procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(host_procfs_pidns);
+
+	ASSERT_TRUE(is_same_inode(self->host_pidns, host_procfs_pidns));
+	ASSERT_FALSE(is_same_inode(self->dummy_pidns, host_procfs_pidns));
+
+	ASSERT_SUCCESS(mkdir("/tmp/proc-dummy", 0755));
+	ASSERT_SUCCESS(mount("proc", "/tmp/proc-dummy", "proc", 0, "pidns=/tmp/dummy/pidns"));
+
+	int dummy_procfs = open("/tmp/proc-dummy", O_RDONLY|O_CLOEXEC);
+	ASSERT_SUCCESS(dummy_procfs);
+	int dummy_procfs_pidns = ioctl(dummy_procfs, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(dummy_procfs_pidns);
+
+	ASSERT_FALSE(is_same_inode(self->host_pidns, dummy_procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, dummy_procfs_pidns));
+
+	ASSERT_SUCCESS(close(host_procfs));
+	ASSERT_SUCCESS(close(host_procfs_pidns));
+	ASSERT_SUCCESS(close(dummy_procfs));
+	ASSERT_SUCCESS(close(dummy_procfs_pidns));
+}
+
+TEST_F(ns, fsconfig_pidns_get_pidns_ioctl)
+{
+	int fsfd = fsopen("proc", FSOPEN_CLOEXEC);
+	ASSERT_SUCCESS(fsfd);
+
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, self->dummy_pidns));
+	ASSERT_SUCCESS(fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0));
+
+	int mountfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+	ASSERT_SUCCESS(mountfd);
+
+	/* fsmount returns an O_PATH, which ioctl(2) doesn't accept. */
+	int new_mountfd = openat(mountfd, ".", O_RDONLY|O_DIRECTORY|O_CLOEXEC);
+	ASSERT_SUCCESS(new_mountfd);
+
+	ASSERT_SUCCESS(close(mountfd));
+	mountfd = -EBADF;
+
+	int procfs_pidns = ioctl(new_mountfd, PROCFS_GET_PID_NAMESPACE);
+	ASSERT_SUCCESS(procfs_pidns);
+
+	ASSERT_NE(self->dummy_pidns, procfs_pidns);
+	ASSERT_FALSE(is_same_inode(self->host_pidns, procfs_pidns));
+	ASSERT_TRUE(is_same_inode(self->dummy_pidns, procfs_pidns));
+
+	ASSERT_SUCCESS(close(fsfd));
+	ASSERT_SUCCESS(close(new_mountfd));
+	ASSERT_SUCCESS(close(procfs_pidns));
+}
+
+TEST_HARNESS_MAIN

-- 
2.50.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-05  5:45 ` [PATCH v4 2/4] procfs: add "pidns" mount option Aleksa Sarai
@ 2025-08-05  7:29   ` Aleksa Sarai
  2025-08-06 10:25     ` Askar Safin
  2025-08-06  0:19   ` Randy Dunlap
  1 sibling, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-05  7:29 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest, Amir Goldstein

[-- Attachment #1: Type: text/plain, Size: 8905 bytes --]

On 2025-08-05, Aleksa Sarai <cyphar@cyphar.com> wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> In order to avoid having to RCU-protect all users of proc_pid_ns() (to
> avoid UAFs), attempting to reconfigure an existing procfs instance's pid
> namespace will error out with -EBUSY. Creating new procfs instances is
> quite cheap, so this should not be an impediment to most users, and lets
> us avoid a lot of churn in fs/proc/* for a feature that it seems
> unlikely userspace would use.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  8 ++++
>  fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..5a157dadea0b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.
>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace. Note that the pid
> +namespace of an existing procfs instance cannot be modified (attempting to do
> +so will give an `-EBUSY` error).
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index ed86ac710384..fd1f1c8a939a 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -38,12 +38,14 @@ enum proc_param {
>  	Opt_gid,
>  	Opt_hidepid,
>  	Opt_subset,
> +	Opt_pidns,
>  };
>  
>  static const struct fs_parameter_spec proc_fs_parameters[] = {
> -	fsparam_u32("gid",	Opt_gid),
> +	fsparam_u32("gid",		Opt_gid),
>  	fsparam_string("hidepid",	Opt_hidepid),
>  	fsparam_string("subset",	Opt_subset),
> +	fsparam_file_or_string("pidns",	Opt_pidns),
>  	{}
>  };
>  
> @@ -109,11 +111,66 @@ static int proc_parse_subset_param(struct fs_context *fc, char *value)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PID_NS
> +static int proc_parse_pidns_param(struct fs_context *fc,
> +				  struct fs_parameter *param,
> +				  struct fs_parse_result *result)
> +{
> +	struct proc_fs_context *ctx = fc->fs_private;
> +	struct pid_namespace *target, *active = task_active_pid_ns(current);
> +	struct ns_common *ns;
> +	struct file *ns_filp __free(fput) = NULL;
> +
> +	switch (param->type) {
> +	case fs_value_is_file:
> +		/* came through fsconfig, steal the file reference */
> +		ns_filp = no_free_ptr(param->file);
> +		break;
> +	case fs_value_is_string:
> +		ns_filp = filp_open(param->string, O_RDONLY, 0);
> +		break;

I just realised that we probably also want to support FSCONFIG_SET_PATH
here, but fsparam_file_or_string() doesn't handle that at the moment. I
think we probably want to have fsparam_file_or_path() which would act
like:

 1. A path with FSCONFIG_SET_STRING and FSCONFIG_SET_PATH.
 2. A file with FSCONFIG_SET_FD.

These are the semantics I would already expect from these kinds of
flags, but at the moment FSCONFIG_SET_PATH is entirely disallowed.

@Amir:

I wonder if overlayfs (the only other user of fsparam_file_or_string())
would also prefer having these semantics? We could just migrate
fsparam_file_or_string() to fsparam_file_or_path() everwhere, since I'm
pretty sure these are the semantics userspace expects anyway.

> +	default:
> +		WARN_ON_ONCE(true);
> +		break;
> +	}
> +	if (!ns_filp)
> +		ns_filp = ERR_PTR(-EBADF);
> +	if (IS_ERR(ns_filp)) {
> +		errorfc(fc, "could not get file from pidns argument");
> +		return PTR_ERR(ns_filp);
> +	}
> +
> +	if (!proc_ns_file(ns_filp))
> +		return invalfc(fc, "pidns argument is not an nsfs file");
> +	ns = get_proc_ns(file_inode(ns_filp));
> +	if (ns->ops->type != CLONE_NEWPID)
> +		return invalfc(fc, "pidns argument is not a pidns file");
> +	target = container_of(ns, struct pid_namespace, ns);
> +
> +	/*
> +	 * pidns= is shorthand for joining the pidns to get a fsopen fd, so the
> +	 * permission model should be the same as pidns_install().
> +	 */
> +	if (!ns_capable(target->user_ns, CAP_SYS_ADMIN)) {
> +		errorfc(fc, "insufficient permissions to set pidns");
> +		return -EPERM;
> +	}
> +	if (!pidns_is_ancestor(target, active))
> +		return invalfc(fc, "cannot set pidns to non-descendant pidns");
> +
> +	put_pid_ns(ctx->pid_ns);
> +	ctx->pid_ns = get_pid_ns(target);
> +	put_user_ns(fc->user_ns);
> +	fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);
> +	return 0;
> +}
> +#endif /* CONFIG_PID_NS */
> +
>  static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  {
>  	struct proc_fs_context *ctx = fc->fs_private;
>  	struct fs_parse_result result;
> -	int opt;
> +	int opt, err;
>  
>  	opt = fs_parse(fc, proc_fs_parameters, param, &result);
>  	if (opt < 0)
> @@ -125,14 +182,38 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
>  		break;
>  
>  	case Opt_hidepid:
> -		if (proc_parse_hidepid_param(fc, param))
> -			return -EINVAL;
> +		err = proc_parse_hidepid_param(fc, param);
> +		if (err)
> +			return err;
>  		break;
>  
>  	case Opt_subset:
> -		if (proc_parse_subset_param(fc, param->string) < 0)
> -			return -EINVAL;
> +		err = proc_parse_subset_param(fc, param->string);
> +		if (err)
> +			return err;
> +		break;
> +
> +	case Opt_pidns:
> +#ifdef CONFIG_PID_NS
> +		/*
> +		 * We would have to RCU-protect every proc_pid_ns() or
> +		 * proc_sb_info() access if we allowed this to be reconfigured
> +		 * for an existing procfs instance. Luckily, procfs instances
> +		 * are cheap to create, and mount-beneath would let you
> +		 * atomically replace an instance even with overmounts.
> +		 */
> +		if (fc->purpose == FS_CONTEXT_FOR_RECONFIGURE) {
> +			errorfc(fc, "cannot reconfigure pidns for existing procfs");
> +			return -EBUSY;
> +		}
> +		err = proc_parse_pidns_param(fc, param, &result);
> +		if (err)
> +			return err;
>  		break;
> +#else
> +		errorfc(fc, "pidns mount flag not supported on this system");
> +		return -EOPNOTSUPP;
> +#endif
>  
>  	default:
>  		return -EINVAL;
> @@ -154,6 +235,11 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
>  		fs_info->hide_pid = ctx->hidepid;
>  	if (ctx->mask & (1 << Opt_subset))
>  		fs_info->pidonly = ctx->pidonly;
> +	if (ctx->mask & (1 << Opt_pidns) &&
> +	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
> +		put_pid_ns(fs_info->pid_ns);
> +		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> +	}
>  }
>  
>  static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> 
> -- 
> 2.50.1
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-05  5:45 ` [PATCH v4 2/4] procfs: add "pidns" mount option Aleksa Sarai
  2025-08-05  7:29   ` Aleksa Sarai
@ 2025-08-06  0:19   ` Randy Dunlap
  1 sibling, 0 replies; 18+ messages in thread
From: Randy Dunlap @ 2025-08-06  0:19 UTC (permalink / raw)
  To: Aleksa Sarai, Alexander Viro, Christian Brauner, Jan Kara,
	Jonathan Corbet, Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest

Hi,

On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> Since the introduction of pid namespaces, their interaction with procfs
> has been entirely implicit in ways that require a lot of dancing around
> by programs that need to construct sandboxes with different PID
> namespaces.
> 
> Being able to explicitly specify the pid namespace to use when
> constructing a procfs super block will allow programs to no longer need
> to fork off a process which does then does unshare(2) / setns(2) and
> forks again in order to construct a procfs in a pidns.
> 
> So, provide a "pidns" mount option which allows such users to just
> explicitly state which pid namespace they want that procfs instance to
> use. This interface can be used with fsconfig(2) either with a file
> descriptor or a path:
> 
>   fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>   fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or with classic mount(2) / mount(8):
> 
>   // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>   mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> As this new API is effectively shorthand for setns(2) followed by
> mount(2), the permission model for this mirrors pidns_install() to avoid
> opening up new attack surfaces by loosening the existing permission
> model.
> 
> In order to avoid having to RCU-protect all users of proc_pid_ns() (to
> avoid UAFs), attempting to reconfigure an existing procfs instance's pid
> namespace will error out with -EBUSY. Creating new procfs instances is
> quite cheap, so this should not be an impediment to most users, and lets
> us avoid a lot of churn in fs/proc/* for a feature that it seems
> unlikely userspace would use.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  8 ++++
>  fs/proc/root.c                     | 98 +++++++++++++++++++++++++++++++++++---
>  2 files changed, 100 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 5236cb52e357..5a157dadea0b 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -2360,6 +2360,7 @@ The following mount options are supported:
>  	hidepid=	Set /proc/<pid>/ access mode.
>  	gid=		Set the group authorized to learn processes information.
>  	subset=		Show only the specified subset of procfs.
> +	pidns=		Specify a the namespace used by this procfs.

			drop ^^ a

>  	=========	========================================================
>  
>  hidepid=off or hidepid=0 means classic mode - everybody may access all
> @@ -2392,6 +2393,13 @@ information about processes information, just add identd to this group.
>  subset=pid hides all top level files and directories in the procfs that
>  are not related to tasks.
>  
> +pidns= specifies a pid namespace (either as a string path to something like
> +`/proc/$pid/ns/pid`, or a file descriptor when using `FSCONFIG_SET_FD`) that
> +will be used by the procfs instance when translating pids. By default, procfs
> +will use the calling process's active pid namespace. Note that the pid
> +namespace of an existing procfs instance cannot be modified (attempting to do
> +so will give an `-EBUSY` error).
> +
>  Chapter 5: Filesystem behavior
>  ==============================
>  
-- 
~Randy


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-08-05  5:45 ` [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
@ 2025-08-06  0:25   ` Randy Dunlap
  2025-08-06 18:02     ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Randy Dunlap @ 2025-08-06  0:25 UTC (permalink / raw)
  To: Aleksa Sarai, Alexander Viro, Christian Brauner, Jan Kara,
	Jonathan Corbet, Shuah Khan
  Cc: Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest



On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> /proc has historically had very opaque semantics about PID namespaces,
> which is a little unfortunate for container runtimes and other programs
> that deal with switching namespaces very often. One common issue is that
> of converting between PIDs in the process's namespace and PIDs in the
> namespace of /proc.
> 
> In principle, it is possible to do this today by opening a pidfd with
> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> contain a PID value translated to the pid namespace associated with that
> procfs superblock). However, allocating a new file for each PID to be
> converted is less than ideal for programs that may need to scan procfs,
> and it is generally useful for userspace to be able to finally get this
> information from procfs.
> 
> So, add a new API to get the pid namespace of a procfs instance, in the
> form of an ioctl(2) you can call on the root directory of said procfs.
> The returned file descriptor will have O_CLOEXEC set. This acts as a
> sister feature to the new "pidns" mount option, finally allowing
> userspace full control of the pid namespaces associated with procfs
> instances.
> 
> The permission model for this is a bit looser than that of the "pidns"
> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> same information, so as long as you have access to that magic-link (or
> something equivalently reasonable such as being in an ancestor pid
> namespace) it makes sense to allow userspace to grab a handle. Ideally
> we would check for ptrace-read access against all processes in the pidns
> (which is very likely to be true for at least one process, as
> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> programs), but this would obviously not scale.
> 
> setns(2) will still have their own permission checks, so being able to
> open a pidns handle doesn't really provide too many other capabilities.
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  Documentation/filesystems/proc.rst |  4 +++
>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/fs.h            |  4 +++
>  3 files changed, 74 insertions(+), 2 deletions(-)
> 


> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 0bd678a4a10e..68e65e6d7d6b 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
>  			 RWF_DONTCACHE)
>  
> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
>  #define PROCFS_IOCTL_MAGIC 'f'
>  
> +/* procfs root ioctls */
> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)

Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
should be updated like:

-'f'   00-0F  linux/fs.h                                                conflict!
+'f'   00-1F  linux/fs.h                                                conflict!

(17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
have update the Doc/rst file.)

> +
>  /* Pagemap ioctl */
>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
>  
> 
Thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-05  7:29   ` Aleksa Sarai
@ 2025-08-06 10:25     ` Askar Safin
  2025-08-06 14:12       ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Askar Safin @ 2025-08-06 10:25 UTC (permalink / raw)
  To: cyphar
  Cc: amir73il, brauner, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro

> I just realised that we probably also want to support FSCONFIG_SET_PATH

I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH. Moreover, fsparam_path macro is present since 5.1. And for all this time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody used it, so this will not break anything.

If you okay with that, I can submit patch, removing it.

--
Askar Safin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-06 10:25     ` Askar Safin
@ 2025-08-06 14:12       ` Aleksa Sarai
  2025-08-07  7:17         ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-06 14:12 UTC (permalink / raw)
  To: Askar Safin
  Cc: amir73il, brauner, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro

[-- Attachment #1: Type: text/plain, Size: 1539 bytes --]

On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > I just realised that we probably also want to support FSCONFIG_SET_PATH
> 
> I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> Moreover, fsparam_path macro is present since 5.1. And for all this
> time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> used it, so this will not break anything.
> 
> If you okay with that, I can submit patch, removing it.

I would prefer you didn't -- "*at()" semantics are very useful to a lot
of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
to support it, and probably also overlayfs...

I suspect the primary issue is that when migrating to the new mount API,
filesystem devs just went with the easiest thing to use
(FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
suspect the lack of documentation around fsconfig(2) played a part too.

My impression is that interest in the minutia about fsconfig(2) is quite
low on the list of priorities for most filesystem devs, and so the neat
aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
we struggled to come to an agreement on how filesystems should use the
read(2)-based error interface.)

We can very easily move fsparam_string() or fsparam_file_or_string()
parameters to fsparam_path() and a future fsparam_file_or_path(). I
would much prefer that as a user.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-08-06  0:25   ` Randy Dunlap
@ 2025-08-06 18:02     ` Aleksa Sarai
  2025-08-06 18:57       ` Randy Dunlap
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-06 18:02 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 3871 bytes --]

On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
> 
> 
> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> > /proc has historically had very opaque semantics about PID namespaces,
> > which is a little unfortunate for container runtimes and other programs
> > that deal with switching namespaces very often. One common issue is that
> > of converting between PIDs in the process's namespace and PIDs in the
> > namespace of /proc.
> > 
> > In principle, it is possible to do this today by opening a pidfd with
> > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > contain a PID value translated to the pid namespace associated with that
> > procfs superblock). However, allocating a new file for each PID to be
> > converted is less than ideal for programs that may need to scan procfs,
> > and it is generally useful for userspace to be able to finally get this
> > information from procfs.
> > 
> > So, add a new API to get the pid namespace of a procfs instance, in the
> > form of an ioctl(2) you can call on the root directory of said procfs.
> > The returned file descriptor will have O_CLOEXEC set. This acts as a
> > sister feature to the new "pidns" mount option, finally allowing
> > userspace full control of the pid namespaces associated with procfs
> > instances.
> > 
> > The permission model for this is a bit looser than that of the "pidns"
> > mount option (and also setns(2)) because /proc/1/ns/pid provides the
> > same information, so as long as you have access to that magic-link (or
> > something equivalently reasonable such as being in an ancestor pid
> > namespace) it makes sense to allow userspace to grab a handle. Ideally
> > we would check for ptrace-read access against all processes in the pidns
> > (which is very likely to be true for at least one process, as
> > SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> > programs), but this would obviously not scale.
> > 
> > setns(2) will still have their own permission checks, so being able to
> > open a pidns handle doesn't really provide too many other capabilities.
> > 
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  Documentation/filesystems/proc.rst |  4 +++
> >  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
> >  include/uapi/linux/fs.h            |  4 +++
> >  3 files changed, 74 insertions(+), 2 deletions(-)
> > 
> 
> 
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 0bd678a4a10e..68e65e6d7d6b 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
> >  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> >  			 RWF_DONTCACHE)
> >  
> > +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
> >  #define PROCFS_IOCTL_MAGIC 'f'
> >  
> > +/* procfs root ioctls */
> > +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
> 
> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
> should be updated like:
> 
> -'f'   00-0F  linux/fs.h                                                conflict!
> +'f'   00-1F  linux/fs.h                                                conflict!

Should this be 00-20 (or 00-2F) instead?

Also, is there a better value to use for this new ioctl? I'm not quite
sure what is the best practice to handle these kinds of conflicts...

> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
> have update the Doc/rst file.)
> 
> > +
> >  /* Pagemap ioctl */
> >  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)
> >  
> > 
> Thanks.
> -- 
> ~Randy
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-08-06 18:02     ` Aleksa Sarai
@ 2025-08-06 18:57       ` Randy Dunlap
  2025-08-08 14:12         ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Randy Dunlap @ 2025-08-06 18:57 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Jonathan Corbet,
	Shuah Khan, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest



On 8/6/25 11:02 AM, Aleksa Sarai wrote:
> On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
>>
>>
>> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
>>> /proc has historically had very opaque semantics about PID namespaces,
>>> which is a little unfortunate for container runtimes and other programs
>>> that deal with switching namespaces very often. One common issue is that
>>> of converting between PIDs in the process's namespace and PIDs in the
>>> namespace of /proc.
>>>
>>> In principle, it is possible to do this today by opening a pidfd with
>>> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
>>> contain a PID value translated to the pid namespace associated with that
>>> procfs superblock). However, allocating a new file for each PID to be
>>> converted is less than ideal for programs that may need to scan procfs,
>>> and it is generally useful for userspace to be able to finally get this
>>> information from procfs.
>>>
>>> So, add a new API to get the pid namespace of a procfs instance, in the
>>> form of an ioctl(2) you can call on the root directory of said procfs.
>>> The returned file descriptor will have O_CLOEXEC set. This acts as a
>>> sister feature to the new "pidns" mount option, finally allowing
>>> userspace full control of the pid namespaces associated with procfs
>>> instances.
>>>
>>> The permission model for this is a bit looser than that of the "pidns"
>>> mount option (and also setns(2)) because /proc/1/ns/pid provides the
>>> same information, so as long as you have access to that magic-link (or
>>> something equivalently reasonable such as being in an ancestor pid
>>> namespace) it makes sense to allow userspace to grab a handle. Ideally
>>> we would check for ptrace-read access against all processes in the pidns
>>> (which is very likely to be true for at least one process, as
>>> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
>>> programs), but this would obviously not scale.
>>>
>>> setns(2) will still have their own permission checks, so being able to
>>> open a pidns handle doesn't really provide too many other capabilities.
>>>
>>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
>>> ---
>>>  Documentation/filesystems/proc.rst |  4 +++
>>>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
>>>  include/uapi/linux/fs.h            |  4 +++
>>>  3 files changed, 74 insertions(+), 2 deletions(-)
>>>
>>
>>
>>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>>> index 0bd678a4a10e..68e65e6d7d6b 100644
>>> --- a/include/uapi/linux/fs.h
>>> +++ b/include/uapi/linux/fs.h
>>> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
>>>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
>>>  			 RWF_DONTCACHE)
>>>  
>>> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
>>>  #define PROCFS_IOCTL_MAGIC 'f'
>>>  
>>> +/* procfs root ioctls */
>>> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
>>
>> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
>> should be updated like:
>>
>> -'f'   00-0F  linux/fs.h                                                conflict!
>> +'f'   00-1F  linux/fs.h                                                conflict!
> 
> Should this be 00-20 (or 00-2F) instead?

Oops, yes, it should be one of those. Thanks.

> Also, is there a better value to use for this new ioctl? I'm not quite
> sure what is the best practice to handle these kinds of conflicts...

I wouldn't worry about it. We have *many* conflicts.
(unless Al or Christian are concerned)

>> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
>> have update the Doc/rst file.)
>>
>>> +
>>>  /* Pagemap ioctl */
>>>  #define PAGEMAP_SCAN	_IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg)

-- 
~Randy


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-06 14:12       ` Aleksa Sarai
@ 2025-08-07  7:17         ` Aleksa Sarai
  2025-08-08 14:09           ` Christian Brauner
  0 siblings, 1 reply; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-07  7:17 UTC (permalink / raw)
  To: Askar Safin
  Cc: amir73il, brauner, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro

[-- Attachment #1: Type: text/plain, Size: 2293 bytes --]

On 2025-08-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > > I just realised that we probably also want to support FSCONFIG_SET_PATH
> > 
> > I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> > Moreover, fsparam_path macro is present since 5.1. And for all this
> > time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> > used it, so this will not break anything.
> > 
> > If you okay with that, I can submit patch, removing it.
> 
> I would prefer you didn't -- "*at()" semantics are very useful to a lot
> of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
> to support it, and probably also overlayfs...
> 
> I suspect the primary issue is that when migrating to the new mount API,
> filesystem devs just went with the easiest thing to use
> (FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
> suspect the lack of documentation around fsconfig(2) played a part too.
> 
> My impression is that interest in the minutia about fsconfig(2) is quite
> low on the list of priorities for most filesystem devs, and so the neat
> aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
> we struggled to come to an agreement on how filesystems should use the
> read(2)-based error interface.)
> 
> We can very easily move fsparam_string() or fsparam_file_or_string()
> parameters to fsparam_path() and a future fsparam_file_or_path(). I
> would much prefer that as a user.

Actually, fsparam_bdev() accepts FSCONFIG_SET_PATH in a very roundabout
way (and the checker doesn't verify anything...?). So there is at least
one user (ext4's "journal_path"), it's just not well-documented (which
I'm trying to fix ;]).

My plan is to update fs_lookup_param() to be more useful for the (fairly
common) use-case of wanting to support paths and file descriptors, and
going through to clean up some of these unused fsparam_* helpers (or
fsparam_* helpers being abused to implement stuff that the fs_parser
core already supports).

At the very least, overlayfs, ext4, and this procfs patchset can make
use of it.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-07  7:17         ` Aleksa Sarai
@ 2025-08-08 14:09           ` Christian Brauner
  2025-08-08 15:51             ` Aleksa Sarai
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2025-08-08 14:09 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Askar Safin, amir73il, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro

On Thu, Aug 07, 2025 at 05:17:56PM +1000, Aleksa Sarai wrote:
> On 2025-08-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > > > I just realised that we probably also want to support FSCONFIG_SET_PATH
> > > 
> > > I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> > > Moreover, fsparam_path macro is present since 5.1. And for all this
> > > time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> > > used it, so this will not break anything.
> > > 
> > > If you okay with that, I can submit patch, removing it.
> > 
> > I would prefer you didn't -- "*at()" semantics are very useful to a lot
> > of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
> > to support it, and probably also overlayfs...
> > 
> > I suspect the primary issue is that when migrating to the new mount API,
> > filesystem devs just went with the easiest thing to use
> > (FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
> > suspect the lack of documentation around fsconfig(2) played a part too.
> > 
> > My impression is that interest in the minutia about fsconfig(2) is quite
> > low on the list of priorities for most filesystem devs, and so the neat
> > aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
> > we struggled to come to an agreement on how filesystems should use the
> > read(2)-based error interface.)
> > 
> > We can very easily move fsparam_string() or fsparam_file_or_string()
> > parameters to fsparam_path() and a future fsparam_file_or_path(). I
> > would much prefer that as a user.
> 
> Actually, fsparam_bdev() accepts FSCONFIG_SET_PATH in a very roundabout
> way (and the checker doesn't verify anything...?). So there is at least
> one user (ext4's "journal_path"), it's just not well-documented (which
> I'm trying to fix ;]).
> 
> My plan is to update fs_lookup_param() to be more useful for the (fairly
> common) use-case of wanting to support paths and file descriptors, and
> going through to clean up some of these unused fsparam_* helpers (or
> fsparam_* helpers being abused to implement stuff that the fs_parser
> core already supports).
> 
> At the very least, overlayfs, ext4, and this procfs patchset can make
> use of it.

I've never bothered with actually iplementing FSCONFIG_SET_PATH
semantics because I think it's really weird to allow *at semantics when
setting filesystem parameters. I always thought it's better to force
userspace to provide a file descriptor for the final destination instead
of doing some arcane lookup variant for mount configuration. But I'm
happy to be convinced of its usefulness...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
  2025-08-06 18:57       ` Randy Dunlap
@ 2025-08-08 14:12         ` Christian Brauner
  0 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-08-08 14:12 UTC (permalink / raw)
  To: Randy Dunlap, Arnd Bergmann
  Cc: Aleksa Sarai, Alexander Viro, Jan Kara, Jonathan Corbet,
	Shuah Khan, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest

On Wed, Aug 06, 2025 at 11:57:42AM -0700, Randy Dunlap wrote:
> 
> 
> On 8/6/25 11:02 AM, Aleksa Sarai wrote:
> > On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
> >>
> >>
> >> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> >>> /proc has historically had very opaque semantics about PID namespaces,
> >>> which is a little unfortunate for container runtimes and other programs
> >>> that deal with switching namespaces very often. One common issue is that
> >>> of converting between PIDs in the process's namespace and PIDs in the
> >>> namespace of /proc.
> >>>
> >>> In principle, it is possible to do this today by opening a pidfd with
> >>> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> >>> contain a PID value translated to the pid namespace associated with that
> >>> procfs superblock). However, allocating a new file for each PID to be
> >>> converted is less than ideal for programs that may need to scan procfs,
> >>> and it is generally useful for userspace to be able to finally get this
> >>> information from procfs.
> >>>
> >>> So, add a new API to get the pid namespace of a procfs instance, in the
> >>> form of an ioctl(2) you can call on the root directory of said procfs.
> >>> The returned file descriptor will have O_CLOEXEC set. This acts as a
> >>> sister feature to the new "pidns" mount option, finally allowing
> >>> userspace full control of the pid namespaces associated with procfs
> >>> instances.
> >>>
> >>> The permission model for this is a bit looser than that of the "pidns"
> >>> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> >>> same information, so as long as you have access to that magic-link (or
> >>> something equivalently reasonable such as being in an ancestor pid
> >>> namespace) it makes sense to allow userspace to grab a handle. Ideally
> >>> we would check for ptrace-read access against all processes in the pidns
> >>> (which is very likely to be true for at least one process, as
> >>> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> >>> programs), but this would obviously not scale.
> >>>
> >>> setns(2) will still have their own permission checks, so being able to
> >>> open a pidns handle doesn't really provide too many other capabilities.
> >>>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>> ---
> >>>  Documentation/filesystems/proc.rst |  4 +++
> >>>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
> >>>  include/uapi/linux/fs.h            |  4 +++
> >>>  3 files changed, 74 insertions(+), 2 deletions(-)
> >>>
> >>
> >>
> >>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> >>> index 0bd678a4a10e..68e65e6d7d6b 100644
> >>> --- a/include/uapi/linux/fs.h
> >>> +++ b/include/uapi/linux/fs.h
> >>> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
> >>>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> >>>  			 RWF_DONTCACHE)
> >>>  
> >>> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
> >>>  #define PROCFS_IOCTL_MAGIC 'f'
> >>>  
> >>> +/* procfs root ioctls */
> >>> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
> >>
> >> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
> >> should be updated like:
> >>
> >> -'f'   00-0F  linux/fs.h                                                conflict!
> >> +'f'   00-1F  linux/fs.h                                                conflict!
> > 
> > Should this be 00-20 (or 00-2F) instead?
> 
> Oops, yes, it should be one of those. Thanks.
> 
> > Also, is there a better value to use for this new ioctl? I'm not quite
> > sure what is the best practice to handle these kinds of conflicts...
> 
> I wouldn't worry about it. We have *many* conflicts.
> (unless Al or Christian are concerned)

We try to minimize conflicts but we unfortunately give no strong
guarantees in any way. I always defer to Arnd in such matters as he's
got a pretty good mental model of what is best to do for ioctls.

> 
> >> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
> >> have update the Doc/rst file.)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
  2025-08-08 14:09           ` Christian Brauner
@ 2025-08-08 15:51             ` Aleksa Sarai
  0 siblings, 0 replies; 18+ messages in thread
From: Aleksa Sarai @ 2025-08-08 15:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Askar Safin, amir73il, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro

[-- Attachment #1: Type: text/plain, Size: 5470 bytes --]

On 2025-08-08, Christian Brauner <brauner@kernel.org> wrote:
> On Thu, Aug 07, 2025 at 05:17:56PM +1000, Aleksa Sarai wrote:
> > On 2025-08-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > > > > I just realised that we probably also want to support FSCONFIG_SET_PATH
> > > > 
> > > > I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> > > > Moreover, fsparam_path macro is present since 5.1. And for all this
> > > > time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> > > > used it, so this will not break anything.
> > > > 
> > > > If you okay with that, I can submit patch, removing it.
> > > 
> > > I would prefer you didn't -- "*at()" semantics are very useful to a lot
> > > of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
> > > to support it, and probably also overlayfs...
> > > 
> > > I suspect the primary issue is that when migrating to the new mount API,
> > > filesystem devs just went with the easiest thing to use
> > > (FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
> > > suspect the lack of documentation around fsconfig(2) played a part too.
> > > 
> > > My impression is that interest in the minutia about fsconfig(2) is quite
> > > low on the list of priorities for most filesystem devs, and so the neat
> > > aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
> > > we struggled to come to an agreement on how filesystems should use the
> > > read(2)-based error interface.)
> > > 
> > > We can very easily move fsparam_string() or fsparam_file_or_string()
> > > parameters to fsparam_path() and a future fsparam_file_or_path(). I
> > > would much prefer that as a user.
> > 
> > Actually, fsparam_bdev() accepts FSCONFIG_SET_PATH in a very roundabout
> > way (and the checker doesn't verify anything...?). So there is at least
> > one user (ext4's "journal_path"), it's just not well-documented (which
> > I'm trying to fix ;]).
> > 
> > My plan is to update fs_lookup_param() to be more useful for the (fairly
> > common) use-case of wanting to support paths and file descriptors, and
> > going through to clean up some of these unused fsparam_* helpers (or
> > fsparam_* helpers being abused to implement stuff that the fs_parser
> > core already supports).
> > 
> > At the very least, overlayfs, ext4, and this procfs patchset can make
> > use of it.
> 
> I've never bothered with actually iplementing FSCONFIG_SET_PATH
> semantics because I think it's really weird to allow *at semantics when
> setting filesystem parameters. I always thought it's better to force
> userspace to provide a file descriptor for the final destination instead
> of doing some arcane lookup variant for mount configuration. But I'm
> happy to be convinced of its usefulness...

I do think it's useful, and here's my thought process...

Most filesystems have to take string path parameters in order to support
mount(2) and work with mount(8). Yes, fsparam_fd() will accept
FSCONFIG_SET_STRING by parsing it as a decimal string, but there are
only two users of fsparam_fd() and honestly I'm not convinced this is a
particularly sane API for anything other than strict backcompat reasons
(the API only makes sense as a file descriptor and you want mount(8) to
be able to use it).

So you end up with most parameters supporting paths set using
FSCONFIG_SET_STRING anyway, meaning in-kernel lookups can't be taken off
the table. And if we accept paths for lookup, then (for the same reason
we have *at(2) syscalls) it is preferable to allow specifying dirfds. So
FSCONFIG_SET_PATH should also be supported.

And as there is no infrastructure to block FSCONFIG_SET_PATH_EMPTY
arguments (yes, you can do it manually, but the *only* user of
fs_lookup_param() doesn't), then anything that accepts FSCONFIG_SET_PATH
currently also accepts FSCONFIG_SET_PATH_EMPTY which is "morally
equivalent" to FSCONFIG_SET_FD. So unless you block
FSCONFIG_SET_PATH_EMPTY then FSCONFIG_SET_FD should probably also be
supported (there is the re-opening distinction, of course, but that is
not relevant if you use filename_lookup() -- which is what filesystems
will do in practice).

So my impression is that most users (if they had an fsconfig(2) man page
to read...) would expect parameters that accept paths to either:

* Work with FSCONFIG_SET_STRING and FSCONFIG_SET_PATH only; or
* Work with FSCONFIG_SET_STRING, FSCONFIG_SET_PATH,
  FSCONFIG_SET_PATH_EMPTY, and FSCONFIG_SET_FD.

Currently, none of our parameters work that way.

 * ext4's journal_path takes FSCONFIG_SET_STRING, FSCONFIG_SET_PATH, and
   FSCONFIG_SET_PATH_EMPTY.
 * overlayfs takes FSCONFIG_SET_FD and FSCONFIG_SET_STRING.

I only fully realised how inconsistent this is while working on the
fsconfig(2) man pages -- at the moment I have a very long paragraph
explaining that there is this distinction in-kernel, but this really
doesn't seem intentional to me. I would be very confused as a user that
FSCONFIG_SET_PATH is useless for most filesystem *path* parameters, even
though the filesystem accepts them as FSCONFIG_SET_STRING.

As for practical uses, it would be nice to not have to open 500 files in
order to create a 500-layer overlayfs.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
                   ` (3 preceding siblings ...)
  2025-08-05  5:45 ` [PATCH v4 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
@ 2025-09-02  9:54 ` Christian Brauner
  2025-09-02 10:02 ` Christian Brauner
  5 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-09-02  9:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest, Alexander Viro, Jan Kara,
	Jonathan Corbet, Shuah Khan

On Tue, 05 Aug 2025 15:45:07 +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> [...]

Applied to the vfs-6.18.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.procfs

[1/4] pidns: move is-ancestor logic to helper
      https://git.kernel.org/vfs/vfs/c/60d22c6ef41b
[2/4] procfs: add "pidns" mount option
      https://git.kernel.org/vfs/vfs/c/77e211dd1392
[4/4] selftests/proc: add tests for new pidns APIs
      https://git.kernel.org/vfs/vfs/c/568d4239002c

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/4] procfs: make reference pidns more user-visible
  2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
                   ` (4 preceding siblings ...)
  2025-09-02  9:54 ` (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible Christian Brauner
@ 2025-09-02 10:02 ` Christian Brauner
  5 siblings, 0 replies; 18+ messages in thread
From: Christian Brauner @ 2025-09-02 10:02 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	Andy Lutomirski, linux-kernel, linux-fsdevel, linux-api,
	linux-doc, linux-kselftest

On Tue, Aug 05, 2025 at 03:45:07PM +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> This implicit behaviour has historically meant that userspace was
> required to do some special dances in order to configure the pidns of a
> procfs mount as desired. Examples include:
> 
>  * In order to bypass the mnt_too_revealing() check, Kubernetes creates
>    a procfs mount from an empty pidns so that user namespaced containers
>    can be nested (without this, the nested containers would fail to
>    mount procfs). But this requires forking off a helper process because
>    you cannot just one-shot this using mount(2).
> 
>  * Container runtimes in general need to fork into a container before
>    configuring its mounts, which can lead to security issues in the case
>    of shared-pidns containers (a privileged process in the pidns can
>    interact with your container runtime process). While
>    SUID_DUMP_DISABLE and user namespaces make this less of an issue, the
>    strict need for this due to a minor uAPI wart is kind of unfortunate.
> 
> Things would be much easier if there was a way for userspace to just
> specify the pidns they want. Patch 1 implements a new "pidns" argument
> which can be set using fsconfig(2):
> 
>     fsconfig(procfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>     fsconfig(procfd, FSCONFIG_SET_STRING, "pidns", "/proc/self/ns/pid", 0);
> 
> or classic mount(2) / mount(8):
> 
>     // mount -t proc -o pidns=/proc/self/ns/pid proc /tmp/proc
>     mount("proc", "/tmp/proc", "proc", MS_..., "pidns=/proc/self/ns/pid");
> 
> The initial security model I have in this RFC is to be as conservative
> as possible and just mirror the security model for setns(2) -- which
> means that you can only set pidns=... to pid namespaces that your
> current pid namespace is a direct ancestor of and you have CAP_SYS_ADMIN
> privileges over the pid namespace. This fulfils the requirements of
> container runtimes, but I suspect that this may be too strict for some
> usecases.
> 
> The pidns argument is not displayed in mountinfo -- it's not clear to me
> what value it would make sense to show (maybe we could just use ns_dname
> to provide an identifier for the namespace, but this number would be
> fairly useless to userspace). I'm open to suggestions. Note that
> PROCFS_GET_PID_NAMESPACE (see below) does at least let userspace get
> information about this outside of mountinfo.
> 
> Note that you cannot change the pidns of an already-created procfs
> instance. The primary reason is that allowing this to be changed would
> require RCU-protecting proc_pid_ns(sb) and thus auditing all of
> fs/proc/* and some of the users in fs/* to make sure they wouldn't UAF
> the pid namespace. Since creating procfs instances is very cheap, it
> seems unnecessary to overcomplicate this upfront. Trying to reconfigure
> procfs this way errors out with -EBUSY.
> 
> /* ioctl(PROCFS_GET_PID_NAMESPACE) */
> 
> In addition, being able to figure out what pid namespace is being used
> by a procfs mount is quite useful when you have an administrative
> process (such as a container runtime) which wants to figure out the
> correct way of mapping PIDs between its own namespace and the namespace
> for procfs (using NS_GET_{PID,TGID}_{IN,FROM}_PIDNS). There are
> alternative ways to do this, but they all rely on ancillary information
> that third-party libraries and tools do not necessarily have access to.
> 
> To make this easier, add a new ioctl (PROCFS_GET_PID_NAMESPACE) which
> can be used to get a reference to the pidns that a procfs is using.
> 
> Rather than copying the (fairly strict) security model for setns(2),
> apply a slightly looser model to better match what userspace can already
> do:
> 
>  * Make the ioctl only valid on the root (meaning that a process without
>    access to the procfs root -- such as only having an fd to a procfs
>    file or some open_tree(2)-like subset -- cannot use this API). This
>    means that the process already has some level of access to the
>    /proc/$pid directories.
> 
>  * If the calling process is in an ancestor pidns, then they can already
>    create pidfd for processes inside the pidns, which is morally
>    equivalent to a pidns file descriptor according to setns(2). So it
>    seems reasonable to just allow it in this case. (The justification
>    for this model was suggested by Christian.)
> 
>  * If the process has access to /proc/1/ns/pid already (i.e. has
>    ptrace-read access to the pidns pid1), then this ioctl is equivalent
>    to just opening a handle to it that way.
> 
>    Ideally we would check for ptrace-read access against all processes
>    in the pidns (which is very likely to be true for at least one
>    process, as SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set
>    by most programs), but this would obviously not scale.
> 
> I'm open to suggestions for whether we need to make this stricter (or
> possibly allow more cases).
> 
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks for the patchset. Being able to specify what pid namespace the
procfs instance is supposed to belong to is super useful and will make
things easier for userspace for sure.

The code you added contains a minor wrinkle that I disliked which I've
changed and you tell me if you can live with this restriction or not.

The way you've implemented it specifying a pid namespace that the caller
holds privilege over would silently also override the user namespace the
filesystem is supposed to belong to.

Specifically, you did something like:

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);
        put_user_ns(fc->user_ns);
        fc->user_ns = get_user_ns(ctx->pid_ns->user_ns);

This silently overrides the user namespace recorded at fsopen() time. I
think that's too subtle and we should just not allow that at all for
now.

Instead I've changed this to:

        if (fc->user_ns != target->user_ns)
                return invalfc(fc, "owning user namespace of pid namespace doesn't match procfs user namespace");

        put_pid_ns(ctx->pid_ns);
        ctx->pid_ns = get_pid_ns(target);

so we just refuse different owernship.

I've also dropped the procfs ioctl because I'm not sure how much value
it will actually add given that you can do this via /proc/1/ns/pid.

If that is something that libpathrs despearately needs I would like to
do it as a separate patch anyways.

Thanks for the excellent cover letter. This was a pleasure merging!

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-09-02 10:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-05  5:45 [PATCH v4 0/4] procfs: make reference pidns more user-visible Aleksa Sarai
2025-08-05  5:45 ` [PATCH v4 1/4] pidns: move is-ancestor logic to helper Aleksa Sarai
2025-08-05  5:45 ` [PATCH v4 2/4] procfs: add "pidns" mount option Aleksa Sarai
2025-08-05  7:29   ` Aleksa Sarai
2025-08-06 10:25     ` Askar Safin
2025-08-06 14:12       ` Aleksa Sarai
2025-08-07  7:17         ` Aleksa Sarai
2025-08-08 14:09           ` Christian Brauner
2025-08-08 15:51             ` Aleksa Sarai
2025-08-06  0:19   ` Randy Dunlap
2025-08-05  5:45 ` [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl Aleksa Sarai
2025-08-06  0:25   ` Randy Dunlap
2025-08-06 18:02     ` Aleksa Sarai
2025-08-06 18:57       ` Randy Dunlap
2025-08-08 14:12         ` Christian Brauner
2025-08-05  5:45 ` [PATCH v4 4/4] selftests/proc: add tests for new pidns APIs Aleksa Sarai
2025-09-02  9:54 ` (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible Christian Brauner
2025-09-02 10:02 ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).