[RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
@ 2021-07-16 10:45 Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:45 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the mounter.

Changelog
---------
v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.

v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 15 +++++++++++++++
 fs/namespace.c                     | 30 ++++++++++++++++++------------
 fs/proc/proc_net.c                 |  8 ++++++++
 fs/proc/root.c                     | 24 +++++++++++++++++++-----
 include/linux/fs.h                 |  1 +
 include/linux/proc_fs.h            |  1 +
 6 files changed, 62 insertions(+), 17 deletions(-)

-- 
2.29.3


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:45 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:45 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..5a1bb0e081fd 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -50,6 +50,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
 
   4	Configuring procfs
   4.1	Mount options
+  4.2	Mount restrictions
 
   5	Filesystem behavior
 
@@ -2175,6 +2176,19 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+4.2	Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+  1. This mount is not fully visible.
+
+     a. It's root directory is not the root directory of the filesystem.
+     b. If any file or non-empty procfs directory is hidden by another mount.
+
+  2. A new mount overrides the readonly option or any option from atime familty.
+
 Chapter 5: Filesystem behavior
 ==============================
 
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.

Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/proc_net.c      | 8 ++++++++
 fs/proc/root.c          | 5 +++++
 include/linux/proc_fs.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
 #include <linux/uidgid.h>
 #include <net/net_namespace.h>
 #include <linux/seq_file.h>
+#include <linux/security.h>
 
 #include "internal.h"
 
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
 	struct task_struct *task;
 	struct nsproxy *ns;
 	struct net *net = NULL;
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
 	}
 	rcu_read_unlock();
 
+	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+		put_net(net);
+		net = NULL;
+	}
+
 	return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..6a75ac717455 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -171,6 +171,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 		return -ENOMEM;
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	fs_info->mounter_cred = get_cred(fc->cred);
 	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
@@ -220,6 +221,9 @@ static int proc_reconfigure(struct fs_context *fc)
 
 	sync_filesystem(sb);
 
+	put_cred(fs_info->mounter_cred);
+	fs_info->mounter_cred = get_cred(fc->cred);
+
 	proc_apply_options(fs_info, fc, current_user_ns());
 	return 0;
 }
@@ -274,6 +278,7 @@ static void proc_kill_sb(struct super_block *sb)
 
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
+	put_cred(fs_info->mounter_cred);
 	kfree(fs_info);
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
 	kgid_t pid_gid;
 	enum proc_hidepid hide_pid;
 	enum proc_pidonly pidonly;
+	const struct cred *mounter_cred;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.

This patch makes the limitation explicit and prints an error message.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/root.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a75ac717455..0d20bb67e79a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
@@ -155,8 +155,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
-	if (ctx->mask & (1 << Opt_subset))
+	if (ctx->mask & (1 << Opt_subset)) {
+		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
+	}
+	return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -172,7 +176,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	proc_apply_options(fs_info, fc, current_user_ns());
+	ret = proc_apply_options(fs_info, fc, current_user_ns());
+	if (ret)
+		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -224,8 +230,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	proc_apply_options(fs_info, fc, current_user_ns());
-	return 0;
+	return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RESEND PATCH v6 4/5] proc: Relax check of mount visibility
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (2 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the user.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/namespace.c     | 30 ++++++++++++++++++------------
 fs/proc/root.c     | 16 ++++++++++------
 include/linux/fs.h |  1 +
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f38570fdfc3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3951,7 +3951,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		/* This mount is not fully visible if it's root directory
 		 * is not the root directory of the filesystem.
 		 */
-		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+		if (!(sb->s_iflags & SB_I_DYNAMIC) &&
+		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
 		/* A local view of the mount flags */
@@ -3971,18 +3972,23 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any
-		 * locked child mounts that cover anything except for
-		 * empty directories.
+		/* If this filesystem is completely dynamic, then it
+		 * makes no sense to check for any child mounts.
 		 */
-		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-			struct inode *inode = child->mnt_mountpoint->d_inode;
-			/* Only worry about locked mounts */
-			if (!(child->mnt.mnt_flags & MNT_LOCKED))
-				continue;
-			/* Is the directory permanetly empty? */
-			if (!is_empty_dir_inode(inode))
-				goto next;
+		if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+			/* This mount is not fully visible if there are any
+			 * locked child mounts that cover anything except for
+			 * empty directories.
+			 */
+			list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+				struct inode *inode = child->mnt_mountpoint->d_inode;
+				/* Only worry about locked mounts */
+				if (!(child->mnt.mnt_flags & MNT_LOCKED))
+					continue;
+				/* Is the directory permanetly empty? */
+				if (!is_empty_dir_inode(inode))
+					goto next;
+			}
 		}
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0d20bb67e79a..c739ed94246c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 
 	if (ctx->mask & (1 << Opt_gid))
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset)) {
-		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+		if (ctx->pidonly == PROC_PIDONLY_ON)
+			s->s_iflags |= SB_I_DYNAMIC;
+		else if (fs_info->pidonly == PROC_PIDONLY_ON)
 			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
 	}
@@ -176,9 +179,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	ret = proc_apply_options(fs_info, fc, current_user_ns());
-	if (ret)
-		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -190,6 +190,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_time_gran = 1;
 	s->s_fs_info = fs_info;
 
+	ret = proc_apply_options(s, fc, current_user_ns());
+	if (ret)
+		return ret;
+
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
 	 * too much magic going on inside it to permit stacking things on
@@ -230,7 +234,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	return proc_apply_options(fs_info, fc, current_user_ns());
+	return proc_apply_options(sb, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2c9a47bad796 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1390,6 +1390,7 @@ extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_USERNS_VISIBLE		0x00000010 /* fstype already mounted */
 #define SB_I_IMA_UNVERIFIABLE_SIGNATURE	0x00000020
 #define SB_I_UNTRUSTED_MOUNTER		0x00000040
+#define SB_I_DYNAMIC			0x00000080
 
 #define SB_I_SKIP_SYNC	0x00000100	/* Skip superblock at global sync */
 
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (3 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a1bb0e081fd..9d993aef7f1c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2182,7 +2182,8 @@ are not related to tasks.
 If user namespaces are in use, the kernel additionally checks the instances of
 procfs available to the mounter and will not allow procfs to be mounted if:
 
-  1. This mount is not fully visible.
+  1. This mount is not fully visible unless the new procfs is going to be
+     mounted with subset=pid option.
 
      a. It's root directory is not the root directory of the filesystem.
      b. If any file or non-empty procfs directory is hidden by another mount.
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (4 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
@ 2025-12-13  5:06 ` Dan Klishch
  2025-12-13 10:49   ` Alexey Gladkov
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
  5 siblings, 2 replies; 34+ messages in thread
From: Dan Klishch @ 2025-12-13  5:06 UTC (permalink / raw)
  To: legion, linux-kernel
  Cc: ebiederm, viro, keescook, containers, linux-fsdevel, Dan Klishch

Hello Alexey,

Would it be possible to revive this patch series?

I wanted to add an additional downstream use case that would benefit
from this work. In particular, I am trying to run the sandbox
sunwalker-box [1] without root privileges and/or inside a container.

The sandbox aims to prevent cross-run communication via side channels,
and PID allocation is one such channel. Therefore, it creates a new PID
namespace and mounts the corresponding procfs instance inside of the
sandbox. This currently works without a real root when procfs is fully
accessible, but obviously fails otherwise.

Thanks,
Dan Klishch

[1] https://github.com/purplesyringa/sunwalker-box/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
@ 2025-12-13 10:49   ` Alexey Gladkov
  2025-12-13 18:00     ` Dan Klishch
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
  1 sibling, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2025-12-13 10:49 UTC (permalink / raw)
  To: Dan Klishch
  Cc: linux-kernel, ebiederm, viro, keescook, containers, linux-fsdevel

On Sat, Dec 13, 2025 at 12:06:38AM -0500, Dan Klishch wrote:
> Hello Alexey,
> 
> Would it be possible to revive this patch series?
> 
> I wanted to add an additional downstream use case that would benefit
> from this work. In particular, I am trying to run the sandbox
> sunwalker-box [1] without root privileges and/or inside a container.
> 
> The sandbox aims to prevent cross-run communication via side channels,
> and PID allocation is one such channel. Therefore, it creates a new PID
> namespace and mounts the corresponding procfs instance inside of the
> sandbox. This currently works without a real root when procfs is fully
> accessible, but obviously fails otherwise.
> 
> Thanks,
> Dan Klishch
> 
> [1] https://github.com/purplesyringa/sunwalker-box/
> 

Overmounting "dangerous" files in procfs is an incorrect and potentially
dangerous practice. I know that many programs (docker, podman, etc.) use
this method, but it is not the correct way to isolate dangerous files in
procfs.

In particular, this is one of the reasons why this patchset was abandoned.

It is quite difficult to implement these checks in procfs correctly and
not break anything. It is much easier to implement file access
restrictions in procfs using an ebpf controller. Some time ago, I tried to
implement such a controller [1], and it seemed to me that it was much
easier than adding complex checks to the kernel.

If I'm wrong and missing a use case, let me know and we can go back to
the patches.

[1] https://github.com/legionus/proc-bpf-controller

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13 10:49   ` Alexey Gladkov
@ 2025-12-13 18:00     ` Dan Klishch
  2025-12-14 16:40       ` Alexey Gladkov
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Klishch @ 2025-12-13 18:00 UTC (permalink / raw)
  To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

> It is much easier to implement file access
> restrictions in procfs using an ebpf controller.

But if we already have a masked /proc from podman/docker/user who
decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
not have a choice other than to bail out. Also, correct me if I am
wrong, installing ebpf controller requires CAP_BPF in initial
userns, so rootless podman will not be able to mask /proc "properly"
even if someone sends a patch switching it to ebpf.

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13 18:00     ` Dan Klishch
@ 2025-12-14 16:40       ` Alexey Gladkov
  2025-12-14 18:02         ` Dan Klishch
  0 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2025-12-14 16:40 UTC (permalink / raw)
  To: Dan Klishch
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On Sat, Dec 13, 2025 at 01:00:38PM -0500, Dan Klishch wrote:
> > It is much easier to implement file access
> > restrictions in procfs using an ebpf controller.
> 
> But if we already have a masked /proc from podman/docker/user who
> decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
> not have a choice other than to bail out.

I misunderstood you. I thought you were writing your own container
implementation.

Yes, if you want a nested container inside docker/podman, then file
overmount technique is already used there.

But then, if I understand you correctly, this patch will not be enough
for you. procfs with subset=pid will not allow you to have /proc/meminfo,
/proc/cpuinfo, etc.

> Also, correct me if I am wrong, installing ebpf controller requires
> CAP_BPF in initial userns, so rootless podman will not be able to mask
> /proc "properly" even if someone sends a patch switching it to ebpf.

You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 16:40       ` Alexey Gladkov
@ 2025-12-14 18:02         ` Dan Klishch
  2025-12-15 10:10           ` Alexey Gladkov
  2025-12-15 11:30           ` Christian Brauner
  0 siblings, 2 replies; 34+ messages in thread
From: Dan Klishch @ 2025-12-14 18:02 UTC (permalink / raw)
  To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> But then, if I understand you correctly, this patch will not be enough
> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> /proc/cpuinfo, etc.

Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
tree to the sandboxed programs (empirically, this is enough for most of
programs you want sandboxing for). With that in mind, this patch and a
FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
/proc/cpuinfo / a small kernel patch with a new mount option for procfs
to expose more static files still look like a clean solution to me.

>> Also, correct me if I am wrong, installing ebpf controller requires
>> CAP_BPF in initial userns, so rootless podman will not be able to mask
>> /proc "properly" even if someone sends a patch switching it to ebpf.
> 
> You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.

$ cat /proc/sys/kernel/unprivileged_bpf_disabled
0
$ unshare -pfr --mount-proc
$ ./proc-controller -p deny /proc/cpuinfo
libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
libbpf: prog 'proc_access_restrict': failed to load: -1
libbpf: failed to load object './proc-controller.bpf.o'
proc-controller: ERROR: loading BPF object file failed

I think only packet filters are allowed to be installed by non-root.

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 18:02         ` Dan Klishch
@ 2025-12-15 10:10           ` Alexey Gladkov
  2025-12-15 14:46             ` Dan Klishch
  2025-12-15 11:30           ` Christian Brauner
  1 sibling, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2025-12-15 10:10 UTC (permalink / raw)
  To: Dan Klishch
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
> 
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.

I don't think you'll be able to do that. procfs doesn't allow itself to
be overlayed [1]. What should block mounting overlayfs and fuse on top
of procfs.

[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274

> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.
> > 
> > You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.
> 
> $ cat /proc/sys/kernel/unprivileged_bpf_disabled
> 0
> $ unshare -pfr --mount-proc
> $ ./proc-controller -p deny /proc/cpuinfo
> libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
> libbpf: prog 'proc_access_restrict': failed to load: -1
> libbpf: failed to load object './proc-controller.bpf.o'
> proc-controller: ERROR: loading BPF object file failed
> 
> I think only packet filters are allowed to be installed by non-root.

I probably forgot about that. I wrote this code a long time ago, and
to be honest, I forgot whether it can be used for rootless.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-15 10:10           ` Alexey Gladkov
@ 2025-12-15 14:46             ` Dan Klishch
  2025-12-15 14:58               ` Alexey Gladkov
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Klishch @ 2025-12-15 14:46 UTC (permalink / raw)
  To: legion, brauner
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
>> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
>>> But then, if I understand you correctly, this patch will not be enough
>>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
>>> /proc/cpuinfo, etc.
>>
>> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
>> tree to the sandboxed programs (empirically, this is enough for most of
>> programs you want sandboxing for). With that in mind, this patch and a
>> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
>> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
>> to expose more static files still look like a clean solution to me.
> 
> I don't think you'll be able to do that. procfs doesn't allow itself to
> be overlayed [1]. What should block mounting overlayfs and fuse on top
> of procfs.
> 
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274

This is why I have been careful not to say overlayfs. With [2] (warning:
zero-shot ChatGPT output), I can do:

$ ./fuse-overlay target --source=/proc
$ ls target
1   88   194   1374    889840  908552
2   90   195   1375    889987  908619
3   91   196   1379    890031  908658
4   92   203   1412    890063  908756
5   93   205   1590    890085  908804
6   94   233   1644    890139  908951
7   96   237   1802    890246  909848
8   97   239   1850    890271  909914
10  98   240   1852    894665  909924
13  99   243   1865    895854  909926
15  100  244   1888    895864  910005
16  102  246   1889    896030  acpi
17  103  262   1891    896205  asound
18  104  263   1895    896508  bus
19  105  264   1896    896544  driver
20  106  265   1899    896706  dynamic_debug
<...>

[2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474

This requires a much more careful thought wrt magic symlinks
and permission checks. The fact that I am highly unlikely to 100%
correctly reimplement the checks and special behavior of procfs makes me
not want to proceed with the FUSE route.

On 12/15/25 6:30 AM, Christian Brauner wrote:
> The standard way of making it possible to mount procfs inside of a
> container with a separate mount namespace that has a procfs inside it
> with overmounted entries is to ensure that a fully-visible procfs
> instance is present.

Yes, this is a solution. However, this is only marginally better than
passing --privileged to the outer container (in a sense that we require
outer sandbox to remove some protections for the inner sandbox to work).

> The container needs to inherit a fully-visible instance somehow if you
> want nesting. Using an unprivileged LSM such as landlock to prevent any
> access to the fully visible procfs instance is usually the better way.
> 
> My hope is that once signed bpf is more widely adopted that distros will
> just start enabling blessed bpf programs that will just take on the
> access protecting instead of the clumsy bind-mount protection mechanism.

These are big changes to container runtimes that are unlikely to happen
soon. In contrast, the patch we are discussing will be available in 2
months after the merge for me to use on ArchLinux, and in a couple more
months on Ubuntu.

So, is there any way forward with the patch or should I continue trying
to find a userspace solution?

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-15 14:46             ` Dan Klishch
@ 2025-12-15 14:58               ` Alexey Gladkov
  2025-12-24 12:55                 ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2025-12-15 14:58 UTC (permalink / raw)
  To: Dan Klishch
  Cc: brauner, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Mon, Dec 15, 2025 at 09:46:00AM -0500, Dan Klishch wrote:
> On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> > On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> >> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> >>> But then, if I understand you correctly, this patch will not be enough
> >>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> >>> /proc/cpuinfo, etc.
> >>
> >> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> >> tree to the sandboxed programs (empirically, this is enough for most of
> >> programs you want sandboxing for). With that in mind, this patch and a
> >> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> >> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> >> to expose more static files still look like a clean solution to me.
> > 
> > I don't think you'll be able to do that. procfs doesn't allow itself to
> > be overlayed [1]. What should block mounting overlayfs and fuse on top
> > of procfs.
> > 
> > [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
> 
> This is why I have been careful not to say overlayfs. With [2] (warning:
> zero-shot ChatGPT output), I can do:
> 
> $ ./fuse-overlay target --source=/proc
> $ ls target
> 1   88   194   1374    889840  908552
> 2   90   195   1375    889987  908619
> 3   91   196   1379    890031  908658
> 4   92   203   1412    890063  908756
> 5   93   205   1590    890085  908804
> 6   94   233   1644    890139  908951
> 7   96   237   1802    890246  909848
> 8   97   239   1850    890271  909914
> 10  98   240   1852    894665  909924
> 13  99   243   1865    895854  909926
> 15  100  244   1888    895864  910005
> 16  102  246   1889    896030  acpi
> 17  103  262   1891    896205  asound
> 18  104  263   1895    896508  bus
> 19  105  264   1896    896544  driver
> 20  106  265   1899    896706  dynamic_debug
> <...>
> 
> [2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
> 
> This requires a much more careful thought wrt magic symlinks
> and permission checks. The fact that I am highly unlikely to 100%
> correctly reimplement the checks and special behavior of procfs makes me
> not want to proceed with the FUSE route.
> 
> On 12/15/25 6:30 AM, Christian Brauner wrote:
> > The standard way of making it possible to mount procfs inside of a
> > container with a separate mount namespace that has a procfs inside it
> > with overmounted entries is to ensure that a fully-visible procfs
> > instance is present.
> 
> Yes, this is a solution. However, this is only marginally better than
> passing --privileged to the outer container (in a sense that we require
> outer sandbox to remove some protections for the inner sandbox to work).
> 
> > The container needs to inherit a fully-visible instance somehow if you
> > want nesting. Using an unprivileged LSM such as landlock to prevent any
> > access to the fully visible procfs instance is usually the better way.
> > 
> > My hope is that once signed bpf is more widely adopted that distros will
> > just start enabling blessed bpf programs that will just take on the
> > access protecting instead of the clumsy bind-mount protection mechanism.
> 
> These are big changes to container runtimes that are unlikely to happen
> soon. In contrast, the patch we are discussing will be available in 2
> months after the merge for me to use on ArchLinux, and in a couple more
> months on Ubuntu.
> 
> So, is there any way forward with the patch or should I continue trying
> to find a userspace solution?

I still consider these patches useful. I made them precisely to remove
some of the restrictions we have for procfs because of global files in
the root of this filesystem.

I can update and prepare a new version of patchset if Christian thinks
it's useful too.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-15 14:58               ` Alexey Gladkov
@ 2025-12-24 12:55                 ` Christian Brauner
  2026-01-30 13:34                   ` Alexey Gladkov
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2025-12-24 12:55 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Dan Klishch, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Mon, Dec 15, 2025 at 03:58:42PM +0100, Alexey Gladkov wrote:
> On Mon, Dec 15, 2025 at 09:46:00AM -0500, Dan Klishch wrote:
> > On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> > > On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> > >> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > >>> But then, if I understand you correctly, this patch will not be enough
> > >>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > >>> /proc/cpuinfo, etc.
> > >>
> > >> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> > >> tree to the sandboxed programs (empirically, this is enough for most of
> > >> programs you want sandboxing for). With that in mind, this patch and a
> > >> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> > >> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> > >> to expose more static files still look like a clean solution to me.
> > > 
> > > I don't think you'll be able to do that. procfs doesn't allow itself to
> > > be overlayed [1]. What should block mounting overlayfs and fuse on top
> > > of procfs.
> > > 
> > > [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
> > 
> > This is why I have been careful not to say overlayfs. With [2] (warning:
> > zero-shot ChatGPT output), I can do:
> > 
> > $ ./fuse-overlay target --source=/proc
> > $ ls target
> > 1   88   194   1374    889840  908552
> > 2   90   195   1375    889987  908619
> > 3   91   196   1379    890031  908658
> > 4   92   203   1412    890063  908756
> > 5   93   205   1590    890085  908804
> > 6   94   233   1644    890139  908951
> > 7   96   237   1802    890246  909848
> > 8   97   239   1850    890271  909914
> > 10  98   240   1852    894665  909924
> > 13  99   243   1865    895854  909926
> > 15  100  244   1888    895864  910005
> > 16  102  246   1889    896030  acpi
> > 17  103  262   1891    896205  asound
> > 18  104  263   1895    896508  bus
> > 19  105  264   1896    896544  driver
> > 20  106  265   1899    896706  dynamic_debug
> > <...>
> > 
> > [2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
> > 
> > This requires a much more careful thought wrt magic symlinks
> > and permission checks. The fact that I am highly unlikely to 100%
> > correctly reimplement the checks and special behavior of procfs makes me
> > not want to proceed with the FUSE route.
> > 
> > On 12/15/25 6:30 AM, Christian Brauner wrote:
> > > The standard way of making it possible to mount procfs inside of a
> > > container with a separate mount namespace that has a procfs inside it
> > > with overmounted entries is to ensure that a fully-visible procfs
> > > instance is present.
> > 
> > Yes, this is a solution. However, this is only marginally better than
> > passing --privileged to the outer container (in a sense that we require
> > outer sandbox to remove some protections for the inner sandbox to work).
> > 
> > > The container needs to inherit a fully-visible instance somehow if you
> > > want nesting. Using an unprivileged LSM such as landlock to prevent any
> > > access to the fully visible procfs instance is usually the better way.
> > > 
> > > My hope is that once signed bpf is more widely adopted that distros will
> > > just start enabling blessed bpf programs that will just take on the
> > > access protecting instead of the clumsy bind-mount protection mechanism.
> > 
> > These are big changes to container runtimes that are unlikely to happen
> > soon. In contrast, the patch we are discussing will be available in 2
> > months after the merge for me to use on ArchLinux, and in a couple more
> > months on Ubuntu.
> > 
> > So, is there any way forward with the patch or should I continue trying
> > to find a userspace solution?
> 
> I still consider these patches useful. I made them precisely to remove
> some of the restrictions we have for procfs because of global files in
> the root of this filesystem.
> 
> I can update and prepare a new version of patchset if Christian thinks
> it's useful too.

Let's see it at least! No need to preemptively dismiss it. :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-24 12:55                 ` Christian Brauner
@ 2026-01-30 13:34                   ` Alexey Gladkov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-30 13:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dan Klishch, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Wed, Dec 24, 2025 at 01:55:20PM +0100, Christian Brauner wrote:
> On Mon, Dec 15, 2025 at 03:58:42PM +0100, Alexey Gladkov wrote:
> > On Mon, Dec 15, 2025 at 09:46:00AM -0500, Dan Klishch wrote:
> > > On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> > > > On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> > > >> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > > >>> But then, if I understand you correctly, this patch will not be enough
> > > >>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > > >>> /proc/cpuinfo, etc.
> > > >>
> > > >> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> > > >> tree to the sandboxed programs (empirically, this is enough for most of
> > > >> programs you want sandboxing for). With that in mind, this patch and a
> > > >> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> > > >> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> > > >> to expose more static files still look like a clean solution to me.
> > > > 
> > > > I don't think you'll be able to do that. procfs doesn't allow itself to
> > > > be overlayed [1]. What should block mounting overlayfs and fuse on top
> > > > of procfs.
> > > > 
> > > > [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
> > > 
> > > This is why I have been careful not to say overlayfs. With [2] (warning:
> > > zero-shot ChatGPT output), I can do:
> > > 
> > > $ ./fuse-overlay target --source=/proc
> > > $ ls target
> > > 1   88   194   1374    889840  908552
> > > 2   90   195   1375    889987  908619
> > > 3   91   196   1379    890031  908658
> > > 4   92   203   1412    890063  908756
> > > 5   93   205   1590    890085  908804
> > > 6   94   233   1644    890139  908951
> > > 7   96   237   1802    890246  909848
> > > 8   97   239   1850    890271  909914
> > > 10  98   240   1852    894665  909924
> > > 13  99   243   1865    895854  909926
> > > 15  100  244   1888    895864  910005
> > > 16  102  246   1889    896030  acpi
> > > 17  103  262   1891    896205  asound
> > > 18  104  263   1895    896508  bus
> > > 19  105  264   1896    896544  driver
> > > 20  106  265   1899    896706  dynamic_debug
> > > <...>
> > > 
> > > [2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
> > > 
> > > This requires a much more careful thought wrt magic symlinks
> > > and permission checks. The fact that I am highly unlikely to 100%
> > > correctly reimplement the checks and special behavior of procfs makes me
> > > not want to proceed with the FUSE route.
> > > 
> > > On 12/15/25 6:30 AM, Christian Brauner wrote:
> > > > The standard way of making it possible to mount procfs inside of a
> > > > container with a separate mount namespace that has a procfs inside it
> > > > with overmounted entries is to ensure that a fully-visible procfs
> > > > instance is present.
> > > 
> > > Yes, this is a solution. However, this is only marginally better than
> > > passing --privileged to the outer container (in a sense that we require
> > > outer sandbox to remove some protections for the inner sandbox to work).
> > > 
> > > > The container needs to inherit a fully-visible instance somehow if you
> > > > want nesting. Using an unprivileged LSM such as landlock to prevent any
> > > > access to the fully visible procfs instance is usually the better way.
> > > > 
> > > > My hope is that once signed bpf is more widely adopted that distros will
> > > > just start enabling blessed bpf programs that will just take on the
> > > > access protecting instead of the clumsy bind-mount protection mechanism.
> > > 
> > > These are big changes to container runtimes that are unlikely to happen
> > > soon. In contrast, the patch we are discussing will be available in 2
> > > months after the merge for me to use on ArchLinux, and in a couple more
> > > months on Ubuntu.
> > > 
> > > So, is there any way forward with the patch or should I continue trying
> > > to find a userspace solution?
> > 
> > I still consider these patches useful. I made them precisely to remove
> > some of the restrictions we have for procfs because of global files in
> > the root of this filesystem.
> > 
> > I can update and prepare a new version of patchset if Christian thinks
> > it's useful too.
> 
> Let's see it at least! No need to preemptively dismiss it. :)
> 

So what do you think about these changes?

https://lore.kernel.org/all/cover.1768295900.git.legion@kernel.org/#t

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 18:02         ` Dan Klishch
  2025-12-15 10:10           ` Alexey Gladkov
@ 2025-12-15 11:30           ` Christian Brauner
  1 sibling, 0 replies; 34+ messages in thread
From: Christian Brauner @ 2025-12-15 11:30 UTC (permalink / raw)
  To: Dan Klishch
  Cc: legion, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
> 
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.

The standard way of making it possible to mount procfs inside of a
container with a separate mount namespace that has a procfs inside it
with overmounted entries is to ensure that a fully-visible procfs
instance is present. This is for example what Incus does when nesting
containers is enabled. In systemd I implemented the same logic years
ago:

commit b71a0192c040f585397cfc6fc2ca025bf839733d
Author:     Christian Brauner <brauner@kernel.org>
AuthorDate: Mon Nov 28 12:36:47 2022 +0100
Commit:     Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Mon Dec 5 18:34:25 2022 +0100

    nspawn: mount temporary visible procfs and sysfs instance

    In order to mount procfs and sysfs in an unprivileged container the
    kernel requires that a fully visible instance is already present in the
    target mount namespace. Mount one here so the inner child can mount its
    own  instances. Later we umount the temporary  instances created here
    before we actually exec the payload. Since the rootfs is shared the
    umount will propagate into the container. Note, the inner child wouldn't
    be able to unmount the  instances on its own since it doesn't own the
    originating mount namespace. IOW, the outer child needs to do this.

    So far nspawn didn't run into this issue because it used MS_MOVE which
    meant that the shadow mount tree pinned a procfs and sysfs instance
    which the kernel would find. The shadow mount tree is gone with proper
    pivot_root() semantics.

    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

> 
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.

The container needs to inherit a fully-visible instance somehow if you
want nesting. Using an unprivileged LSM such as landlock to prevent any
access to the fully visible procfs instance is usually the better way.

My hope is that once signed bpf is more widely adopted that distros will
just start enabling blessed bpf programs that will just take on the
access protecting instead of the clumsy bind-mount protection mechanism.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v7 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
  2025-12-13 10:49   ` Alexey Gladkov
@ 2026-01-13  9:20   ` Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
                       ` (5 more replies)
  1 sibling, 6 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

When mounting procfs with the subset=pids option, all static files become
unavailable and only the dynamic part with information about pids is accessible.

In this case, there is no point in imposing additional restrictions on the
visibility of the entire filesystem for the mounter. Everything that can be
hidden in procfs is already inaccessible.

Currently, these restrictions prevent pidfs from being mounted inside rootless
containers, as almost all container implementations override part of procfs to
hide certain directories. Relaxing these restrictions will allow pidfs to be
used in nested containerization.

Changelog
---------
v7:
* Rebase on v6.19-rc5.
* Rename SB_I_DYNAMIC to SB_I_USERNS_ALLOW_REVEALING.

v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.

v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 15 +++++++++++++++
 fs/namespace.c                     | 29 ++++++++++++++++-------------
 fs/proc/proc_net.c                 |  8 ++++++++
 fs/proc/root.c                     | 24 +++++++++++++++++++-----
 include/linux/fs/super_types.h     |  2 ++
 include/linux/proc_fs.h            |  1 +
 6 files changed, 61 insertions(+), 18 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v7 1/5] docs: proc: add documentation about mount restrictions
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
@ 2026-01-13  9:20     ` Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

procfs has a number of mounting restrictions that are not documented
anywhere.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c8864fcbdec7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -52,6 +52,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
 
   4	Configuring procfs
   4.1	Mount options
+  4.2	Mount restrictions
 
   5	Filesystem behavior
 
@@ -2410,6 +2411,19 @@ will use the calling process's active pid namespace. Note that the pid
 namespace of an existing procfs instance cannot be modified (attempting to do
 so will give an `-EBUSY` error).
 
+4.2	Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+  1. This mount is not fully visible.
+
+     a. It's root directory is not the root directory of the filesystem.
+     b. If any file or non-empty procfs directory is hidden by another mount.
+
+  2. A new mount overrides the readonly option or any option from atime familty.
+
 Chapter 5: Filesystem behavior
 ==============================
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2026-01-13  9:20     ` Alexey Gladkov
  2026-02-04 14:39       ` Christian Brauner
  2026-01-13  9:20     ` [PATCH v7 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.

Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/proc_net.c      | 8 ++++++++
 fs/proc/root.c          | 5 +++++
 include/linux/proc_fs.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..6e0ccef0169f 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -23,6 +23,7 @@
 #include <linux/uidgid.h>
 #include <net/net_namespace.h>
 #include <linux/seq_file.h>
+#include <linux/security.h>
 
 #include "internal.h"
 
@@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
 	struct task_struct *task;
 	struct nsproxy *ns;
 	struct net *net = NULL;
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
 	}
 	rcu_read_unlock();
 
+	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+		put_net(net);
+		net = NULL;
+	}
+
 	return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..ed8a101d09d3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 		return -ENOMEM;
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	fs_info->mounter_cred = get_cred(fc->cred);
 	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
@@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
 
 	sync_filesystem(sb);
 
+	put_cred(fs_info->mounter_cred);
+	fs_info->mounter_cred = get_cred(fc->cred);
+
 	proc_apply_options(fs_info, fc, current_user_ns());
 	return 0;
 }
@@ -350,6 +354,7 @@ static void proc_kill_sb(struct super_block *sb)
 	kill_anon_super(sb);
 	if (fs_info) {
 		put_pid_ns(fs_info->pid_ns);
+		put_cred(fs_info->mounter_cred);
 		kfree_rcu(fs_info, rcu);
 	}
 }
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..ec123c277d49 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -67,6 +67,7 @@ enum proc_pidonly {
 struct proc_fs_info {
 	struct pid_namespace *pid_ns;
 	kgid_t pid_gid;
+	const struct cred *mounter_cred;
 	enum proc_hidepid hide_pid;
 	enum proc_pidonly pidonly;
 	struct rcu_head rcu;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2026-01-13  9:20     ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-02-04 14:39       ` Christian Brauner
  2026-02-11 19:35         ` Alexey Gladkov
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-02-04 14:39 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook, containers,
	linux-fsdevel, linux-kernel

On Tue, Jan 13, 2026 at 10:20:34AM +0100, Alexey Gladkov wrote:
> Cache the mounters credentials and allow access to the net directories
> contingent of the permissions of the mounter of proc.
> 
> Do not show /proc/self/net when proc is mounted with subset=pid option
> and the mounter does not have CAP_NET_ADMIN.
> 
> Signed-off-by: Alexey Gladkov <legion@kernel.org>
> ---
>  fs/proc/proc_net.c      | 8 ++++++++
>  fs/proc/root.c          | 5 +++++
>  include/linux/proc_fs.h | 1 +
>  3 files changed, 14 insertions(+)
> 
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 52f0b75cbce2..6e0ccef0169f 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -23,6 +23,7 @@
>  #include <linux/uidgid.h>
>  #include <net/net_namespace.h>
>  #include <linux/seq_file.h>
> +#include <linux/security.h>
>  
>  #include "internal.h"
>  
> @@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
>  	struct task_struct *task;
>  	struct nsproxy *ns;
>  	struct net *net = NULL;
> +	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
>  
>  	rcu_read_lock();
>  	task = pid_task(proc_pid(dir), PIDTYPE_PID);
> @@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
>  	}
>  	rcu_read_unlock();
>  
> +	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
> +	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
> +		put_net(net);
> +		net = NULL;
> +	}
> +
>  	return net;
>  }
>  
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index d8ca41d823e4..ed8a101d09d3 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
>  		return -ENOMEM;
>  
>  	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> +	fs_info->mounter_cred = get_cred(fc->cred);
>  	proc_apply_options(fs_info, fc, current_user_ns());
>  
>  	/* User space would break if executables or devices appear on proc */
> @@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
>  
>  	sync_filesystem(sb);
>  
> +	put_cred(fs_info->mounter_cred);
> +	fs_info->mounter_cred = get_cred(fc->cred);

Afaict, this races with get_proc_task_net(). You need a synchronization
mechanism here so that get_proc_task_net() doesn't risk accessing
invalid mounter creds while someone concurrently updates the creds.
Proposal how to fix that below.

But I'm kinda torn here anyway whether we want that credential change on
remount. The problem is that someone might inadvertently allow access to
/proc/<pid>/net as a side-effect simply because they remounted procfs.
But they never had a chance to prevent this.

I think it's best if mounter_creds stays fixed just as they do for
overlayfs. So we don't allow them to change on reconfigure. That also
makes all of the code I hinted at below pointless.

If we ever want to change the credentials it's easier to add a mount
option to procfs like I did for overlayfs.

_Untested_ patches:

First, the preparatory patch diff (no functional changes intended):

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..81825e5819b8 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -268,19 +268,19 @@ EXPORT_SYMBOL_GPL(proc_create_net_single_write);
 static struct net *get_proc_task_net(struct inode *dir)
 {
        struct task_struct *task;
-       struct nsproxy *ns;
-       struct net *net = NULL;
+       struct net *net;

-       rcu_read_lock();
+       guard(rcu)();
        task = pid_task(proc_pid(dir), PIDTYPE_PID);
-       if (task != NULL) {
-               task_lock(task);
-               ns = task->nsproxy;
-               if (ns != NULL)
-                       net = get_net(ns->net_ns);
-               task_unlock(task);
+       if (!task)
+               return NULL;
+
+       scoped_guard(task_lock, task) {
+               struct nsproxy *ns = task->nsproxy;
+               if (!ns)
+                       return NULL;
+               net = get_net(ns->net_ns);
        }
-       rcu_read_unlock();

        return net;
 }

And then on top of it something like:

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 81825e5819b8..47dc9806395c 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -269,6 +269,8 @@ static struct net *get_proc_task_net(struct inode *dir)
 {
        struct task_struct *task;
        struct net *net;
+       struct proc_fs_info *fs_info;
+       const struct cred *cred;

        guard(rcu)();
        task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,15 @@ static struct net *get_proc_task_net(struct inode *dir)
                net = get_net(ns->net_ns);
        }

+       fs_info = proc_sb_info(dir->i_sb);
+       if (fs_info->pidonly != PROC_PIDONLY_ON)
+               return net;
+
+       cred = rcu_dereference(fs_info->mounter_cred);
+       if (security_capable(cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) != 0) {
+               put_net(net);
+               return NULL;
+       }
        return net;
 }

diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..68397900dab7 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -300,11 +300,15 @@ static int proc_reconfigure(struct fs_context *fc)
 {
        struct super_block *sb = fc->root->d_sb;
        struct proc_fs_info *fs_info = proc_sb_info(sb);
+       const struct cred *cred;

        sync_filesystem(sb);

-       proc_apply_options(fs_info, fc, current_user_ns());
-       return 0;
+       cred = rcu_replace_pointer(fs_info->mounter_cred, get_cred(fc->cred),
+                                  lockdep_is_held(&sb->s_umount));
+       put_cred(cred);
+
+       return proc_apply_options(sb, fc, current_user_ns());
 }

 static int proc_get_tree(struct fs_context *fc)

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2026-02-04 14:39       ` Christian Brauner
@ 2026-02-11 19:35         ` Alexey Gladkov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-11 19:35 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook, containers,
	linux-fsdevel, linux-kernel

On Wed, Feb 04, 2026 at 03:39:53PM +0100, Christian Brauner wrote:
> On Tue, Jan 13, 2026 at 10:20:34AM +0100, Alexey Gladkov wrote:
> > Cache the mounters credentials and allow access to the net directories
> > contingent of the permissions of the mounter of proc.
> > 
> > Do not show /proc/self/net when proc is mounted with subset=pid option
> > and the mounter does not have CAP_NET_ADMIN.
> > 
> > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > ---
> >  fs/proc/proc_net.c      | 8 ++++++++
> >  fs/proc/root.c          | 5 +++++
> >  include/linux/proc_fs.h | 1 +
> >  3 files changed, 14 insertions(+)
> > 
> > diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> > index 52f0b75cbce2..6e0ccef0169f 100644
> > --- a/fs/proc/proc_net.c
> > +++ b/fs/proc/proc_net.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/uidgid.h>
> >  #include <net/net_namespace.h>
> >  #include <linux/seq_file.h>
> > +#include <linux/security.h>
> >  
> >  #include "internal.h"
> >  
> > @@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
> >  	struct task_struct *task;
> >  	struct nsproxy *ns;
> >  	struct net *net = NULL;
> > +	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
> >  
> >  	rcu_read_lock();
> >  	task = pid_task(proc_pid(dir), PIDTYPE_PID);
> > @@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
> >  	}
> >  	rcu_read_unlock();
> >  
> > +	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
> > +	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
> > +		put_net(net);
> > +		net = NULL;
> > +	}
> > +
> >  	return net;
> >  }
> >  
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index d8ca41d823e4..ed8a101d09d3 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> >  		return -ENOMEM;
> >  
> >  	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> > +	fs_info->mounter_cred = get_cred(fc->cred);
> >  	proc_apply_options(fs_info, fc, current_user_ns());
> >  
> >  	/* User space would break if executables or devices appear on proc */
> > @@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
> >  
> >  	sync_filesystem(sb);
> >  
> > +	put_cred(fs_info->mounter_cred);
> > +	fs_info->mounter_cred = get_cred(fc->cred);
> 
> Afaict, this races with get_proc_task_net(). You need a synchronization
> mechanism here so that get_proc_task_net() doesn't risk accessing
> invalid mounter creds while someone concurrently updates the creds.
> Proposal how to fix that below.
> 
> But I'm kinda torn here anyway whether we want that credential change on
> remount. The problem is that someone might inadvertently allow access to
> /proc/<pid>/net as a side-effect simply because they remounted procfs.
> But they never had a chance to prevent this.

I think you're right, and there's no need to change credentials on
remount. At least not now.

> I think it's best if mounter_creds stays fixed just as they do for
> overlayfs. So we don't allow them to change on reconfigure. That also
> makes all of the code I hinted at below pointless.

I'll just remove the mounter_cred update from proc_reconfigure.

> If we ever want to change the credentials it's easier to add a mount
> option to procfs like I did for overlayfs.
> 
> _Untested_ patches:
> 
> First, the preparatory patch diff (no functional changes intended):
> 
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 52f0b75cbce2..81825e5819b8 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -268,19 +268,19 @@ EXPORT_SYMBOL_GPL(proc_create_net_single_write);
>  static struct net *get_proc_task_net(struct inode *dir)
>  {
>         struct task_struct *task;
> -       struct nsproxy *ns;
> -       struct net *net = NULL;
> +       struct net *net;
> 
> -       rcu_read_lock();
> +       guard(rcu)();
>         task = pid_task(proc_pid(dir), PIDTYPE_PID);
> -       if (task != NULL) {
> -               task_lock(task);
> -               ns = task->nsproxy;
> -               if (ns != NULL)
> -                       net = get_net(ns->net_ns);
> -               task_unlock(task);
> +       if (!task)
> +               return NULL;
> +
> +       scoped_guard(task_lock, task) {
> +               struct nsproxy *ns = task->nsproxy;
> +               if (!ns)
> +                       return NULL;
> +               net = get_net(ns->net_ns);
>         }
> -       rcu_read_unlock();
> 
>         return net;
>  }
> 
> And then on top of it something like:
> 
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 81825e5819b8..47dc9806395c 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -269,6 +269,8 @@ static struct net *get_proc_task_net(struct inode *dir)
>  {
>         struct task_struct *task;
>         struct net *net;
> +       struct proc_fs_info *fs_info;
> +       const struct cred *cred;
> 
>         guard(rcu)();
>         task = pid_task(proc_pid(dir), PIDTYPE_PID);
> @@ -282,6 +284,15 @@ static struct net *get_proc_task_net(struct inode *dir)
>                 net = get_net(ns->net_ns);
>         }
> 
> +       fs_info = proc_sb_info(dir->i_sb);
> +       if (fs_info->pidonly != PROC_PIDONLY_ON)
> +               return net;
> +
> +       cred = rcu_dereference(fs_info->mounter_cred);
> +       if (security_capable(cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) != 0) {
> +               put_net(net);
> +               return NULL;
> +       }
>         return net;
>  }
> 
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index d8ca41d823e4..68397900dab7 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -300,11 +300,15 @@ static int proc_reconfigure(struct fs_context *fc)
>  {
>         struct super_block *sb = fc->root->d_sb;
>         struct proc_fs_info *fs_info = proc_sb_info(sb);
> +       const struct cred *cred;
> 
>         sync_filesystem(sb);
> 
> -       proc_apply_options(fs_info, fc, current_user_ns());
> -       return 0;
> +       cred = rcu_replace_pointer(fs_info->mounter_cred, get_cred(fc->cred),
> +                                  lockdep_is_held(&sb->s_umount));
> +       put_cred(cred);
> +
> +       return proc_apply_options(sb, fc, current_user_ns());
>  }
> 
>  static int proc_get_tree(struct fs_context *fc)
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v7 3/5] proc: Disable cancellation of subset=pid option
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-01-13  9:20     ` Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 4/5] proc: Relax check of mount visibility Alexey Gladkov
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.

This patch makes the limitation explicit and prints an error message.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/root.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index ed8a101d09d3..b9f33b67cdd6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,7 +223,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
@@ -233,13 +233,17 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
-	if (ctx->mask & (1 << Opt_subset))
+	if (ctx->mask & (1 << Opt_subset)) {
+		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
+	}
 	if (ctx->mask & (1 << Opt_pidns) &&
 	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
 		put_pid_ns(fs_info->pid_ns);
 		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	}
+	return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -255,7 +259,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	proc_apply_options(fs_info, fc, current_user_ns());
+	ret = proc_apply_options(fs_info, fc, current_user_ns());
+	if (ret)
+		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -307,8 +313,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	proc_apply_options(fs_info, fc, current_user_ns());
-	return 0;
+	return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 4/5] proc: Relax check of mount visibility
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
                       ` (2 preceding siblings ...)
  2026-01-13  9:20     ` [PATCH v7 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2026-01-13  9:20     ` Alexey Gladkov
  2026-01-13  9:20     ` [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

When /proc is mounted with the subset=pid option, all system files from
the root of the file system are not accessible in userspace. Only
dynamic information about processes is available, which cannot be
hidden with overmount.

For this reason, checking for full visibility is not relevant if
mounting is performed with the subset=pid option.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/namespace.c                 | 29 ++++++++++++++++-------------
 fs/proc/root.c                 | 16 ++++++++++------
 include/linux/fs/super_types.h |  2 ++
 3 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..7daa86315c05 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		/* This mount is not fully visible if it's root directory
 		 * is not the root directory of the filesystem.
 		 */
-		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
+		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
 		/* A local view of the mount flags */
@@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any
-		 * locked child mounts that cover anything except for
-		 * empty directories.
-		 */
-		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-			struct inode *inode = child->mnt_mountpoint->d_inode;
-			/* Only worry about locked mounts */
-			if (!(child->mnt.mnt_flags & MNT_LOCKED))
-				continue;
-			/* Is the directory permanently empty? */
-			if (!is_empty_dir_inode(inode))
-				goto next;
+		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING)) {
+			/* This mount is not fully visible if there are any
+			 * locked child mounts that cover anything except for
+			 * empty directories.
+			 */
+			list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+				struct inode *inode = child->mnt_mountpoint->d_inode;
+				/* Only worry about locked mounts */
+				if (!IS_MNT_LOCKED(child))
+					continue;
+				/* Is the directory permanently empty? */
+				if (!is_empty_dir_inode(inode))
+					goto next;
+			}
 		}
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b9f33b67cdd6..354dc13417e3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,18 +223,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 
 	if (ctx->mask & (1 << Opt_gid))
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset)) {
-		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+		if (ctx->pidonly == PROC_PIDONLY_ON)
+			s->s_iflags |= SB_I_USERNS_ALLOW_REVEALING;
+		else if (fs_info->pidonly == PROC_PIDONLY_ON)
 			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
 	}
@@ -259,9 +262,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	ret = proc_apply_options(fs_info, fc, current_user_ns());
-	if (ret)
-		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -273,6 +273,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_time_gran = 1;
 	s->s_fs_info = fs_info;
 
+	ret = proc_apply_options(s, fc, current_user_ns());
+	if (ret)
+		return ret;
+
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
 	 * too much magic going on inside it to permit stacking things on
@@ -313,7 +317,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	return proc_apply_options(fs_info, fc, current_user_ns());
+	return proc_apply_options(sb, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 6bd3009e09b3..5e640b9140df 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -333,4 +333,6 @@ struct super_block {
 #define SB_I_NOIDMAP	0x00002000	/* No idmapped mounts on this superblock */
 #define SB_I_ALLOW_HSM	0x00004000	/* Allow HSM events on this superblock */
 
+#define SB_I_USERNS_ALLOW_REVEALING	0x00008000 /* Skip full visibility check */
+
 #endif /* _LINUX_FS_SUPER_TYPES_H */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
                       ` (3 preceding siblings ...)
  2026-01-13  9:20     ` [PATCH v7 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-01-13  9:20     ` Alexey Gladkov
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13  9:20 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
	linux-kernel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c8864fcbdec7..3acf178c1202 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2417,7 +2417,8 @@ so will give an `-EBUSY` error).
 If user namespaces are in use, the kernel additionally checks the instances of
 procfs available to the mounter and will not allow procfs to be mounted if:
 
-  1. This mount is not fully visible.
+  1. This mount is not fully visible unless the new procfs is going to be
+     mounted with subset=pid option.
 
      a. It's root directory is not the root directory of the filesystem.
      b. If any file or non-empty procfs directory is hidden by another mount.
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility
  2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
                       ` (4 preceding siblings ...)
  2026-01-13  9:20     ` [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
@ 2026-02-13 10:44     ` Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
                         ` (4 more replies)
  5 siblings, 5 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

When mounting procfs with the subset=pids option, all static files become
unavailable and only the dynamic part with information about pids is accessible.

In this case, there is no point in imposing additional restrictions on the
visibility of the entire filesystem for the mounter. Everything that can be
hidden in procfs is already inaccessible.

Currently, these restrictions prevent pidfs from being mounted inside rootless
containers, as almost all container implementations override part of procfs to
hide certain directories. Relaxing these restrictions will allow pidfs to be
used in nested containerization.

---
Changelog
---------
v8:
* Remove mounter credential change on remount as suggested by Christian Brauner.

v7:
* Rebase on v6.19-rc5.
* Rename SB_I_DYNAMIC to SB_I_USERNS_ALLOW_REVEALING.

v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.

v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 15 +++++++++++++++
 fs/namespace.c                     | 29 ++++++++++++++++-------------
 fs/proc/proc_net.c                 |  8 ++++++++
 fs/proc/root.c                     | 22 ++++++++++++++++------
 include/linux/fs/super_types.h     |  2 ++
 include/linux/proc_fs.h            |  1 +
 6 files changed, 58 insertions(+), 19 deletions(-)

-- 
2.53.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v8 1/5] docs: proc: add documentation about mount restrictions
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
@ 2026-02-13 10:44       ` Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

procfs has a number of mounting restrictions that are not documented
anywhere.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c8864fcbdec7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -52,6 +52,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
 
   4	Configuring procfs
   4.1	Mount options
+  4.2	Mount restrictions
 
   5	Filesystem behavior
 
@@ -2410,6 +2411,19 @@ will use the calling process's active pid namespace. Note that the pid
 namespace of an existing procfs instance cannot be modified (attempting to do
 so will give an `-EBUSY` error).
 
+4.2	Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+  1. This mount is not fully visible.
+
+     a. It's root directory is not the root directory of the filesystem.
+     b. If any file or non-empty procfs directory is hidden by another mount.
+
+  2. A new mount overrides the readonly option or any option from atime familty.
+
 Chapter 5: Filesystem behavior
 ==============================
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2026-02-13 10:44       ` Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.

Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN. To avoid inadvertently
allowing access to /proc/<pid>/net, updating mounter credentials is not
supported.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/proc_net.c      | 8 ++++++++
 fs/proc/root.c          | 2 ++
 include/linux/proc_fs.h | 1 +
 3 files changed, 11 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..6e0ccef0169f 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -23,6 +23,7 @@
 #include <linux/uidgid.h>
 #include <net/net_namespace.h>
 #include <linux/seq_file.h>
+#include <linux/security.h>
 
 #include "internal.h"
 
@@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
 	struct task_struct *task;
 	struct nsproxy *ns;
 	struct net *net = NULL;
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
 	}
 	rcu_read_unlock();
 
+	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+		put_net(net);
+		net = NULL;
+	}
+
 	return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..c4af3a9b1a44 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 		return -ENOMEM;
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	fs_info->mounter_cred = get_cred(fc->cred);
 	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
@@ -350,6 +351,7 @@ static void proc_kill_sb(struct super_block *sb)
 	kill_anon_super(sb);
 	if (fs_info) {
 		put_pid_ns(fs_info->pid_ns);
+		put_cred(fs_info->mounter_cred);
 		kfree_rcu(fs_info, rcu);
 	}
 }
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..ec123c277d49 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -67,6 +67,7 @@ enum proc_pidonly {
 struct proc_fs_info {
 	struct pid_namespace *pid_ns;
 	kgid_t pid_gid;
+	const struct cred *mounter_cred;
 	enum proc_hidepid hide_pid;
 	enum proc_pidonly pidonly;
 	struct rcu_head rcu;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 3/5] proc: Disable cancellation of subset=pid option
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-02-13 10:44       ` Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
  2026-02-13 10:44       ` [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
  4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.

This patch makes the limitation explicit and prints an error message.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/root.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index c4af3a9b1a44..535a168046e3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,7 +223,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
@@ -233,13 +233,17 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
-	if (ctx->mask & (1 << Opt_subset))
+	if (ctx->mask & (1 << Opt_subset)) {
+		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
+	}
 	if (ctx->mask & (1 << Opt_pidns) &&
 	    !WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
 		put_pid_ns(fs_info->pid_ns);
 		fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	}
+	return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -255,7 +259,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	proc_apply_options(fs_info, fc, current_user_ns());
+	ret = proc_apply_options(fs_info, fc, current_user_ns());
+	if (ret)
+		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -304,8 +310,7 @@ static int proc_reconfigure(struct fs_context *fc)
 
 	sync_filesystem(sb);
 
-	proc_apply_options(fs_info, fc, current_user_ns());
-	return 0;
+	return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v8 4/5] proc: Relax check of mount visibility
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                         ` (2 preceding siblings ...)
  2026-02-13 10:44       ` [PATCH v8 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2026-02-13 10:44       ` Alexey Gladkov
  2026-02-17 11:59         ` Christian Brauner
  2026-02-13 10:44       ` [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
  4 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

When /proc is mounted with the subset=pid option, all system files from
the root of the file system are not accessible in userspace. Only
dynamic information about processes is available, which cannot be
hidden with overmount.

For this reason, checking for full visibility is not relevant if
mounting is performed with the subset=pid option.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/namespace.c                 | 29 ++++++++++++++++-------------
 fs/proc/root.c                 | 17 ++++++++++-------
 include/linux/fs/super_types.h |  2 ++
 3 files changed, 28 insertions(+), 20 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..7daa86315c05 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		/* This mount is not fully visible if it's root directory
 		 * is not the root directory of the filesystem.
 		 */
-		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
+		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
 		/* A local view of the mount flags */
@@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any
-		 * locked child mounts that cover anything except for
-		 * empty directories.
-		 */
-		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-			struct inode *inode = child->mnt_mountpoint->d_inode;
-			/* Only worry about locked mounts */
-			if (!(child->mnt.mnt_flags & MNT_LOCKED))
-				continue;
-			/* Is the directory permanently empty? */
-			if (!is_empty_dir_inode(inode))
-				goto next;
+		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING)) {
+			/* This mount is not fully visible if there are any
+			 * locked child mounts that cover anything except for
+			 * empty directories.
+			 */
+			list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+				struct inode *inode = child->mnt_mountpoint->d_inode;
+				/* Only worry about locked mounts */
+				if (!IS_MNT_LOCKED(child))
+					continue;
+				/* Is the directory permanently empty? */
+				if (!is_empty_dir_inode(inode))
+					goto next;
+			}
 		}
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 535a168046e3..e029d3587494 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,18 +223,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 
 	if (ctx->mask & (1 << Opt_gid))
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset)) {
-		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+		if (ctx->pidonly == PROC_PIDONLY_ON)
+			s->s_iflags |= SB_I_USERNS_ALLOW_REVEALING;
+		else if (fs_info->pidonly == PROC_PIDONLY_ON)
 			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
 	}
@@ -259,9 +262,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	ret = proc_apply_options(fs_info, fc, current_user_ns());
-	if (ret)
-		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -273,6 +273,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_time_gran = 1;
 	s->s_fs_info = fs_info;
 
+	ret = proc_apply_options(s, fc, current_user_ns());
+	if (ret)
+		return ret;
+
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
 	 * too much magic going on inside it to permit stacking things on
@@ -306,11 +310,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 static int proc_reconfigure(struct fs_context *fc)
 {
 	struct super_block *sb = fc->root->d_sb;
-	struct proc_fs_info *fs_info = proc_sb_info(sb);
 
 	sync_filesystem(sb);
 
-	return proc_apply_options(fs_info, fc, current_user_ns());
+	return proc_apply_options(sb, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 6bd3009e09b3..5e640b9140df 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -333,4 +333,6 @@ struct super_block {
 #define SB_I_NOIDMAP	0x00002000	/* No idmapped mounts on this superblock */
 #define SB_I_ALLOW_HSM	0x00004000	/* Allow HSM events on this superblock */
 
+#define SB_I_USERNS_ALLOW_REVEALING	0x00008000 /* Skip full visibility check */
+
 #endif /* _LINUX_FS_SUPER_TYPES_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
  2026-02-13 10:44       ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-02-17 11:59         ` Christian Brauner
  2026-04-10 11:12           ` Christian Brauner
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-02-17 11:59 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
	linux-fsdevel, linux-kernel

On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> When /proc is mounted with the subset=pid option, all system files from
> the root of the file system are not accessible in userspace. Only
> dynamic information about processes is available, which cannot be
> hidden with overmount.
> 
> For this reason, checking for full visibility is not relevant if
> mounting is performed with the subset=pid option.
> 
> Signed-off-by: Alexey Gladkov <legion@kernel.org>
> ---
>  fs/namespace.c                 | 29 ++++++++++++++++-------------
>  fs/proc/root.c                 | 17 ++++++++++-------
>  include/linux/fs/super_types.h |  2 ++
>  3 files changed, 28 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c58674a20cad..7daa86315c05 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
>  		/* This mount is not fully visible if it's root directory
>  		 * is not the root directory of the filesystem.
>  		 */
> -		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> +		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> +		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
>  			continue;
>  
>  		/* A local view of the mount flags */
> @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
>  		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
>  			continue;

There are a few things that I find problematic here.

Even before your change the mount flags of the first fully visible
procfs mount would be picked up. If the caller was unlucky they could
stumble upon the most restricted procfs mount in the mount namespace
rbtree. Leading to weird scenarios where a user cannot write to the
procfs instance they just mounted but could to another one that is also
in their namespace.

The other thing is that with this change specifically:

    if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
        mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)

we start caring about mount options of even partially exposed procfs
mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
somewhere that got inherited via CLONE_NEWNS then we suddenly take the
mount options of that into account for a new /proc/<pid>/* only instance.
I think we should continue caring only about procfs mounts that are
visible from their root.

The the other problem is that it is really annoying that we walk all
mounts in a mount namespace just to find procfs and sysfs mounts in
there. Currently a lot of workloads still do the CLONE_NEWNS dance
meaning they inherit all the crap from the host and then proceed to
setup their new rootfs. Busy container workloads that can be a lot.

So let's just be honest about it and treat procfs and sysfs as the
snowflakes that they have become and record their instances in a
separate per mount namespace hlist as in the (untested) patch below [1].

Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
care about that flag is when we setup a new superblock. So this could
easily be a struct fs_context bitfield that just exists for the duration
of the creation of the new superblock and mount. So maybe pass that down
to mount_too_revealing() and further down into the actual helper.

[1]:
From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 17 Feb 2026 12:02:34 +0100
Subject: [PATCH] namespace: record fully visible mounts in list

Instead of wading through all the mounts in the mount namespace rbtree
to find fully visible procfs and sysfs mounts, be honest about them
being special cruft and record them in a separate per-mount namespace
list.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/mount.h     |  4 ++++
 fs/namespace.c | 19 +++++++++++--------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/mount.h b/fs/mount.h
index e0816c11a198..5df134d56d47 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -25,6 +25,7 @@ struct mnt_namespace {
 	__u32			n_fsnotify_mask;
 	struct fsnotify_mark_connector __rcu *n_fsnotify_marks;
 #endif
+	struct hlist_head	mnt_visible_mounts; /* SB_I_USERNS_VISIBLE mounts */
 	unsigned int		nr_mounts; /* # of mounts in the namespace */
 	unsigned int		pending_mounts;
 	refcount_t		passive; /* number references not pinning @mounts */
@@ -90,6 +91,7 @@ struct mount {
 	int mnt_expiry_mark;		/* true if marked for expiry */
 	struct hlist_head mnt_pins;
 	struct hlist_head mnt_stuck_children;
+	struct hlist_node mnt_ns_visible; /* link in ns->mnt_visible_mounts */
 	struct mount *overmount;	/* mounted on ->mnt_root */
 } __randomize_layout;
 
@@ -207,6 +209,8 @@ static inline void move_from_ns(struct mount *mnt)
 		ns->mnt_first_node = rb_next(&mnt->mnt_node);
 	rb_erase(&mnt->mnt_node, &ns->mounts);
 	RB_CLEAR_NODE(&mnt->mnt_node);
+	if (!hlist_unhashed(&mnt->mnt_ns_visible))
+		hlist_del_init(&mnt->mnt_ns_visible);
 }
 
 bool has_locked_children(struct mount *mnt, struct dentry *dentry);
diff --git a/fs/namespace.c b/fs/namespace.c
index a67cbe42746d..764081c690d5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -321,6 +321,7 @@ static struct mount *alloc_vfsmnt(const char *name)
 		INIT_HLIST_NODE(&mnt->mnt_slave);
 		INIT_HLIST_NODE(&mnt->mnt_mp_list);
 		INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
+		INIT_HLIST_NODE(&mnt->mnt_ns_visible);
 		RB_CLEAR_NODE(&mnt->mnt_node);
 		mnt->mnt.mnt_idmap = &nop_mnt_idmap;
 	}
@@ -1098,6 +1099,10 @@ static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
 	rb_link_node(&mnt->mnt_node, parent, link);
 	rb_insert_color(&mnt->mnt_node, &ns->mounts);
 
+	if ((mnt->mnt.mnt_sb->s_iflags & SB_I_USERNS_VISIBLE) &&
+	    mnt->mnt.mnt_root == mnt->mnt.mnt_sb->s_root)
+		hlist_add_head(&mnt->mnt_ns_visible, &ns->mnt_visible_mounts);
+
 	mnt_notify_add(mnt);
 }
 
@@ -6295,22 +6300,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 				int *new_mnt_flags)
 {
 	int new_flags = *new_mnt_flags;
-	struct mount *mnt, *n;
+	struct mount *mnt;
+
+	/* Don't acquire namespace semaphore without a good reason. */
+	if (hlist_empty(&ns->mnt_visible_mounts))
+		return false;
 
 	guard(namespace_shared)();
-	rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) {
+	hlist_for_each_entry(mnt, &ns->mnt_visible_mounts, mnt_ns_visible) {
 		struct mount *child;
 		int mnt_flags;
 
 		if (mnt->mnt.mnt_sb->s_type != sb->s_type)
 			continue;
 
-		/* This mount is not fully visible if it's root directory
-		 * is not the root directory of the filesystem.
-		 */
-		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
-			continue;
-
 		/* A local view of the mount flags */
 		mnt_flags = mnt->mnt.mnt_flags;
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
  2026-02-17 11:59         ` Christian Brauner
@ 2026-04-10 11:12           ` Christian Brauner
  2026-04-10 11:31             ` Alexey Gladkov
  0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-04-10 11:12 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
	linux-fsdevel, linux-kernel

On Tue, Feb 17, 2026 at 12:59:54PM +0100, Christian Brauner wrote:
> On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> > When /proc is mounted with the subset=pid option, all system files from
> > the root of the file system are not accessible in userspace. Only
> > dynamic information about processes is available, which cannot be
> > hidden with overmount.
> > 
> > For this reason, checking for full visibility is not relevant if
> > mounting is performed with the subset=pid option.
> > 
> > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > ---
> >  fs/namespace.c                 | 29 ++++++++++++++++-------------
> >  fs/proc/root.c                 | 17 ++++++++++-------
> >  include/linux/fs/super_types.h |  2 ++
> >  3 files changed, 28 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index c58674a20cad..7daa86315c05 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> >  		/* This mount is not fully visible if it's root directory
> >  		 * is not the root directory of the filesystem.
> >  		 */
> > -		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > +		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> > +		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> >  			continue;
> >  
> >  		/* A local view of the mount flags */
> > @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> >  		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> >  			continue;
> 
> There are a few things that I find problematic here.
> 
> Even before your change the mount flags of the first fully visible
> procfs mount would be picked up. If the caller was unlucky they could
> stumble upon the most restricted procfs mount in the mount namespace
> rbtree. Leading to weird scenarios where a user cannot write to the
> procfs instance they just mounted but could to another one that is also
> in their namespace.
> 
> The other thing is that with this change specifically:
> 
>     if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
>         mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> 
> we start caring about mount options of even partially exposed procfs
> mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
> somewhere that got inherited via CLONE_NEWNS then we suddenly take the
> mount options of that into account for a new /proc/<pid>/* only instance.
> I think we should continue caring only about procfs mounts that are
> visible from their root.
> 
> The the other problem is that it is really annoying that we walk all
> mounts in a mount namespace just to find procfs and sysfs mounts in
> there. Currently a lot of workloads still do the CLONE_NEWNS dance
> meaning they inherit all the crap from the host and then proceed to
> setup their new rootfs. Busy container workloads that can be a lot.
> 
> So let's just be honest about it and treat procfs and sysfs as the
> snowflakes that they have become and record their instances in a
> separate per mount namespace hlist as in the (untested) patch below [1].
> 
> Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
> care about that flag is when we setup a new superblock. So this could
> easily be a struct fs_context bitfield that just exists for the duration
> of the creation of the new superblock and mount. So maybe pass that down
> to mount_too_revealing() and further down into the actual helper.
> 
> [1]:
> >From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Tue, 17 Feb 2026 12:02:34 +0100
> Subject: [PATCH] namespace: record fully visible mounts in list
> 
> Instead of wading through all the mounts in the mount namespace rbtree
> to find fully visible procfs and sysfs mounts, be honest about them
> being special cruft and record them in a separate per-mount namespace
> list.

If you rework this I would expect to take it for v7.3. It's a bit late
now...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
  2026-04-10 11:12           ` Christian Brauner
@ 2026-04-10 11:31             ` Alexey Gladkov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-04-10 11:31 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
	linux-fsdevel, linux-kernel

On Fri, Apr 10, 2026 at 01:12:36PM +0200, Christian Brauner wrote:
> On Tue, Feb 17, 2026 at 12:59:54PM +0100, Christian Brauner wrote:
> > On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> > > When /proc is mounted with the subset=pid option, all system files from
> > > the root of the file system are not accessible in userspace. Only
> > > dynamic information about processes is available, which cannot be
> > > hidden with overmount.
> > > 
> > > For this reason, checking for full visibility is not relevant if
> > > mounting is performed with the subset=pid option.
> > > 
> > > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > > ---
> > >  fs/namespace.c                 | 29 ++++++++++++++++-------------
> > >  fs/proc/root.c                 | 17 ++++++++++-------
> > >  include/linux/fs/super_types.h |  2 ++
> > >  3 files changed, 28 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > index c58674a20cad..7daa86315c05 100644
> > > --- a/fs/namespace.c
> > > +++ b/fs/namespace.c
> > > @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > >  		/* This mount is not fully visible if it's root directory
> > >  		 * is not the root directory of the filesystem.
> > >  		 */
> > > -		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > > +		if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> > > +		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > >  			continue;
> > >  
> > >  		/* A local view of the mount flags */
> > > @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > >  		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> > >  			continue;
> > 
> > There are a few things that I find problematic here.
> > 
> > Even before your change the mount flags of the first fully visible
> > procfs mount would be picked up. If the caller was unlucky they could
> > stumble upon the most restricted procfs mount in the mount namespace
> > rbtree. Leading to weird scenarios where a user cannot write to the
> > procfs instance they just mounted but could to another one that is also
> > in their namespace.
> > 
> > The other thing is that with this change specifically:
> > 
> >     if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> >         mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > 
> > we start caring about mount options of even partially exposed procfs
> > mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
> > somewhere that got inherited via CLONE_NEWNS then we suddenly take the
> > mount options of that into account for a new /proc/<pid>/* only instance.
> > I think we should continue caring only about procfs mounts that are
> > visible from their root.
> > 
> > The the other problem is that it is really annoying that we walk all
> > mounts in a mount namespace just to find procfs and sysfs mounts in
> > there. Currently a lot of workloads still do the CLONE_NEWNS dance
> > meaning they inherit all the crap from the host and then proceed to
> > setup their new rootfs. Busy container workloads that can be a lot.
> > 
> > So let's just be honest about it and treat procfs and sysfs as the
> > snowflakes that they have become and record their instances in a
> > separate per mount namespace hlist as in the (untested) patch below [1].
> > 
> > Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
> > care about that flag is when we setup a new superblock. So this could
> > easily be a struct fs_context bitfield that just exists for the duration
> > of the creation of the new superblock and mount. So maybe pass that down
> > to mount_too_revealing() and further down into the actual helper.
> > 
> > [1]:
> > >From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Tue, 17 Feb 2026 12:02:34 +0100
> > Subject: [PATCH] namespace: record fully visible mounts in list
> > 
> > Instead of wading through all the mounts in the mount namespace rbtree
> > to find fully visible procfs and sysfs mounts, be honest about them
> > being special cruft and record them in a separate per-mount namespace
> > list.
> 
> If you rework this I would expect to take it for v7.3. It's a bit late
> now...

No problem. I understand. Sorry it took me so long to get back to you.
I was laid off from my job and had to look for a new one quickly.

I’ll be back soon to update this patch.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions
  2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                         ` (3 preceding siblings ...)
  2026-02-13 10:44       ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-02-13 10:44       ` Alexey Gladkov
  4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
  To: Christian Brauner, Dan Klishch
  Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
	linux-kernel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c8864fcbdec7..3acf178c1202 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2417,7 +2417,8 @@ so will give an `-EBUSY` error).
 If user namespaces are in use, the kernel additionally checks the instances of
 procfs available to the mounter and will not allow procfs to be mounted if:
 
-  1. This mount is not fully visible.
+  1. This mount is not fully visible unless the new procfs is going to be
+     mounted with subset=pid option.
 
      a. It's root directory is not the root directory of the filesystem.
      b. If any file or non-empty procfs directory is hidden by another mount.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-04-10 11:31 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
2025-12-13 10:49   ` Alexey Gladkov
2025-12-13 18:00     ` Dan Klishch
2025-12-14 16:40       ` Alexey Gladkov
2025-12-14 18:02         ` Dan Klishch
2025-12-15 10:10           ` Alexey Gladkov
2025-12-15 14:46             ` Dan Klishch
2025-12-15 14:58               ` Alexey Gladkov
2025-12-24 12:55                 ` Christian Brauner
2026-01-30 13:34                   ` Alexey Gladkov
2025-12-15 11:30           ` Christian Brauner
2026-01-13  9:20   ` [PATCH v7 " Alexey Gladkov
2026-01-13  9:20     ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2026-01-13  9:20     ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
2026-02-04 14:39       ` Christian Brauner
2026-02-11 19:35         ` Alexey Gladkov
2026-01-13  9:20     ` [PATCH v7 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
2026-01-13  9:20     ` [PATCH v7 4/5] proc: Relax check of mount visibility Alexey Gladkov
2026-01-13  9:20     ` [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
2026-02-13 10:44     ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2026-02-13 10:44       ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2026-02-13 10:44       ` [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
2026-02-13 10:44       ` [PATCH v8 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
2026-02-13 10:44       ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
2026-02-17 11:59         ` Christian Brauner
2026-04-10 11:12           ` Christian Brauner
2026-04-10 11:31             ` Alexey Gladkov
2026-02-13 10:44       ` [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox