[RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
@ 2021-07-16 10:45 Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:45 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the mounter.

Changelog
---------
v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.

v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.

v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).

v2:
* cache the mounters credentials and make access to the net directories
  contingent of the permissions of the mounter of procfs.

--

Alexey Gladkov (5):
  docs: proc: add documentation about mount restrictions
  proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  proc: Disable cancellation of subset=pid option
  proc: Relax check of mount visibility
  docs: proc: add documentation about relaxing visibility restrictions

 Documentation/filesystems/proc.rst | 15 +++++++++++++++
 fs/namespace.c                     | 30 ++++++++++++++++++------------
 fs/proc/proc_net.c                 |  8 ++++++++
 fs/proc/root.c                     | 24 +++++++++++++++++++-----
 include/linux/fs.h                 |  1 +
 include/linux/proc_fs.h            |  1 +
 6 files changed, 62 insertions(+), 17 deletions(-)

-- 
2.29.3


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:45 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:45 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..5a1bb0e081fd 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -50,6 +50,7 @@ fixes/update part 1.1  Stefani Seibold <stefani@seibold.net>    June 9 2009
 
   4	Configuring procfs
   4.1	Mount options
+  4.2	Mount restrictions
 
   5	Filesystem behavior
 
@@ -2175,6 +2176,19 @@ information about processes information, just add identd to this group.
 subset=pid hides all top level files and directories in the procfs that
 are not related to tasks.
 
+4.2	Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+  1. This mount is not fully visible.
+
+     a. It's root directory is not the root directory of the filesystem.
+     b. If any file or non-empty procfs directory is hidden by another mount.
+
+  2. A new mount overrides the readonly option or any option from atime familty.
+
 Chapter 5: Filesystem behavior
 ==============================
 
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.

Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/proc_net.c      | 8 ++++++++
 fs/proc/root.c          | 5 +++++
 include/linux/proc_fs.h | 1 +
 3 files changed, 14 insertions(+)

diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
 #include <linux/uidgid.h>
 #include <net/net_namespace.h>
 #include <linux/seq_file.h>
+#include <linux/security.h>
 
 #include "internal.h"
 
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
 	struct task_struct *task;
 	struct nsproxy *ns;
 	struct net *net = NULL;
+	struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
 
 	rcu_read_lock();
 	task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
 	}
 	rcu_read_unlock();
 
+	if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+	    security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+		put_net(net);
+		net = NULL;
+	}
+
 	return net;
 }
 
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..6a75ac717455 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -171,6 +171,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 		return -ENOMEM;
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+	fs_info->mounter_cred = get_cred(fc->cred);
 	proc_apply_options(fs_info, fc, current_user_ns());
 
 	/* User space would break if executables or devices appear on proc */
@@ -220,6 +221,9 @@ static int proc_reconfigure(struct fs_context *fc)
 
 	sync_filesystem(sb);
 
+	put_cred(fs_info->mounter_cred);
+	fs_info->mounter_cred = get_cred(fc->cred);
+
 	proc_apply_options(fs_info, fc, current_user_ns());
 	return 0;
 }
@@ -274,6 +278,7 @@ static void proc_kill_sb(struct super_block *sb)
 
 	kill_anon_super(sb);
 	put_pid_ns(fs_info->pid_ns);
+	put_cred(fs_info->mounter_cred);
 	kfree(fs_info);
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
 	kgid_t pid_gid;
 	enum proc_hidepid hide_pid;
 	enum proc_pidonly pidonly;
+	const struct cred *mounter_cred;
 };
 
 static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
  2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.

This patch makes the limitation explicit and prints an error message.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/proc/root.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a75ac717455..0d20bb67e79a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
@@ -155,8 +155,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
-	if (ctx->mask & (1 << Opt_subset))
+	if (ctx->mask & (1 << Opt_subset)) {
+		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
+	}
+	return 0;
 }
 
 static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -172,7 +176,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	proc_apply_options(fs_info, fc, current_user_ns());
+	ret = proc_apply_options(fs_info, fc, current_user_ns());
+	if (ret)
+		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -224,8 +230,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	proc_apply_options(fs_info, fc, current_user_ns());
-	return 0;
+	return proc_apply_options(fs_info, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH v6 4/5] proc: Relax check of mount visibility
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (2 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
  5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the user.

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 fs/namespace.c     | 30 ++++++++++++++++++------------
 fs/proc/root.c     | 16 ++++++++++------
 include/linux/fs.h |  1 +
 3 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f38570fdfc3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3951,7 +3951,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		/* This mount is not fully visible if it's root directory
 		 * is not the root directory of the filesystem.
 		 */
-		if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+		if (!(sb->s_iflags & SB_I_DYNAMIC) &&
+		    mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
 			continue;
 
 		/* A local view of the mount flags */
@@ -3971,18 +3972,23 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
 		    ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
 			continue;
 
-		/* This mount is not fully visible if there are any
-		 * locked child mounts that cover anything except for
-		 * empty directories.
+		/* If this filesystem is completely dynamic, then it
+		 * makes no sense to check for any child mounts.
 		 */
-		list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
-			struct inode *inode = child->mnt_mountpoint->d_inode;
-			/* Only worry about locked mounts */
-			if (!(child->mnt.mnt_flags & MNT_LOCKED))
-				continue;
-			/* Is the directory permanetly empty? */
-			if (!is_empty_dir_inode(inode))
-				goto next;
+		if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+			/* This mount is not fully visible if there are any
+			 * locked child mounts that cover anything except for
+			 * empty directories.
+			 */
+			list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+				struct inode *inode = child->mnt_mountpoint->d_inode;
+				/* Only worry about locked mounts */
+				if (!(child->mnt.mnt_flags & MNT_LOCKED))
+					continue;
+				/* Is the directory permanetly empty? */
+				if (!is_empty_dir_inode(inode))
+					goto next;
+			}
 		}
 		/* Preserve the locked attributes */
 		*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0d20bb67e79a..c739ed94246c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
 	return 0;
 }
 
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
 			       struct fs_context *fc,
 			       struct user_namespace *user_ns)
 {
 	struct proc_fs_context *ctx = fc->fs_private;
+	struct proc_fs_info *fs_info = proc_sb_info(s);
 
 	if (ctx->mask & (1 << Opt_gid))
 		fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
 	if (ctx->mask & (1 << Opt_hidepid))
 		fs_info->hide_pid = ctx->hidepid;
 	if (ctx->mask & (1 << Opt_subset)) {
-		if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+		if (ctx->pidonly == PROC_PIDONLY_ON)
+			s->s_iflags |= SB_I_DYNAMIC;
+		else if (fs_info->pidonly == PROC_PIDONLY_ON)
 			return invalf(fc, "proc: subset=pid cannot be unset\n");
 		fs_info->pidonly = ctx->pidonly;
 	}
@@ -176,9 +179,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 
 	fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
 	fs_info->mounter_cred = get_cred(fc->cred);
-	ret = proc_apply_options(fs_info, fc, current_user_ns());
-	if (ret)
-		return ret;
 
 	/* User space would break if executables or devices appear on proc */
 	s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -190,6 +190,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
 	s->s_time_gran = 1;
 	s->s_fs_info = fs_info;
 
+	ret = proc_apply_options(s, fc, current_user_ns());
+	if (ret)
+		return ret;
+
 	/*
 	 * procfs isn't actually a stacking filesystem; however, there is
 	 * too much magic going on inside it to permit stacking things on
@@ -230,7 +234,7 @@ static int proc_reconfigure(struct fs_context *fc)
 	put_cred(fs_info->mounter_cred);
 	fs_info->mounter_cred = get_cred(fc->cred);
 
-	return proc_apply_options(fs_info, fc, current_user_ns());
+	return proc_apply_options(sb, fc, current_user_ns());
 }
 
 static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2c9a47bad796 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1390,6 +1390,7 @@ extern int send_sigurg(struct fown_struct *fown);
 #define SB_I_USERNS_VISIBLE		0x00000010 /* fstype already mounted */
 #define SB_I_IMA_UNVERIFIABLE_SIGNATURE	0x00000020
 #define SB_I_UNTRUSTED_MOUNTER		0x00000040
+#define SB_I_DYNAMIC			0x00000080
 
 #define SB_I_SKIP_SYNC	0x00000100	/* Skip superblock at global sync */
 
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (3 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
  5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
  To: LKML, Eric W . Biederman
  Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel

Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
 Documentation/filesystems/proc.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a1bb0e081fd..9d993aef7f1c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2182,7 +2182,8 @@ are not related to tasks.
 If user namespaces are in use, the kernel additionally checks the instances of
 procfs available to the mounter and will not allow procfs to be mounted if:
 
-  1. This mount is not fully visible.
+  1. This mount is not fully visible unless the new procfs is going to be
+     mounted with subset=pid option.
 
      a. It's root directory is not the root directory of the filesystem.
      b. If any file or non-empty procfs directory is hidden by another mount.
-- 
2.29.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
                   ` (4 preceding siblings ...)
  2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
@ 2025-12-13  5:06 ` Dan Klishch
  2025-12-13 10:49   ` Alexey Gladkov
  5 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-13  5:06 UTC (permalink / raw)
  To: legion, linux-kernel
  Cc: ebiederm, viro, keescook, containers, linux-fsdevel, Dan Klishch

Hello Alexey,

Would it be possible to revive this patch series?

I wanted to add an additional downstream use case that would benefit
from this work. In particular, I am trying to run the sandbox
sunwalker-box [1] without root privileges and/or inside a container.

The sandbox aims to prevent cross-run communication via side channels,
and PID allocation is one such channel. Therefore, it creates a new PID
namespace and mounts the corresponding procfs instance inside of the
sandbox. This currently works without a real root when procfs is fully
accessible, but obviously fails otherwise.

Thanks,
Dan Klishch

[1] https://github.com/purplesyringa/sunwalker-box/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
@ 2025-12-13 10:49   ` Alexey Gladkov
  2025-12-13 18:00     ` Dan Klishch
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-13 10:49 UTC (permalink / raw)
  To: Dan Klishch
  Cc: linux-kernel, ebiederm, viro, keescook, containers, linux-fsdevel

On Sat, Dec 13, 2025 at 12:06:38AM -0500, Dan Klishch wrote:
> Hello Alexey,
> 
> Would it be possible to revive this patch series?
> 
> I wanted to add an additional downstream use case that would benefit
> from this work. In particular, I am trying to run the sandbox
> sunwalker-box [1] without root privileges and/or inside a container.
> 
> The sandbox aims to prevent cross-run communication via side channels,
> and PID allocation is one such channel. Therefore, it creates a new PID
> namespace and mounts the corresponding procfs instance inside of the
> sandbox. This currently works without a real root when procfs is fully
> accessible, but obviously fails otherwise.
> 
> Thanks,
> Dan Klishch
> 
> [1] https://github.com/purplesyringa/sunwalker-box/
> 

Overmounting "dangerous" files in procfs is an incorrect and potentially
dangerous practice. I know that many programs (docker, podman, etc.) use
this method, but it is not the correct way to isolate dangerous files in
procfs.

In particular, this is one of the reasons why this patchset was abandoned.

It is quite difficult to implement these checks in procfs correctly and
not break anything. It is much easier to implement file access
restrictions in procfs using an ebpf controller. Some time ago, I tried to
implement such a controller [1], and it seemed to me that it was much
easier than adding complex checks to the kernel.

If I'm wrong and missing a use case, let me know and we can go back to
the patches.

[1] https://github.com/legionus/proc-bpf-controller

-- 
Rgrds, legion

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13 10:49   ` Alexey Gladkov
@ 2025-12-13 18:00     ` Dan Klishch
  2025-12-14 16:40       ` Alexey Gladkov
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-13 18:00 UTC (permalink / raw)
  To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

> It is much easier to implement file access
> restrictions in procfs using an ebpf controller.

But if we already have a masked /proc from podman/docker/user who
decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
not have a choice other than to bail out. Also, correct me if I am
wrong, installing ebpf controller requires CAP_BPF in initial
userns, so rootless podman will not be able to mask /proc "properly"
even if someone sends a patch switching it to ebpf.

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-13 18:00     ` Dan Klishch
@ 2025-12-14 16:40       ` Alexey Gladkov
  2025-12-14 18:02         ` Dan Klishch
  0 siblings, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-14 16:40 UTC (permalink / raw)
  To: Dan Klishch
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On Sat, Dec 13, 2025 at 01:00:38PM -0500, Dan Klishch wrote:
> > It is much easier to implement file access
> > restrictions in procfs using an ebpf controller.
> 
> But if we already have a masked /proc from podman/docker/user who
> decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
> not have a choice other than to bail out.

I misunderstood you. I thought you were writing your own container
implementation.

Yes, if you want a nested container inside docker/podman, then file
overmount technique is already used there.

But then, if I understand you correctly, this patch will not be enough
for you. procfs with subset=pid will not allow you to have /proc/meminfo,
/proc/cpuinfo, etc.

> Also, correct me if I am wrong, installing ebpf controller requires
> CAP_BPF in initial userns, so rootless podman will not be able to mask
> /proc "properly" even if someone sends a patch switching it to ebpf.

You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 16:40       ` Alexey Gladkov
@ 2025-12-14 18:02         ` Dan Klishch
  2025-12-15 10:10           ` Alexey Gladkov
  2025-12-15 11:30           ` Christian Brauner
  0 siblings, 2 replies; 15+ messages in thread
From: Dan Klishch @ 2025-12-14 18:02 UTC (permalink / raw)
  To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> But then, if I understand you correctly, this patch will not be enough
> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> /proc/cpuinfo, etc.

Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
tree to the sandboxed programs (empirically, this is enough for most of
programs you want sandboxing for). With that in mind, this patch and a
FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
/proc/cpuinfo / a small kernel patch with a new mount option for procfs
to expose more static files still look like a clean solution to me.

>> Also, correct me if I am wrong, installing ebpf controller requires
>> CAP_BPF in initial userns, so rootless podman will not be able to mask
>> /proc "properly" even if someone sends a patch switching it to ebpf.
> 
> You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.

$ cat /proc/sys/kernel/unprivileged_bpf_disabled
0
$ unshare -pfr --mount-proc
$ ./proc-controller -p deny /proc/cpuinfo
libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
libbpf: prog 'proc_access_restrict': failed to load: -1
libbpf: failed to load object './proc-controller.bpf.o'
proc-controller: ERROR: loading BPF object file failed

I think only packet filters are allowed to be installed by non-root.

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 18:02         ` Dan Klishch
@ 2025-12-15 10:10           ` Alexey Gladkov
  2025-12-15 14:46             ` Dan Klishch
  2025-12-15 11:30           ` Christian Brauner
  1 sibling, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-15 10:10 UTC (permalink / raw)
  To: Dan Klishch
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
> 
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.

I don't think you'll be able to do that. procfs doesn't allow itself to
be overlayed [1]. What should block mounting overlayfs and fuse on top
of procfs.

[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274

> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.
> > 
> > You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.
> 
> $ cat /proc/sys/kernel/unprivileged_bpf_disabled
> 0
> $ unshare -pfr --mount-proc
> $ ./proc-controller -p deny /proc/cpuinfo
> libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
> libbpf: prog 'proc_access_restrict': failed to load: -1
> libbpf: failed to load object './proc-controller.bpf.o'
> proc-controller: ERROR: loading BPF object file failed
> 
> I think only packet filters are allowed to be installed by non-root.

I probably forgot about that. I wrote this code a long time ago, and
to be honest, I forgot whether it can be used for rootless.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-15 10:10           ` Alexey Gladkov
@ 2025-12-15 14:46             ` Dan Klishch
  2025-12-15 14:58               ` Alexey Gladkov
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-15 14:46 UTC (permalink / raw)
  To: legion, brauner
  Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro

On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
>> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
>>> But then, if I understand you correctly, this patch will not be enough
>>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
>>> /proc/cpuinfo, etc.
>>
>> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
>> tree to the sandboxed programs (empirically, this is enough for most of
>> programs you want sandboxing for). With that in mind, this patch and a
>> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
>> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
>> to expose more static files still look like a clean solution to me.
> 
> I don't think you'll be able to do that. procfs doesn't allow itself to
> be overlayed [1]. What should block mounting overlayfs and fuse on top
> of procfs.
> 
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274

This is why I have been careful not to say overlayfs. With [2] (warning:
zero-shot ChatGPT output), I can do:

$ ./fuse-overlay target --source=/proc
$ ls target
1   88   194   1374    889840  908552
2   90   195   1375    889987  908619
3   91   196   1379    890031  908658
4   92   203   1412    890063  908756
5   93   205   1590    890085  908804
6   94   233   1644    890139  908951
7   96   237   1802    890246  909848
8   97   239   1850    890271  909914
10  98   240   1852    894665  909924
13  99   243   1865    895854  909926
15  100  244   1888    895864  910005
16  102  246   1889    896030  acpi
17  103  262   1891    896205  asound
18  104  263   1895    896508  bus
19  105  264   1896    896544  driver
20  106  265   1899    896706  dynamic_debug
<...>

[2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474

This requires a much more careful thought wrt magic symlinks
and permission checks. The fact that I am highly unlikely to 100%
correctly reimplement the checks and special behavior of procfs makes me
not want to proceed with the FUSE route.

On 12/15/25 6:30 AM, Christian Brauner wrote:
> The standard way of making it possible to mount procfs inside of a
> container with a separate mount namespace that has a procfs inside it
> with overmounted entries is to ensure that a fully-visible procfs
> instance is present.

Yes, this is a solution. However, this is only marginally better than
passing --privileged to the outer container (in a sense that we require
outer sandbox to remove some protections for the inner sandbox to work).

> The container needs to inherit a fully-visible instance somehow if you
> want nesting. Using an unprivileged LSM such as landlock to prevent any
> access to the fully visible procfs instance is usually the better way.
> 
> My hope is that once signed bpf is more widely adopted that distros will
> just start enabling blessed bpf programs that will just take on the
> access protecting instead of the clumsy bind-mount protection mechanism.

These are big changes to container runtimes that are unlikely to happen
soon. In contrast, the patch we are discussing will be available in 2
months after the merge for me to use on ArchLinux, and in a couple more
months on Ubuntu.

So, is there any way forward with the patch or should I continue trying
to find a userspace solution?

Thanks,
Dan Klishch

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-15 14:46             ` Dan Klishch
@ 2025-12-15 14:58               ` Alexey Gladkov
  0 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-15 14:58 UTC (permalink / raw)
  To: Dan Klishch
  Cc: brauner, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Mon, Dec 15, 2025 at 09:46:00AM -0500, Dan Klishch wrote:
> On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> > On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> >> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> >>> But then, if I understand you correctly, this patch will not be enough
> >>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> >>> /proc/cpuinfo, etc.
> >>
> >> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> >> tree to the sandboxed programs (empirically, this is enough for most of
> >> programs you want sandboxing for). With that in mind, this patch and a
> >> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> >> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> >> to expose more static files still look like a clean solution to me.
> > 
> > I don't think you'll be able to do that. procfs doesn't allow itself to
> > be overlayed [1]. What should block mounting overlayfs and fuse on top
> > of procfs.
> > 
> > [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
> 
> This is why I have been careful not to say overlayfs. With [2] (warning:
> zero-shot ChatGPT output), I can do:
> 
> $ ./fuse-overlay target --source=/proc
> $ ls target
> 1   88   194   1374    889840  908552
> 2   90   195   1375    889987  908619
> 3   91   196   1379    890031  908658
> 4   92   203   1412    890063  908756
> 5   93   205   1590    890085  908804
> 6   94   233   1644    890139  908951
> 7   96   237   1802    890246  909848
> 8   97   239   1850    890271  909914
> 10  98   240   1852    894665  909924
> 13  99   243   1865    895854  909926
> 15  100  244   1888    895864  910005
> 16  102  246   1889    896030  acpi
> 17  103  262   1891    896205  asound
> 18  104  263   1895    896508  bus
> 19  105  264   1896    896544  driver
> 20  106  265   1899    896706  dynamic_debug
> <...>
> 
> [2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
> 
> This requires a much more careful thought wrt magic symlinks
> and permission checks. The fact that I am highly unlikely to 100%
> correctly reimplement the checks and special behavior of procfs makes me
> not want to proceed with the FUSE route.
> 
> On 12/15/25 6:30 AM, Christian Brauner wrote:
> > The standard way of making it possible to mount procfs inside of a
> > container with a separate mount namespace that has a procfs inside it
> > with overmounted entries is to ensure that a fully-visible procfs
> > instance is present.
> 
> Yes, this is a solution. However, this is only marginally better than
> passing --privileged to the outer container (in a sense that we require
> outer sandbox to remove some protections for the inner sandbox to work).
> 
> > The container needs to inherit a fully-visible instance somehow if you
> > want nesting. Using an unprivileged LSM such as landlock to prevent any
> > access to the fully visible procfs instance is usually the better way.
> > 
> > My hope is that once signed bpf is more widely adopted that distros will
> > just start enabling blessed bpf programs that will just take on the
> > access protecting instead of the clumsy bind-mount protection mechanism.
> 
> These are big changes to container runtimes that are unlikely to happen
> soon. In contrast, the patch we are discussing will be available in 2
> months after the merge for me to use on ArchLinux, and in a couple more
> months on Ubuntu.
> 
> So, is there any way forward with the patch or should I continue trying
> to find a userspace solution?

I still consider these patches useful. I made them precisely to remove
some of the restrictions we have for procfs because of global files in
the root of this filesystem.

I can update and prepare a new version of patchset if Christian thinks
it's useful too.

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
  2025-12-14 18:02         ` Dan Klishch
  2025-12-15 10:10           ` Alexey Gladkov
@ 2025-12-15 11:30           ` Christian Brauner
  1 sibling, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-12-15 11:30 UTC (permalink / raw)
  To: Dan Klishch
  Cc: legion, containers, ebiederm, keescook, linux-fsdevel,
	linux-kernel, viro

On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
> 
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.

The standard way of making it possible to mount procfs inside of a
container with a separate mount namespace that has a procfs inside it
with overmounted entries is to ensure that a fully-visible procfs
instance is present. This is for example what Incus does when nesting
containers is enabled. In systemd I implemented the same logic years
ago:

commit b71a0192c040f585397cfc6fc2ca025bf839733d
Author:     Christian Brauner <brauner@kernel.org>
AuthorDate: Mon Nov 28 12:36:47 2022 +0100
Commit:     Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Mon Dec 5 18:34:25 2022 +0100

    nspawn: mount temporary visible procfs and sysfs instance

    In order to mount procfs and sysfs in an unprivileged container the
    kernel requires that a fully visible instance is already present in the
    target mount namespace. Mount one here so the inner child can mount its
    own  instances. Later we umount the temporary  instances created here
    before we actually exec the payload. Since the rootfs is shared the
    umount will propagate into the container. Note, the inner child wouldn't
    be able to unmount the  instances on its own since it doesn't own the
    originating mount namespace. IOW, the outer child needs to do this.

    So far nspawn didn't run into this issue because it used MS_MOVE which
    meant that the shadow mount tree pinned a procfs and sysfs instance
    which the kernel would find. The shadow mount tree is gone with proper
    pivot_root() semantics.

    Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

> 
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.

The container needs to inherit a fully-visible instance somehow if you
want nesting. Using an unprivileged LSM such as landlock to prevent any
access to the fully visible procfs instance is usually the better way.

My hope is that once signed bpf is more widely adopted that distros will
just start enabling blessed bpf programs that will just take on the
access protecting instead of the clumsy bind-mount protection mechanism.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-12-15 14:58 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
2025-12-13  5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
2025-12-13 10:49   ` Alexey Gladkov
2025-12-13 18:00     ` Dan Klishch
2025-12-14 16:40       ` Alexey Gladkov
2025-12-14 18:02         ` Dan Klishch
2025-12-15 10:10           ` Alexey Gladkov
2025-12-15 14:46             ` Dan Klishch
2025-12-15 14:58               ` Alexey Gladkov
2025-12-15 11:30           ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).