* [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:45 ` Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
` (4 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:45 UTC (permalink / raw)
To: LKML, Eric W . Biederman
Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 2fa69f710e2a..5a1bb0e081fd 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -50,6 +50,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
4 Configuring procfs
4.1 Mount options
+ 4.2 Mount restrictions
5 Filesystem behavior
@@ -2175,6 +2176,19 @@ information about processes information, just add identd to this group.
subset=pid hides all top level files and directories in the procfs that
are not related to tasks.
+4.2 Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+ 1. This mount is not fully visible.
+
+ a. It's root directory is not the root directory of the filesystem.
+ b. If any file or non-empty procfs directory is hidden by another mount.
+
+ 2. A new mount overrides the readonly option or any option from atime familty.
+
Chapter 5: Filesystem behavior
==============================
--
2.29.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
` (3 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
To: LKML, Eric W . Biederman
Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel
Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.
Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/proc_net.c | 8 ++++++++
fs/proc/root.c | 5 +++++
include/linux/proc_fs.h | 1 +
3 files changed, 14 insertions(+)
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 18601042af99..a198f74cdb3b 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -26,6 +26,7 @@
#include <linux/uidgid.h>
#include <net/net_namespace.h>
#include <linux/seq_file.h>
+#include <linux/security.h>
#include "internal.h"
@@ -259,6 +260,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+ struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -271,6 +273,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
+ if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+ security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+ put_net(net);
+ net = NULL;
+ }
+
return net;
}
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5e444d4f9717..6a75ac717455 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -171,6 +171,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return -ENOMEM;
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+ fs_info->mounter_cred = get_cred(fc->cred);
proc_apply_options(fs_info, fc, current_user_ns());
/* User space would break if executables or devices appear on proc */
@@ -220,6 +221,9 @@ static int proc_reconfigure(struct fs_context *fc)
sync_filesystem(sb);
+ put_cred(fs_info->mounter_cred);
+ fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(fs_info, fc, current_user_ns());
return 0;
}
@@ -274,6 +278,7 @@ static void proc_kill_sb(struct super_block *sb)
kill_anon_super(sb);
put_pid_ns(fs_info->pid_ns);
+ put_cred(fs_info->mounter_cred);
kfree(fs_info);
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 000cc0533c33..ffa871941bd0 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -64,6 +64,7 @@ struct proc_fs_info {
kgid_t pid_gid;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
+ const struct cred *mounter_cred;
};
static inline struct proc_fs_info *proc_sb_info(struct super_block *sb)
--
2.29.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2021-07-16 10:45 ` [RESEND PATCH v6 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
` (2 subsequent siblings)
5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
To: LKML, Eric W . Biederman
Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel
When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.
This patch makes the limitation explicit and prints an error message.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/root.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 6a75ac717455..0d20bb67e79a 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,7 +145,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
struct fs_context *fc,
struct user_namespace *user_ns)
{
@@ -155,8 +155,12 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
- if (ctx->mask & (1 << Opt_subset))
+ if (ctx->mask & (1 << Opt_subset)) {
+ if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
+ }
+ return 0;
}
static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -172,7 +176,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- proc_apply_options(fs_info, fc, current_user_ns());
+ ret = proc_apply_options(fs_info, fc, current_user_ns());
+ if (ret)
+ return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -224,8 +230,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
- proc_apply_options(fs_info, fc, current_user_ns());
- return 0;
+ return proc_apply_options(fs_info, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
--
2.29.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* [RESEND PATCH v6 4/5] proc: Relax check of mount visibility
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
` (2 preceding siblings ...)
2021-07-16 10:46 ` [RESEND PATCH v6 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
2025-12-13 5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
To: LKML, Eric W . Biederman
Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel
Allow to mount procfs with subset=pid option even if the entire procfs
is not fully accessible to the user.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/namespace.c | 30 ++++++++++++++++++------------
fs/proc/root.c | 16 ++++++++++------
include/linux/fs.h | 1 +
3 files changed, 29 insertions(+), 18 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 9d33909d0f9e..f38570fdfc3f 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -3951,7 +3951,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
/* This mount is not fully visible if it's root directory
* is not the root directory of the filesystem.
*/
- if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+ if (!(sb->s_iflags & SB_I_DYNAMIC) &&
+ mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
continue;
/* A local view of the mount flags */
@@ -3971,18 +3972,23 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
continue;
- /* This mount is not fully visible if there are any
- * locked child mounts that cover anything except for
- * empty directories.
+ /* If this filesystem is completely dynamic, then it
+ * makes no sense to check for any child mounts.
*/
- list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
- struct inode *inode = child->mnt_mountpoint->d_inode;
- /* Only worry about locked mounts */
- if (!(child->mnt.mnt_flags & MNT_LOCKED))
- continue;
- /* Is the directory permanetly empty? */
- if (!is_empty_dir_inode(inode))
- goto next;
+ if (!(sb->s_iflags & SB_I_DYNAMIC)) {
+ /* This mount is not fully visible if there are any
+ * locked child mounts that cover anything except for
+ * empty directories.
+ */
+ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+ struct inode *inode = child->mnt_mountpoint->d_inode;
+ /* Only worry about locked mounts */
+ if (!(child->mnt.mnt_flags & MNT_LOCKED))
+ continue;
+ /* Is the directory permanetly empty? */
+ if (!is_empty_dir_inode(inode))
+ goto next;
+ }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 0d20bb67e79a..c739ed94246c 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -145,18 +145,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
struct fs_context *fc,
struct user_namespace *user_ns)
{
struct proc_fs_context *ctx = fc->fs_private;
+ struct proc_fs_info *fs_info = proc_sb_info(s);
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
- if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ if (ctx->pidonly == PROC_PIDONLY_ON)
+ s->s_iflags |= SB_I_DYNAMIC;
+ else if (fs_info->pidonly == PROC_PIDONLY_ON)
return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
@@ -176,9 +179,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- ret = proc_apply_options(fs_info, fc, current_user_ns());
- if (ret)
- return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -190,6 +190,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
s->s_time_gran = 1;
s->s_fs_info = fs_info;
+ ret = proc_apply_options(s, fc, current_user_ns());
+ if (ret)
+ return ret;
+
/*
* procfs isn't actually a stacking filesystem; however, there is
* too much magic going on inside it to permit stacking things on
@@ -230,7 +234,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
- return proc_apply_options(fs_info, fc, current_user_ns());
+ return proc_apply_options(sb, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..2c9a47bad796 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1390,6 +1390,7 @@ extern int send_sigurg(struct fown_struct *fown);
#define SB_I_USERNS_VISIBLE 0x00000010 /* fstype already mounted */
#define SB_I_IMA_UNVERIFIABLE_SIGNATURE 0x00000020
#define SB_I_UNTRUSTED_MOUNTER 0x00000040
+#define SB_I_DYNAMIC 0x00000080
#define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */
--
2.29.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
` (3 preceding siblings ...)
2021-07-16 10:46 ` [RESEND PATCH v6 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2021-07-16 10:46 ` Alexey Gladkov
2025-12-13 5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
5 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2021-07-16 10:46 UTC (permalink / raw)
To: LKML, Eric W . Biederman
Cc: Alexander Viro, Kees Cook, Linux Containers, Linux FS Devel
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 5a1bb0e081fd..9d993aef7f1c 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2182,7 +2182,8 @@ are not related to tasks.
If user namespaces are in use, the kernel additionally checks the instances of
procfs available to the mounter and will not allow procfs to be mounted if:
- 1. This mount is not fully visible.
+ 1. This mount is not fully visible unless the new procfs is going to be
+ mounted with subset=pid option.
a. It's root directory is not the root directory of the filesystem.
b. If any file or non-empty procfs directory is hidden by another mount.
--
2.29.3
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2021-07-16 10:45 [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
` (4 preceding siblings ...)
2021-07-16 10:46 ` [RESEND PATCH v6 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
@ 2025-12-13 5:06 ` Dan Klishch
2025-12-13 10:49 ` Alexey Gladkov
5 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-13 5:06 UTC (permalink / raw)
To: legion, linux-kernel
Cc: ebiederm, viro, keescook, containers, linux-fsdevel, Dan Klishch
Hello Alexey,
Would it be possible to revive this patch series?
I wanted to add an additional downstream use case that would benefit
from this work. In particular, I am trying to run the sandbox
sunwalker-box [1] without root privileges and/or inside a container.
The sandbox aims to prevent cross-run communication via side channels,
and PID allocation is one such channel. Therefore, it creates a new PID
namespace and mounts the corresponding procfs instance inside of the
sandbox. This currently works without a real root when procfs is fully
accessible, but obviously fails otherwise.
Thanks,
Dan Klishch
[1] https://github.com/purplesyringa/sunwalker-box/
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-13 5:06 ` [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility Dan Klishch
@ 2025-12-13 10:49 ` Alexey Gladkov
2025-12-13 18:00 ` Dan Klishch
0 siblings, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-13 10:49 UTC (permalink / raw)
To: Dan Klishch
Cc: linux-kernel, ebiederm, viro, keescook, containers, linux-fsdevel
On Sat, Dec 13, 2025 at 12:06:38AM -0500, Dan Klishch wrote:
> Hello Alexey,
>
> Would it be possible to revive this patch series?
>
> I wanted to add an additional downstream use case that would benefit
> from this work. In particular, I am trying to run the sandbox
> sunwalker-box [1] without root privileges and/or inside a container.
>
> The sandbox aims to prevent cross-run communication via side channels,
> and PID allocation is one such channel. Therefore, it creates a new PID
> namespace and mounts the corresponding procfs instance inside of the
> sandbox. This currently works without a real root when procfs is fully
> accessible, but obviously fails otherwise.
>
> Thanks,
> Dan Klishch
>
> [1] https://github.com/purplesyringa/sunwalker-box/
>
Overmounting "dangerous" files in procfs is an incorrect and potentially
dangerous practice. I know that many programs (docker, podman, etc.) use
this method, but it is not the correct way to isolate dangerous files in
procfs.
In particular, this is one of the reasons why this patchset was abandoned.
It is quite difficult to implement these checks in procfs correctly and
not break anything. It is much easier to implement file access
restrictions in procfs using an ebpf controller. Some time ago, I tried to
implement such a controller [1], and it seemed to me that it was much
easier than adding complex checks to the kernel.
If I'm wrong and missing a use case, let me know and we can go back to
the patches.
[1] https://github.com/legionus/proc-bpf-controller
--
Rgrds, legion
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-13 10:49 ` Alexey Gladkov
@ 2025-12-13 18:00 ` Dan Klishch
2025-12-14 16:40 ` Alexey Gladkov
0 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-13 18:00 UTC (permalink / raw)
To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro
> It is much easier to implement file access
> restrictions in procfs using an ebpf controller.
But if we already have a masked /proc from podman/docker/user who
decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
not have a choice other than to bail out. Also, correct me if I am
wrong, installing ebpf controller requires CAP_BPF in initial
userns, so rootless podman will not be able to mask /proc "properly"
even if someone sends a patch switching it to ebpf.
Thanks,
Dan Klishch
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-13 18:00 ` Dan Klishch
@ 2025-12-14 16:40 ` Alexey Gladkov
2025-12-14 18:02 ` Dan Klishch
0 siblings, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-14 16:40 UTC (permalink / raw)
To: Dan Klishch
Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro
On Sat, Dec 13, 2025 at 01:00:38PM -0500, Dan Klishch wrote:
> > It is much easier to implement file access
> > restrictions in procfs using an ebpf controller.
>
> But if we already have a masked /proc from podman/docker/user who
> decided to run `mount --bind /dev/null /proc/smth`, the sandbox will
> not have a choice other than to bail out.
I misunderstood you. I thought you were writing your own container
implementation.
Yes, if you want a nested container inside docker/podman, then file
overmount technique is already used there.
But then, if I understand you correctly, this patch will not be enough
for you. procfs with subset=pid will not allow you to have /proc/meminfo,
/proc/cpuinfo, etc.
> Also, correct me if I am wrong, installing ebpf controller requires
> CAP_BPF in initial userns, so rootless podman will not be able to mask
> /proc "properly" even if someone sends a patch switching it to ebpf.
You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.
--
Rgrds, legion
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-14 16:40 ` Alexey Gladkov
@ 2025-12-14 18:02 ` Dan Klishch
2025-12-15 10:10 ` Alexey Gladkov
2025-12-15 11:30 ` Christian Brauner
0 siblings, 2 replies; 15+ messages in thread
From: Dan Klishch @ 2025-12-14 18:02 UTC (permalink / raw)
To: legion; +Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro
On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> But then, if I understand you correctly, this patch will not be enough
> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> /proc/cpuinfo, etc.
Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
tree to the sandboxed programs (empirically, this is enough for most of
programs you want sandboxing for). With that in mind, this patch and a
FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
/proc/cpuinfo / a small kernel patch with a new mount option for procfs
to expose more static files still look like a clean solution to me.
>> Also, correct me if I am wrong, installing ebpf controller requires
>> CAP_BPF in initial userns, so rootless podman will not be able to mask
>> /proc "properly" even if someone sends a patch switching it to ebpf.
>
> You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.
$ cat /proc/sys/kernel/unprivileged_bpf_disabled
0
$ unshare -pfr --mount-proc
$ ./proc-controller -p deny /proc/cpuinfo
libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
libbpf: prog 'proc_access_restrict': failed to load: -1
libbpf: failed to load object './proc-controller.bpf.o'
proc-controller: ERROR: loading BPF object file failed
I think only packet filters are allowed to be installed by non-root.
Thanks,
Dan Klishch
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-14 18:02 ` Dan Klishch
@ 2025-12-15 10:10 ` Alexey Gladkov
2025-12-15 14:46 ` Dan Klishch
2025-12-15 11:30 ` Christian Brauner
1 sibling, 1 reply; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-15 10:10 UTC (permalink / raw)
To: Dan Klishch
Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro
On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
>
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.
I don't think you'll be able to do that. procfs doesn't allow itself to
be overlayed [1]. What should block mounting overlayfs and fuse on top
of procfs.
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.
> >
> > You can turn on /proc/sys/kernel/unprivileged_bpf_disabled.
>
> $ cat /proc/sys/kernel/unprivileged_bpf_disabled
> 0
> $ unshare -pfr --mount-proc
> $ ./proc-controller -p deny /proc/cpuinfo
> libbpf: prog 'proc_access_restrict': BPF program load failed: Operation not permitted
> libbpf: prog 'proc_access_restrict': failed to load: -1
> libbpf: failed to load object './proc-controller.bpf.o'
> proc-controller: ERROR: loading BPF object file failed
>
> I think only packet filters are allowed to be installed by non-root.
I probably forgot about that. I wrote this code a long time ago, and
to be honest, I forgot whether it can be used for rootless.
--
Rgrds, legion
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-15 10:10 ` Alexey Gladkov
@ 2025-12-15 14:46 ` Dan Klishch
2025-12-15 14:58 ` Alexey Gladkov
0 siblings, 1 reply; 15+ messages in thread
From: Dan Klishch @ 2025-12-15 14:46 UTC (permalink / raw)
To: legion, brauner
Cc: containers, ebiederm, keescook, linux-fsdevel, linux-kernel, viro
On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
>> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
>>> But then, if I understand you correctly, this patch will not be enough
>>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
>>> /proc/cpuinfo, etc.
>>
>> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
>> tree to the sandboxed programs (empirically, this is enough for most of
>> programs you want sandboxing for). With that in mind, this patch and a
>> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
>> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
>> to expose more static files still look like a clean solution to me.
>
> I don't think you'll be able to do that. procfs doesn't allow itself to
> be overlayed [1]. What should block mounting overlayfs and fuse on top
> of procfs.
>
> [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
This is why I have been careful not to say overlayfs. With [2] (warning:
zero-shot ChatGPT output), I can do:
$ ./fuse-overlay target --source=/proc
$ ls target
1 88 194 1374 889840 908552
2 90 195 1375 889987 908619
3 91 196 1379 890031 908658
4 92 203 1412 890063 908756
5 93 205 1590 890085 908804
6 94 233 1644 890139 908951
7 96 237 1802 890246 909848
8 97 239 1850 890271 909914
10 98 240 1852 894665 909924
13 99 243 1865 895854 909926
15 100 244 1888 895864 910005
16 102 246 1889 896030 acpi
17 103 262 1891 896205 asound
18 104 263 1895 896508 bus
19 105 264 1896 896544 driver
20 106 265 1899 896706 dynamic_debug
<...>
[2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
This requires a much more careful thought wrt magic symlinks
and permission checks. The fact that I am highly unlikely to 100%
correctly reimplement the checks and special behavior of procfs makes me
not want to proceed with the FUSE route.
On 12/15/25 6:30 AM, Christian Brauner wrote:
> The standard way of making it possible to mount procfs inside of a
> container with a separate mount namespace that has a procfs inside it
> with overmounted entries is to ensure that a fully-visible procfs
> instance is present.
Yes, this is a solution. However, this is only marginally better than
passing --privileged to the outer container (in a sense that we require
outer sandbox to remove some protections for the inner sandbox to work).
> The container needs to inherit a fully-visible instance somehow if you
> want nesting. Using an unprivileged LSM such as landlock to prevent any
> access to the fully visible procfs instance is usually the better way.
>
> My hope is that once signed bpf is more widely adopted that distros will
> just start enabling blessed bpf programs that will just take on the
> access protecting instead of the clumsy bind-mount protection mechanism.
These are big changes to container runtimes that are unlikely to happen
soon. In contrast, the patch we are discussing will be available in 2
months after the merge for me to use on ArchLinux, and in a couple more
months on Ubuntu.
So, is there any way forward with the patch or should I continue trying
to find a userspace solution?
Thanks,
Dan Klishch
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-15 14:46 ` Dan Klishch
@ 2025-12-15 14:58 ` Alexey Gladkov
0 siblings, 0 replies; 15+ messages in thread
From: Alexey Gladkov @ 2025-12-15 14:58 UTC (permalink / raw)
To: Dan Klishch
Cc: brauner, containers, ebiederm, keescook, linux-fsdevel,
linux-kernel, viro
On Mon, Dec 15, 2025 at 09:46:00AM -0500, Dan Klishch wrote:
> On 12/15/25 5:10 AM, Alexey Gladkov wrote:
> > On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> >> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> >>> But then, if I understand you correctly, this patch will not be enough
> >>> for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> >>> /proc/cpuinfo, etc.
> >>
> >> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> >> tree to the sandboxed programs (empirically, this is enough for most of
> >> programs you want sandboxing for). With that in mind, this patch and a
> >> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> >> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> >> to expose more static files still look like a clean solution to me.
> >
> > I don't think you'll be able to do that. procfs doesn't allow itself to
> > be overlayed [1]. What should block mounting overlayfs and fuse on top
> > of procfs.
> >
> > [1] https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/root.c#n274
>
> This is why I have been careful not to say overlayfs. With [2] (warning:
> zero-shot ChatGPT output), I can do:
>
> $ ./fuse-overlay target --source=/proc
> $ ls target
> 1 88 194 1374 889840 908552
> 2 90 195 1375 889987 908619
> 3 91 196 1379 890031 908658
> 4 92 203 1412 890063 908756
> 5 93 205 1590 890085 908804
> 6 94 233 1644 890139 908951
> 7 96 237 1802 890246 909848
> 8 97 239 1850 890271 909914
> 10 98 240 1852 894665 909924
> 13 99 243 1865 895854 909926
> 15 100 244 1888 895864 910005
> 16 102 246 1889 896030 acpi
> 17 103 262 1891 896205 asound
> 18 104 263 1895 896508 bus
> 19 105 264 1896 896544 driver
> 20 106 265 1899 896706 dynamic_debug
> <...>
>
> [2] https://gist.github.com/DanShaders/547eeb74a90315356b98472feae47474
>
> This requires a much more careful thought wrt magic symlinks
> and permission checks. The fact that I am highly unlikely to 100%
> correctly reimplement the checks and special behavior of procfs makes me
> not want to proceed with the FUSE route.
>
> On 12/15/25 6:30 AM, Christian Brauner wrote:
> > The standard way of making it possible to mount procfs inside of a
> > container with a separate mount namespace that has a procfs inside it
> > with overmounted entries is to ensure that a fully-visible procfs
> > instance is present.
>
> Yes, this is a solution. However, this is only marginally better than
> passing --privileged to the outer container (in a sense that we require
> outer sandbox to remove some protections for the inner sandbox to work).
>
> > The container needs to inherit a fully-visible instance somehow if you
> > want nesting. Using an unprivileged LSM such as landlock to prevent any
> > access to the fully visible procfs instance is usually the better way.
> >
> > My hope is that once signed bpf is more widely adopted that distros will
> > just start enabling blessed bpf programs that will just take on the
> > access protecting instead of the clumsy bind-mount protection mechanism.
>
> These are big changes to container runtimes that are unlikely to happen
> soon. In contrast, the patch we are discussing will be available in 2
> months after the merge for me to use on ArchLinux, and in a couple more
> months on Ubuntu.
>
> So, is there any way forward with the patch or should I continue trying
> to find a userspace solution?
I still consider these patches useful. I made them precisely to remove
some of the restrictions we have for procfs because of global files in
the root of this filesystem.
I can update and prepare a new version of patchset if Christian thinks
it's useful too.
--
Rgrds, legion
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RESEND PATCH v6 0/5] proc: subset=pid: Relax check of mount visibility
2025-12-14 18:02 ` Dan Klishch
2025-12-15 10:10 ` Alexey Gladkov
@ 2025-12-15 11:30 ` Christian Brauner
1 sibling, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2025-12-15 11:30 UTC (permalink / raw)
To: Dan Klishch
Cc: legion, containers, ebiederm, keescook, linux-fsdevel,
linux-kernel, viro
On Sun, Dec 14, 2025 at 01:02:54PM -0500, Dan Klishch wrote:
> On 12/14/25 11:40 AM, Alexey Gladkov wrote:
> > But then, if I understand you correctly, this patch will not be enough
> > for you. procfs with subset=pid will not allow you to have /proc/meminfo,
> > /proc/cpuinfo, etc.
>
> Hmm, I didn't think of this. sunwalker-box only exposes cpuinfo and PID
> tree to the sandboxed programs (empirically, this is enough for most of
> programs you want sandboxing for). With that in mind, this patch and a
> FUSE providing an overlay with cpuinfo / seccomp intercepting opens of
> /proc/cpuinfo / a small kernel patch with a new mount option for procfs
> to expose more static files still look like a clean solution to me.
The standard way of making it possible to mount procfs inside of a
container with a separate mount namespace that has a procfs inside it
with overmounted entries is to ensure that a fully-visible procfs
instance is present. This is for example what Incus does when nesting
containers is enabled. In systemd I implemented the same logic years
ago:
commit b71a0192c040f585397cfc6fc2ca025bf839733d
Author: Christian Brauner <brauner@kernel.org>
AuthorDate: Mon Nov 28 12:36:47 2022 +0100
Commit: Christian Brauner (Microsoft) <brauner@kernel.org>
CommitDate: Mon Dec 5 18:34:25 2022 +0100
nspawn: mount temporary visible procfs and sysfs instance
In order to mount procfs and sysfs in an unprivileged container the
kernel requires that a fully visible instance is already present in the
target mount namespace. Mount one here so the inner child can mount its
own instances. Later we umount the temporary instances created here
before we actually exec the payload. Since the rootfs is shared the
umount will propagate into the container. Note, the inner child wouldn't
be able to unmount the instances on its own since it doesn't own the
originating mount namespace. IOW, the outer child needs to do this.
So far nspawn didn't run into this issue because it used MS_MOVE which
meant that the shadow mount tree pinned a procfs and sysfs instance
which the kernel would find. The shadow mount tree is gone with proper
pivot_root() semantics.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
>
> >> Also, correct me if I am wrong, installing ebpf controller requires
> >> CAP_BPF in initial userns, so rootless podman will not be able to mask
> >> /proc "properly" even if someone sends a patch switching it to ebpf.
The container needs to inherit a fully-visible instance somehow if you
want nesting. Using an unprivileged LSM such as landlock to prevent any
access to the fully visible procfs instance is usually the better way.
My hope is that once signed bpf is more widely adopted that distros will
just start enabling blessed bpf programs that will just take on the
access protecting instead of the clumsy bind-mount protection mechanism.
^ permalink raw reply [flat|nested] 15+ messages in thread