* [PATCH v7 1/5] docs: proc: add documentation about mount restrictions
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
@ 2026-01-13 9:20 ` Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
` (4 subsequent siblings)
5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13 9:20 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
linux-kernel
procfs has a number of mounting restrictions that are not documented
anywhere.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c8864fcbdec7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -52,6 +52,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
4 Configuring procfs
4.1 Mount options
+ 4.2 Mount restrictions
5 Filesystem behavior
@@ -2410,6 +2411,19 @@ will use the calling process's active pid namespace. Note that the pid
namespace of an existing procfs instance cannot be modified (attempting to do
so will give an `-EBUSY` error).
+4.2 Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+ 1. This mount is not fully visible.
+
+ a. It's root directory is not the root directory of the filesystem.
+ b. If any file or non-empty procfs directory is hidden by another mount.
+
+ 2. A new mount overrides the readonly option or any option from atime familty.
+
Chapter 5: Filesystem behavior
==============================
--
2.52.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2026-01-13 9:20 ` Alexey Gladkov
2026-02-04 14:39 ` Christian Brauner
2026-01-13 9:20 ` [PATCH v7 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
` (3 subsequent siblings)
5 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13 9:20 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
linux-kernel
Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.
Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/proc_net.c | 8 ++++++++
fs/proc/root.c | 5 +++++
include/linux/proc_fs.h | 1 +
3 files changed, 14 insertions(+)
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..6e0ccef0169f 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -23,6 +23,7 @@
#include <linux/uidgid.h>
#include <net/net_namespace.h>
#include <linux/seq_file.h>
+#include <linux/security.h>
#include "internal.h"
@@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+ struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
+ if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+ security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+ put_net(net);
+ net = NULL;
+ }
+
return net;
}
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..ed8a101d09d3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return -ENOMEM;
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+ fs_info->mounter_cred = get_cred(fc->cred);
proc_apply_options(fs_info, fc, current_user_ns());
/* User space would break if executables or devices appear on proc */
@@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
sync_filesystem(sb);
+ put_cred(fs_info->mounter_cred);
+ fs_info->mounter_cred = get_cred(fc->cred);
+
proc_apply_options(fs_info, fc, current_user_ns());
return 0;
}
@@ -350,6 +354,7 @@ static void proc_kill_sb(struct super_block *sb)
kill_anon_super(sb);
if (fs_info) {
put_pid_ns(fs_info->pid_ns);
+ put_cred(fs_info->mounter_cred);
kfree_rcu(fs_info, rcu);
}
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..ec123c277d49 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -67,6 +67,7 @@ enum proc_pidonly {
struct proc_fs_info {
struct pid_namespace *pid_ns;
kgid_t pid_gid;
+ const struct cred *mounter_cred;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
struct rcu_head rcu;
--
2.52.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
2026-01-13 9:20 ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-02-04 14:39 ` Christian Brauner
2026-02-11 19:35 ` Alexey Gladkov
0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-02-04 14:39 UTC (permalink / raw)
To: Alexey Gladkov
Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook, containers,
linux-fsdevel, linux-kernel
On Tue, Jan 13, 2026 at 10:20:34AM +0100, Alexey Gladkov wrote:
> Cache the mounters credentials and allow access to the net directories
> contingent of the permissions of the mounter of proc.
>
> Do not show /proc/self/net when proc is mounted with subset=pid option
> and the mounter does not have CAP_NET_ADMIN.
>
> Signed-off-by: Alexey Gladkov <legion@kernel.org>
> ---
> fs/proc/proc_net.c | 8 ++++++++
> fs/proc/root.c | 5 +++++
> include/linux/proc_fs.h | 1 +
> 3 files changed, 14 insertions(+)
>
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 52f0b75cbce2..6e0ccef0169f 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -23,6 +23,7 @@
> #include <linux/uidgid.h>
> #include <net/net_namespace.h>
> #include <linux/seq_file.h>
> +#include <linux/security.h>
>
> #include "internal.h"
>
> @@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
> struct task_struct *task;
> struct nsproxy *ns;
> struct net *net = NULL;
> + struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
>
> rcu_read_lock();
> task = pid_task(proc_pid(dir), PIDTYPE_PID);
> @@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
> }
> rcu_read_unlock();
>
> + if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
> + security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
> + put_net(net);
> + net = NULL;
> + }
> +
> return net;
> }
>
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index d8ca41d823e4..ed8a101d09d3 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> return -ENOMEM;
>
> fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> + fs_info->mounter_cred = get_cred(fc->cred);
> proc_apply_options(fs_info, fc, current_user_ns());
>
> /* User space would break if executables or devices appear on proc */
> @@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
>
> sync_filesystem(sb);
>
> + put_cred(fs_info->mounter_cred);
> + fs_info->mounter_cred = get_cred(fc->cred);
Afaict, this races with get_proc_task_net(). You need a synchronization
mechanism here so that get_proc_task_net() doesn't risk accessing
invalid mounter creds while someone concurrently updates the creds.
Proposal how to fix that below.
But I'm kinda torn here anyway whether we want that credential change on
remount. The problem is that someone might inadvertently allow access to
/proc/<pid>/net as a side-effect simply because they remounted procfs.
But they never had a chance to prevent this.
I think it's best if mounter_creds stays fixed just as they do for
overlayfs. So we don't allow them to change on reconfigure. That also
makes all of the code I hinted at below pointless.
If we ever want to change the credentials it's easier to add a mount
option to procfs like I did for overlayfs.
_Untested_ patches:
First, the preparatory patch diff (no functional changes intended):
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..81825e5819b8 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -268,19 +268,19 @@ EXPORT_SYMBOL_GPL(proc_create_net_single_write);
static struct net *get_proc_task_net(struct inode *dir)
{
struct task_struct *task;
- struct nsproxy *ns;
- struct net *net = NULL;
+ struct net *net;
- rcu_read_lock();
+ guard(rcu)();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
- if (task != NULL) {
- task_lock(task);
- ns = task->nsproxy;
- if (ns != NULL)
- net = get_net(ns->net_ns);
- task_unlock(task);
+ if (!task)
+ return NULL;
+
+ scoped_guard(task_lock, task) {
+ struct nsproxy *ns = task->nsproxy;
+ if (!ns)
+ return NULL;
+ net = get_net(ns->net_ns);
}
- rcu_read_unlock();
return net;
}
And then on top of it something like:
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 81825e5819b8..47dc9806395c 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -269,6 +269,8 @@ static struct net *get_proc_task_net(struct inode *dir)
{
struct task_struct *task;
struct net *net;
+ struct proc_fs_info *fs_info;
+ const struct cred *cred;
guard(rcu)();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,15 @@ static struct net *get_proc_task_net(struct inode *dir)
net = get_net(ns->net_ns);
}
+ fs_info = proc_sb_info(dir->i_sb);
+ if (fs_info->pidonly != PROC_PIDONLY_ON)
+ return net;
+
+ cred = rcu_dereference(fs_info->mounter_cred);
+ if (security_capable(cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) != 0) {
+ put_net(net);
+ return NULL;
+ }
return net;
}
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..68397900dab7 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -300,11 +300,15 @@ static int proc_reconfigure(struct fs_context *fc)
{
struct super_block *sb = fc->root->d_sb;
struct proc_fs_info *fs_info = proc_sb_info(sb);
+ const struct cred *cred;
sync_filesystem(sb);
- proc_apply_options(fs_info, fc, current_user_ns());
- return 0;
+ cred = rcu_replace_pointer(fs_info->mounter_cred, get_cred(fc->cred),
+ lockdep_is_held(&sb->s_umount));
+ put_cred(cred);
+
+ return proc_apply_options(sb, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
2026-02-04 14:39 ` Christian Brauner
@ 2026-02-11 19:35 ` Alexey Gladkov
0 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-11 19:35 UTC (permalink / raw)
To: Christian Brauner
Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook, containers,
linux-fsdevel, linux-kernel
On Wed, Feb 04, 2026 at 03:39:53PM +0100, Christian Brauner wrote:
> On Tue, Jan 13, 2026 at 10:20:34AM +0100, Alexey Gladkov wrote:
> > Cache the mounters credentials and allow access to the net directories
> > contingent of the permissions of the mounter of proc.
> >
> > Do not show /proc/self/net when proc is mounted with subset=pid option
> > and the mounter does not have CAP_NET_ADMIN.
> >
> > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > ---
> > fs/proc/proc_net.c | 8 ++++++++
> > fs/proc/root.c | 5 +++++
> > include/linux/proc_fs.h | 1 +
> > 3 files changed, 14 insertions(+)
> >
> > diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> > index 52f0b75cbce2..6e0ccef0169f 100644
> > --- a/fs/proc/proc_net.c
> > +++ b/fs/proc/proc_net.c
> > @@ -23,6 +23,7 @@
> > #include <linux/uidgid.h>
> > #include <net/net_namespace.h>
> > #include <linux/seq_file.h>
> > +#include <linux/security.h>
> >
> > #include "internal.h"
> >
> > @@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
> > struct task_struct *task;
> > struct nsproxy *ns;
> > struct net *net = NULL;
> > + struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
> >
> > rcu_read_lock();
> > task = pid_task(proc_pid(dir), PIDTYPE_PID);
> > @@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
> > }
> > rcu_read_unlock();
> >
> > + if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
> > + security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
> > + put_net(net);
> > + net = NULL;
> > + }
> > +
> > return net;
> > }
> >
> > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > index d8ca41d823e4..ed8a101d09d3 100644
> > --- a/fs/proc/root.c
> > +++ b/fs/proc/root.c
> > @@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
> > return -ENOMEM;
> >
> > fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
> > + fs_info->mounter_cred = get_cred(fc->cred);
> > proc_apply_options(fs_info, fc, current_user_ns());
> >
> > /* User space would break if executables or devices appear on proc */
> > @@ -303,6 +304,9 @@ static int proc_reconfigure(struct fs_context *fc)
> >
> > sync_filesystem(sb);
> >
> > + put_cred(fs_info->mounter_cred);
> > + fs_info->mounter_cred = get_cred(fc->cred);
>
> Afaict, this races with get_proc_task_net(). You need a synchronization
> mechanism here so that get_proc_task_net() doesn't risk accessing
> invalid mounter creds while someone concurrently updates the creds.
> Proposal how to fix that below.
>
> But I'm kinda torn here anyway whether we want that credential change on
> remount. The problem is that someone might inadvertently allow access to
> /proc/<pid>/net as a side-effect simply because they remounted procfs.
> But they never had a chance to prevent this.
I think you're right, and there's no need to change credentials on
remount. At least not now.
> I think it's best if mounter_creds stays fixed just as they do for
> overlayfs. So we don't allow them to change on reconfigure. That also
> makes all of the code I hinted at below pointless.
I'll just remove the mounter_cred update from proc_reconfigure.
> If we ever want to change the credentials it's easier to add a mount
> option to procfs like I did for overlayfs.
>
> _Untested_ patches:
>
> First, the preparatory patch diff (no functional changes intended):
>
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 52f0b75cbce2..81825e5819b8 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -268,19 +268,19 @@ EXPORT_SYMBOL_GPL(proc_create_net_single_write);
> static struct net *get_proc_task_net(struct inode *dir)
> {
> struct task_struct *task;
> - struct nsproxy *ns;
> - struct net *net = NULL;
> + struct net *net;
>
> - rcu_read_lock();
> + guard(rcu)();
> task = pid_task(proc_pid(dir), PIDTYPE_PID);
> - if (task != NULL) {
> - task_lock(task);
> - ns = task->nsproxy;
> - if (ns != NULL)
> - net = get_net(ns->net_ns);
> - task_unlock(task);
> + if (!task)
> + return NULL;
> +
> + scoped_guard(task_lock, task) {
> + struct nsproxy *ns = task->nsproxy;
> + if (!ns)
> + return NULL;
> + net = get_net(ns->net_ns);
> }
> - rcu_read_unlock();
>
> return net;
> }
>
> And then on top of it something like:
>
> diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
> index 81825e5819b8..47dc9806395c 100644
> --- a/fs/proc/proc_net.c
> +++ b/fs/proc/proc_net.c
> @@ -269,6 +269,8 @@ static struct net *get_proc_task_net(struct inode *dir)
> {
> struct task_struct *task;
> struct net *net;
> + struct proc_fs_info *fs_info;
> + const struct cred *cred;
>
> guard(rcu)();
> task = pid_task(proc_pid(dir), PIDTYPE_PID);
> @@ -282,6 +284,15 @@ static struct net *get_proc_task_net(struct inode *dir)
> net = get_net(ns->net_ns);
> }
>
> + fs_info = proc_sb_info(dir->i_sb);
> + if (fs_info->pidonly != PROC_PIDONLY_ON)
> + return net;
> +
> + cred = rcu_dereference(fs_info->mounter_cred);
> + if (security_capable(cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) != 0) {
> + put_net(net);
> + return NULL;
> + }
> return net;
> }
>
> diff --git a/fs/proc/root.c b/fs/proc/root.c
> index d8ca41d823e4..68397900dab7 100644
> --- a/fs/proc/root.c
> +++ b/fs/proc/root.c
> @@ -300,11 +300,15 @@ static int proc_reconfigure(struct fs_context *fc)
> {
> struct super_block *sb = fc->root->d_sb;
> struct proc_fs_info *fs_info = proc_sb_info(sb);
> + const struct cred *cred;
>
> sync_filesystem(sb);
>
> - proc_apply_options(fs_info, fc, current_user_ns());
> - return 0;
> + cred = rcu_replace_pointer(fs_info->mounter_cred, get_cred(fc->cred),
> + lockdep_is_held(&sb->s_umount));
> + put_cred(cred);
> +
> + return proc_apply_options(sb, fc, current_user_ns());
> }
>
> static int proc_get_tree(struct fs_context *fc)
>
--
Rgrds, legion
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v7 3/5] proc: Disable cancellation of subset=pid option
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-01-13 9:20 ` Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 4/5] proc: Relax check of mount visibility Alexey Gladkov
` (2 subsequent siblings)
5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13 9:20 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
linux-kernel
When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.
This patch makes the limitation explicit and prints an error message.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/root.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index ed8a101d09d3..b9f33b67cdd6 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,7 +223,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
struct fs_context *fc,
struct user_namespace *user_ns)
{
@@ -233,13 +233,17 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
- if (ctx->mask & (1 << Opt_subset))
+ if (ctx->mask & (1 << Opt_subset)) {
+ if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
+ }
if (ctx->mask & (1 << Opt_pidns) &&
!WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
put_pid_ns(fs_info->pid_ns);
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
}
+ return 0;
}
static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -255,7 +259,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- proc_apply_options(fs_info, fc, current_user_ns());
+ ret = proc_apply_options(fs_info, fc, current_user_ns());
+ if (ret)
+ return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -307,8 +313,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
- proc_apply_options(fs_info, fc, current_user_ns());
- return 0;
+ return proc_apply_options(fs_info, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
--
2.52.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v7 4/5] proc: Relax check of mount visibility
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
` (2 preceding siblings ...)
2026-01-13 9:20 ` [PATCH v7 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2026-01-13 9:20 ` Alexey Gladkov
2026-01-13 9:20 ` [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13 9:20 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
linux-kernel
When /proc is mounted with the subset=pid option, all system files from
the root of the file system are not accessible in userspace. Only
dynamic information about processes is available, which cannot be
hidden with overmount.
For this reason, checking for full visibility is not relevant if
mounting is performed with the subset=pid option.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/namespace.c | 29 ++++++++++++++++-------------
fs/proc/root.c | 16 ++++++++++------
include/linux/fs/super_types.h | 2 ++
3 files changed, 28 insertions(+), 19 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..7daa86315c05 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
/* This mount is not fully visible if it's root directory
* is not the root directory of the filesystem.
*/
- if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+ if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
+ mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
continue;
/* A local view of the mount flags */
@@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
continue;
- /* This mount is not fully visible if there are any
- * locked child mounts that cover anything except for
- * empty directories.
- */
- list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
- struct inode *inode = child->mnt_mountpoint->d_inode;
- /* Only worry about locked mounts */
- if (!(child->mnt.mnt_flags & MNT_LOCKED))
- continue;
- /* Is the directory permanently empty? */
- if (!is_empty_dir_inode(inode))
- goto next;
+ if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING)) {
+ /* This mount is not fully visible if there are any
+ * locked child mounts that cover anything except for
+ * empty directories.
+ */
+ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+ struct inode *inode = child->mnt_mountpoint->d_inode;
+ /* Only worry about locked mounts */
+ if (!IS_MNT_LOCKED(child))
+ continue;
+ /* Is the directory permanently empty? */
+ if (!is_empty_dir_inode(inode))
+ goto next;
+ }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index b9f33b67cdd6..354dc13417e3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,18 +223,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
struct fs_context *fc,
struct user_namespace *user_ns)
{
struct proc_fs_context *ctx = fc->fs_private;
+ struct proc_fs_info *fs_info = proc_sb_info(s);
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
- if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ if (ctx->pidonly == PROC_PIDONLY_ON)
+ s->s_iflags |= SB_I_USERNS_ALLOW_REVEALING;
+ else if (fs_info->pidonly == PROC_PIDONLY_ON)
return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
@@ -259,9 +262,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- ret = proc_apply_options(fs_info, fc, current_user_ns());
- if (ret)
- return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -273,6 +273,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
s->s_time_gran = 1;
s->s_fs_info = fs_info;
+ ret = proc_apply_options(s, fc, current_user_ns());
+ if (ret)
+ return ret;
+
/*
* procfs isn't actually a stacking filesystem; however, there is
* too much magic going on inside it to permit stacking things on
@@ -313,7 +317,7 @@ static int proc_reconfigure(struct fs_context *fc)
put_cred(fs_info->mounter_cred);
fs_info->mounter_cred = get_cred(fc->cred);
- return proc_apply_options(fs_info, fc, current_user_ns());
+ return proc_apply_options(sb, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 6bd3009e09b3..5e640b9140df 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -333,4 +333,6 @@ struct super_block {
#define SB_I_NOIDMAP 0x00002000 /* No idmapped mounts on this superblock */
#define SB_I_ALLOW_HSM 0x00004000 /* Allow HSM events on this superblock */
+#define SB_I_USERNS_ALLOW_REVEALING 0x00008000 /* Skip full visibility check */
+
#endif /* _LINUX_FS_SUPER_TYPES_H */
--
2.52.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
` (3 preceding siblings ...)
2026-01-13 9:20 ` [PATCH v7 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-01-13 9:20 ` Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
5 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-01-13 9:20 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, containers, linux-fsdevel,
linux-kernel
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c8864fcbdec7..3acf178c1202 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2417,7 +2417,8 @@ so will give an `-EBUSY` error).
If user namespaces are in use, the kernel additionally checks the instances of
procfs available to the mounter and will not allow procfs to be mounted if:
- 1. This mount is not fully visible.
+ 1. This mount is not fully visible unless the new procfs is going to be
+ mounted with subset=pid option.
a. It's root directory is not the root directory of the filesystem.
b. If any file or non-empty procfs directory is hidden by another mount.
--
2.52.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility
2026-01-13 9:20 ` [PATCH v7 " Alexey Gladkov
` (4 preceding siblings ...)
2026-01-13 9:20 ` [PATCH v7 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
` (4 more replies)
5 siblings, 5 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
When mounting procfs with the subset=pids option, all static files become
unavailable and only the dynamic part with information about pids is accessible.
In this case, there is no point in imposing additional restrictions on the
visibility of the entire filesystem for the mounter. Everything that can be
hidden in procfs is already inaccessible.
Currently, these restrictions prevent pidfs from being mounted inside rootless
containers, as almost all container implementations override part of procfs to
hide certain directories. Relaxing these restrictions will allow pidfs to be
used in nested containerization.
---
Changelog
---------
v8:
* Remove mounter credential change on remount as suggested by Christian Brauner.
v7:
* Rebase on v6.19-rc5.
* Rename SB_I_DYNAMIC to SB_I_USERNS_ALLOW_REVEALING.
v6:
* Add documentation about procfs mount restrictions.
* Reorder commits for better review.
v4:
* Set SB_I_DYNAMIC only if pidonly is set.
* Add an error message if subset=pid is canceled during remount.
v3:
* Add 'const' to struct cred *mounter_cred (fix kernel test robot warning).
v2:
* cache the mounters credentials and make access to the net directories
contingent of the permissions of the mounter of procfs.
Alexey Gladkov (5):
docs: proc: add documentation about mount restrictions
proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
proc: Disable cancellation of subset=pid option
proc: Relax check of mount visibility
docs: proc: add documentation about relaxing visibility restrictions
Documentation/filesystems/proc.rst | 15 +++++++++++++++
fs/namespace.c | 29 ++++++++++++++++-------------
fs/proc/proc_net.c | 8 ++++++++
fs/proc/root.c | 22 ++++++++++++++++------
include/linux/fs/super_types.h | 2 ++
include/linux/proc_fs.h | 1 +
6 files changed, 58 insertions(+), 19 deletions(-)
--
2.53.0
^ permalink raw reply [flat|nested] 34+ messages in thread* [PATCH v8 1/5] docs: proc: add documentation about mount restrictions
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
` (3 subsequent siblings)
4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
procfs has a number of mounting restrictions that are not documented
anywhere.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c8864fcbdec7 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -52,6 +52,7 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
4 Configuring procfs
4.1 Mount options
+ 4.2 Mount restrictions
5 Filesystem behavior
@@ -2410,6 +2411,19 @@ will use the calling process's active pid namespace. Note that the pid
namespace of an existing procfs instance cannot be modified (attempting to do
so will give an `-EBUSY` error).
+4.2 Mount restrictions
+--------------------------
+
+If user namespaces are in use, the kernel additionally checks the instances of
+procfs available to the mounter and will not allow procfs to be mounted if:
+
+ 1. This mount is not fully visible.
+
+ a. It's root directory is not the root directory of the filesystem.
+ b. If any file or non-empty procfs directory is hidden by another mount.
+
+ 2. A new mount overrides the readonly option or any option from atime familty.
+
Chapter 5: Filesystem behavior
==============================
--
2.53.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
` (2 subsequent siblings)
4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
Cache the mounters credentials and allow access to the net directories
contingent of the permissions of the mounter of proc.
Do not show /proc/self/net when proc is mounted with subset=pid option
and the mounter does not have CAP_NET_ADMIN. To avoid inadvertently
allowing access to /proc/<pid>/net, updating mounter credentials is not
supported.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/proc_net.c | 8 ++++++++
fs/proc/root.c | 2 ++
include/linux/proc_fs.h | 1 +
3 files changed, 11 insertions(+)
diff --git a/fs/proc/proc_net.c b/fs/proc/proc_net.c
index 52f0b75cbce2..6e0ccef0169f 100644
--- a/fs/proc/proc_net.c
+++ b/fs/proc/proc_net.c
@@ -23,6 +23,7 @@
#include <linux/uidgid.h>
#include <net/net_namespace.h>
#include <linux/seq_file.h>
+#include <linux/security.h>
#include "internal.h"
@@ -270,6 +271,7 @@ static struct net *get_proc_task_net(struct inode *dir)
struct task_struct *task;
struct nsproxy *ns;
struct net *net = NULL;
+ struct proc_fs_info *fs_info = proc_sb_info(dir->i_sb);
rcu_read_lock();
task = pid_task(proc_pid(dir), PIDTYPE_PID);
@@ -282,6 +284,12 @@ static struct net *get_proc_task_net(struct inode *dir)
}
rcu_read_unlock();
+ if (net && (fs_info->pidonly == PROC_PIDONLY_ON) &&
+ security_capable(fs_info->mounter_cred, net->user_ns, CAP_NET_ADMIN, CAP_OPT_NONE) < 0) {
+ put_net(net);
+ net = NULL;
+ }
+
return net;
}
diff --git a/fs/proc/root.c b/fs/proc/root.c
index d8ca41d823e4..c4af3a9b1a44 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -254,6 +254,7 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
return -ENOMEM;
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
+ fs_info->mounter_cred = get_cred(fc->cred);
proc_apply_options(fs_info, fc, current_user_ns());
/* User space would break if executables or devices appear on proc */
@@ -350,6 +351,7 @@ static void proc_kill_sb(struct super_block *sb)
kill_anon_super(sb);
if (fs_info) {
put_pid_ns(fs_info->pid_ns);
+ put_cred(fs_info->mounter_cred);
kfree_rcu(fs_info, rcu);
}
}
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 19d1c5e5f335..ec123c277d49 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -67,6 +67,7 @@ enum proc_pidonly {
struct proc_fs_info {
struct pid_namespace *pid_ns;
kgid_t pid_gid;
+ const struct cred *mounter_cred;
enum proc_hidepid hide_pid;
enum proc_pidonly pidonly;
struct rcu_head rcu;
--
2.53.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v8 3/5] proc: Disable cancellation of subset=pid option
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 1/5] docs: proc: add documentation about mount restrictions Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 2/5] proc: subset=pid: Show /proc/self/net only for CAP_NET_ADMIN Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
2026-02-13 10:44 ` [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
When procfs is mounted with subset=pid option, where is no way to
remount it with this option removed. This is done in order not to make
visible what ever was hidden since some checks occur during mount.
This patch makes the limitation explicit and prints an error message.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/proc/root.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index c4af3a9b1a44..535a168046e3 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,7 +223,7 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static void proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct proc_fs_info *fs_info,
struct fs_context *fc,
struct user_namespace *user_ns)
{
@@ -233,13 +233,17 @@ static void proc_apply_options(struct proc_fs_info *fs_info,
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
- if (ctx->mask & (1 << Opt_subset))
+ if (ctx->mask & (1 << Opt_subset)) {
+ if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
+ }
if (ctx->mask & (1 << Opt_pidns) &&
!WARN_ON_ONCE(fc->purpose == FS_CONTEXT_FOR_RECONFIGURE)) {
put_pid_ns(fs_info->pid_ns);
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
}
+ return 0;
}
static int proc_fill_super(struct super_block *s, struct fs_context *fc)
@@ -255,7 +259,9 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- proc_apply_options(fs_info, fc, current_user_ns());
+ ret = proc_apply_options(fs_info, fc, current_user_ns());
+ if (ret)
+ return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -304,8 +310,7 @@ static int proc_reconfigure(struct fs_context *fc)
sync_filesystem(sb);
- proc_apply_options(fs_info, fc, current_user_ns());
- return 0;
+ return proc_apply_options(fs_info, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
--
2.53.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* [PATCH v8 4/5] proc: Relax check of mount visibility
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
` (2 preceding siblings ...)
2026-02-13 10:44 ` [PATCH v8 3/5] proc: Disable cancellation of subset=pid option Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
2026-02-17 11:59 ` Christian Brauner
2026-02-13 10:44 ` [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions Alexey Gladkov
4 siblings, 1 reply; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
When /proc is mounted with the subset=pid option, all system files from
the root of the file system are not accessible in userspace. Only
dynamic information about processes is available, which cannot be
hidden with overmount.
For this reason, checking for full visibility is not relevant if
mounting is performed with the subset=pid option.
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
fs/namespace.c | 29 ++++++++++++++++-------------
fs/proc/root.c | 17 ++++++++++-------
include/linux/fs/super_types.h | 2 ++
3 files changed, 28 insertions(+), 20 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index c58674a20cad..7daa86315c05 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
/* This mount is not fully visible if it's root directory
* is not the root directory of the filesystem.
*/
- if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
+ if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
+ mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
continue;
/* A local view of the mount flags */
@@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
continue;
- /* This mount is not fully visible if there are any
- * locked child mounts that cover anything except for
- * empty directories.
- */
- list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
- struct inode *inode = child->mnt_mountpoint->d_inode;
- /* Only worry about locked mounts */
- if (!(child->mnt.mnt_flags & MNT_LOCKED))
- continue;
- /* Is the directory permanently empty? */
- if (!is_empty_dir_inode(inode))
- goto next;
+ if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING)) {
+ /* This mount is not fully visible if there are any
+ * locked child mounts that cover anything except for
+ * empty directories.
+ */
+ list_for_each_entry(child, &mnt->mnt_mounts, mnt_child) {
+ struct inode *inode = child->mnt_mountpoint->d_inode;
+ /* Only worry about locked mounts */
+ if (!IS_MNT_LOCKED(child))
+ continue;
+ /* Is the directory permanently empty? */
+ if (!is_empty_dir_inode(inode))
+ goto next;
+ }
}
/* Preserve the locked attributes */
*new_mnt_flags |= mnt_flags & (MNT_LOCK_READONLY | \
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 535a168046e3..e029d3587494 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -223,18 +223,21 @@ static int proc_parse_param(struct fs_context *fc, struct fs_parameter *param)
return 0;
}
-static int proc_apply_options(struct proc_fs_info *fs_info,
+static int proc_apply_options(struct super_block *s,
struct fs_context *fc,
struct user_namespace *user_ns)
{
struct proc_fs_context *ctx = fc->fs_private;
+ struct proc_fs_info *fs_info = proc_sb_info(s);
if (ctx->mask & (1 << Opt_gid))
fs_info->pid_gid = make_kgid(user_ns, ctx->gid);
if (ctx->mask & (1 << Opt_hidepid))
fs_info->hide_pid = ctx->hidepid;
if (ctx->mask & (1 << Opt_subset)) {
- if (ctx->pidonly != PROC_PIDONLY_ON && fs_info->pidonly == PROC_PIDONLY_ON)
+ if (ctx->pidonly == PROC_PIDONLY_ON)
+ s->s_iflags |= SB_I_USERNS_ALLOW_REVEALING;
+ else if (fs_info->pidonly == PROC_PIDONLY_ON)
return invalf(fc, "proc: subset=pid cannot be unset\n");
fs_info->pidonly = ctx->pidonly;
}
@@ -259,9 +262,6 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
fs_info->pid_ns = get_pid_ns(ctx->pid_ns);
fs_info->mounter_cred = get_cred(fc->cred);
- ret = proc_apply_options(fs_info, fc, current_user_ns());
- if (ret)
- return ret;
/* User space would break if executables or devices appear on proc */
s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
@@ -273,6 +273,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
s->s_time_gran = 1;
s->s_fs_info = fs_info;
+ ret = proc_apply_options(s, fc, current_user_ns());
+ if (ret)
+ return ret;
+
/*
* procfs isn't actually a stacking filesystem; however, there is
* too much magic going on inside it to permit stacking things on
@@ -306,11 +310,10 @@ static int proc_fill_super(struct super_block *s, struct fs_context *fc)
static int proc_reconfigure(struct fs_context *fc)
{
struct super_block *sb = fc->root->d_sb;
- struct proc_fs_info *fs_info = proc_sb_info(sb);
sync_filesystem(sb);
- return proc_apply_options(fs_info, fc, current_user_ns());
+ return proc_apply_options(sb, fc, current_user_ns());
}
static int proc_get_tree(struct fs_context *fc)
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 6bd3009e09b3..5e640b9140df 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -333,4 +333,6 @@ struct super_block {
#define SB_I_NOIDMAP 0x00002000 /* No idmapped mounts on this superblock */
#define SB_I_ALLOW_HSM 0x00004000 /* Allow HSM events on this superblock */
+#define SB_I_USERNS_ALLOW_REVEALING 0x00008000 /* Skip full visibility check */
+
#endif /* _LINUX_FS_SUPER_TYPES_H */
--
2.53.0
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
2026-02-13 10:44 ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-02-17 11:59 ` Christian Brauner
2026-04-10 11:12 ` Christian Brauner
0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-02-17 11:59 UTC (permalink / raw)
To: Alexey Gladkov
Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
linux-fsdevel, linux-kernel
On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> When /proc is mounted with the subset=pid option, all system files from
> the root of the file system are not accessible in userspace. Only
> dynamic information about processes is available, which cannot be
> hidden with overmount.
>
> For this reason, checking for full visibility is not relevant if
> mounting is performed with the subset=pid option.
>
> Signed-off-by: Alexey Gladkov <legion@kernel.org>
> ---
> fs/namespace.c | 29 ++++++++++++++++-------------
> fs/proc/root.c | 17 ++++++++++-------
> include/linux/fs/super_types.h | 2 ++
> 3 files changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c58674a20cad..7daa86315c05 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> /* This mount is not fully visible if it's root directory
> * is not the root directory of the filesystem.
> */
> - if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> + if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> + mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> continue;
>
> /* A local view of the mount flags */
> @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> continue;
There are a few things that I find problematic here.
Even before your change the mount flags of the first fully visible
procfs mount would be picked up. If the caller was unlucky they could
stumble upon the most restricted procfs mount in the mount namespace
rbtree. Leading to weird scenarios where a user cannot write to the
procfs instance they just mounted but could to another one that is also
in their namespace.
The other thing is that with this change specifically:
if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
we start caring about mount options of even partially exposed procfs
mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
somewhere that got inherited via CLONE_NEWNS then we suddenly take the
mount options of that into account for a new /proc/<pid>/* only instance.
I think we should continue caring only about procfs mounts that are
visible from their root.
The the other problem is that it is really annoying that we walk all
mounts in a mount namespace just to find procfs and sysfs mounts in
there. Currently a lot of workloads still do the CLONE_NEWNS dance
meaning they inherit all the crap from the host and then proceed to
setup their new rootfs. Busy container workloads that can be a lot.
So let's just be honest about it and treat procfs and sysfs as the
snowflakes that they have become and record their instances in a
separate per mount namespace hlist as in the (untested) patch below [1].
Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
care about that flag is when we setup a new superblock. So this could
easily be a struct fs_context bitfield that just exists for the duration
of the creation of the new superblock and mount. So maybe pass that down
to mount_too_revealing() and further down into the actual helper.
[1]:
From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Tue, 17 Feb 2026 12:02:34 +0100
Subject: [PATCH] namespace: record fully visible mounts in list
Instead of wading through all the mounts in the mount namespace rbtree
to find fully visible procfs and sysfs mounts, be honest about them
being special cruft and record them in a separate per-mount namespace
list.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/mount.h | 4 ++++
fs/namespace.c | 19 +++++++++++--------
2 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/fs/mount.h b/fs/mount.h
index e0816c11a198..5df134d56d47 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -25,6 +25,7 @@ struct mnt_namespace {
__u32 n_fsnotify_mask;
struct fsnotify_mark_connector __rcu *n_fsnotify_marks;
#endif
+ struct hlist_head mnt_visible_mounts; /* SB_I_USERNS_VISIBLE mounts */
unsigned int nr_mounts; /* # of mounts in the namespace */
unsigned int pending_mounts;
refcount_t passive; /* number references not pinning @mounts */
@@ -90,6 +91,7 @@ struct mount {
int mnt_expiry_mark; /* true if marked for expiry */
struct hlist_head mnt_pins;
struct hlist_head mnt_stuck_children;
+ struct hlist_node mnt_ns_visible; /* link in ns->mnt_visible_mounts */
struct mount *overmount; /* mounted on ->mnt_root */
} __randomize_layout;
@@ -207,6 +209,8 @@ static inline void move_from_ns(struct mount *mnt)
ns->mnt_first_node = rb_next(&mnt->mnt_node);
rb_erase(&mnt->mnt_node, &ns->mounts);
RB_CLEAR_NODE(&mnt->mnt_node);
+ if (!hlist_unhashed(&mnt->mnt_ns_visible))
+ hlist_del_init(&mnt->mnt_ns_visible);
}
bool has_locked_children(struct mount *mnt, struct dentry *dentry);
diff --git a/fs/namespace.c b/fs/namespace.c
index a67cbe42746d..764081c690d5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -321,6 +321,7 @@ static struct mount *alloc_vfsmnt(const char *name)
INIT_HLIST_NODE(&mnt->mnt_slave);
INIT_HLIST_NODE(&mnt->mnt_mp_list);
INIT_HLIST_HEAD(&mnt->mnt_stuck_children);
+ INIT_HLIST_NODE(&mnt->mnt_ns_visible);
RB_CLEAR_NODE(&mnt->mnt_node);
mnt->mnt.mnt_idmap = &nop_mnt_idmap;
}
@@ -1098,6 +1099,10 @@ static void mnt_add_to_ns(struct mnt_namespace *ns, struct mount *mnt)
rb_link_node(&mnt->mnt_node, parent, link);
rb_insert_color(&mnt->mnt_node, &ns->mounts);
+ if ((mnt->mnt.mnt_sb->s_iflags & SB_I_USERNS_VISIBLE) &&
+ mnt->mnt.mnt_root == mnt->mnt.mnt_sb->s_root)
+ hlist_add_head(&mnt->mnt_ns_visible, &ns->mnt_visible_mounts);
+
mnt_notify_add(mnt);
}
@@ -6295,22 +6300,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
int *new_mnt_flags)
{
int new_flags = *new_mnt_flags;
- struct mount *mnt, *n;
+ struct mount *mnt;
+
+ /* Don't acquire namespace semaphore without a good reason. */
+ if (hlist_empty(&ns->mnt_visible_mounts))
+ return false;
guard(namespace_shared)();
- rbtree_postorder_for_each_entry_safe(mnt, n, &ns->mounts, mnt_node) {
+ hlist_for_each_entry(mnt, &ns->mnt_visible_mounts, mnt_ns_visible) {
struct mount *child;
int mnt_flags;
if (mnt->mnt.mnt_sb->s_type != sb->s_type)
continue;
- /* This mount is not fully visible if it's root directory
- * is not the root directory of the filesystem.
- */
- if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
- continue;
-
/* A local view of the mount flags */
mnt_flags = mnt->mnt.mnt_flags;
--
2.47.3
^ permalink raw reply related [flat|nested] 34+ messages in thread* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
2026-02-17 11:59 ` Christian Brauner
@ 2026-04-10 11:12 ` Christian Brauner
2026-04-10 11:31 ` Alexey Gladkov
0 siblings, 1 reply; 34+ messages in thread
From: Christian Brauner @ 2026-04-10 11:12 UTC (permalink / raw)
To: Alexey Gladkov
Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
linux-fsdevel, linux-kernel
On Tue, Feb 17, 2026 at 12:59:54PM +0100, Christian Brauner wrote:
> On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> > When /proc is mounted with the subset=pid option, all system files from
> > the root of the file system are not accessible in userspace. Only
> > dynamic information about processes is available, which cannot be
> > hidden with overmount.
> >
> > For this reason, checking for full visibility is not relevant if
> > mounting is performed with the subset=pid option.
> >
> > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > ---
> > fs/namespace.c | 29 ++++++++++++++++-------------
> > fs/proc/root.c | 17 ++++++++++-------
> > include/linux/fs/super_types.h | 2 ++
> > 3 files changed, 28 insertions(+), 20 deletions(-)
> >
> > diff --git a/fs/namespace.c b/fs/namespace.c
> > index c58674a20cad..7daa86315c05 100644
> > --- a/fs/namespace.c
> > +++ b/fs/namespace.c
> > @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > /* This mount is not fully visible if it's root directory
> > * is not the root directory of the filesystem.
> > */
> > - if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > + if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> > + mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > continue;
> >
> > /* A local view of the mount flags */
> > @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> > continue;
>
> There are a few things that I find problematic here.
>
> Even before your change the mount flags of the first fully visible
> procfs mount would be picked up. If the caller was unlucky they could
> stumble upon the most restricted procfs mount in the mount namespace
> rbtree. Leading to weird scenarios where a user cannot write to the
> procfs instance they just mounted but could to another one that is also
> in their namespace.
>
> The other thing is that with this change specifically:
>
> if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
>
> we start caring about mount options of even partially exposed procfs
> mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
> somewhere that got inherited via CLONE_NEWNS then we suddenly take the
> mount options of that into account for a new /proc/<pid>/* only instance.
> I think we should continue caring only about procfs mounts that are
> visible from their root.
>
> The the other problem is that it is really annoying that we walk all
> mounts in a mount namespace just to find procfs and sysfs mounts in
> there. Currently a lot of workloads still do the CLONE_NEWNS dance
> meaning they inherit all the crap from the host and then proceed to
> setup their new rootfs. Busy container workloads that can be a lot.
>
> So let's just be honest about it and treat procfs and sysfs as the
> snowflakes that they have become and record their instances in a
> separate per mount namespace hlist as in the (untested) patch below [1].
>
> Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
> care about that flag is when we setup a new superblock. So this could
> easily be a struct fs_context bitfield that just exists for the duration
> of the creation of the new superblock and mount. So maybe pass that down
> to mount_too_revealing() and further down into the actual helper.
>
> [1]:
> >From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
> From: Christian Brauner <brauner@kernel.org>
> Date: Tue, 17 Feb 2026 12:02:34 +0100
> Subject: [PATCH] namespace: record fully visible mounts in list
>
> Instead of wading through all the mounts in the mount namespace rbtree
> to find fully visible procfs and sysfs mounts, be honest about them
> being special cruft and record them in a separate per-mount namespace
> list.
If you rework this I would expect to take it for v7.3. It's a bit late
now...
^ permalink raw reply [flat|nested] 34+ messages in thread
* Re: [PATCH v8 4/5] proc: Relax check of mount visibility
2026-04-10 11:12 ` Christian Brauner
@ 2026-04-10 11:31 ` Alexey Gladkov
0 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-04-10 11:31 UTC (permalink / raw)
To: Christian Brauner
Cc: Dan Klishch, Al Viro, Eric W . Biederman, Kees Cook,
linux-fsdevel, linux-kernel
On Fri, Apr 10, 2026 at 01:12:36PM +0200, Christian Brauner wrote:
> On Tue, Feb 17, 2026 at 12:59:54PM +0100, Christian Brauner wrote:
> > On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> > > When /proc is mounted with the subset=pid option, all system files from
> > > the root of the file system are not accessible in userspace. Only
> > > dynamic information about processes is available, which cannot be
> > > hidden with overmount.
> > >
> > > For this reason, checking for full visibility is not relevant if
> > > mounting is performed with the subset=pid option.
> > >
> > > Signed-off-by: Alexey Gladkov <legion@kernel.org>
> > > ---
> > > fs/namespace.c | 29 ++++++++++++++++-------------
> > > fs/proc/root.c | 17 ++++++++++-------
> > > include/linux/fs/super_types.h | 2 ++
> > > 3 files changed, 28 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/fs/namespace.c b/fs/namespace.c
> > > index c58674a20cad..7daa86315c05 100644
> > > --- a/fs/namespace.c
> > > +++ b/fs/namespace.c
> > > @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > > /* This mount is not fully visible if it's root directory
> > > * is not the root directory of the filesystem.
> > > */
> > > - if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > > + if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> > > + mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> > > continue;
> > >
> > > /* A local view of the mount flags */
> > > @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> > > ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> > > continue;
> >
> > There are a few things that I find problematic here.
> >
> > Even before your change the mount flags of the first fully visible
> > procfs mount would be picked up. If the caller was unlucky they could
> > stumble upon the most restricted procfs mount in the mount namespace
> > rbtree. Leading to weird scenarios where a user cannot write to the
> > procfs instance they just mounted but could to another one that is also
> > in their namespace.
> >
> > The other thing is that with this change specifically:
> >
> > if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> > mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> >
> > we start caring about mount options of even partially exposed procfs
> > mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
> > somewhere that got inherited via CLONE_NEWNS then we suddenly take the
> > mount options of that into account for a new /proc/<pid>/* only instance.
> > I think we should continue caring only about procfs mounts that are
> > visible from their root.
> >
> > The the other problem is that it is really annoying that we walk all
> > mounts in a mount namespace just to find procfs and sysfs mounts in
> > there. Currently a lot of workloads still do the CLONE_NEWNS dance
> > meaning they inherit all the crap from the host and then proceed to
> > setup their new rootfs. Busy container workloads that can be a lot.
> >
> > So let's just be honest about it and treat procfs and sysfs as the
> > snowflakes that they have become and record their instances in a
> > separate per mount namespace hlist as in the (untested) patch below [1].
> >
> > Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
> > care about that flag is when we setup a new superblock. So this could
> > easily be a struct fs_context bitfield that just exists for the duration
> > of the creation of the new superblock and mount. So maybe pass that down
> > to mount_too_revealing() and further down into the actual helper.
> >
> > [1]:
> > >From 4bbd41e7a3ef91667dd334f31b1b6bf8caec0599 Mon Sep 17 00:00:00 2001
> > From: Christian Brauner <brauner@kernel.org>
> > Date: Tue, 17 Feb 2026 12:02:34 +0100
> > Subject: [PATCH] namespace: record fully visible mounts in list
> >
> > Instead of wading through all the mounts in the mount namespace rbtree
> > to find fully visible procfs and sysfs mounts, be honest about them
> > being special cruft and record them in a separate per-mount namespace
> > list.
>
> If you rework this I would expect to take it for v7.3. It's a bit late
> now...
No problem. I understand. Sorry it took me so long to get back to you.
I was laid off from my job and had to look for a new one quickly.
I’ll be back soon to update this patch.
--
Rgrds, legion
^ permalink raw reply [flat|nested] 34+ messages in thread
* [PATCH v8 5/5] docs: proc: add documentation about relaxing visibility restrictions
2026-02-13 10:44 ` [PATCH v8 0/5] proc: subset=pid: Relax check of mount visibility Alexey Gladkov
` (3 preceding siblings ...)
2026-02-13 10:44 ` [PATCH v8 4/5] proc: Relax check of mount visibility Alexey Gladkov
@ 2026-02-13 10:44 ` Alexey Gladkov
4 siblings, 0 replies; 34+ messages in thread
From: Alexey Gladkov @ 2026-02-13 10:44 UTC (permalink / raw)
To: Christian Brauner, Dan Klishch
Cc: Al Viro, Eric W . Biederman, Kees Cook, linux-fsdevel,
linux-kernel
Signed-off-by: Alexey Gladkov <legion@kernel.org>
---
Documentation/filesystems/proc.rst | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index c8864fcbdec7..3acf178c1202 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -2417,7 +2417,8 @@ so will give an `-EBUSY` error).
If user namespaces are in use, the kernel additionally checks the instances of
procfs available to the mounter and will not allow procfs to be mounted if:
- 1. This mount is not fully visible.
+ 1. This mount is not fully visible unless the new procfs is going to be
+ mounted with subset=pid option.
a. It's root directory is not the root directory of the filesystem.
b. If any file or non-empty procfs directory is hidden by another mount.
--
2.53.0
^ permalink raw reply related [flat|nested] 34+ messages in thread