[PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs
@ 2026-03-05 23:30 Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing Christian Brauner
                   ` (24 more replies)
  0 siblings, 25 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Summary:

* all kthreads are isolated in a separate SB_KERNMOUNT of nullfs.
  -> no lookup of anything else, no mounting on top of it, completely
  isolated.
* init has a separate fs_struct from all kthreads
* scoped_with_init_fs() allows a kthread to temporarily assume init's
  fs_struct for filesystem operations.

So this is a bit of a crazy series. When the kernel is started it
roughly goes like this:

init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)

After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.

This rewriting is really weird and mostly done so kthread can use init's
filesystem state when they would like to. But this really should be
discouraged. The rewriting should also stop completely. I worked a bit
to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.

Instead of sharing fs_struct between kernel threads and pid 1, pid 1
get's a completely separate fs_struct. All kthreads continue sharing
init_fs as before and pid 1's fs_struct is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.

All kthreads are anchored in a kernel internal mount of nullfs that
cannot be mounted on and that cannot be used to follow other mounts.
It's a completely private mount that insulates kthreads.

This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.

The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.

This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.

The places that need to perform lookup in init's filesystem state may
use scoped_with_init_fs() which will temporarily override the caller's
fs_struct with init's fs_struct.

We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.

On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.

This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.

Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.

I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).

TL;DR:

==== PID 1 (systemd) ====

  root@localhost:~# stat --file-system /proc/1/root
    File: "/proc/1/root"
      ID: e3cb00dd533cd3d7 Namelen: 255     Type: ext2/ext3

  root@localhost:~# cat /proc/1/mountinfo | wc -l
  30

==== PID 2 (kthreadd) ====

  root@localhost:~# stat --file-system /proc/2/root
    File: "/proc/2/root"
      ID: 200000000 Namelen: 255     Type: nullfs

  root@localhost:~# cat /proc/2/mountinfo | wc -l
  0

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v2:
- Remove LOOKUP_IN_INIT in favor of scoped_with_init_fs().
- Link to v1: https://patch.msgid.link/20260303-work-kthread-nullfs-v1-0-87e559b94375@kernel.org

---
Christian Brauner (23):
      fs: notice when init abandons fs sharing
      fs: add scoped_with_init_fs()
      rnbd: use scoped_with_init_fs() for block device open
      crypto: ccp: use scoped_with_init_fs() for SEV file access
      scsi: target: use scoped_with_init_fs() for ALUA metadata
      scsi: target: use scoped_with_init_fs() for APTPL metadata
      btrfs: use scoped_with_init_fs() for update_dev_time()
      coredump: use scoped_with_init_fs() for coredump path resolution
      fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns()
      ksmbd: use scoped_with_init_fs() for share path resolution
      ksmbd: use scoped_with_init_fs() for filesystem info path lookup
      ksmbd: use scoped_with_init_fs() for VFS path operations
      initramfs: use scoped_with_init_fs() for rootfs unpacking
      af_unix: use scoped_with_init_fs() for coredump socket lookup
      fs: add real_fs to track task's actual fs_struct
      fs: make userspace_init_fs a dynamically-initialized pointer
      fs: stop sharing fs_struct between init_task and pid 1
      fs: add umh argument to struct kernel_clone_args
      fs: add kthread_mntns()
      devtmpfs: create private mount namespace
      nullfs: make nullfs multi-instance
      fs: start all kthreads in nullfs
      fs: stop rewriting kthread fs structs

 drivers/base/devtmpfs.c           |  2 +-
 drivers/block/rnbd/rnbd-srv.c     |  4 +-
 drivers/crypto/ccp/sev-dev.c      | 12 ++---
 drivers/target/target_core_alua.c |  6 ++-
 drivers/target/target_core_pr.c   |  4 +-
 fs/btrfs/volumes.c                | 11 ++++-
 fs/coredump.c                     | 11 ++---
 fs/fs_struct.c                    | 96 ++++++++++++++++++++++++++++++++++++++-
 fs/kernel_read_file.c             |  9 +---
 fs/namespace.c                    | 40 ++++++++++++++--
 fs/nullfs.c                       |  7 +--
 fs/smb/server/mgmt/share_config.c |  4 +-
 fs/smb/server/smb2pdu.c           |  4 +-
 fs/smb/server/vfs.c               | 14 ++++--
 include/linux/fs_struct.h         | 34 ++++++++++++++
 include/linux/init_task.h         |  1 +
 include/linux/mount.h             |  1 +
 include/linux/sched.h             |  1 +
 include/linux/sched/task.h        |  1 +
 init/init_task.c                  |  1 +
 init/initramfs.c                  | 12 +++--
 init/main.c                       | 10 +++-
 kernel/fork.c                     | 41 +++++++++++------
 net/unix/af_unix.c                | 17 +++----
 24 files changed, 266 insertions(+), 77 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260303-work-kthread-nullfs-875a837f4198

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-10 16:03   ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 02/23] fs: add scoped_with_init_fs() Christian Brauner
                   ` (23 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

PID 1 may choose to stop sharing fs_struct state with us. Either via
unshare(CLONE_FS) or unshare(CLONE_NEWNS). Of course, PID 1 could have
chosen to create arbitrary process trees that all share fs_struct state
via CLONE_FS. This is a strong statement: We only care about PID 1 aka
the thread-group leader so subthread's fs_struct state doesn't matter.

PID 1 unsharing fs_struct state is a bug. PID 1 relies on various
kthreads to be able to perform work based on its fs_struct state.
Breaking that contract sucks for both sides. So just don't bother with
extra work for this. No sane init system should ever do this.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs_struct.c            | 41 +++++++++++++++++++++++++++++++++++++++++
 include/linux/fs_struct.h |  2 ++
 kernel/fork.c             | 14 +++-----------
 3 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 394875d06fd6..3ff79fb894c1 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -147,6 +147,47 @@ int unshare_fs_struct(void)
 }
 EXPORT_SYMBOL_GPL(unshare_fs_struct);
 
+/*
+ * PID 1 may choose to stop sharing fs_struct state with us.
+ * Either via unshare(CLONE_FS) or unshare(CLONE_NEWNS). Of
+ * course, PID 1 could have chosen to create arbitrary process
+ * trees that all share fs_struct state via CLONE_FS. This is a
+ * strong statement: We only care about PID 1 aka the thread-group
+ * leader so subthread's fs_struct state doesn't matter.
+ *
+ * PID 1 unsharing fs_struct state is a bug. PID 1 relies on
+ * various kthreads to be able to perform work based on its
+ * fs_struct state. Breaking that contract sucks for both sides.
+ * So just don't bother with extra work for this. No sane init
+ * system should ever do this.
+ */
+static inline void nullfs_userspace_init(struct fs_struct *old_fs)
+{
+	if (likely(current->pid != 1))
+		return;
+	/* @old_fs may be dangling but for comparison it's fine */
+	if (old_fs != &init_fs)
+		return;
+	pr_warn("VFS: Pid 1 stopped sharing filesystem state\n");
+}
+
+struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
+{
+	struct fs_struct *fs;
+
+	fs = current->fs;
+	read_seqlock_excl(&fs->seq);
+	current->fs = new_fs;
+	if (--fs->users)
+		new_fs = NULL;
+	else
+		new_fs = fs;
+	read_sequnlock_excl(&fs->seq);
+
+	nullfs_userspace_init(fs);
+	return new_fs;
+}
+
 /* to be mentioned only in INIT_TASK */
 struct fs_struct init_fs = {
 	.users		= 1,
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0070764b790a..ade459383f92 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -40,6 +40,8 @@ static inline void get_fs_pwd(struct fs_struct *fs, struct path *pwd)
 	read_sequnlock_excl(&fs->seq);
 }
 
+struct fs_struct *switch_fs_struct(struct fs_struct *new_fs);
+
 extern bool current_chrooted(void);
 
 static inline int current_umask(void)
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518..583078c69bbd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3123,7 +3123,7 @@ static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp
  */
 int ksys_unshare(unsigned long unshare_flags)
 {
-	struct fs_struct *fs, *new_fs = NULL;
+	struct fs_struct *new_fs = NULL;
 	struct files_struct *new_fd = NULL;
 	struct cred *new_cred = NULL;
 	struct nsproxy *new_nsproxy = NULL;
@@ -3200,16 +3200,8 @@ int ksys_unshare(unsigned long unshare_flags)
 
 		task_lock(current);
 
-		if (new_fs) {
-			fs = current->fs;
-			read_seqlock_excl(&fs->seq);
-			current->fs = new_fs;
-			if (--fs->users)
-				new_fs = NULL;
-			else
-				new_fs = fs;
-			read_sequnlock_excl(&fs->seq);
-		}
+		if (new_fs)
+			new_fs = switch_fs_struct(new_fs);
 
 		if (new_fd)
 			swap(current->files, new_fd);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing
  2026-03-05 23:30 ` [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing Christian Brauner
@ 2026-03-10 16:03   ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 16:03 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn

> +struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
> +{
> +	struct fs_struct *fs;
> +
> +	fs = current->fs;
> +	read_seqlock_excl(&fs->seq);
> +	current->fs = new_fs;
> +	if (--fs->users)
> +		new_fs = NULL;
> +	else
> +		new_fs = fs;
> +	read_sequnlock_excl(&fs->seq);
> +
> +	nullfs_userspace_init(fs);

This is called under task_lock() and nullfs_userspace_init() may sleep.
Oversight on my part.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 02/23] fs: add scoped_with_init_fs()
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-09 15:19   ` Jann Horn
  2026-03-05 23:30 ` [PATCH RFC v2 03/23] rnbd: use scoped_with_init_fs() for block device open Christian Brauner
                   ` (22 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Similar to scoped_with_kernel_creds() allow a temporary override of
current->fs to serve the few places where lookup is performed from
kthread context or needs init's filesytem state.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/fs_struct.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index ade459383f92..ff525a1e45d4 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -6,6 +6,7 @@
 #include <linux/path.h>
 #include <linux/spinlock.h>
 #include <linux/seqlock.h>
+#include <linux/vfsdebug.h>
 
 struct fs_struct {
 	int users;
@@ -49,4 +50,34 @@ static inline int current_umask(void)
 	return current->fs->umask;
 }
 
+/*
+ * Temporarily use userspace_init_fs for path resolution in kthreads.
+ * Callers should use scoped_with_init_fs() which automatically
+ * restores the original fs_struct at scope exit.
+ */
+static inline struct fs_struct *__override_init_fs(void)
+{
+	struct fs_struct *fs;
+
+	fs = current->fs;
+	smp_store_release(&current->fs, current->fs);
+	return fs;
+}
+
+static inline void __revert_init_fs(struct fs_struct *revert_fs)
+{
+	VFS_WARN_ON_ONCE(current->fs != current->fs);
+	smp_store_release(&current->fs, revert_fs);
+}
+
+DEFINE_CLASS(__override_init_fs,
+	     struct fs_struct *,
+	     __revert_init_fs(_T),
+	     __override_init_fs(), void)
+
+#define scoped_with_init_fs() \
+	scoped_class(__override_init_fs, __UNIQUE_ID(label))
+
+void __init init_userspace_fs(void);
+
 #endif /* _LINUX_FS_STRUCT_H */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 02/23] fs: add scoped_with_init_fs()
  2026-03-05 23:30 ` [PATCH RFC v2 02/23] fs: add scoped_with_init_fs() Christian Brauner
@ 2026-03-09 15:19   ` Jann Horn
  2026-03-10 11:30     ` Christian Brauner
  0 siblings, 1 reply; 40+ messages in thread
From: Jann Horn @ 2026-03-09 15:19 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> Similar to scoped_with_kernel_creds() allow a temporary override of
> current->fs to serve the few places where lookup is performed from
> kthread context or needs init's filesytem state.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
[...]
> +static inline struct fs_struct *__override_init_fs(void)
> +{
> +       struct fs_struct *fs;
> +
> +       fs = current->fs;
> +       smp_store_release(&current->fs, current->fs);
> +       return fs;
> +}

See my comments on patch 15 - I think you'll have to reorder this
patch after the introduction of task_struct::real_fs and changing
procfs to access task_struct::real_fs.

I'm not sure what the smp_store_release() is supposed to pair with; if
you get rid of remote access to task_struct::fs as I suggest on patch
15, it could be a plain store, otherwise it would have to happen under
task_lock().

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 02/23] fs: add scoped_with_init_fs()
  2026-03-09 15:19   ` Jann Horn
@ 2026-03-10 11:30     ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 11:30 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Mon, Mar 09, 2026 at 04:19:36PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> > Similar to scoped_with_kernel_creds() allow a temporary override of
> > current->fs to serve the few places where lookup is performed from
> > kthread context or needs init's filesytem state.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> [...]
> > +static inline struct fs_struct *__override_init_fs(void)
> > +{
> > +       struct fs_struct *fs;
> > +
> > +       fs = current->fs;
> > +       smp_store_release(&current->fs, current->fs);
> > +       return fs;
> > +}
> 
> See my comments on patch 15 - I think you'll have to reorder this
> patch after the introduction of task_struct::real_fs and changing
> procfs to access task_struct::real_fs.
> 
> I'm not sure what the smp_store_release() is supposed to pair with; if
> you get rid of remote access to task_struct::fs as I suggest on patch
> 15, it could be a plain store, otherwise it would have to happen under
> task_lock().

Yes, that's right. I'll remove that.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 03/23] rnbd: use scoped_with_init_fs() for block device open
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 02/23] fs: add scoped_with_init_fs() Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access Christian Brauner
                   ` (21 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the bdev_file_open_by_path() call so the path lookup happens in
init's filesystem context.

process_msg_open() ← rnbd_srv_rdma_ev() ← RDMA completion callback ←
ib_cq_poll_work() ← kworker (InfiniBand completion workqueue)

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/block/rnbd/rnbd-srv.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
index 10e8c438bb43..79c9a5fb418f 100644
--- a/drivers/block/rnbd/rnbd-srv.c
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -11,6 +11,7 @@
 
 #include <linux/module.h>
 #include <linux/blkdev.h>
+#include <linux/fs_struct.h>
 
 #include "rnbd-srv.h"
 #include "rnbd-srv-trace.h"
@@ -734,7 +735,8 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
 		goto reject;
 	}
 
-	bdev_file = bdev_file_open_by_path(full_path, open_flags, NULL, NULL);
+	scoped_with_init_fs()
+		bdev_file = bdev_file_open_by_path(full_path, open_flags, NULL, NULL);
 	if (IS_ERR(bdev_file)) {
 		ret = PTR_ERR(bdev_file);
 		pr_err("Opening device '%s' on session %s failed, failed to open the block device, err: %pe\n",

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (2 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 03/23] rnbd: use scoped_with_init_fs() for block device open Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-09 15:37   ` Jann Horn
  2026-03-05 23:30 ` [PATCH RFC v2 05/23] scsi: target: use scoped_with_init_fs() for ALUA metadata Christian Brauner
                   ` (20 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Replace the manual init_task root retrieval with scoped_with_init_fs()
to temporarily override current->fs. This allows using the simpler
filp_open() instead of the init_root() + file_open_root() pattern.

open_file_as_root() ← sev_read_init_ex_file() / sev_write_init_ex_file()
← sev_platform_init() ← __sev_guest_init() ← KVM ioctl — user process context

Needs init's root because the SEV init_ex file path should resolve
against the real root, not a KVM user's chroot.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/crypto/ccp/sev-dev.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 096f993974d1..4320054da0f6 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -260,20 +260,16 @@ static int sev_cmd_buffer_len(int cmd)
 
 static struct file *open_file_as_root(const char *filename, int flags, umode_t mode)
 {
-	struct path root __free(path_put) = {};
-
-	task_lock(&init_task);
-	get_fs_root(init_task.fs, &root);
-	task_unlock(&init_task);
-
 	CLASS(prepare_creds, cred)();
 	if (!cred)
 		return ERR_PTR(-ENOMEM);
 
 	cred->fsuid = GLOBAL_ROOT_UID;
 
-	scoped_with_creds(cred)
-		return file_open_root(&root, filename, flags, mode);
+	scoped_with_init_fs() {
+		scoped_with_creds(cred)
+			return filp_open(filename, flags, mode);
+	}
 }
 
 static int sev_read_init_ex_file(void)

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access
  2026-03-05 23:30 ` [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access Christian Brauner
@ 2026-03-09 15:37   ` Jann Horn
  2026-03-10 11:33     ` Christian Brauner
  0 siblings, 1 reply; 40+ messages in thread
From: Jann Horn @ 2026-03-09 15:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> Replace the manual init_task root retrieval with scoped_with_init_fs()
> to temporarily override current->fs. This allows using the simpler
> filp_open() instead of the init_root() + file_open_root() pattern.
>
> open_file_as_root() ← sev_read_init_ex_file() / sev_write_init_ex_file()
> ← sev_platform_init() ← __sev_guest_init() ← KVM ioctl — user process context
>
> Needs init's root because the SEV init_ex file path should resolve
> against the real root, not a KVM user's chroot.
>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  drivers/crypto/ccp/sev-dev.c | 12 ++++--------
>  1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> index 096f993974d1..4320054da0f6 100644
> --- a/drivers/crypto/ccp/sev-dev.c
> +++ b/drivers/crypto/ccp/sev-dev.c
> @@ -260,20 +260,16 @@ static int sev_cmd_buffer_len(int cmd)
>
>  static struct file *open_file_as_root(const char *filename, int flags, umode_t mode)
>  {
> -       struct path root __free(path_put) = {};
> -
> -       task_lock(&init_task);
> -       get_fs_root(init_task.fs, &root);
> -       task_unlock(&init_task);
> -
>         CLASS(prepare_creds, cred)();
>         if (!cred)
>                 return ERR_PTR(-ENOMEM);
>
>         cred->fsuid = GLOBAL_ROOT_UID;
>
> -       scoped_with_creds(cred)
> -               return file_open_root(&root, filename, flags, mode);
> +       scoped_with_init_fs() {
> +               scoped_with_creds(cred)
> +                       return filp_open(filename, flags, mode);
> +       }

This patch, along with the others that start using
scoped_with_init_fs, should probably go closer to the end of the
series? As-is, if someone bisects to just after this patch, SEV will
be in a broken state where it wrongly looks up a file from the process
root.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access
  2026-03-09 15:37   ` Jann Horn
@ 2026-03-10 11:33     ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 11:33 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Mon, Mar 09, 2026 at 04:37:44PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> > Replace the manual init_task root retrieval with scoped_with_init_fs()
> > to temporarily override current->fs. This allows using the simpler
> > filp_open() instead of the init_root() + file_open_root() pattern.
> >
> > open_file_as_root() ← sev_read_init_ex_file() / sev_write_init_ex_file()
> > ← sev_platform_init() ← __sev_guest_init() ← KVM ioctl — user process context
> >
> > Needs init's root because the SEV init_ex file path should resolve
> > against the real root, not a KVM user's chroot.
> >
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > ---
> >  drivers/crypto/ccp/sev-dev.c | 12 ++++--------
> >  1 file changed, 4 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
> > index 096f993974d1..4320054da0f6 100644
> > --- a/drivers/crypto/ccp/sev-dev.c
> > +++ b/drivers/crypto/ccp/sev-dev.c
> > @@ -260,20 +260,16 @@ static int sev_cmd_buffer_len(int cmd)
> >
> >  static struct file *open_file_as_root(const char *filename, int flags, umode_t mode)
> >  {
> > -       struct path root __free(path_put) = {};
> > -
> > -       task_lock(&init_task);
> > -       get_fs_root(init_task.fs, &root);
> > -       task_unlock(&init_task);
> > -
> >         CLASS(prepare_creds, cred)();
> >         if (!cred)
> >                 return ERR_PTR(-ENOMEM);
> >
> >         cred->fsuid = GLOBAL_ROOT_UID;
> >
> > -       scoped_with_creds(cred)
> > -               return file_open_root(&root, filename, flags, mode);
> > +       scoped_with_init_fs() {
> > +               scoped_with_creds(cred)
> > +                       return filp_open(filename, flags, mode);
> > +       }
> 
> This patch, along with the others that start using
> scoped_with_init_fs, should probably go closer to the end of the
> series? As-is, if someone bisects to just after this patch, SEV will
> be in a broken state where it wrongly looks up a file from the process
> root.

Oh yeah, that's dumb. I think the fix is to simply grab init_task.fs and
use that until the series is done. Then it's 1:1 for all callers. Thanks
for spotting that!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 05/23] scsi: target: use scoped_with_init_fs() for ALUA metadata
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (3 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 06/23] scsi: target: use scoped_with_init_fs() for APTPL metadata Christian Brauner
                   ` (19 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the filp_open() call in core_alua_write_tpg_metadata() so the
path lookup happens in init's filesystem context.

core_alua_write_tpg_metadata() ← core_alua_update_tpg_primary_metadata()
← core_alua_do_transition_tg_pt() ← target_queued_submit_work() ←
kworker (target submission workqueue)

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/target/target_core_alua.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c
index 10250aca5a81..fde88642a43a 100644
--- a/drivers/target/target_core_alua.c
+++ b/drivers/target/target_core_alua.c
@@ -18,6 +18,7 @@
 #include <linux/fcntl.h>
 #include <linux/file.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <scsi/scsi_proto.h>
 #include <linux/unaligned.h>
 
@@ -856,10 +857,13 @@ static int core_alua_write_tpg_metadata(
 	unsigned char *md_buf,
 	u32 md_buf_len)
 {
-	struct file *file = filp_open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);
+	struct file *file;
 	loff_t pos = 0;
 	int ret;
 
+	scoped_with_init_fs()
+		file = filp_open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);
+
 	if (IS_ERR(file)) {
 		pr_err("filp_open(%s) for ALUA metadata failed\n", path);
 		return -ENODEV;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 06/23] scsi: target: use scoped_with_init_fs() for APTPL metadata
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (4 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 05/23] scsi: target: use scoped_with_init_fs() for ALUA metadata Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 07/23] btrfs: use scoped_with_init_fs() for update_dev_time() Christian Brauner
                   ` (18 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the filp_open() call in __core_scsi3_write_aptpl_to_file() so the
path lookup happens in init's filesystem context.

__core_scsi3_write_aptpl_to_file() ← core_scsi3_update_and_write_aptpl()
← PR command handlers ← target_queued_submit_work() ← kworker

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/target/target_core_pr.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
index f88e63aefcd8..2a030f119b24 100644
--- a/drivers/target/target_core_pr.c
+++ b/drivers/target/target_core_pr.c
@@ -18,6 +18,7 @@
 #include <linux/file.h>
 #include <linux/fcntl.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <scsi/scsi_proto.h>
 #include <linux/unaligned.h>
 
@@ -1969,7 +1970,8 @@ static int __core_scsi3_write_aptpl_to_file(
 	if (!path)
 		return -ENOMEM;
 
-	file = filp_open(path, flags, 0600);
+	scoped_with_init_fs()
+		file = filp_open(path, flags, 0600);
 	if (IS_ERR(file)) {
 		pr_err("filp_open(%s) for APTPL metadata"
 			" failed\n", path);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 07/23] btrfs: use scoped_with_init_fs() for update_dev_time()
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (5 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 06/23] scsi: target: use scoped_with_init_fs() for APTPL metadata Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 08/23] coredump: use scoped_with_init_fs() for coredump path resolution Christian Brauner
                   ` (17 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

update_dev_time() can be called from both kthread and process context.
Use scoped_with_init_fs() to temporarily override current->fs for
the kern_path() call when running in kthread context so the path
lookup happens in init's filesystem context.

update_dev_time() ← btrfs_scratch_superblocks() ←
btrfs_dev_replace_finishing() ← btrfs_dev_replace_kthread()
← kthread (kthread_run)

Also called from ioctl (user process).

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/btrfs/volumes.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 648bb09fc416..b42e93c8e5b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -12,6 +12,7 @@
 #include <linux/uuid.h>
 #include <linux/list_sort.h>
 #include <linux/namei.h>
+#include <linux/fs_struct.h>
 #include "misc.h"
 #include "disk-io.h"
 #include "extent-tree.h"
@@ -2119,8 +2120,16 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
 static void update_dev_time(const char *device_path)
 {
 	struct path path;
+	int err;
 
-	if (!kern_path(device_path, LOOKUP_FOLLOW, &path)) {
+	if (tsk_is_kthread(current)) {
+		scoped_with_init_fs()
+			err = kern_path(device_path, LOOKUP_FOLLOW, &path);
+	} else {
+		err = kern_path(device_path, LOOKUP_FOLLOW, &path);
+	}
+
+	if (!err) {
 		vfs_utimes(&path, NULL);
 		path_put(&path);
 	}

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 08/23] coredump: use scoped_with_init_fs() for coredump path resolution
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (6 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 07/23] btrfs: use scoped_with_init_fs() for update_dev_time() Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 09/23] fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns() Christian Brauner
                   ` (16 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the filp_open() call so the coredump path lookup happens in init's
filesystem context. This replaces the init_root() + file_open_root()
pattern with the simpler scoped override.

coredump_file() ← do_coredump() ← vfs_coredump() ← get_signal() — runs
as the crashing userspace process

Uses init's root to prevent a chrooted/user-namespaced process from
controlling where suid coredumps land. Not a kthread, but intentionally
needs init's fs for security.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/coredump.c | 11 +++--------
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 29df8aa19e2e..7428349f10bf 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -919,15 +919,10 @@ static bool coredump_file(struct core_name *cn, struct coredump_params *cprm,
 		 * with a fully qualified path" rule is to control where
 		 * coredumps may be placed using root privileges,
 		 * current->fs->root must not be used. Instead, use the
-		 * root directory of init_task.
+		 * root directory of PID 1.
 		 */
-		struct path root;
-
-		task_lock(&init_task);
-		get_fs_root(init_task.fs, &root);
-		task_unlock(&init_task);
-		file = file_open_root(&root, cn->corename, open_flags, 0600);
-		path_put(&root);
+		scoped_with_init_fs()
+			file = filp_open(cn->corename, open_flags, 0600);
 	} else {
 		file = filp_open(cn->corename, open_flags, 0600);
 	}

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 09/23] fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns()
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (7 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 08/23] coredump: use scoped_with_init_fs() for coredump path resolution Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 10/23] ksmbd: use scoped_with_init_fs() for share path resolution Christian Brauner
                   ` (15 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Replace the manual init_task root retrieval with scoped_with_init_fs()
to temporarily override current->fs. This allows using the simpler
filp_open() instead of the init_root() + file_open_root() pattern.

kernel_read_file_from_path_initns() ← fw_get_filesystem_firmware() ←
_request_firmware() ← request_firmware_work_func() ← kworker (async
firmware loading)

Also called synchronously from request_firmware() which can be user or
kthread context.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/kernel_read_file.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index de32c95d823d..9c2ba9240083 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -150,18 +150,13 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset,
 					  enum kernel_read_file_id id)
 {
 	struct file *file;
-	struct path root;
 	ssize_t ret;
 
 	if (!path || !*path)
 		return -EINVAL;
 
-	task_lock(&init_task);
-	get_fs_root(init_task.fs, &root);
-	task_unlock(&init_task);
-
-	file = file_open_root(&root, path, O_RDONLY, 0);
-	path_put(&root);
+	scoped_with_init_fs()
+		file = filp_open(path, O_RDONLY, 0);
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 10/23] ksmbd: use scoped_with_init_fs() for share path resolution
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (8 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 09/23] fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns() Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 11/23] ksmbd: use scoped_with_init_fs() for filesystem info path lookup Christian Brauner
                   ` (14 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the kern_path() call in share_config_request() so the share path
lookup happens in init's filesystem context.

All ksmbd paths ← SMB command handlers ← handle_ksmbd_work() ← workqueue
← ksmbd_conn_handler_loop() ← kthread

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/smb/server/mgmt/share_config.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/smb/server/mgmt/share_config.c b/fs/smb/server/mgmt/share_config.c
index 53f44ff4d376..4535566abef2 100644
--- a/fs/smb/server/mgmt/share_config.c
+++ b/fs/smb/server/mgmt/share_config.c
@@ -9,6 +9,7 @@
 #include <linux/rwsem.h>
 #include <linux/parser.h>
 #include <linux/namei.h>
+#include <linux/fs_struct.h>
 #include <linux/sched.h>
 #include <linux/mm.h>
 
@@ -189,7 +190,8 @@ static struct ksmbd_share_config *share_config_request(struct ksmbd_work *work,
 				goto out;
 			}
 
-			ret = kern_path(share->path, 0, &share->vfs_path);
+			scoped_with_init_fs()
+				ret = kern_path(share->path, 0, &share->vfs_path);
 			ksmbd_revert_fsids(work);
 			if (ret) {
 				ksmbd_debug(SMB, "failed to access '%s'\n",

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 11/23] ksmbd: use scoped_with_init_fs() for filesystem info path lookup
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (9 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 10/23] ksmbd: use scoped_with_init_fs() for share path resolution Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 12/23] ksmbd: use scoped_with_init_fs() for VFS path operations Christian Brauner
                   ` (13 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
the kern_path() call in smb2_get_info_filesystem() so the share
path lookup happens in init's filesystem context.

All ksmbd paths ← SMB command handlers ← handle_ksmbd_work() ← workqueue
← ksmbd_conn_handler_loop() ← kthread

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/smb/server/smb2pdu.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/smb/server/smb2pdu.c b/fs/smb/server/smb2pdu.c
index 743c629fe7ec..0667b0b663cd 100644
--- a/fs/smb/server/smb2pdu.c
+++ b/fs/smb/server/smb2pdu.c
@@ -9,6 +9,7 @@
 #include <net/addrconf.h>
 #include <linux/syscalls.h>
 #include <linux/namei.h>
+#include <linux/fs_struct.h>
 #include <linux/statfs.h>
 #include <linux/ethtool.h>
 #include <linux/falloc.h>
@@ -5463,7 +5464,8 @@ static int smb2_get_info_filesystem(struct ksmbd_work *work,
 	if (!share->path)
 		return -EIO;
 
-	rc = kern_path(share->path, LOOKUP_NO_SYMLINKS, &path);
+	scoped_with_init_fs()
+		rc = kern_path(share->path, LOOKUP_NO_SYMLINKS, &path);
 	if (rc) {
 		pr_err("cannot create vfs path\n");
 		return -EIO;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 12/23] ksmbd: use scoped_with_init_fs() for VFS path operations
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (10 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 11/23] ksmbd: use scoped_with_init_fs() for filesystem info path lookup Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 13/23] initramfs: use scoped_with_init_fs() for rootfs unpacking Christian Brauner
                   ` (12 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for
path lookups in ksmbd VFS helpers:
- ksmbd_vfs_path_lookup(): wrap vfs_path_parent_lookup()
- ksmbd_vfs_link(): wrap kern_path() for old path resolution
- ksmbd_vfs_kern_path_create(): wrap start_creating_path()

This ensures path lookups happen in init's filesystem context.

All ksmbd paths ← SMB command handlers ← handle_ksmbd_work() ← workqueue
← ksmbd_conn_handler_loop() ← kthread

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/smb/server/vfs.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index d08973b288e5..4b537e169160 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -7,6 +7,7 @@
 #include <crypto/sha2.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <linux/filelock.h>
 #include <linux/uaccess.h>
 #include <linux/backing-dev.h>
@@ -67,9 +68,10 @@ static int ksmbd_vfs_path_lookup(struct ksmbd_share_config *share_conf,
 	}
 
 	CLASS(filename_kernel, filename)(pathname);
-	err = vfs_path_parent_lookup(filename, flags,
-				     path, &last, &type,
-				     root_share_path);
+	scoped_with_init_fs()
+		err = vfs_path_parent_lookup(filename, flags,
+					     path, &last, &type,
+					     root_share_path);
 	if (err)
 		return err;
 
@@ -622,7 +624,8 @@ int ksmbd_vfs_link(struct ksmbd_work *work, const char *oldname,
 	if (ksmbd_override_fsids(work))
 		return -ENOMEM;
 
-	err = kern_path(oldname, LOOKUP_NO_SYMLINKS, &oldpath);
+	scoped_with_init_fs()
+		err = kern_path(oldname, LOOKUP_NO_SYMLINKS, &oldpath);
 	if (err) {
 		pr_err("cannot get linux path for %s, err = %d\n",
 		       oldname, err);
@@ -1258,7 +1261,8 @@ struct dentry *ksmbd_vfs_kern_path_create(struct ksmbd_work *work,
 	if (!abs_name)
 		return ERR_PTR(-ENOMEM);
 
-	dent = start_creating_path(AT_FDCWD, abs_name, path, flags);
+	scoped_with_init_fs()
+		dent = start_creating_path(AT_FDCWD, abs_name, path, flags);
 	kfree(abs_name);
 	return dent;
 }

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 13/23] initramfs: use scoped_with_init_fs() for rootfs unpacking
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (11 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 12/23] ksmbd: use scoped_with_init_fs() for VFS path operations Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 14/23] af_unix: use scoped_with_init_fs() for coredump socket lookup Christian Brauner
                   ` (11 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Extract the initramfs unpacking code into a separate
unpack_initramfs() function and wrap its invocation from
do_populate_rootfs() with scoped_with_init_fs(). This ensures all
file operations during initramfs unpacking (including filp_open()
calls in do_name() and populate_initrd_image()) happen in init's
filesystem context.

do_populate_rootfs() ← async_schedule_domain() ← kworker (async
workqueue)

May also run synchronously from PID 1 in case async workqueue is
considered full. Overriding in that case is fine as well.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 init/initramfs.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 139baed06589..045e2b9d6716 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -3,6 +3,7 @@
 #include <linux/async.h>
 #include <linux/export.h>
 #include <linux/fs.h>
+#include <linux/fs_struct.h>
 #include <linux/slab.h>
 #include <linux/types.h>
 #include <linux/fcntl.h>
@@ -715,7 +716,7 @@ static void __init populate_initrd_image(char *err)
 }
 #endif /* CONFIG_BLK_DEV_RAM */
 
-static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
+static void __init unpack_initramfs(async_cookie_t cookie)
 {
 	/* Load the built in initramfs */
 	char *err = unpack_to_rootfs(__initramfs_start, __initramfs_size);
@@ -723,7 +724,7 @@ static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 		panic_show_mem("%s", err); /* Failed to decompress INTERNAL initramfs */
 
 	if (!initrd_start || IS_ENABLED(CONFIG_INITRAMFS_FORCE))
-		goto done;
+		return;
 
 	if (IS_ENABLED(CONFIG_BLK_DEV_RAM))
 		printk(KERN_INFO "Trying to unpack rootfs image as initramfs...\n");
@@ -738,8 +739,13 @@ static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
 		printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
 #endif
 	}
+}
+
+static void __init do_populate_rootfs(void *unused, async_cookie_t cookie)
+{
+	scoped_with_init_fs()
+		unpack_initramfs(cookie);
 
-done:
 	security_initramfs_populated();
 
 	/*

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 14/23] af_unix: use scoped_with_init_fs() for coredump socket lookup
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (12 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 13/23] initramfs: use scoped_with_init_fs() for rootfs unpacking Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct Christian Brauner
                   ` (10 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Use scoped_with_init_fs() to temporarily override current->fs for the
coredump unix socket path resolution. This replaces the init_root() +
vfs_path_lookup() pattern with scoped_with_init_fs() + kern_path().

The old code used LOOKUP_BENEATH to confine the lookup beneath init's
root. This is dropped because the coredump socket path is absolute and
resolved from root (where ".." is a no-op), and LOOKUP_NO_SYMLINKS
already blocks any symlink-based escape. LOOKUP_BENEATH was redundant
in this context.

unix_find_bsd(SOCK_COREDUMP) ← coredump_sock_connect() ← do_coredump() —
same crashing userspace process

Same security rationale as coredump.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 net/unix/af_unix.c | 17 ++++++-----------
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3756a93dc63a..64b56b3d0aee 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1198,17 +1198,12 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
 	unix_mkname_bsd(sunaddr, addr_len);
 
 	if (flags & SOCK_COREDUMP) {
-		struct path root;
-
-		task_lock(&init_task);
-		get_fs_root(init_task.fs, &root);
-		task_unlock(&init_task);
-
-		scoped_with_kernel_creds()
-			err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
-					      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
-					      LOOKUP_NO_MAGICLINKS, &path);
-		path_put(&root);
+		scoped_with_init_fs() {
+			scoped_with_kernel_creds()
+				err = kern_path(sunaddr->sun_path,
+						LOOKUP_NO_SYMLINKS |
+						LOOKUP_NO_MAGICLINKS, &path);
+		}
 		if (err)
 			goto fail;
 	} else {

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (13 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 14/23] af_unix: use scoped_with_init_fs() for coredump socket lookup Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-07  0:51   ` Askar Safin
  2026-03-09 15:14   ` Jann Horn
  2026-03-05 23:30 ` [PATCH RFC v2 16/23] fs: make userspace_init_fs a dynamically-initialized pointer Christian Brauner
                   ` (9 subsequent siblings)
  24 siblings, 2 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Add a real_fs field to task_struct that always mirrors the fs field.
This lays the groundwork for distinguishing between a task's permanent
fs_struct and one that is temporarily overridden via scoped_with_init_fs().

When a kthread temporarily overrides current->fs for path lookup, we
need to know the original fs_struct for operations like exit_fs() and
unshare_fs_struct() that must operate on the real, permanent fs.

For now real_fs is always equal to fs. It is maintained alongside fs in
all the relevant paths: exit_fs(), unshare_fs_struct(),
switch_fs_struct(), and copy_fs().

Also fix the argument passed to nullfs_userspace_init() in
switch_fs_struct(): pass the old fs_struct itself rather than the
conditional return value which is NULL when other users still hold
a reference, ensuring the PID 1 unshare detection actually works.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs_struct.c        | 10 +++++++---
 include/linux/sched.h |  1 +
 init/init_task.c      |  1 +
 kernel/fork.c         |  4 +++-
 4 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 3ff79fb894c1..b9b9a327f299 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -89,12 +89,13 @@ void free_fs_struct(struct fs_struct *fs)
 
 void exit_fs(struct task_struct *tsk)
 {
-	struct fs_struct *fs = tsk->fs;
+	struct fs_struct *fs = tsk->real_fs;
 
 	if (fs) {
 		int kill;
 		task_lock(tsk);
 		read_seqlock_excl(&fs->seq);
+		tsk->real_fs = NULL;
 		tsk->fs = NULL;
 		kill = !--fs->users;
 		read_sequnlock_excl(&fs->seq);
@@ -126,7 +127,7 @@ struct fs_struct *copy_fs_struct(struct fs_struct *old)
 
 int unshare_fs_struct(void)
 {
-	struct fs_struct *fs = current->fs;
+	struct fs_struct *fs = current->real_fs;
 	struct fs_struct *new_fs = copy_fs_struct(fs);
 	int kill;
 
@@ -135,8 +136,10 @@ int unshare_fs_struct(void)
 
 	task_lock(current);
 	read_seqlock_excl(&fs->seq);
+	VFS_WARN_ON_ONCE(fs != current->fs);
 	kill = !--fs->users;
 	current->fs = new_fs;
+	current->real_fs = new_fs;
 	read_sequnlock_excl(&fs->seq);
 	task_unlock(current);
 
@@ -177,13 +180,14 @@ struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
 
 	fs = current->fs;
 	read_seqlock_excl(&fs->seq);
+	VFS_WARN_ON_ONCE(current->fs != current->real_fs);
 	current->fs = new_fs;
+	current->real_fs = new_fs;
 	if (--fs->users)
 		new_fs = NULL;
 	else
 		new_fs = fs;
 	read_sequnlock_excl(&fs->seq);
-
 	nullfs_userspace_init(fs);
 	return new_fs;
 }
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a7b4a980eb2f..5c7b9df92ebb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1179,6 +1179,7 @@ struct task_struct {
 	unsigned long			last_switch_time;
 #endif
 	/* Filesystem information: */
+	struct fs_struct		*real_fs;
 	struct fs_struct		*fs;
 
 	/* Open file information: */
diff --git a/init/init_task.c b/init/init_task.c
index 5c838757fc10..7d0b4a5927eb 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -152,6 +152,7 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = {
 	RCU_POINTER_INITIALIZER(cred, &init_cred),
 	.comm		= INIT_TASK_COMM,
 	.thread		= INIT_THREAD,
+	.real_fs	= &init_fs,
 	.fs		= &init_fs,
 	.files		= &init_files,
 #ifdef CONFIG_IO_URING
diff --git a/kernel/fork.c b/kernel/fork.c
index 583078c69bbd..73f4ed82f656 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1593,6 +1593,8 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
 static int copy_fs(u64 clone_flags, struct task_struct *tsk)
 {
 	struct fs_struct *fs = current->fs;
+
+	VFS_WARN_ON_ONCE(current->fs != current->real_fs);
 	if (clone_flags & CLONE_FS) {
 		/* tsk->fs is already what we want */
 		read_seqlock_excl(&fs->seq);
@@ -1605,7 +1607,7 @@ static int copy_fs(u64 clone_flags, struct task_struct *tsk)
 		read_sequnlock_excl(&fs->seq);
 		return 0;
 	}
-	tsk->fs = copy_fs_struct(fs);
+	tsk->real_fs = tsk->fs = copy_fs_struct(fs);
 	if (!tsk->fs)
 		return -ENOMEM;
 	return 0;

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct
  2026-03-05 23:30 ` [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct Christian Brauner
@ 2026-03-07  0:51   ` Askar Safin
  2026-03-09 15:14   ` Jann Horn
  1 sibling, 0 replies; 40+ messages in thread
From: Askar Safin @ 2026-03-07  0:51 UTC (permalink / raw)
  To: brauner; +Cc: axboe, jack, jannh, linux-fsdevel, linux-kernel, tj, torvalds,
	viro

Christian Brauner <brauner@kernel.org>:
> Also fix the argument passed to nullfs_userspace_init() in
> switch_fs_struct(): pass the old fs_struct itself rather than the
> conditional return value which is NULL when other users still hold
> a reference, ensuring the PID 1 unshare detection actually works.

This description doesn't match actual patch.

Your patch doesn't change what is passed to nullfs_userspace_init in
switch_fs_struct. Code before this patch already passed fs.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct
  2026-03-05 23:30 ` [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct Christian Brauner
  2026-03-07  0:51   ` Askar Safin
@ 2026-03-09 15:14   ` Jann Horn
  2026-03-10 11:29     ` Christian Brauner
  1 sibling, 1 reply; 40+ messages in thread
From: Jann Horn @ 2026-03-09 15:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> Add a real_fs field to task_struct that always mirrors the fs field.
> This lays the groundwork for distinguishing between a task's permanent
> fs_struct and one that is temporarily overridden via scoped_with_init_fs().
>
> When a kthread temporarily overrides current->fs for path lookup, we
> need to know the original fs_struct for operations like exit_fs() and
> unshare_fs_struct() that must operate on the real, permanent fs.

Note that there are remote accesses to ->fs from procfs, including
(idk if there are more, I didn't look closely):

 - mounts_open_common
 - get_task_root
 - proc_cwd_link

These expect that task_lock() keeps the task_struct::fs pointer
stable, and I don't see anything that prevents them operating on
kthreads.

You should probably ensure that remote accesses to task_struct::fs all
use task_struct::real_fs, just like how there are no remote accesses
to task_struct::cred - that makes logical sense since when userspace
queries a task's file system root/cwd/mounts, what the task is
currently doing probably shouldn't affect the result.

Then you could also change the locking rules such that task_struct::fs
has no locking while task_lock() protects modifications of, and remote
access to, task_struct::real_fs.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct
  2026-03-09 15:14   ` Jann Horn
@ 2026-03-10 11:29     ` Christian Brauner
  2026-03-10 16:05       ` Christian Brauner
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 11:29 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Mon, Mar 09, 2026 at 04:14:57PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> > Add a real_fs field to task_struct that always mirrors the fs field.
> > This lays the groundwork for distinguishing between a task's permanent
> > fs_struct and one that is temporarily overridden via scoped_with_init_fs().
> >
> > When a kthread temporarily overrides current->fs for path lookup, we
> > need to know the original fs_struct for operations like exit_fs() and
> > unshare_fs_struct() that must operate on the real, permanent fs.
> 
> Note that there are remote accesses to ->fs from procfs, including
> (idk if there are more, I didn't look closely):
> 
>  - mounts_open_common
>  - get_task_root
>  - proc_cwd_link
> 
> These expect that task_lock() keeps the task_struct::fs pointer
> stable, and I don't see anything that prevents them operating on
> kthreads.
> 
> You should probably ensure that remote accesses to task_struct::fs all
> use task_struct::real_fs, just like how there are no remote accesses
> to task_struct::cred - that makes logical sense since when userspace
> queries a task's file system root/cwd/mounts, what the task is
> currently doing probably shouldn't affect the result.
> 
> Then you could also change the locking rules such that task_struct::fs
> has no locking while task_lock() protects modifications of, and remote
> access to, task_struct::real_fs.

Yes, I have included those changes in the new revision I haven't yet sent out. :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct
  2026-03-10 11:29     ` Christian Brauner
@ 2026-03-10 16:05       ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 16:05 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Tue, Mar 10, 2026 at 12:29:28PM +0100, Christian Brauner wrote:
> On Mon, Mar 09, 2026 at 04:14:57PM +0100, Jann Horn wrote:
> > On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> > > Add a real_fs field to task_struct that always mirrors the fs field.
> > > This lays the groundwork for distinguishing between a task's permanent
> > > fs_struct and one that is temporarily overridden via scoped_with_init_fs().
> > >
> > > When a kthread temporarily overrides current->fs for path lookup, we
> > > need to know the original fs_struct for operations like exit_fs() and
> > > unshare_fs_struct() that must operate on the real, permanent fs.
> > 
> > Note that there are remote accesses to ->fs from procfs, including
> > (idk if there are more, I didn't look closely):
> > 
> >  - mounts_open_common
> >  - get_task_root
> >  - proc_cwd_link
> > 
> > These expect that task_lock() keeps the task_struct::fs pointer
> > stable, and I don't see anything that prevents them operating on
> > kthreads.
> > 
> > You should probably ensure that remote accesses to task_struct::fs all
> > use task_struct::real_fs, just like how there are no remote accesses
> > to task_struct::cred - that makes logical sense since when userspace
> > queries a task's file system root/cwd/mounts, what the task is
> > currently doing probably shouldn't affect the result.
> > 
> > Then you could also change the locking rules such that task_struct::fs
> > has no locking while task_lock() protects modifications of, and remote
> > access to, task_struct::real_fs.
> 
> Yes, I have included those changes in the new revision I haven't yet sent out. :)

- umask access in task_state()

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 16/23] fs: make userspace_init_fs a dynamically-initialized pointer
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (14 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 17/23] fs: stop sharing fs_struct between init_task and pid 1 Christian Brauner
                   ` (8 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Change userspace_init_fs from a declared-but-unused extern struct to
a dynamically initialized pointer. Add init_userspace_fs() which is
called early in kernel_init() (PID 1) to record PID 1's fs_struct
as the canonical userspace filesystem state.

Wire up __override_init_fs() and __revert_init_fs() to actually swap
current->fs to/from userspace_init_fs. Previously these were no-ops
that stored current->fs back to itself.

Fix nullfs_userspace_init() to compare against userspace_init_fs
instead of &init_fs. When PID 1 unshares its filesystem state, revert
userspace_init_fs to init_fs's root (nullfs) so that stale filesystem
state is not silently inherited by kworkers and usermodehelpers.

At this stage PID 1's fs still points to rootfs (set by
init_mount_tree), so userspace_init_fs points to rootfs and
scoped_with_init_fs() is functionally equivalent to its previous no-op
behavior.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs_struct.c            | 46 +++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs_struct.h |  5 +++--
 include/linux/init_task.h |  1 +
 init/main.c               |  3 +++
 4 files changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index b9b9a327f299..c1afa7513e34 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -8,6 +8,7 @@
 #include <linux/fs_struct.h>
 #include <linux/init_task.h>
 #include "internal.h"
+#include "mount.h"
 
 /*
  * Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
@@ -163,15 +164,32 @@ EXPORT_SYMBOL_GPL(unshare_fs_struct);
  * fs_struct state. Breaking that contract sucks for both sides.
  * So just don't bother with extra work for this. No sane init
  * system should ever do this.
+ *
+ * On older kernels if PID 1 unshared its filesystem state with us the
+ * kernel simply used the stale fs_struct state implicitly pinning
+ * anything that PID 1 had last used. Even if PID 1 might've moved on to
+ * some completely different fs_struct state and might've even unmounted
+ * the old root.
+ *
+ * This has hilarious consequences: Think continuing to dump coredump
+ * state into an implicitly pinned directory somewhere. Calling random
+ * binaries in the old rootfs via usermodehelpers.
+ *
+ * Be aggressive about this: We simply reject operating on stale
+ * fs_struct state by reverting to nullfs. Every kworker that does
+ * lookups after this point will fail. Every usermodehelper call will
+ * fail. Tough luck but let's be kind and emit a warning to userspace.
  */
 static inline void nullfs_userspace_init(struct fs_struct *old_fs)
 {
 	if (likely(current->pid != 1))
 		return;
 	/* @old_fs may be dangling but for comparison it's fine */
-	if (old_fs != &init_fs)
+	if (old_fs != userspace_init_fs)
 		return;
 	pr_warn("VFS: Pid 1 stopped sharing filesystem state\n");
+	set_fs_root(userspace_init_fs, &init_fs.root);
+	set_fs_pwd(userspace_init_fs, &init_fs.root);
 }
 
 struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
@@ -198,3 +216,29 @@ struct fs_struct init_fs = {
 	.seq		= __SEQLOCK_UNLOCKED(init_fs.seq),
 	.umask		= 0022,
 };
+
+struct fs_struct *userspace_init_fs __ro_after_init;
+EXPORT_SYMBOL_GPL(userspace_init_fs);
+
+void __init init_userspace_fs(void)
+{
+	struct mount *m;
+	struct path root;
+
+	/* Move PID 1 from nullfs into the initramfs. */
+	m = topmost_overmount(current->nsproxy->mnt_ns->root);
+	root.mnt = &m->mnt;
+	root.dentry = root.mnt->mnt_root;
+
+	VFS_WARN_ON_ONCE(current->pid != 1);
+
+	set_fs_root(current->fs, &root);
+	set_fs_pwd(current->fs, &root);
+
+	/* Hold a reference for the global pointer. */
+	read_seqlock_excl(&current->fs->seq);
+	current->fs->users++;
+	read_sequnlock_excl(&current->fs->seq);
+
+	userspace_init_fs = current->fs;
+}
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index ff525a1e45d4..51d335924029 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -17,6 +17,7 @@ struct fs_struct {
 } __randomize_layout;
 
 extern struct kmem_cache *fs_cachep;
+extern struct fs_struct *userspace_init_fs;
 
 extern void exit_fs(struct task_struct *);
 extern void set_fs_root(struct fs_struct *, const struct path *);
@@ -60,13 +61,13 @@ static inline struct fs_struct *__override_init_fs(void)
 	struct fs_struct *fs;
 
 	fs = current->fs;
-	smp_store_release(&current->fs, current->fs);
+	smp_store_release(&current->fs, userspace_init_fs);
 	return fs;
 }
 
 static inline void __revert_init_fs(struct fs_struct *revert_fs)
 {
-	VFS_WARN_ON_ONCE(current->fs != current->fs);
+	VFS_WARN_ON_ONCE(current->fs != userspace_init_fs);
 	smp_store_release(&current->fs, revert_fs);
 }
 
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a6cb241ea00c..61536be773f5 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -24,6 +24,7 @@
 
 extern struct files_struct init_files;
 extern struct fs_struct init_fs;
+extern struct fs_struct *userspace_init_fs;
 extern struct nsproxy init_nsproxy;
 
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..5ccc642a5aa7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -102,6 +102,7 @@
 #include <linux/stackdepot.h>
 #include <linux/randomize_kstack.h>
 #include <linux/pidfs.h>
+#include <linux/fs_struct.h>
 #include <linux/ptdump.h>
 #include <linux/time_namespace.h>
 #include <linux/unaligned.h>
@@ -1574,6 +1575,8 @@ static int __ref kernel_init(void *unused)
 {
 	int ret;
 
+	init_userspace_fs();
+
 	/*
 	 * Wait until kthreadd is all set-up.
 	 */

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 17/23] fs: stop sharing fs_struct between init_task and pid 1
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (15 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 16/23] fs: make userspace_init_fs a dynamically-initialized pointer Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args Christian Brauner
                   ` (7 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Spawn kernel_init (PID 1) via kernel_clone() directly instead of
user_mode_thread(), without CLONE_FS. This gives PID 1 its own private
copy of init_task's fs_struct rather than sharing it.

This is a prerequisite for isolating kthreads in nullfs: when
init_task's fs is later pointed at nullfs, PID 1 must not share it
or init_userspace_fs() would modify init_task's fs as well, defeating
the isolation.

At this stage PID 1 still gets rootfs (a private copy rather than a
shared reference), so there is no functional change.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 init/main.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/init/main.c b/init/main.c
index 5ccc642a5aa7..6633d4bea52b 100644
--- a/init/main.c
+++ b/init/main.c
@@ -714,6 +714,11 @@ static __initdata DECLARE_COMPLETION(kthreadd_done);
 
 static noinline void __ref __noreturn rest_init(void)
 {
+	struct kernel_clone_args init_args = {
+		.flags		= (CLONE_VM | CLONE_UNTRACED),
+		.fn		= kernel_init,
+		.fn_arg		= NULL,
+	};
 	struct task_struct *tsk;
 	int pid;
 
@@ -723,7 +728,7 @@ static noinline void __ref __noreturn rest_init(void)
 	 * the init task will end up wanting to create kthreads, which, if
 	 * we schedule it before we create kthreadd, will OOPS.
 	 */
-	pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
+	pid = kernel_clone(&init_args);
 	/*
 	 * Pin init on the boot CPU. Task migration is not properly working
 	 * until sched_init_smp() has been run. It will set the allowed

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (16 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 17/23] fs: stop sharing fs_struct between init_task and pid 1 Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-09 16:06   ` Jann Horn
  2026-03-05 23:30 ` [PATCH RFC v2 19/23] fs: add kthread_mntns() Christian Brauner
                   ` (6 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
usermodehelper threads always get init's filesystem state regardless of
their parent's (kthreadd's) fs.

Usermodehelper threads are not allowed to create mount namespaces
(CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
a non-initial mount namespace. No usermodehelper currently does this so
we don't need to worry about this restriction.

Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
rootfs which is the same as kthreadd's fs, so this is functionally
equivalent.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/sched/task.h |  1 +
 kernel/fork.c              | 23 ++++++++++++++++++++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cffc9..e0c1ca8c6a18 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
 	u32 io_thread:1;
 	u32 user_worker:1;
 	u32 no_files:1;
+	u32 umh:1;
 	unsigned long stack;
 	unsigned long stack_size;
 	unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index 73f4ed82f656..c740fe2ad1ef 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1590,9 +1590,25 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
 	return 0;
 }
 
-static int copy_fs(u64 clone_flags, struct task_struct *tsk)
+static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
 {
-	struct fs_struct *fs = current->fs;
+	struct fs_struct *fs;
+
+	/*
+	 * Usermodehelper may use userspace_init_fs filesystem state but
+	 * they don't get to create mount namespaces, share the
+	 * filesystem state, or be started from a non-initial mount
+	 * namespace.
+	 */
+	if (umh) {
+		if (clone_flags & (CLONE_NEWNS | CLONE_FS))
+			return -EINVAL;
+		if (current->nsproxy->mnt_ns != &init_mnt_ns)
+			return -EINVAL;
+		fs = userspace_init_fs;
+	} else {
+		fs = current->fs;
+	}
 
 	VFS_WARN_ON_ONCE(current->fs != current->real_fs);
 	if (clone_flags & CLONE_FS) {
@@ -2213,7 +2229,7 @@ __latent_entropy struct task_struct *copy_process(
 	retval = copy_files(clone_flags, p, args->no_files);
 	if (retval)
 		goto bad_fork_cleanup_semundo;
-	retval = copy_fs(clone_flags, p);
+	retval = copy_fs(clone_flags, p, args->umh);
 	if (retval)
 		goto bad_fork_cleanup_files;
 	retval = copy_sighand(clone_flags, p);
@@ -2727,6 +2743,7 @@ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
 		.exit_signal	= (flags & CSIGNAL),
 		.fn		= fn,
 		.fn_arg		= arg,
+		.umh		= 1,
 	};
 
 	return kernel_clone(&args);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
  2026-03-05 23:30 ` [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args Christian Brauner
@ 2026-03-09 16:06   ` Jann Horn
  2026-03-10 11:58     ` Christian Brauner
  0 siblings, 1 reply; 40+ messages in thread
From: Jann Horn @ 2026-03-09 16:06 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
> from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
> usermodehelper threads always get init's filesystem state regardless of
> their parent's (kthreadd's) fs.
>
> Usermodehelper threads are not allowed to create mount namespaces
> (CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
> a non-initial mount namespace. No usermodehelper currently does this so
> we don't need to worry about this restriction.
>
> Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
> rootfs which is the same as kthreadd's fs, so this is functionally
> equivalent.
[...]
> -static int copy_fs(u64 clone_flags, struct task_struct *tsk)
> +static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
>  {
> -       struct fs_struct *fs = current->fs;
> +       struct fs_struct *fs;
> +
> +       /*
> +        * Usermodehelper may use userspace_init_fs filesystem state but
> +        * they don't get to create mount namespaces, share the
> +        * filesystem state, or be started from a non-initial mount
> +        * namespace.
> +        */
> +       if (umh) {
> +               if (clone_flags & (CLONE_NEWNS | CLONE_FS))
> +                       return -EINVAL;
> +               if (current->nsproxy->mnt_ns != &init_mnt_ns)
> +                       return -EINVAL;
> +               fs = userspace_init_fs;
> +       } else {
> +               fs = current->fs;
> +       }
>
>         VFS_WARN_ON_ONCE(current->fs != current->real_fs);

Should that VFS_WARN_ON_ONCE() be in the else {} block?

I don't know if it could happen that a VFS operation that happens with
overwritten current->fs calls back into firmware loading or such, or
if that is anyway impossible for locking reasons or such...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args
  2026-03-09 16:06   ` Jann Horn
@ 2026-03-10 11:58     ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 11:58 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Mon, Mar 09, 2026 at 05:06:24PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:31 AM Christian Brauner <brauner@kernel.org> wrote:
> > Add a umh field to struct kernel_clone_args. When set, copy_fs() copies
> > from pid 1's fs_struct instead of the kthread's fs_struct. This ensures
> > usermodehelper threads always get init's filesystem state regardless of
> > their parent's (kthreadd's) fs.
> >
> > Usermodehelper threads are not allowed to create mount namespaces
> > (CLONE_NEWNS), share filesystem state (CLONE_FS), or be started from
> > a non-initial mount namespace. No usermodehelper currently does this so
> > we don't need to worry about this restriction.
> >
> > Set .umh = 1 in user_mode_thread(). At this stage pid 1's fs points to
> > rootfs which is the same as kthreadd's fs, so this is functionally
> > equivalent.
> [...]
> > -static int copy_fs(u64 clone_flags, struct task_struct *tsk)
> > +static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
> >  {
> > -       struct fs_struct *fs = current->fs;
> > +       struct fs_struct *fs;
> > +
> > +       /*
> > +        * Usermodehelper may use userspace_init_fs filesystem state but
> > +        * they don't get to create mount namespaces, share the
> > +        * filesystem state, or be started from a non-initial mount
> > +        * namespace.
> > +        */
> > +       if (umh) {
> > +               if (clone_flags & (CLONE_NEWNS | CLONE_FS))
> > +                       return -EINVAL;
> > +               if (current->nsproxy->mnt_ns != &init_mnt_ns)
> > +                       return -EINVAL;
> > +               fs = userspace_init_fs;
> > +       } else {
> > +               fs = current->fs;
> > +       }
> >
> >         VFS_WARN_ON_ONCE(current->fs != current->real_fs);
> 
> Should that VFS_WARN_ON_ONCE() be in the else {} block?

I think in spirit the placement you suggest makes more sense.

> I don't know if it could happen that a VFS operation that happens with
> overwritten current->fs calls back into firmware loading or such, or
> if that is anyway impossible for locking reasons or such...

Usermodehelper are terrible hybrids that always go through workqueue
dispatch. So let's say somehow someone ends up triggering a
usermodehelper under scoped_with_init_fs() - no matter if regular task
or kthread - what would actually happen is:

INIT_WORK(&sub_info->work, call_usermodehelper_exec_work)

which means all usermodehelper creations are done from some other
kthread and never by the caller. IOW, even if the caller has overridden
current->fs the usermodehelper would be created from a pristine kthread
context.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 19/23] fs: add kthread_mntns()
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (17 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-07  2:04   ` Askar Safin
  2026-03-05 23:30 ` [PATCH RFC v2 20/23] devtmpfs: create private mount namespace Christian Brauner
                   ` (5 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Allow kthreads to create a private mount namespace.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c        | 30 ++++++++++++++++++++++++++++++
 include/linux/mount.h |  1 +
 2 files changed, 31 insertions(+)

diff --git a/fs/namespace.c b/fs/namespace.c
index 854f4fc66469..668131aa5de1 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6200,6 +6200,36 @@ static void __init init_mount_tree(void)
 	ns_tree_add(&init_mnt_ns);
 }
 
+/*
+ * Allow to give a specific kthread a private mount namespace that is
+ * anchored in nullfs so it can mount.
+ */
+int __init kthread_mntns(void)
+{
+	struct mount *m;
+	struct path root;
+	int ret;
+
+	/* Only allowed for kthreads in the initial mount namespace. */
+	VFS_WARN_ON_ONCE(!(current->flags & PF_KTHREAD));
+	VFS_WARN_ON_ONCE(current->nsproxy->mnt_ns != &init_mnt_ns);
+
+	/*
+	 * TODO: switch to creating a completely empty mount namespace
+	 * once that series lands.
+	 */
+	ret = ksys_unshare(CLONE_NEWNS);
+	if (ret)
+		return ret;
+
+	m = current->nsproxy->mnt_ns->root;
+	root.mnt = &m->mnt;
+	root.dentry = root.mnt->mnt_root;
+	set_fs_pwd(current->fs, &root);
+	set_fs_root(current->fs, &root);
+	return 0;
+}
+
 void __init mnt_init(void)
 {
 	int err;
diff --git a/include/linux/mount.h b/include/linux/mount.h
index acfe7ef86a1b..69d61f21b548 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -106,6 +106,7 @@ int do_mount(const char *, const char __user *,
 extern const struct path *collect_paths(const struct path *, struct path *, unsigned);
 extern void drop_collected_paths(const struct path *, const struct path *);
 extern void kern_unmount_array(struct vfsmount *mnt[], unsigned int num);
+int __init kthread_mntns(void);
 
 extern int cifs_root_data(char **dev, char **opts);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 19/23] fs: add kthread_mntns()
  2026-03-05 23:30 ` [PATCH RFC v2 19/23] fs: add kthread_mntns() Christian Brauner
@ 2026-03-07  2:04   ` Askar Safin
  0 siblings, 0 replies; 40+ messages in thread
From: Askar Safin @ 2026-03-07  2:04 UTC (permalink / raw)
  To: brauner; +Cc: axboe, jack, jannh, linux-fsdevel, linux-kernel, tj, torvalds,
	viro

Christian Brauner <brauner@kernel.org>:
> + * Allow to give a specific kthread a private mount namespace that is
> + * anchored in nullfs so it can mount.

Which nullfs? By the end of this patchset we have two ones: the one in the root
of namespace of every userspace task, and the one used by kernel threads.

You meant userspace one here. (Similar problem may apply to other patches
in this patchset.)

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 20/23] devtmpfs: create private mount namespace
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (18 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 19/23] fs: add kthread_mntns() Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 21/23] nullfs: make nullfs multi-instance Christian Brauner
                   ` (4 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Kernel threads are located in a completely isolated nullfs mount.
Make it possible for a kthread to create a private mount namespace so it
can mount private filesystem instances. This is only used by devtmpfs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/base/devtmpfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index b1c4ceb65026..246ac0b331fe 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -413,7 +413,7 @@ static noinline int __init devtmpfs_setup(void *p)
 {
 	int err;
 
-	err = ksys_unshare(CLONE_NEWNS);
+	err = kthread_mntns();
 	if (err)
 		goto out;
 	err = init_mount("devtmpfs", "/", "devtmpfs", DEVTMPFS_MFLAGS, NULL);

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 21/23] nullfs: make nullfs multi-instance
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (19 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 20/23] devtmpfs: create private mount namespace Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-05 23:30 ` [PATCH RFC v2 22/23] fs: start all kthreads in nullfs Christian Brauner
                   ` (3 subsequent siblings)
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Allow multiple instances of nullfs to be created. Right now we're only
going to use it for kernel-internal purposes but ultimately we can allow
userspace to use it too to e.g., safely overmount stuff.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/nullfs.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/nullfs.c b/fs/nullfs.c
index fdbd3e5d3d71..88ba4f3fc3a2 100644
--- a/fs/nullfs.c
+++ b/fs/nullfs.c
@@ -40,14 +40,9 @@ static int nullfs_fs_fill_super(struct super_block *s, struct fs_context *fc)
 	return 0;
 }
 
-/*
- * For now this is a single global instance. If needed we can make it
- * mountable by userspace at which point we will need to make it
- * multi-instance.
- */
 static int nullfs_fs_get_tree(struct fs_context *fc)
 {
-	return get_tree_single(fc, nullfs_fs_fill_super);
+	return get_tree_nodev(fc, nullfs_fs_fill_super);
 }
 
 static const struct fs_context_operations nullfs_fs_context_ops = {

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 22/23] fs: start all kthreads in nullfs
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (20 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 21/23] nullfs: make nullfs multi-instance Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-07 22:17   ` Askar Safin
  2026-03-05 23:30 ` [PATCH RFC v2 23/23] fs: stop rewriting kthread fs structs Christian Brauner
                   ` (2 subsequent siblings)
  24 siblings, 1 reply; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Point init_task's fs_struct (root and pwd) at a private nullfs instance
instead of the mutable rootfs. All kthreads now start isolated in nullfs
and must use scoped_with_init_fs() for any path resolution.

PID 1 is moved from nullfs into the initramfs by init_userspace_fs().
Usermodehelper threads use userspace_init_fs via the umh flag in
copy_fs(). All subsystems that need init's filesystem state for path
resolution already use scoped_with_init_fs() from earlier commits in
this series.

This isolates kthreads from userspace filesystem state and makes it
hard to perform filesystem operations from kthread context.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/namespace.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 668131aa5de1..2a530109eb36 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6188,12 +6188,14 @@ static void __init init_mount_tree(void)
 		init_mnt_ns.nr_mounts++;
 	}
 
+	nullfs_mnt = kern_mount(&nullfs_fs_type);
+	if (IS_ERR(nullfs_mnt))
+		panic("VFS: Failed to create private nullfs instance");
+	root.mnt	= nullfs_mnt;
+	root.dentry	= nullfs_mnt->mnt_root;
+
 	init_task.nsproxy->mnt_ns = &init_mnt_ns;
 	get_mnt_ns(&init_mnt_ns);
-
-	/* The root and pwd always point to the mutable rootfs. */
-	root.mnt	= mnt;
-	root.dentry	= mnt->mnt_root;
 	set_fs_pwd(current->fs, &root);
 	set_fs_root(current->fs, &root);
 

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 22/23] fs: start all kthreads in nullfs
  2026-03-05 23:30 ` [PATCH RFC v2 22/23] fs: start all kthreads in nullfs Christian Brauner
@ 2026-03-07 22:17   ` Askar Safin
  0 siblings, 0 replies; 40+ messages in thread
From: Askar Safin @ 2026-03-07 22:17 UTC (permalink / raw)
  To: brauner; +Cc: axboe, jack, jannh, linux-fsdevel, linux-kernel, tj, torvalds,
	viro

Christian Brauner <brauner@kernel.org>:
> +	nullfs_mnt = kern_mount(&nullfs_fs_type);

There is a comment "We create two mounts" above. But now it is wrong,
because now this function creates 3 mounts: 2 nullfses and 1 tmpfs/ramfs.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC v2 23/23] fs: stop rewriting kthread fs structs
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (21 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 22/23] fs: start all kthreads in nullfs Christian Brauner
@ 2026-03-05 23:30 ` Christian Brauner
  2026-03-07  2:19 ` [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Askar Safin
  2026-03-09 16:50 ` Jann Horn
  24 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-05 23:30 UTC (permalink / raw)
  To: linux-fsdevel, Linus Torvalds
  Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
	Jann Horn, Christian Brauner

Now that we isolated kthreads filesystem state completely from userspace
stop rewriting their state.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 fs/fs_struct.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index c1afa7513e34..74fcee814be3 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -61,6 +61,9 @@ void chroot_fs_refs(const struct path *old_root, const struct path *new_root)
 
 	read_lock(&tasklist_lock);
 	for_each_process_thread(g, p) {
+		if (p->flags & PF_KTHREAD)
+			continue;
+
 		task_lock(p);
 		fs = p->fs;
 		if (fs) {

-- 
2.47.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (22 preceding siblings ...)
  2026-03-05 23:30 ` [PATCH RFC v2 23/23] fs: stop rewriting kthread fs structs Christian Brauner
@ 2026-03-07  2:19 ` Askar Safin
  2026-03-09 16:50 ` Jann Horn
  24 siblings, 0 replies; 40+ messages in thread
From: Askar Safin @ 2026-03-07  2:19 UTC (permalink / raw)
  To: brauner; +Cc: axboe, jack, jannh, linux-fsdevel, linux-kernel, tj, torvalds,
	viro

Christian Brauner <brauner@kernel.org>:
> Summary:

Comment in "call_usermodehelper_exec_async" contains this:

> Initial kernel threads share ther FS with init

This is now wrong and should be changed.

-- 
Askar Safin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs
  2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
                   ` (23 preceding siblings ...)
  2026-03-07  2:19 ` [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Askar Safin
@ 2026-03-09 16:50 ` Jann Horn
  2026-03-10 12:54   ` Christian Brauner
  24 siblings, 1 reply; 40+ messages in thread
From: Jann Horn @ 2026-03-09 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> The places that need to perform lookup in init's filesystem state may
> use scoped_with_init_fs() which will temporarily override the caller's
> fs_struct with init's fs_struct.

One small concern I have about the overall approach is that the use of
scoped_with_init_fs() in non-kernel tasks reminds me a _little_ bit of
the set_fs(KERNEL_DS) mechanism that was removed a few years ago:
There is state in the task that controls whether some argument is
interpreted as a user-supplied, untrusted value or a kernel-supplied
value that is interpreted in some more privileged scope. I think there
were occasionally security issues where userspace-supplied pointers
were accidentally accessed under KERNEL_DS, allowing userspace to
cause accesses to arbitrary kernel addresses - in particular,
performance interrupts could occur in KERNEL_DS sections and attempt
to access userspace stack memory, see
<https://project-zero.issues.chromium.org/42452355>.

I think switching task_struct::fs is much less problematic - path
walks shouldn't happen in IRQ context or such, scoped_with_init_fs()
will likely only be used when accessing paths that unprivileged
userspace has no influence over, and VFS operations normally don't
operate on multiple logically unrelated file paths; but it means we'll
have to keep in mind that filesystem handlers for some operations like
lookup/open can run with weird task_struct::fs.

To be clear, I think what you're doing is fine; it's just something to
keep in mind.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs
  2026-03-09 16:50 ` Jann Horn
@ 2026-03-10 12:54   ` Christian Brauner
  0 siblings, 0 replies; 40+ messages in thread
From: Christian Brauner @ 2026-03-10 12:54 UTC (permalink / raw)
  To: Jann Horn
  Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
	Jens Axboe, Jan Kara, Tejun Heo

On Mon, Mar 09, 2026 at 05:50:36PM +0100, Jann Horn wrote:
> On Fri, Mar 6, 2026 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> > The places that need to perform lookup in init's filesystem state may
> > use scoped_with_init_fs() which will temporarily override the caller's
> > fs_struct with init's fs_struct.
> 
> One small concern I have about the overall approach is that the use of
> scoped_with_init_fs() in non-kernel tasks reminds me a _little_ bit of
> the set_fs(KERNEL_DS) mechanism that was removed a few years ago:
> There is state in the task that controls whether some argument is
> interpreted as a user-supplied, untrusted value or a kernel-supplied
> value that is interpreted in some more privileged scope. I think there
> were occasionally security issues where userspace-supplied pointers
> were accidentally accessed under KERNEL_DS, allowing userspace to
> cause accesses to arbitrary kernel addresses - in particular,
> performance interrupts could occur in KERNEL_DS sections and attempt
> to access userspace stack memory, see
> <https://project-zero.issues.chromium.org/42452355>.
> 
> I think switching task_struct::fs is much less problematic - path
> walks shouldn't happen in IRQ context or such, scoped_with_init_fs()
> will likely only be used when accessing paths that unprivileged
> userspace has no influence over, and VFS operations normally don't
> operate on multiple logically unrelated file paths; but it means we'll
> have to keep in mind that filesystem handlers for some operations like
> lookup/open can run with weird task_struct::fs.
> 
> To be clear, I think what you're doing is fine; it's just something to
> keep in mind.

Just for some background. I think as it currently stands we have a 1:1
sharing between all kthreads and pid 1. So effectively a kthread is in a
permanent scope_with_init_fs() block. Any driver can just do:

file = filp_open("/proc/sys/kernel/core_pattern")
kernel_write(file, "/usr/bin/systemctl poweroff")

which is ofc nonsense but still.

But my wider point is that this implicit lookup context is probably in
very few people's mind.

Some people who are aware of this then end up with brilliant ideas such
as writing kernel modules that perform mountains of actual path lookup
work from kthread context because it's just so easy to do and lets them
avoid having to do any real conceptual work to come up with a better
solution.

Offloading fs work to kthreads is really nasty... And we've relearned
that lesson not too long ago when io_uring was still based on kthreads
with custom credential overrides. It's a broken concept.

scoped_with_init_fs() forces the users that do this to acknowledge that
they are now performing lookup work within PID 1's filesystem state. We
have few of those and this will make it harder to gain more.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-03-10 16:05 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-05 23:30 [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 01/23] fs: notice when init abandons fs sharing Christian Brauner
2026-03-10 16:03   ` Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 02/23] fs: add scoped_with_init_fs() Christian Brauner
2026-03-09 15:19   ` Jann Horn
2026-03-10 11:30     ` Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 03/23] rnbd: use scoped_with_init_fs() for block device open Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 04/23] crypto: ccp: use scoped_with_init_fs() for SEV file access Christian Brauner
2026-03-09 15:37   ` Jann Horn
2026-03-10 11:33     ` Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 05/23] scsi: target: use scoped_with_init_fs() for ALUA metadata Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 06/23] scsi: target: use scoped_with_init_fs() for APTPL metadata Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 07/23] btrfs: use scoped_with_init_fs() for update_dev_time() Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 08/23] coredump: use scoped_with_init_fs() for coredump path resolution Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 09/23] fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns() Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 10/23] ksmbd: use scoped_with_init_fs() for share path resolution Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 11/23] ksmbd: use scoped_with_init_fs() for filesystem info path lookup Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 12/23] ksmbd: use scoped_with_init_fs() for VFS path operations Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 13/23] initramfs: use scoped_with_init_fs() for rootfs unpacking Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 14/23] af_unix: use scoped_with_init_fs() for coredump socket lookup Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 15/23] fs: add real_fs to track task's actual fs_struct Christian Brauner
2026-03-07  0:51   ` Askar Safin
2026-03-09 15:14   ` Jann Horn
2026-03-10 11:29     ` Christian Brauner
2026-03-10 16:05       ` Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 16/23] fs: make userspace_init_fs a dynamically-initialized pointer Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 17/23] fs: stop sharing fs_struct between init_task and pid 1 Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 18/23] fs: add umh argument to struct kernel_clone_args Christian Brauner
2026-03-09 16:06   ` Jann Horn
2026-03-10 11:58     ` Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 19/23] fs: add kthread_mntns() Christian Brauner
2026-03-07  2:04   ` Askar Safin
2026-03-05 23:30 ` [PATCH RFC v2 20/23] devtmpfs: create private mount namespace Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 21/23] nullfs: make nullfs multi-instance Christian Brauner
2026-03-05 23:30 ` [PATCH RFC v2 22/23] fs: start all kthreads in nullfs Christian Brauner
2026-03-07 22:17   ` Askar Safin
2026-03-05 23:30 ` [PATCH RFC v2 23/23] fs: stop rewriting kthread fs structs Christian Brauner
2026-03-07  2:19 ` [PATCH RFC v2 00/23] fs,kthread: start all kthreads in nullfs Askar Safin
2026-03-09 16:50 ` Jann Horn
2026-03-10 12:54   ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox