public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC v3 00/26] fs,kthread: start all kthreads in nullfs
@ 2026-03-11 21:43 Christian Brauner
  2026-03-11 21:43 ` [PATCH RFC v3 01/26] fs: add switch_fs_struct() Christian Brauner
                   ` (25 more replies)
  0 siblings, 26 replies; 28+ messages in thread
From: Christian Brauner @ 2026-03-11 21:43 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Linus Torvalds, linux-kernel, Alexander Viro, Jens Axboe,
	Jan Kara, Tejun Heo, Jann Horn, Christian Brauner

Summary:

* all kthreads are isolated in a separate SB_KERNMOUNT of nullfs.
  -> no lookup of anything else, no mounting on top of it, completely
  isolated.
* init has a separate fs_struct from all kthreads
* scoped_with_init_fs() allows a kthread to temporarily assume init's
  fs_struct for filesystem operations.

So this is a bit of a crazy series. When the kernel is started it
roughly goes like this:

init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)

After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.

This rewriting is really weird and mostly done so kthread can use init's
filesystem state when they would like to. But this really should be
discouraged. The rewriting should also stop completely. I worked a bit
to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.

Instead of sharing fs_struct between kernel threads and pid 1, pid 1
get's a completely separate fs_struct. All kthreads continue sharing
init_fs as before and pid 1's fs_struct is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.

All kthreads are anchored in a kernel internal mount of nullfs that
cannot be mounted on and that cannot be used to follow other mounts.
It's a completely private mount that insulates kthreads.

This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.

The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.

This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.

The places that need to perform lookup in init's filesystem state may
use scoped_with_init_fs() which will temporarily override the caller's
fs_struct with init's fs_struct.

We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.

On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.

This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.

Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.

I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).

TL;DR:

==== PID 1 (systemd) ====

  root@localhost:~# stat --file-system /proc/1/root
    File: "/proc/1/root"
      ID: e3cb00dd533cd3d7 Namelen: 255     Type: ext2/ext3

  root@localhost:~# cat /proc/1/mountinfo | wc -l
  30

==== PID 2 (kthreadd) ====

  root@localhost:~# stat --file-system /proc/2/root
    File: "/proc/2/root"
      ID: 200000000 Namelen: 255     Type: nullfs

  root@localhost:~# cat /proc/2/mountinfo | wc -l
  0

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Changes in v3:
- Fix __override_init_fs() to save and return the original fs instead of
  the override, so __revert_init_fs() actually restores the caller's fs.
- Replace smp_store_release() with WRITE_ONCE() in fs override/revert.
- Move userspace_init_fs wiring commit before conversion patches to fix
  bisectability for user-process-context callers.
- Switch all remote procfs accesses (mounts_open_common, get_task_root,
  proc_cwd_link, task_state umask, kcmp) to use task_struct::real_fs
  instead of task_struct::fs.
- Move VFS_WARN_ON_ONCE in copy_fs() into else block so it doesn't fire
  for UMH threads.
- Fix sleeping under task_lock: validate_fs_switch() now runs outside
  task_lock() with might_sleep() annotation.
- Add pnfs/blocklayout scoped_with_init_fs() conversion.
- Wrap security_initramfs_populated() inside scoped_with_init_fs() in
  initramfs unpacking since IPE accesses current->fs->root.
- Fix stale comments: "two mounts" -> "three mounts", UMH comment,
  kthread_mntns() nullfs ambiguity.
- Fix commit message mismatches.
- Link to v2: https://patch.msgid.link/20260306-work-kthread-nullfs-v2-0-ad1b4bed7d3e@kernel.org

Changes in v2:
- Remove LOOKUP_IN_INIT in favor of scoped_with_init_fs().
- Link to v1: https://patch.msgid.link/20260303-work-kthread-nullfs-v1-0-87e559b94375@kernel.org

---
Christian Brauner (26):
      fs: add switch_fs_struct()
      fs: notice when init abandons fs sharing
      fs: add scoped_with_init_fs()
      fs: add real_fs to track task's actual fs_struct
      fs: make userspace_init_fs a dynamically-initialized pointer
      rnbd: use scoped_with_init_fs() for block device open
      crypto: ccp: use scoped_with_init_fs() for SEV file access
      scsi: target: use scoped_with_init_fs() for ALUA metadata
      scsi: target: use scoped_with_init_fs() for APTPL metadata
      btrfs: use scoped_with_init_fs() for update_dev_time()
      coredump: use scoped_with_init_fs() for coredump path resolution
      fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns()
      ksmbd: use scoped_with_init_fs() for share path resolution
      ksmbd: use scoped_with_init_fs() for filesystem info path lookup
      ksmbd: use scoped_with_init_fs() for VFS path operations
      pnfs/blocklayout: use scoped_with_init_fs() for SCSI device lookup
      initramfs: use scoped_with_init_fs() for rootfs unpacking
      af_unix: use scoped_with_init_fs() for coredump socket lookup
      fs: stop sharing fs_struct between init_task and pid 1
      fs: add umh argument to struct kernel_clone_args
      fs: add kthread_mntns()
      devtmpfs: create private mount namespace
      nullfs: make nullfs multi-instance
      fs: start all kthreads in nullfs
      fs: stop rewriting kthread fs structs
      fs: stop rewriting paths for PF_EXITING | PF_DUMPCORE

 drivers/base/devtmpfs.c           |   2 +-
 drivers/block/rnbd/rnbd-srv.c     |   4 +-
 drivers/crypto/ccp/sev-dev.c      |  12 ++---
 drivers/target/target_core_alua.c |   6 ++-
 drivers/target/target_core_pr.c   |   4 +-
 fs/btrfs/volumes.c                |  11 +++-
 fs/coredump.c                     |  11 ++--
 fs/fs_struct.c                    | 103 ++++++++++++++++++++++++++++++++++++--
 fs/kernel_read_file.c             |   9 +---
 fs/namespace.c                    |  46 ++++++++++++++---
 fs/nfs/blocklayout/dev.c          |  13 +++--
 fs/nullfs.c                       |  12 ++---
 fs/proc/array.c                   |   4 +-
 fs/proc/base.c                    |   8 +--
 fs/proc_namespace.c               |   4 +-
 fs/smb/server/mgmt/share_config.c |   4 +-
 fs/smb/server/smb2pdu.c           |   4 +-
 fs/smb/server/vfs.c               |  14 ++++--
 include/linux/fs_struct.h         |  34 +++++++++++++
 include/linux/init_task.h         |   1 +
 include/linux/mount.h             |   1 +
 include/linux/sched.h             |   1 +
 include/linux/sched/task.h        |   1 +
 init/init_task.c                  |   1 +
 init/initramfs.c                  |  14 ++++--
 init/main.c                       |  10 +++-
 kernel/fork.c                     |  53 ++++++++++++--------
 kernel/kcmp.c                     |   2 +-
 kernel/umh.c                      |   6 +--
 net/unix/af_unix.c                |  17 +++----
 30 files changed, 307 insertions(+), 105 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260303-work-kthread-nullfs-875a837f4198


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-03-11 22:13 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-11 21:43 [PATCH RFC v3 00/26] fs,kthread: start all kthreads in nullfs Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 01/26] fs: add switch_fs_struct() Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 02/26] fs: notice when init abandons fs sharing Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 03/26] fs: add scoped_with_init_fs() Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 04/26] fs: add real_fs to track task's actual fs_struct Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 05/26] fs: make userspace_init_fs a dynamically-initialized pointer Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 06/26] rnbd: use scoped_with_init_fs() for block device open Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 07/26] crypto: ccp: use scoped_with_init_fs() for SEV file access Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 08/26] scsi: target: use scoped_with_init_fs() for ALUA metadata Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 09/26] scsi: target: use scoped_with_init_fs() for APTPL metadata Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 10/26] btrfs: use scoped_with_init_fs() for update_dev_time() Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 11/26] coredump: use scoped_with_init_fs() for coredump path resolution Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 12/26] fs: use scoped_with_init_fs() for kernel_read_file_from_path_initns() Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 13/26] ksmbd: use scoped_with_init_fs() for share path resolution Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 14/26] ksmbd: use scoped_with_init_fs() for filesystem info path lookup Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 15/26] ksmbd: use scoped_with_init_fs() for VFS path operations Christian Brauner
2026-03-11 21:43 ` [PATCH RFC v3 16/26] pnfs/blocklayout: use scoped_with_init_fs() for SCSI device lookup Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 17/26] initramfs: use scoped_with_init_fs() for rootfs unpacking Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 18/26] af_unix: use scoped_with_init_fs() for coredump socket lookup Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 19/26] fs: stop sharing fs_struct between init_task and pid 1 Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 20/26] fs: add umh argument to struct kernel_clone_args Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 21/26] fs: add kthread_mntns() Christian Brauner
2026-03-11 22:13   ` Thomas Weißschuh
2026-03-11 21:44 ` [PATCH RFC v3 22/26] devtmpfs: create private mount namespace Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 23/26] nullfs: make nullfs multi-instance Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 24/26] fs: start all kthreads in nullfs Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 25/26] fs: stop rewriting kthread fs structs Christian Brauner
2026-03-11 21:44 ` [PATCH RFC v3 26/26] fs: stop rewriting paths for PF_EXITING | PF_DUMPCORE Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox