* [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs
@ 2026-03-03 13:49 Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 01/11] kthread: refactor __kthread_create_on_node() to take a struct argument Christian Brauner
` (11 more replies)
0 siblings, 12 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
So this is a bit of a crazy series and I've played around with it for
some time and I kinda need to move on to other stuff so I'm sending out
where I've left this as it's overall in a shape where the approach and
idea can be grasped. There's some kthread cleanups at the beginning as
well that are mostly unrelated but fell out of this work as this
whole approach of dumping ever more special helper functions is not very
sustainable. But anyway...
... When the kernel is started it roughly goes like this:
init_task
==> create pid 1 (systemd etc.)
==> pid 2 (kthreadd)
After this point all kthreads and PID 1 share the same filesystem state.
That obviously already came up when we discussed pivot_root() as this
allows pivot_root() to rewrite the fs_struct of all kthreads.
I kinda hate this rewriting the implicit sharing which is abused left
and right - but who knows maybe others really like it - so I worked a
bit to get rid of it in a more fundamental way. Is it crazy? Yes. Is it
likely broken? Yes. Does it at least boot? Yes.
Instead of sharing fs_struct between kernel threads and pid 1 we give
pid a separate userspace_init_fs struct. All kthreads continue sharing
init_fs as before and userspace_init_fs is isolated from kthread's
filesystem state. IOW, userspace init cannot affect kthreads filesystem
state anymore and kthreads cannot affect userspace's filesystem state
anymore - without explicit opt-in.
This series makes performing mountains of filesystem work such as path
lookup and file opening and so on from kthreads hard - painfully so. I
think this is a benefit because it takes the idea of just offloading
_security sensitive_ operations in init's filesystem state and
running random binaries or opening and creating files to kthreads
difficult behind the shed... And imho it should.
The only remaining kernel tasks that actually share init's filesystem
state are usermodhelpers - as they execute random binaries in the root
filesystem. Another concept we should really show the back of the shed.
This gives a lot stronger guarantees than what we have now. This also
makes path lookup from kthreads fail by default. IOW, it won't be
possible anymore to just lookup random stuff in init's filesytem state
without explicitly opting in to that.
The places that need to perform lookup in init's filesystem state may
use LOOKUP_IN_INIT which will grab userspace_init_fs and use that for
root or pwd. Note that we can't just walk up to the topmost mount
otherwise someone in userspace can do mount -t tmpfs tmpfs / and mess
with a kthreads lookup state. We also sometimes might need the working
directory.
We now also warn and notice when pid 1 simply stops sharing filesystem
state with us, i.e., abandons it's userspace_init_fs.
On older kernels if PID 1 unshared its filesystem state with us the
kernel simply used the stale fs_struct state implicitly pinning
anything that PID 1 had last used. Even if PID 1 might've moved on to
some completely different fs_struct state and might've even unmounted
the old root.
This has hilarious consequences: Think continuing to dump coredump
state into an implicitly pinned directory somewhere. Calling random
binaries in the old rootfs via usermodehelpers.
Be aggressive about this: We simply reject operating on stale
fs_struct state by reverting userspace_init_fs to nullfs. Every kworker
that does lookups after this point will fail. Every usermodehelper call
will fail. This is a lot stronger but I wouldn't know what it means for
pid 1 to simply stop sharing its fs state with the kernel. Clearly it
wanted to separate so cut all ties.
I've went through the kernel and looked at hopefully everything that
does path lookup from kthreads (workqueues, ...).
The only really unfortunate place is initramfs unpacking because it runs
mostly from a workqueue but if there' "too much work" pending it will
fallback to synchronous in-task execution. Ideally it just always go
async instead of this weird fallback.
TL;DR:
root@localhost:~# stat --file-system /proc/1/root
File: "/proc/1/root"
ID: e3cb00dd533cd3d7 Namelen: 255 Type: ext2/ext3
root@localhost:~# stat --file-system /proc/2/root
File: "/proc/2/root"
ID: 200000000 Namelen: 255 Type: nullfs
=========================================================================
Here's my review. It's long and ugly, I might have missed stuff:
=========================================================================
==== 1. devtmpfs -- kdevtmpfs kthread ====
Dedicated kthread sharing init_fs (nullfs).
```
kernel_init_freeable() # PID 1
-> do_basic_setup()
-> driver_init()
-> devtmpfs_init()
-> kthread_run(devtmpfsd, &err, "kdevtmpfs")
-> devtmpfsd() # kdevtmpfs kthread context
-> devtmpfs_setup() # runs IN the kthread
-> devtmpfs_work_loop() # runtime loop IN the kthread
```
`devtmpfs_setup()` runs inside the kdevtmpfs kthread, NOT PID 1. However, it is
safe because:
- `ksys_unshare(CLONE_NEWNS)` implies `CLONE_FS` giving the kthread a
**private** copy of init_fs.
- `init_mount("devtmpfs", "/", ...)` mounts devtmpfs over the nullfs root
- `init_chdir("/.."); init_chroot(".")` chroots into the devtmpfs mount
All runtime paths (`handle_create`, `handle_remove`, `create_path`,
`delete_path`) operate within this private chroot via
`devtmpfs_work_loop()`.
**No conversion needed**
==== 2. ksmbd -- `ksmbd-io` workqueue
Let's ignore for a second that this basically does all I/O from
kthread context and the security implications of this...
Heaviest subsystem user. Every SMB file operation goes through workqueue
path lookups. Per-connection kthreads (`ksmbd_conn_handler_loop`) read
requests and dispatch to the `ksmbd-io` workqueue via
`handle_ksmbd_work()`.
**Converted to LOOKUP_IN_INIT**
==== 3. nfsd -- kthreads + laundromat workqueue ====
nfsd service threads are kthreads spawned via `kthread_create_on_node` in
`svc_new_thread()`. The `nfsd()` threadfn is passed through
`svc_create_pooled()` -> `serv->sv_threadfn`.
**Service kthreads (`nfsd()` threadfn):**
The nfsd kthreads call `unshare_fs_struct()` on startup for umask control
(`current->fs->umask = 0`), not for path lookups. NFS request handling
dispatches through `svc_recv()` -> NFS procedure handlers which use
**filehandle-based resolution** (`fh_verify()` etc.) relative to export
mount points. They never resolve paths from `current->fs->root`.
**No conversion needed**
==== 4. kernel_init (PID 1 before execve) ====
All `init_*()` wrappers in `fs/init.c` do `kern_path()` or
`filename_create()`/`filename_parentat()`. The lookup API table is
listed once here; the callchains below show every path that reaches them
from PID 1.
**Callchain 1: kernel_init() direct**
```
kernel_init()
-> do_sysctl_args()
-> process_sysctl_arg()
-> file_open_root_mnt() # uses kern_mount'd procfs, not fs_struct
```
**Callchain 2: kernel_init_freeable() direct**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> console_on_rootfs()
-> filp_open("/dev/console", ...)
-> init_eaccess(ramdisk_execute_command)
-> kern_path()
```
**Callchain 3: prepare_namespace() -> mount_root()**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> mount_root()
-> mount_root_generic()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_nodev_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_nfs_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_cifs_root()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
-> mount_block_root()
-> create_dev()
-> init_unlink()
-> init_mknod()
-> mount_root_generic()
-> do_mount_root()
-> init_mount()
-> init_chdir("/")
```
**Callchain 4: prepare_namespace() -> initrd_load()**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> initrd_load()
-> create_dev()
-> init_unlink()
-> init_mknod()
-> rd_load_image()
-> filp_open() (x2)
-> init_unlink()
```
**Callchain 5: prepare_namespace() -> devtmpfs_mount()**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> devtmpfs_mount()
-> init_mount("devtmpfs", "dev", ...)
```
Note: this is `devtmpfs_mount()` called from PID 1 context (mounts
devtmpfs at /dev after the real root is mounted). Distinct from
`devtmpfs_setup()` which runs in the kdevtmpfs kthread (section 1).
**Callchain 6: prepare_namespace() -> pivot + umount**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> init_pivot_root(".", ".") # kern_path() x2
-> init_umount(".", MNT_DETACH) # kern_path()
```
**Callchain 7: prepare_namespace() -> md_run_setup()**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> prepare_namespace()
-> md_run_setup()
-> md_setup_drive()
-> init_stat()
```
**Callchain 8: do_basic_setup() -> do_initcalls() (rootfs_initcall)**
```
kernel_init() # PID 1
-> kernel_init_freeable()
-> do_basic_setup()
-> do_initcalls()
-> rootfs_initcall(default_rootfs)
-> default_rootfs()
-> init_mkdir("/dev", 0755)
-> init_mknod("/dev/console", ...)
-> init_mkdir("/root", 0700)
```
Only used when `CONFIG_BLK_DEV_INITRD` is not set (no initramfs).
PID 1 uses pid1_fs which points to the initramfs (set by
`init_chroot_to_overmount()` at the start of `kernel_init()`). The
correct context is available so nothing to worry about.
**No conversion needed**
==== 5. Initramfs/initrd unpacking -- async kworker ====
`do_populate_rootfs()` runs as `async_schedule_domain()` callback
(kworker). When `initramfs_async=0` it runs synchronously in
`kernel_init` context instead.
**Async workqueue creation:**
```
async_init()
-> alloc_workqueue("async", WQ_UNBOUND, 0) # line 359
```
**Async scheduling chain (how do_populate_rootfs ends up in kworker):**
```
kernel_init() # PID 1
-> init_fs() # init/main.c -- switches PID 1 to pid1_fs (rootfs)
-> kernel_init_freeable()
-> do_basic_setup()
-> do_initcalls()
-> rootfs_initcall(populate_rootfs) # init/initramfs.c:791
-> populate_rootfs() # init/initramfs.c:782
-> async_schedule_domain(do_populate_rootfs, NULL, &initramfs_domain) # line 784
-> async_schedule_node_domain() # include/linux/async.h:69
-> __async_schedule_node_domain() # kernel/async.c:150
-> INIT_WORK(&entry->work, async_run_entry_fn) # line 162
-> entry->func = do_populate_rootfs # line 163
-> queue_work_node(node, async_wq, &entry->work) # line 180
-> kworker picks up work item # async_wq = "async" WQ_UNBOUND workqueue
-> async_run_entry_fn() # kernel/async.c:122
-> entry->func(entry->data, entry->cookie) # line 139
-> do_populate_rootfs(NULL, cookie) # RUNS IN KWORKER CONTEXT
```
Note: `async_schedule_node_domain()` has an OOM fallback that runs
`func(data, newcookie)` synchronously in the caller's context (PID 1)
if `kzalloc` fails or `entry_count > MAX_WORK` (kernel/async.c:215-221).
In that case the function runs safely in PID 1. The async kworker case
is the one that needs conversion.
Work items execute in kworker kthreads (children of kthreadd, share init_fs).
The kworker's `current->fs` is `init_fs` which now points to **nullfs**.
**Callchain 1: do_name() regular file creation (S_ISREG)**
```
do_populate_rootfs() # kworker context (async_wq)
-> unpack_to_rootfs(__initramfs_start, __initramfs_size) # init/initramfs.c:721
-> write_buffer() # init/initramfs.c:465
-> actions[GotName] = do_name() # init/initramfs.c:361
-> clean_path(collected, mode) # init/initramfs.c:378
-> init_stat(path, &st, AT_SYMLINK_NOFOLLOW) # init/initramfs.c:337
-> kern_path() # fs/init.c:150
-> filename_lookup(AT_FDCWD, ...) # fs/namei.c:2836
-> path_lookupat() # fs/namei.c:2813
-> path_init() # fs/namei.c:2673
-> nd_jump_root() # absolute paths
-> set_root() # uses current->fs = init_fs (NULLFS)
-> init_rmdir(path) # init/initramfs.c:340 (if S_ISDIR)
-> filename_rmdir(AT_FDCWD, name) # fs/init.c:194
-> filename_parentat() -> path_parentat() -> path_init() -> current->fs (NULLFS)
-> init_unlink(path) # init/initramfs.c:342 (if not S_ISDIR)
-> filename_unlinkat(AT_FDCWD, name) # fs/init.c:182
-> filename_parentat() -> path_parentat() -> path_init() -> current->fs (NULLFS)
-> maybe_link() # init/initramfs.c:380
-> find_link() # init/initramfs.c:90 (hardlink hash lookup)
[if hardlink found:]
-> clean_path(collected, 0) # same as above (init_stat/init_rmdir/init_unlink)
-> init_link(old, collected) # init/initramfs.c:352
-> filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0) # fs/init.c:169
-> filename_lookup(olddfd, old, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> filename_create(newdfd, new, ...) # -> filename_parentat() -> path_init() -> NULLFS
[if not hardlink:]
-> filp_open(collected, O_WRONLY|O_CREAT|O_LARGEFILE, mode) # init/initramfs.c:385
-> file_open_name() # fs/open.c:1338
-> do_file_open(AT_FDCWD, name, &op) # fs/open.c:1322
-> path_openat() # fs/namei.c:4821
-> path_init() # -> nd_jump_root() -> set_root() -> NULLFS
-> vfs_fchown(wfile, uid, gid) # init/initramfs.c:391 (on already-open file, SAFE)
-> vfs_fchmod(wfile, mode) # init/initramfs.c:392 (on already-open file, SAFE)
-> vfs_truncate(&wfile->f_path, body_len) # init/initramfs.c:394 (on already-open path, SAFE)
```
**Callchain 2: do_name() directory creation (S_ISDIR)**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
-> init_mkdir(collected, mode) # init/initramfs.c:398
-> filename_mkdirat(AT_FDCWD, name, mode) # fs/init.c:188
-> filename_create(AT_FDCWD, name, ...) # fs/namei.c:4903
-> filename_parentat(AT_FDCWD, name, ...) # fs/namei.c:2900
-> __filename_parentat() # fs/namei.c:2875
-> path_parentat() # fs/namei.c:2858
-> path_init() # -> nd_jump_root() -> set_root() -> NULLFS
-> init_chown(collected, uid, gid, 0) # init/initramfs.c:399
-> kern_path(filename, LOOKUP_FOLLOW, &path) # fs/init.c:106
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> init_chmod(collected, mode) # init/initramfs.c:400
-> kern_path(filename, LOOKUP_FOLLOW, &path) # fs/init.c:123
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> dir_add(collected, name_len, mtime) # init/initramfs.c:401 (saves for later dir_utime)
```
**Callchain 3: do_name() device/pipe/socket creation (S_ISBLK/S_ISCHR/S_ISFIFO/S_ISSOCK)**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
-> maybe_link() # init/initramfs.c:404
[if not hardlink:]
-> init_mknod(collected, mode, rdev) # init/initramfs.c:405
-> filename_mknodat(AT_FDCWD, name, mode, dev) # fs/init.c:162
-> filename_create(AT_FDCWD, name, ...) # fs/namei.c:4903
-> filename_parentat() # -> path_parentat() -> path_init() -> NULLFS
-> init_chown(collected, uid, gid, 0) # init/initramfs.c:406
-> kern_path() # -> filename_lookup() -> path_init() -> NULLFS
-> init_chmod(collected, mode) # init/initramfs.c:407
-> kern_path() # -> filename_lookup() -> path_init() -> NULLFS
-> do_utime(collected, mtime) # init/initramfs.c:408
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
```
**Callchain 4: do_symlink() symlink creation (S_ISLNK)**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer()
-> actions[GotSymlink] = do_symlink() # init/initramfs.c:436
-> clean_path(collected, 0) # init/initramfs.c:445
-> init_stat() -> kern_path() # -> path_init() -> NULLFS
-> init_rmdir() or init_unlink() # -> filename_parentat() -> path_init() -> NULLFS
-> init_symlink(collected + N_ALIGN(name_len), collected) # init/initramfs.c:446
-> filename_symlinkat(old, AT_FDCWD, new) # fs/init.c:176
-> filename_create(AT_FDCWD, new, ...) # fs/namei.c:4903
-> filename_parentat() # -> path_parentat() -> path_init() -> NULLFS
-> init_chown(collected, uid, gid, AT_SYMLINK_NOFOLLOW) # init/initramfs.c:447
-> kern_path(filename, 0, &path) # fs/init.c:106 (lookup_flags = 0, no LOOKUP_FOLLOW)
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
-> do_utime(collected, mtime) # init/initramfs.c:448
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # -> path_lookupat() -> path_init() -> NULLFS
```
**Callchain 5: dir_utime() directory timestamp fixup (CONFIG_INITRAMFS_PRESERVE_MTIME)**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
[at end of unpack_to_rootfs, after all cpio entries processed:]
-> dir_utime() # init/initramfs.c:567
-> list_for_each_entry_safe(de, ...) # init/initramfs.c:168
-> do_utime(de->name, de->mtime) # init/initramfs.c:170
-> init_utimes(filename, t) # init/initramfs.c:136
-> kern_path(filename, 0, &path) # fs/init.c:202
-> filename_lookup(AT_FDCWD, ...) # fs/namei.c:2836
-> path_lookupat() # -> path_init() -> NULLFS
```
**Callchain 6: populate_initrd_image() non-cpio initrd (CONFIG_BLK_DEV_RAM)**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs((char *)initrd_start, ...) # init/initramfs.c:733 (returns error for non-cpio)
[err != NULL && CONFIG_BLK_DEV_RAM:]
-> populate_initrd_image(err) # init/initramfs.c:736
-> filp_open("/initrd.image", O_WRONLY|O_CREAT|O_LARGEFILE, 0700) # init/initramfs.c:705
-> file_open_name(name, flags, mode) # fs/open.c:1338
-> do_file_open(AT_FDCWD, name, &op) # fs/open.c:1322
-> path_openat(&nd, op, flags) # fs/namei.c:4821
-> path_init(&nd, flags) # fs/namei.c:2673
-> nd_jump_root() # absolute path "/"
-> set_root() # uses current->fs = init_fs (NULLFS)
-> xwrite(file, ...) # init/initramfs.c:709 (write to already-open file, SAFE)
-> fput(file) # init/initramfs.c:714
```
**Callchain 7: do_name() hardlink via maybe_link()**
```
do_populate_rootfs() # kworker context
-> unpack_to_rootfs()
-> write_buffer() -> do_name()
-> clean_path(collected, mode) # init/initramfs.c:378 (same as callchain 1)
[S_ISREG(mode):]
-> maybe_link() # init/initramfs.c:380
-> find_link(major, minor, ino, mode, collected) # init/initramfs.c:90
[returns non-NULL old name for nlink >= 2 and matching hash entry:]
-> clean_path(collected, 0) # init/initramfs.c:351
-> init_stat() -> kern_path() # -> path_init() -> NULLFS
-> init_rmdir() or init_unlink() # -> path_init() -> NULLFS
-> init_link(old, collected) # init/initramfs.c:352
-> filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0) # fs/init.c:169
-> filename_lookup(AT_FDCWD, old, 0, &old_path, NULL) # fs/namei.c:5816
-> path_lookupat() -> path_init() # -> NULLFS
-> filename_create(AT_FDCWD, new, &new_path, 0) # fs/namei.c:5822
-> filename_parentat() -> path_parentat() -> path_init() # -> NULLFS
```
When `initramfs_async=1` (the default), `do_populate_rootfs()` runs
in an async kworker. The kworker's `current->fs` is `init_fs` which
now points to **nullfs**. All path lookups resolve "/" against the
nullfs root.
The rootfs (initramfs) is overmounted on top of nullfs's root dentry.
However, `path_init()` does **not** follow overmounts when establishing
the starting point — it sets `nd->path` to the raw `current->fs->root`
(nullfs root dentry on nullfs vfsmount). Mount following only occurs
during component-by-component traversal in `link_path_walk()` via
`step_into()` -> `handle_mounts()`. Since the starting dentry is the
nullfs root (below the overmount), component lookups call nullfs's
`->lookup` which returns -ENOENT (nullfs has no directory entries).
**Result: all `init_*()` and `filp_open()` calls will fail with -ENOENT
in async kworker context.**
When `initramfs_async=0`, `populate_rootfs()` calls
`wait_for_initramfs()` which calls `async_synchronize_cookie_domain()`
to wait for the async work to complete. But the work was already queued
to the async_wq workqueue — `wait_for_initramfs` does not change which
context runs the work. The work still runs in a kworker.
However, there is the OOM fallback: if `kzalloc` fails in
`async_schedule_node_domain()`, the function runs synchronously in PID 1
context (safe).
**Converted to LOOKUP_IN_INIT**
==== 6. Firmware loader -- system workqueue ====
Reached via `request_firmware_nowait()` -> workqueue ->
`request_firmware_work_func()`, and also via synchronous
`request_firmware()` from any kthread caller (428+ callers across
drivers).
Already uses `kernel_read_file_from_path_initns()` which calls
`init_root()`.
**Converted**
==== 7. IMA/EVM integrity -- kernel_init kthread ===
kernel_init() # init/main.c
-> kernel_init_freeable()
-> integrity_load_keys() # hook, called when rootfs is ready
+- ima_load_x509()
| -> integrity_load_x509()
| -> kernel_read_file_from_path() # NOT _initns
+- evm_load_x509() # if !CONFIG_IMA_LOAD_X509
-> integrity_load_x509()
-> kernel_read_file_from_path() # NOT _initns
This is called from PID 1 before init is exec'd where we are chrooted into
the initramfs. The correct context will be available so nothing to worry about.
**No conversion needed**
==== 8. Btrfs -- `btrfs-devrepl` kthread ====
**Kthread creation:**
```
open_ctree() / btrfs_remount_rw() # mount/remount context
-> btrfs_start_pre_rw_mount() # fs/btrfs/disk-io.c:3038
-> btrfs_resume_dev_replace_async() # fs/btrfs/dev-replace.c:1188
-> kthread_run(btrfs_dev_replace_kthread, ..., "btrfs-devrepl") # line 1237
-> kthread_create() -> kthreadd -> kernel_thread(CLONE_FS|CLONE_FILES|SIGCHLD)
```
Dedicated kthread sharing init_fs (nullfs).
**Callchain 1 (kthread -- NEEDS CONVERSION):**
```
btrfs_dev_replace_kthread() # fs/btrfs/dev-replace.c:1239 [kthread context]
-> btrfs_scrub_dev()
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```
**Callchain 2 (kthread, error path -- NEEDS CONVERSION):**
```
btrfs_dev_replace_kthread() # fs/btrfs/dev-replace.c:1239 [kthread context]
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_destroy_dev_replace_tgtdev() # fs/btrfs/volumes.c:2512 (error/cleanup)
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```
**Callchain 3 (ioctl DEV_REPLACE_CMD_START -- SAFE: user context):**
```
btrfs_ioctl() # user syscall context
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl() # fs/btrfs/dev-replace.c:730
-> btrfs_dev_replace_start() # fs/btrfs/dev-replace.c:584
-> btrfs_dev_replace_finishing() # fs/btrfs/dev-replace.c:856
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```
**Callchain 4 (ioctl DEV_REPLACE_CMD_START, error -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl() # fs/btrfs/dev-replace.c:730
-> btrfs_dev_replace_start() # fs/btrfs/dev-replace.c:584
-> btrfs_destroy_dev_replace_tgtdev() # error/leave path, line 711
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```
**Callchain 5 (ioctl DEV_REPLACE_CMD_START, nested error -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_by_ioctl()
-> btrfs_dev_replace_start()
-> btrfs_dev_replace_finishing()
-> btrfs_destroy_dev_replace_tgtdev() # error within finishing
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```
**Callchain 6 (ioctl DEV_REPLACE_CMD_CANCEL -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_dev_replace() # fs/btrfs/ioctl.c:3112
-> btrfs_dev_replace_cancel() # fs/btrfs/dev-replace.c:1075
-> btrfs_destroy_dev_replace_tgtdev() # fs/btrfs/volumes.c:2512
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```
**Callchain 7 (ioctl BTRFS_IOC_RM_DEV -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_rm_dev() # fs/btrfs/ioctl.c:2582
-> btrfs_rm_device() # fs/btrfs/volumes.c:2288
-> btrfs_scratch_superblocks() # fs/btrfs/volumes.c:2266
-> update_dev_time()
-> kern_path()
```
**Callchain 8 (ioctl BTRFS_IOC_RM_DEV_V2 -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_rm_dev_v2() # fs/btrfs/ioctl.c:2514
-> btrfs_rm_device() # fs/btrfs/volumes.c:2288
-> btrfs_scratch_superblocks()
-> update_dev_time()
-> kern_path()
```
**Callchain 9 (ioctl BTRFS_IOC_ADD_DEV -- SAFE: user context):**
```
btrfs_ioctl()
-> btrfs_ioctl_add_dev() # fs/btrfs/ioctl.c:2455
-> btrfs_init_new_device() # fs/btrfs/volumes.c:2802
-> update_dev_time() # fs/btrfs/volumes.c:2119
-> kern_path()
```
Only callchains 1 and 2 (the `btrfs-devrepl` kthread and its error
path) need conversion. All other paths are ioctl/user syscall context.
**Converted to LOOKUP_IN_INIT**
==== 9. SCSI Target (LIO) -- target workqueues ====
Via `target_queued_submit_work` / `target_complete_ok_work` workqueues.
**Workqueue creation:**
```
module_init(target_core_init_configfs) # drivers/target/target_core_configfs.c:3852
-> init_se_kmem_caches() # drivers/target/target_core_transport.c:60
-> alloc_workqueue("target_completion", WQ_MEM_RECLAIM|WQ_PERCPU, 0) # line 128
-> alloc_workqueue("target_submission", WQ_MEM_RECLAIM|WQ_PERCPU, 0) # line 133
```
Work items execute in kworker kthreads (children of kthreadd, share init_fs).
**Converted to LOOKUP_IN_INIT**
==== 10. RNBD server -- RDMA CQ workqueue (IB_POLL_WORKQUEUE) ====
**Workqueue creation:**
The underlying workqueue is `ib_comp_wq`:
```
module_init(ib_core_init) # drivers/infiniband/core/device.c:2994
-> alloc_workqueue("ib-comp-wq",
WQ_HIGHPRI|WQ_MEM_RECLAIM|WQ_SYSFS|WQ_PERCPU, 0) # line 3007
```
At connection time, CQ completion work is bound to `ib_comp_wq`:
```
rtrs_srv_rdma_cm_handler()
-> create_con() # drivers/infiniband/ulp/rtrs/rtrs-srv.c:1704
-> rtrs_cq_qp_create(..., IB_POLL_WORKQUEUE) # line 1759
-> ib_alloc_cq() -> __ib_alloc_cq() # drivers/infiniband/core/cq.c:212
-> cq->comp_wq = ib_comp_wq # line 276
```
Work items execute in kworker kthreads (children of kthreadd, share init_fs).
**Converted to LOOKUP_IN_INIT**
==== 11. NFS client pNFS block layout -- rpciod/nfsiod workqueue (potentially) ====
**Workqueue creation:**
```
module_init(init_nfs_fs) # fs/nfs/inode.c:2809
-> nfsiod_start() # fs/nfs/inode.c:2620
-> alloc_workqueue("nfsiod", WQ_MEM_RECLAIM|WQ_UNBOUND, 0) # line 2627
```
Work items execute in kworker kthreads (children of kthreadd, share init_fs).
**Converted to LOOKUP_IN_INIT**
==== 12. NFS4 referral -- automount ====
`nfs4_submount()` is an automount callback triggered during path walk.
Always user process context.
**No conversion needed**
==== 13. Cachefiles -- fscache cookie workers ====
The fscache cookie worker path is workqueue context:
```
fscache_cookie_worker() [work_struct]
-> fscache_cookie_state_machine()
-> fscache_perform_lookup() -> cachefiles_lookup_cookie()
-> cachefiles_look_up_object() -> lookup_one_positive_unlocked()
-> fscache_perform_invalidation() -> cachefiles_invalidate_cookie()
-> cachefiles_bury_object() -> lookup_one()
```
However, `lookup_one()` and `lookup_one_positive_unlocked()` are **dentry-level
lookups** relative to a parent dentry. They do NOT use
`current->fs->root` for path resolution.
The `cachefiles_add_cache()` -> `kern_path()` path is daemon ioctl context
(user process).
**No conversion needed**
==== 14. Audit subsystem -- netlink handler ====
The `kern_path()` calls are all reached via:
```
audit_receive() -> audit_receive_msg() -> audit_trim_trees() / audit_tag_tree()
```
`audit_receive()` is a netlink callback running in the **context of the
userspace process** (auditctl) that sent the netlink message. The
`prune_tree_thread` kthread (launched by `audit_launch_prune()`) calls
`prune_one()` which does NOT do path lookups. **No conversion needed.**
==== 15. AMD SEV -- `__init` path ====
Uses `init_root()` via `open_file_as_root()` -> `file_open_root()`
(drivers/crypto/ccp/sev-dev.c:265).
**Converted**
==== 16. Overlayfs -- VFS operation context ====
Triggered from `ovl_open()` / `ovl_d_real()` -- inherits caller's
context.
Uses `vfs_path_lookup(layer->mnt->mnt_root, layer->mnt, ...)` with an
explicit root/vfsmount. Does not go through fs_struct root at all.
**No conversion needed.**
==== 17. Module init (kthread context when built-in) ====
Note: `early_boot_devpath()` uses `early_lookup_bdev()` (not `kern_path`)
for the device lookup, but then calls `init_unlink()` and `init_mknod()`
which perform path lookups via `filename_unlinkat()` and
`filename_mknodat()`.
Built-in `module_init` runs in PID 1 context . Module loaded at runtime
runs in modprobe context (user process, safe).
**No conversion needed.**
==== 18. EROFS -- mount operation ====
`erofs_fc_get_tree()` is the `.get_tree` callback in `erofs_context_ops`
(`fs/erofs/super.c:884`), invoked via `vfs_get_tree()`.
The `filp_open()` call happens only on the `CONFIG_EROFS_FS_BACKED_BY_FILE`
path, when `get_tree_bdev_flags()` returns `-ENOTBLK` and the source is a
regular file.
**kthread path (boot-time root mount):**
```
kernel_init_freeable()
-> prepare_namespace()
-> mount_root()
-> mount_root_generic() / mount_nodev_root()
-> do_mount_root()
-> init_mount()
-> path_mount()
-> do_new_mount()
-> vfs_get_tree()
-> erofs_fc_get_tree()
```
This is reachable when erofs is used as the root filesystem
(`rootfstype=erofs`). At this point PID 1 is spawned via
`user_mode_thread(kernel_init, ...)` but has not yet exec'd the
userspace init binary. It will have the correct lookup context as
we chrooted into initramfs.
**No conversion needed.**
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (11):
kthread: refactor __kthread_create_on_node() to take a struct argument
kthread: remove unused flags argument from kthread worker creation API
kthread: add extensible kthread_create()/kthread_run() pattern
fs: notice when init abandons fs sharing
fs: add LOOKUP_IN_INIT
fs: add file_open_init()
block: add bdev_file_open_init()
fs: allow to pass lookup flags to filename_*()
fs: add init_root()
tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT
fs: isolate all kthreads in nullfs
arch/x86/kvm/i8254.c | 2 +-
block/bdev.c | 60 ++++++--
crypto/crypto_engine.c | 2 +-
drivers/block/rnbd/rnbd-srv.c | 2 +-
drivers/char/misc_minor_kunit.c | 2 +-
drivers/cpufreq/cppc_cpufreq.c | 2 +-
drivers/crypto/ccp/sev-dev.c | 4 +-
drivers/dpll/zl3073x/core.c | 2 +-
drivers/gpu/drm/drm_vblank_work.c | 6 +-
.../gpu/drm/i915/gem/selftests/i915_gem_context.c | 4 +-
drivers/gpu/drm/i915/gt/selftest_execlists.c | 2 +-
drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 4 +-
drivers/gpu/drm/i915/gt/selftest_slpc.c | 2 +-
drivers/gpu/drm/i915/selftests/i915_request.c | 12 +-
drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 2 +-
drivers/gpu/drm/msm/msm_atomic.c | 2 +-
drivers/gpu/drm/msm/msm_gpu.c | 2 +-
drivers/gpu/drm/msm/msm_kms.c | 2 +-
.../media/platform/chips-media/wave5/wave5-vpu.c | 2 +-
drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
drivers/net/ethernet/intel/ice/ice_dpll.c | 4 +-
drivers/net/ethernet/intel/ice/ice_gnss.c | 2 +-
drivers/net/ethernet/intel/ice/ice_ptp.c | 4 +-
drivers/platform/chrome/cros_ec_spi.c | 2 +-
drivers/ptp/ptp_clock.c | 2 +-
drivers/spi/spi.c | 2 +-
drivers/target/target_core_alua.c | 2 +-
drivers/target/target_core_pr.c | 2 +-
drivers/usb/gadget/function/uvc_video.c | 2 +-
drivers/usb/typec/tcpm/tcpm.c | 2 +-
drivers/vdpa/vdpa_sim/vdpa_sim.c | 4 +-
drivers/watchdog/watchdog_dev.c | 2 +-
fs/btrfs/volumes.c | 6 +-
fs/coredump.c | 8 +-
fs/erofs/zdata.c | 2 +-
fs/fs_struct.c | 92 ++++++++++++
fs/init.c | 23 +--
fs/internal.h | 18 ++-
fs/kernel_read_file.c | 4 +-
fs/namei.c | 71 +++++----
fs/namespace.c | 4 -
fs/nfs/blocklayout/dev.c | 4 +-
fs/open.c | 25 +++
fs/smb/server/mgmt/share_config.c | 3 +-
fs/smb/server/smb2pdu.c | 2 +-
fs/smb/server/vfs.c | 6 +-
include/linux/blkdev.h | 2 +
include/linux/fs.h | 1 +
include/linux/fs_struct.h | 5 +
include/linux/init_task.h | 1 +
include/linux/kthread.h | 97 +++++++-----
include/linux/namei.h | 3 +-
include/linux/sched/task.h | 1 +
init/initramfs.c | 4 +-
init/initramfs_test.c | 4 +-
init/main.c | 10 +-
io_uring/fs.c | 10 +-
kernel/fork.c | 40 +++--
kernel/kthread.c | 167 ++++++++++++++-------
kernel/rcu/tree.c | 4 +-
kernel/sched/ext.c | 2 +-
kernel/workqueue.c | 2 +-
net/dsa/tag_ksz.c | 4 +-
net/dsa/tag_ocelot_8021q.c | 2 +-
net/dsa/tag_sja1105.c | 4 +-
net/unix/af_unix.c | 4 +-
66 files changed, 526 insertions(+), 257 deletions(-)
---
base-commit: 10047142d6ce3b8562546c61f3cf57f852b9b950
change-id: 20260303-work-kthread-nullfs-875a837f4198
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 01/11] kthread: refactor __kthread_create_on_node() to take a struct argument
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 02/11] kthread: remove unused flags argument from kthread worker creation API Christian Brauner
` (10 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Refactor __kthread_create_on_node() to take a const struct
kthread_create_info pointer instead of individual parameters. The
caller fills in the relevant fields in a stack-local struct and the
helper heap-copies it, making it trivial to add new kthread creation
options without changing the function signature.
As part of this, collapse __kthread_create_worker_on_node() into
__kthread_create_on_node() by adding a kthread_worker:1 bitfield to
struct kthread_create_info. When set, the unified helper allocates and
initializes the kthread_worker internally, removing the need for a
separate helper.
Also switch create_kthread() from the kernel_thread() wrapper to
constructing struct kernel_clone_args directly and calling
kernel_clone(). This makes the clone flags explicit and prepares for
passing richer per-kthread arguments through kernel_clone_args in
subsequent patches.
No functional change.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
kernel/kthread.c | 87 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 48 insertions(+), 39 deletions(-)
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 791210daf8b4..84d535c7a635 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -45,6 +45,7 @@ struct kthread_create_info
int (*threadfn)(void *data);
void *data;
int node;
+ u32 kthread_worker:1;
/* Result passed back to kthread_create() from kthreadd. */
struct task_struct *result;
@@ -451,13 +452,20 @@ int tsk_fork_get_node(struct task_struct *tsk)
static void create_kthread(struct kthread_create_info *create)
{
int pid;
+ struct kernel_clone_args args = {
+ .flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_UNTRACED,
+ .exit_signal = SIGCHLD,
+ .fn = kthread,
+ .fn_arg = create,
+ .name = create->full_name,
+ .kthread = 1,
+ };
#ifdef CONFIG_NUMA
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
- pid = kernel_thread(kthread, create, create->full_name,
- CLONE_FS | CLONE_FILES | SIGCHLD);
+ pid = kernel_clone(&args);
if (pid < 0) {
/* Release the structure when caller killed by a fatal signal. */
struct completion *done = xchg(&create->done, NULL);
@@ -472,21 +480,32 @@ static void create_kthread(struct kthread_create_info *create)
}
}
-static __printf(4, 0)
-struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
- void *data, int node,
+static struct task_struct *__kthread_create_on_node(const struct kthread_create_info *info,
const char namefmt[],
va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
+ struct kthread_worker *worker = NULL;
struct task_struct *task;
- struct kthread_create_info *create = kmalloc_obj(*create);
+ struct kthread_create_info *create;
+ create = kmalloc_obj(*create);
if (!create)
return ERR_PTR(-ENOMEM);
- create->threadfn = threadfn;
- create->data = data;
- create->node = node;
+
+ *create = *info;
+
+ if (create->kthread_worker) {
+ worker = kzalloc_obj(*worker);
+ if (!worker) {
+ kfree(create);
+ return ERR_PTR(-ENOMEM);
+ }
+ kthread_init_worker(worker);
+ create->threadfn = kthread_worker_fn;
+ create->data = worker;
+ }
+
create->done = &done;
create->full_name = kvasprintf(GFP_KERNEL, namefmt, args);
if (!create->full_name) {
@@ -520,6 +539,8 @@ struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
}
task = create->result;
free_create:
+ if (IS_ERR(task))
+ kfree(worker);
kfree(create);
return task;
}
@@ -552,11 +573,16 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
const char namefmt[],
...)
{
+ struct kthread_create_info info = {
+ .threadfn = threadfn,
+ .data = data,
+ .node = node,
+ };
struct task_struct *task;
va_list args;
va_start(args, namefmt);
- task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
+ task = __kthread_create_on_node(&info, namefmt, args);
va_end(args);
return task;
@@ -1045,34 +1071,6 @@ int kthread_worker_fn(void *worker_ptr)
}
EXPORT_SYMBOL_GPL(kthread_worker_fn);
-static __printf(3, 0) struct kthread_worker *
-__kthread_create_worker_on_node(unsigned int flags, int node,
- const char namefmt[], va_list args)
-{
- struct kthread_worker *worker;
- struct task_struct *task;
-
- worker = kzalloc_obj(*worker);
- if (!worker)
- return ERR_PTR(-ENOMEM);
-
- kthread_init_worker(worker);
-
- task = __kthread_create_on_node(kthread_worker_fn, worker,
- node, namefmt, args);
- if (IS_ERR(task))
- goto fail_task;
-
- worker->flags = flags;
- worker->task = task;
-
- return worker;
-
-fail_task:
- kfree(worker);
- return ERR_CAST(task);
-}
-
/**
* kthread_create_worker_on_node - create a kthread worker
* @flags: flags modifying the default behavior of the worker
@@ -1086,13 +1084,24 @@ __kthread_create_worker_on_node(unsigned int flags, int node,
struct kthread_worker *
kthread_create_worker_on_node(unsigned int flags, int node, const char namefmt[], ...)
{
+ struct kthread_create_info info = {
+ .node = node,
+ .kthread_worker = 1,
+ };
struct kthread_worker *worker;
+ struct task_struct *task;
va_list args;
va_start(args, namefmt);
- worker = __kthread_create_worker_on_node(flags, node, namefmt, args);
+ task = __kthread_create_on_node(&info, namefmt, args);
va_end(args);
+ if (IS_ERR(task))
+ return ERR_CAST(task);
+
+ worker = kthread_data(task);
+ worker->flags = flags;
+ worker->task = task;
return worker;
}
EXPORT_SYMBOL(kthread_create_worker_on_node);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 02/11] kthread: remove unused flags argument from kthread worker creation API
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 01/11] kthread: refactor __kthread_create_on_node() to take a struct argument Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 03/11] kthread: add extensible kthread_create()/kthread_run() pattern Christian Brauner
` (9 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Every caller of kthread_create_worker(), kthread_run_worker(),
kthread_create_worker_on_cpu(), and kthread_run_worker_on_cpu() passes
0 for the flags argument. The only defined flag, KTW_FREEZABLE, has no
users anywhere in the tree.
Remove the flags parameter from the entire kthread worker creation API,
the KTW_FREEZABLE enum, the flags field from struct kthread_worker, and
the dead set_freezable() call in kthread_worker_fn().
No functional change.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
arch/x86/kvm/i8254.c | 2 +-
crypto/crypto_engine.c | 2 +-
drivers/cpufreq/cppc_cpufreq.c | 2 +-
drivers/dpll/zl3073x/core.c | 2 +-
drivers/gpu/drm/drm_vblank_work.c | 6 ++---
.../gpu/drm/i915/gem/selftests/i915_gem_context.c | 4 ++--
drivers/gpu/drm/i915/gt/selftest_execlists.c | 2 +-
drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 4 ++--
drivers/gpu/drm/i915/gt/selftest_slpc.c | 2 +-
drivers/gpu/drm/i915/selftests/i915_request.c | 12 +++++-----
drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 2 +-
drivers/gpu/drm/msm/msm_atomic.c | 2 +-
drivers/gpu/drm/msm/msm_gpu.c | 2 +-
drivers/gpu/drm/msm/msm_kms.c | 2 +-
.../media/platform/chips-media/wave5/wave5-vpu.c | 2 +-
drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
drivers/net/ethernet/intel/ice/ice_dpll.c | 4 ++--
drivers/net/ethernet/intel/ice/ice_gnss.c | 2 +-
drivers/net/ethernet/intel/ice/ice_ptp.c | 4 ++--
drivers/platform/chrome/cros_ec_spi.c | 2 +-
drivers/ptp/ptp_clock.c | 2 +-
drivers/spi/spi.c | 2 +-
drivers/usb/gadget/function/uvc_video.c | 2 +-
drivers/usb/typec/tcpm/tcpm.c | 2 +-
drivers/vdpa/vdpa_sim/vdpa_sim.c | 4 ++--
drivers/watchdog/watchdog_dev.c | 2 +-
fs/erofs/zdata.c | 2 +-
include/linux/kthread.h | 28 +++++++---------------
kernel/kthread.c | 13 +++-------
kernel/rcu/tree.c | 4 ++--
kernel/sched/ext.c | 2 +-
kernel/workqueue.c | 2 +-
net/dsa/tag_ksz.c | 4 ++--
net/dsa/tag_ocelot_8021q.c | 2 +-
net/dsa/tag_sja1105.c | 4 ++--
35 files changed, 60 insertions(+), 77 deletions(-)
diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 1982b0077ddd..4f1065c96e78 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -750,7 +750,7 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
pid_nr = pid_vnr(pid);
put_pid(pid);
- pit->worker = kthread_run_worker(0, "kvm-pit/%d", pid_nr);
+ pit->worker = kthread_run_worker("kvm-pit/%d", pid_nr);
if (IS_ERR(pit->worker))
goto fail_kthread;
diff --git a/crypto/crypto_engine.c b/crypto/crypto_engine.c
index 3d07dd5de4fa..60023f485c7f 100644
--- a/crypto/crypto_engine.c
+++ b/crypto/crypto_engine.c
@@ -456,7 +456,7 @@ struct crypto_engine *crypto_engine_alloc_init_and_set(struct device *dev,
guard(spinlock_init)(&engine->queue_lock);
crypto_init_queue(&engine->queue, qlen);
- engine->kworker = kthread_run_worker(0, "%s", engine->name);
+ engine->kworker = kthread_run_worker("%s", engine->name);
if (IS_ERR(engine->kworker)) {
dev_err(dev, "failed to create crypto request pump task\n");
return NULL;
diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 011f35cb47b9..1cdd3ed9e7a3 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -225,7 +225,7 @@ static void cppc_fie_kworker_init(void)
};
int ret;
- kworker_fie = kthread_run_worker(0, "cppc_fie");
+ kworker_fie = kthread_run_worker("cppc_fie");
if (IS_ERR(kworker_fie)) {
pr_warn("%s: failed to create kworker_fie: %ld\n", __func__,
PTR_ERR(kworker_fie));
diff --git a/drivers/dpll/zl3073x/core.c b/drivers/dpll/zl3073x/core.c
index 63bd97181b9e..55d0ee934246 100644
--- a/drivers/dpll/zl3073x/core.c
+++ b/drivers/dpll/zl3073x/core.c
@@ -966,7 +966,7 @@ zl3073x_devm_dpll_init(struct zl3073x_dev *zldev, u8 num_dplls)
/* Initialize monitoring thread */
kthread_init_delayed_work(&zldev->work, zl3073x_dev_periodic_work);
- kworker = kthread_run_worker(0, "zl3073x-%s", dev_name(zldev->dev));
+ kworker = kthread_run_worker("zl3073x-%s", dev_name(zldev->dev));
if (IS_ERR(kworker)) {
rc = PTR_ERR(kworker);
goto error;
diff --git a/drivers/gpu/drm/drm_vblank_work.c b/drivers/gpu/drm/drm_vblank_work.c
index 70f0199251ea..f5a95dc5bb05 100644
--- a/drivers/gpu/drm/drm_vblank_work.c
+++ b/drivers/gpu/drm/drm_vblank_work.c
@@ -279,9 +279,9 @@ int drm_vblank_worker_init(struct drm_vblank_crtc *vblank)
INIT_LIST_HEAD(&vblank->pending_work);
init_waitqueue_head(&vblank->work_wait_queue);
- worker = kthread_run_worker(0, "card%d-crtc%d",
- vblank->dev->primary->index,
- vblank->pipe);
+ worker = kthread_run_worker("card%d-crtc%d",
+ vblank->dev->primary->index,
+ vblank->pipe);
if (IS_ERR(worker))
return PTR_ERR(worker);
diff --git a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
index 9d405098f9e7..8b55eeeabe8c 100644
--- a/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/gem/selftests/i915_gem_context.c
@@ -369,8 +369,8 @@ static int live_parallel_switch(void *arg)
if (!data[n].ce[0])
continue;
- worker = kthread_run_worker(0, "igt/parallel:%s",
- data[n].ce[0]->engine->name);
+ worker = kthread_run_worker("igt/parallel:%s",
+ data[n].ce[0]->engine->name);
if (IS_ERR(worker)) {
err = PTR_ERR(worker);
goto out;
diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
index 21e5ed9f72a3..a6edb922b7e2 100644
--- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
+++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
@@ -3577,7 +3577,7 @@ static int smoke_crescendo(struct preempt_smoke *smoke, unsigned int flags)
arg[id].batch = NULL;
arg[id].count = 0;
- worker[id] = kthread_run_worker(0, "igt/smoke:%d", id);
+ worker[id] = kthread_run_worker("igt/smoke:%d", id);
if (IS_ERR(worker[id])) {
err = PTR_ERR(worker[id]);
break;
diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
index 00dfc37221fa..91a0ab9d6158 100644
--- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
+++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c
@@ -1025,8 +1025,8 @@ static int __igt_reset_engines(struct intel_gt *gt,
threads[tmp].engine = other;
threads[tmp].flags = flags;
- worker = kthread_run_worker(0, "igt/%s",
- other->name);
+ worker = kthread_run_worker("igt/%s",
+ other->name);
if (IS_ERR(worker)) {
err = PTR_ERR(worker);
pr_err("[%s] Worker create failed: %d!\n",
diff --git a/drivers/gpu/drm/i915/gt/selftest_slpc.c b/drivers/gpu/drm/i915/gt/selftest_slpc.c
index c3c918248989..fb69773e89d4 100644
--- a/drivers/gpu/drm/i915/gt/selftest_slpc.c
+++ b/drivers/gpu/drm/i915/gt/selftest_slpc.c
@@ -504,7 +504,7 @@ static int live_slpc_tile_interaction(void *arg)
return -ENOMEM;
for_each_gt(gt, i915, i) {
- threads[i].worker = kthread_run_worker(0, "igt/slpc_parallel:%d", gt->info.id);
+ threads[i].worker = kthread_run_worker("igt/slpc_parallel:%d", gt->info.id);
if (IS_ERR(threads[i].worker)) {
ret = PTR_ERR(threads[i].worker);
diff --git a/drivers/gpu/drm/i915/selftests/i915_request.c b/drivers/gpu/drm/i915/selftests/i915_request.c
index e1a7c454a0a9..54b8f7be0bdd 100644
--- a/drivers/gpu/drm/i915/selftests/i915_request.c
+++ b/drivers/gpu/drm/i915/selftests/i915_request.c
@@ -493,7 +493,7 @@ static int mock_breadcrumbs_smoketest(void *arg)
for (n = 0; n < ncpus; n++) {
struct kthread_worker *worker;
- worker = kthread_run_worker(0, "igt/%d", n);
+ worker = kthread_run_worker("igt/%d", n);
if (IS_ERR(worker)) {
ret = PTR_ERR(worker);
ncpus = n;
@@ -1646,8 +1646,8 @@ static int live_parallel_engines(void *arg)
for_each_uabi_engine(engine, i915) {
struct kthread_worker *worker;
- worker = kthread_run_worker(0, "igt/parallel:%s",
- engine->name);
+ worker = kthread_run_worker("igt/parallel:%s",
+ engine->name);
if (IS_ERR(worker)) {
err = PTR_ERR(worker);
break;
@@ -1805,7 +1805,7 @@ static int live_breadcrumbs_smoketest(void *arg)
unsigned int i = idx * ncpus + n;
struct kthread_worker *worker;
- worker = kthread_run_worker(0, "igt/%d.%d", idx, n);
+ worker = kthread_run_worker("igt/%d.%d", idx, n);
if (IS_ERR(worker)) {
ret = PTR_ERR(worker);
goto out_flush;
@@ -3218,8 +3218,8 @@ static int perf_parallel_engines(void *arg)
memset(&engines[idx].p, 0, sizeof(engines[idx].p));
- worker = kthread_run_worker(0, "igt:%s",
- engine->name);
+ worker = kthread_run_worker("igt:%s",
+ engine->name);
if (IS_ERR(worker)) {
err = PTR_ERR(worker);
intel_engine_pm_put(engine);
diff --git a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
index d99771684728..87f8063b7390 100644
--- a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
+++ b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c
@@ -109,7 +109,7 @@ int msm_disp_snapshot_init(struct drm_device *drm_dev)
mutex_init(&kms->dump_mutex);
- kms->dump_worker = kthread_run_worker(0, "%s", "disp_snapshot");
+ kms->dump_worker = kthread_run_worker("%s", "disp_snapshot");
if (IS_ERR(kms->dump_worker))
DRM_ERROR("failed to create disp state task\n");
diff --git a/drivers/gpu/drm/msm/msm_atomic.c b/drivers/gpu/drm/msm/msm_atomic.c
index 87a91148a731..4c7d5fb0d914 100644
--- a/drivers/gpu/drm/msm/msm_atomic.c
+++ b/drivers/gpu/drm/msm/msm_atomic.c
@@ -115,7 +115,7 @@ int msm_atomic_init_pending_timer(struct msm_pending_timer *timer,
timer->kms = kms;
timer->crtc_idx = crtc_idx;
- timer->worker = kthread_run_worker(0, "atomic-worker-%d", crtc_idx);
+ timer->worker = kthread_run_worker("atomic-worker-%d", crtc_idx);
if (IS_ERR(timer->worker)) {
int ret = PTR_ERR(timer->worker);
timer->worker = NULL;
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 84d6c7f50c8d..7b5cf071d0f3 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -989,7 +989,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
gpu->funcs = funcs;
gpu->name = name;
- gpu->worker = kthread_run_worker(0, "gpu-worker");
+ gpu->worker = kthread_run_worker("gpu-worker");
if (IS_ERR(gpu->worker)) {
ret = PTR_ERR(gpu->worker);
gpu->worker = NULL;
diff --git a/drivers/gpu/drm/msm/msm_kms.c b/drivers/gpu/drm/msm/msm_kms.c
index e5d0ea629448..69df2b46402d 100644
--- a/drivers/gpu/drm/msm/msm_kms.c
+++ b/drivers/gpu/drm/msm/msm_kms.c
@@ -306,7 +306,7 @@ int msm_drm_kms_init(struct device *dev, const struct drm_driver *drv)
/* initialize event thread */
ev_thread = &kms->event_thread[drm_crtc_index(crtc)];
ev_thread->dev = ddev;
- ev_thread->worker = kthread_run_worker(0, "crtc_event:%d", crtc->base.id);
+ ev_thread->worker = kthread_run_worker("crtc_event:%d", crtc->base.id);
if (IS_ERR(ev_thread->worker)) {
ret = PTR_ERR(ev_thread->worker);
DRM_DEV_ERROR(dev, "failed to create crtc_event kthread\n");
diff --git a/drivers/media/platform/chips-media/wave5/wave5-vpu.c b/drivers/media/platform/chips-media/wave5/wave5-vpu.c
index 76d57c6b636a..fea52a23b8c2 100644
--- a/drivers/media/platform/chips-media/wave5/wave5-vpu.c
+++ b/drivers/media/platform/chips-media/wave5/wave5-vpu.c
@@ -342,7 +342,7 @@ static int wave5_vpu_probe(struct platform_device *pdev)
dev->irq_thread = kthread_run(irq_thread, dev, "irq thread");
hrtimer_setup(&dev->hrtimer, &wave5_vpu_timer_callback, CLOCK_MONOTONIC,
HRTIMER_MODE_REL_PINNED);
- dev->worker = kthread_run_worker(0, "vpu_irq_thread");
+ dev->worker = kthread_run_worker("vpu_irq_thread");
if (IS_ERR(dev->worker)) {
dev_err(&pdev->dev, "failed to create vpu irq worker\n");
ret = PTR_ERR(dev->worker);
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6fcd7181116a..a7a59e5e99a2 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -394,7 +394,7 @@ static int mv88e6xxx_irq_poll_setup(struct mv88e6xxx_chip *chip)
kthread_init_delayed_work(&chip->irq_poll_work,
mv88e6xxx_irq_poll);
- chip->kworker = kthread_run_worker(0, "%s", dev_name(chip->dev));
+ chip->kworker = kthread_run_worker("%s", dev_name(chip->dev));
if (IS_ERR(chip->kworker))
return PTR_ERR(chip->kworker);
diff --git a/drivers/net/ethernet/intel/ice/ice_dpll.c b/drivers/net/ethernet/intel/ice/ice_dpll.c
index 62f75701d652..8c03d14d8f83 100644
--- a/drivers/net/ethernet/intel/ice/ice_dpll.c
+++ b/drivers/net/ethernet/intel/ice/ice_dpll.c
@@ -3776,8 +3776,8 @@ static int ice_dpll_init_worker(struct ice_pf *pf)
struct kthread_worker *kworker;
kthread_init_delayed_work(&d->work, ice_dpll_periodic_work);
- kworker = kthread_run_worker(0, "ice-dplls-%s",
- dev_name(ice_pf_to_dev(pf)));
+ kworker = kthread_run_worker("ice-dplls-%s",
+ dev_name(ice_pf_to_dev(pf)));
if (IS_ERR(kworker))
return PTR_ERR(kworker);
d->kworker = kworker;
diff --git a/drivers/net/ethernet/intel/ice/ice_gnss.c b/drivers/net/ethernet/intel/ice/ice_gnss.c
index 8fd954f1ebd6..b85a96d7cac8 100644
--- a/drivers/net/ethernet/intel/ice/ice_gnss.c
+++ b/drivers/net/ethernet/intel/ice/ice_gnss.c
@@ -182,7 +182,7 @@ static struct gnss_serial *ice_gnss_struct_init(struct ice_pf *pf)
pf->gnss_serial = gnss;
kthread_init_delayed_work(&gnss->read_work, ice_gnss_read);
- kworker = kthread_run_worker(0, "ice-gnss-%s", dev_name(dev));
+ kworker = kthread_run_worker("ice-gnss-%s", dev_name(dev));
if (IS_ERR(kworker)) {
kfree(gnss);
return NULL;
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index 094e96219f45..cfc8daec3d50 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -3207,8 +3207,8 @@ static int ice_ptp_init_work(struct ice_pf *pf, struct ice_ptp *ptp)
/* Allocate a kworker for handling work required for the ports
* connected to the PTP hardware clock.
*/
- kworker = kthread_run_worker(0, "ice-ptp-%s",
- dev_name(ice_pf_to_dev(pf)));
+ kworker = kthread_run_worker("ice-ptp-%s",
+ dev_name(ice_pf_to_dev(pf)));
if (IS_ERR(kworker))
return PTR_ERR(kworker);
diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
index 28fa82f8cb07..0009659712ca 100644
--- a/drivers/platform/chrome/cros_ec_spi.c
+++ b/drivers/platform/chrome/cros_ec_spi.c
@@ -715,7 +715,7 @@ static int cros_ec_spi_devm_high_pri_alloc(struct device *dev,
int err;
ec_spi->high_pri_worker =
- kthread_run_worker(0, "cros_ec_spi_high_pri");
+ kthread_run_worker("cros_ec_spi_high_pri");
if (IS_ERR(ec_spi->high_pri_worker)) {
err = PTR_ERR(ec_spi->high_pri_worker);
diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
index d6f54ccaf93b..b9811ccc9147 100644
--- a/drivers/ptp/ptp_clock.c
+++ b/drivers/ptp/ptp_clock.c
@@ -382,7 +382,7 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
if (ptp->info->do_aux_work) {
kthread_init_delayed_work(&ptp->aux_work, ptp_aux_kworker);
- ptp->kworker = kthread_run_worker(0, "ptp%d", ptp->index);
+ ptp->kworker = kthread_run_worker("ptp%d", ptp->index);
if (IS_ERR(ptp->kworker)) {
err = PTR_ERR(ptp->kworker);
pr_err("failed to create ptp aux_worker %d\n", err);
diff --git a/drivers/spi/spi.c b/drivers/spi/spi.c
index 61f7bde8c7fb..c0a742290207 100644
--- a/drivers/spi/spi.c
+++ b/drivers/spi/spi.c
@@ -2046,7 +2046,7 @@ static int spi_init_queue(struct spi_controller *ctlr)
ctlr->busy = false;
ctlr->queue_empty = true;
- ctlr->kworker = kthread_run_worker(0, dev_name(&ctlr->dev));
+ ctlr->kworker = kthread_run_worker(dev_name(&ctlr->dev));
if (IS_ERR(ctlr->kworker)) {
dev_err(&ctlr->dev, "failed to create message pump kworker\n");
return PTR_ERR(ctlr->kworker);
diff --git a/drivers/usb/gadget/function/uvc_video.c b/drivers/usb/gadget/function/uvc_video.c
index 7cea641b06b4..83a745e9b820 100644
--- a/drivers/usb/gadget/function/uvc_video.c
+++ b/drivers/usb/gadget/function/uvc_video.c
@@ -819,7 +819,7 @@ int uvcg_video_init(struct uvc_video *video, struct uvc_device *uvc)
return -EINVAL;
/* Allocate a kthread for asynchronous hw submit handler. */
- video->kworker = kthread_run_worker(0, "UVCG");
+ video->kworker = kthread_run_worker("UVCG");
if (IS_ERR(video->kworker)) {
uvcg_err(&video->uvc->func, "failed to create UVCG kworker\n");
return PTR_ERR(video->kworker);
diff --git a/drivers/usb/typec/tcpm/tcpm.c b/drivers/usb/typec/tcpm/tcpm.c
index 1d2f3af034c5..9d9b8c202ffb 100644
--- a/drivers/usb/typec/tcpm/tcpm.c
+++ b/drivers/usb/typec/tcpm/tcpm.c
@@ -7836,7 +7836,7 @@ struct tcpm_port *tcpm_register_port(struct device *dev, struct tcpc_dev *tcpc)
mutex_init(&port->lock);
mutex_init(&port->swap_lock);
- port->wq = kthread_run_worker(0, dev_name(dev));
+ port->wq = kthread_run_worker(dev_name(dev));
if (IS_ERR(port->wq))
return ERR_CAST(port->wq);
sched_set_fifo(port->wq->task);
diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c b/drivers/vdpa/vdpa_sim/vdpa_sim.c
index 8cb1cc2ea139..78434262bb49 100644
--- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
+++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
@@ -229,8 +229,8 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *dev_attr,
dev = &vdpasim->vdpa.dev;
kthread_init_work(&vdpasim->work, vdpasim_work_fn);
- vdpasim->worker = kthread_run_worker(0, "vDPA sim worker: %s",
- dev_attr->name);
+ vdpasim->worker = kthread_run_worker("vDPA sim worker: %s",
+ dev_attr->name);
if (IS_ERR(vdpasim->worker))
goto err_iommu;
diff --git a/drivers/watchdog/watchdog_dev.c b/drivers/watchdog/watchdog_dev.c
index 834f65f4b59a..13fb68728022 100644
--- a/drivers/watchdog/watchdog_dev.c
+++ b/drivers/watchdog/watchdog_dev.c
@@ -1224,7 +1224,7 @@ int __init watchdog_dev_init(void)
{
int err;
- watchdog_kworker = kthread_run_worker(0, "watchdogd");
+ watchdog_kworker = kthread_run_worker("watchdogd");
if (IS_ERR(watchdog_kworker)) {
pr_err("Failed to create watchdog kworker\n");
return PTR_ERR(watchdog_kworker);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index 3977e42b9516..2f68e2cf393a 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -309,7 +309,7 @@ static void erofs_destroy_percpu_workers(void)
static struct kthread_worker *erofs_init_percpu_worker(int cpu)
{
struct kthread_worker *worker =
- kthread_run_worker_on_cpu(cpu, 0, "erofs_worker/%u");
+ kthread_run_worker_on_cpu(cpu, "erofs_worker/%u");
if (IS_ERR(worker))
return worker;
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index a01a474719a7..2630791295ac 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -137,12 +137,7 @@ struct kthread_work;
typedef void (*kthread_work_func_t)(struct kthread_work *work);
void kthread_delayed_work_timer_fn(struct timer_list *t);
-enum {
- KTW_FREEZABLE = 1 << 0, /* freeze during suspend */
-};
-
struct kthread_worker {
- unsigned int flags;
raw_spinlock_t lock;
struct list_head work_list;
struct list_head delayed_work_list;
@@ -207,39 +202,35 @@ extern void __kthread_init_worker(struct kthread_worker *worker,
int kthread_worker_fn(void *worker_ptr);
-__printf(3, 4)
-struct kthread_worker *kthread_create_worker_on_node(unsigned int flags,
- int node,
+__printf(2, 3)
+struct kthread_worker *kthread_create_worker_on_node(int node,
const char namefmt[], ...);
-#define kthread_create_worker(flags, namefmt, ...) \
- kthread_create_worker_on_node(flags, NUMA_NO_NODE, namefmt, ## __VA_ARGS__);
+#define kthread_create_worker(namefmt, ...) \
+ kthread_create_worker_on_node(NUMA_NO_NODE, namefmt, ## __VA_ARGS__)
/**
* kthread_run_worker - create and wake a kthread worker.
- * @flags: flags modifying the default behavior of the worker
* @namefmt: printf-style name for the thread.
*
* Description: Convenient wrapper for kthread_create_worker() followed by
* wake_up_process(). Returns the kthread_worker or ERR_PTR(-ENOMEM).
*/
-#define kthread_run_worker(flags, namefmt, ...) \
+#define kthread_run_worker(namefmt, ...) \
({ \
struct kthread_worker *__kw \
- = kthread_create_worker(flags, namefmt, ## __VA_ARGS__); \
+ = kthread_create_worker(namefmt, ## __VA_ARGS__); \
if (!IS_ERR(__kw)) \
wake_up_process(__kw->task); \
__kw; \
})
struct kthread_worker *
-kthread_create_worker_on_cpu(int cpu, unsigned int flags,
- const char namefmt[]);
+kthread_create_worker_on_cpu(int cpu, const char namefmt[]);
/**
* kthread_run_worker_on_cpu - create and wake a cpu bound kthread worker.
* @cpu: CPU number
- * @flags: flags modifying the default behavior of the worker
* @namefmt: printf-style name for the thread. Format is restricted
* to "name.*%u". Code fills in cpu number.
*
@@ -248,12 +239,11 @@ kthread_create_worker_on_cpu(int cpu, unsigned int flags,
* ERR_PTR(-ENOMEM).
*/
static inline struct kthread_worker *
-kthread_run_worker_on_cpu(int cpu, unsigned int flags,
- const char namefmt[])
+kthread_run_worker_on_cpu(int cpu, const char namefmt[])
{
struct kthread_worker *kw;
- kw = kthread_create_worker_on_cpu(cpu, flags, namefmt);
+ kw = kthread_create_worker_on_cpu(cpu, namefmt);
if (!IS_ERR(kw))
wake_up_process(kw->task);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 84d535c7a635..4c60c8082126 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1020,9 +1020,6 @@ int kthread_worker_fn(void *worker_ptr)
WARN_ON(worker->task && worker->task != current);
worker->task = current;
- if (worker->flags & KTW_FREEZABLE)
- set_freezable();
-
repeat:
set_current_state(TASK_INTERRUPTIBLE); /* mb paired w/ kthread_stop */
@@ -1073,7 +1070,6 @@ EXPORT_SYMBOL_GPL(kthread_worker_fn);
/**
* kthread_create_worker_on_node - create a kthread worker
- * @flags: flags modifying the default behavior of the worker
* @node: task structure for the thread is allocated on this node
* @namefmt: printf-style name for the kthread worker (task).
*
@@ -1082,7 +1078,7 @@ EXPORT_SYMBOL_GPL(kthread_worker_fn);
* when the caller was killed by a fatal signal.
*/
struct kthread_worker *
-kthread_create_worker_on_node(unsigned int flags, int node, const char namefmt[], ...)
+kthread_create_worker_on_node(int node, const char namefmt[], ...)
{
struct kthread_create_info info = {
.node = node,
@@ -1100,7 +1096,6 @@ kthread_create_worker_on_node(unsigned int flags, int node, const char namefmt[]
return ERR_CAST(task);
worker = kthread_data(task);
- worker->flags = flags;
worker->task = task;
return worker;
}
@@ -1110,7 +1105,6 @@ EXPORT_SYMBOL(kthread_create_worker_on_node);
* kthread_create_worker_on_cpu - create a kthread worker and bind it
* to a given CPU and the associated NUMA node.
* @cpu: CPU number
- * @flags: flags modifying the default behavior of the worker
* @namefmt: printf-style name for the thread. Format is restricted
* to "name.*%u". Code fills in cpu number.
*
@@ -1143,12 +1137,11 @@ EXPORT_SYMBOL(kthread_create_worker_on_node);
* when the caller was killed by a fatal signal.
*/
struct kthread_worker *
-kthread_create_worker_on_cpu(int cpu, unsigned int flags,
- const char namefmt[])
+kthread_create_worker_on_cpu(int cpu, const char namefmt[])
{
struct kthread_worker *worker;
- worker = kthread_create_worker_on_node(flags, cpu_to_node(cpu), namefmt, cpu);
+ worker = kthread_create_worker_on_node(cpu_to_node(cpu), namefmt, cpu);
if (!IS_ERR(worker))
kthread_bind(worker->task, cpu);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 55df6d37145e..7d8c6de2a232 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4186,7 +4186,7 @@ static void rcu_spawn_exp_par_gp_kworker(struct rcu_node *rnp)
if (rnp->exp_kworker)
return;
- kworker = kthread_create_worker(0, name, rnp_index);
+ kworker = kthread_create_worker(name, rnp_index);
if (IS_ERR_OR_NULL(kworker)) {
pr_err("Failed to create par gp kworker on %d/%d\n",
rnp->grplo, rnp->grphi);
@@ -4206,7 +4206,7 @@ static void __init rcu_start_exp_gp_kworker(void)
const char *name = "rcu_exp_gp_kthread_worker";
struct sched_param param = { .sched_priority = kthread_prio };
- rcu_exp_gp_kworker = kthread_run_worker(0, name);
+ rcu_exp_gp_kworker = kthread_run_worker(name);
if (IS_ERR_OR_NULL(rcu_exp_gp_kworker)) {
pr_err("Failed to create %s!\n", name);
rcu_exp_gp_kworker = NULL;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 62b1f3ac5630..4d2fd73de353 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4863,7 +4863,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops)
goto err_free_gdsqs;
}
- sch->helper = kthread_run_worker(0, "sched_ext_helper");
+ sch->helper = kthread_run_worker("sched_ext_helper");
if (IS_ERR(sch->helper)) {
ret = PTR_ERR(sch->helper);
goto err_free_pcpu;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c..3670ea197327 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7954,7 +7954,7 @@ static void __init wq_cpu_intensive_thresh_init(void)
unsigned long thresh;
unsigned long bogo;
- pwq_release_worker = kthread_run_worker(0, "pool_workqueue_release");
+ pwq_release_worker = kthread_run_worker("pool_workqueue_release");
BUG_ON(IS_ERR(pwq_release_worker));
/* if the user set it to a specific value, keep it */
diff --git a/net/dsa/tag_ksz.c b/net/dsa/tag_ksz.c
index d2475c3bbb7d..5285a076476c 100644
--- a/net/dsa/tag_ksz.c
+++ b/net/dsa/tag_ksz.c
@@ -66,8 +66,8 @@ static int ksz_connect(struct dsa_switch *ds)
if (!priv)
return -ENOMEM;
- xmit_worker = kthread_run_worker(0, "dsa%d:%d_xmit",
- ds->dst->index, ds->index);
+ xmit_worker = kthread_run_worker("dsa%d:%d_xmit",
+ ds->dst->index, ds->index);
if (IS_ERR(xmit_worker)) {
ret = PTR_ERR(xmit_worker);
kfree(priv);
diff --git a/net/dsa/tag_ocelot_8021q.c b/net/dsa/tag_ocelot_8021q.c
index e89d9254e90a..c3d294a5149e 100644
--- a/net/dsa/tag_ocelot_8021q.c
+++ b/net/dsa/tag_ocelot_8021q.c
@@ -110,7 +110,7 @@ static int ocelot_connect(struct dsa_switch *ds)
if (!priv)
return -ENOMEM;
- priv->xmit_worker = kthread_run_worker(0, "felix_xmit");
+ priv->xmit_worker = kthread_run_worker("felix_xmit");
if (IS_ERR(priv->xmit_worker)) {
err = PTR_ERR(priv->xmit_worker);
kfree(priv);
diff --git a/net/dsa/tag_sja1105.c b/net/dsa/tag_sja1105.c
index de6d4ce8668b..50c7f8fe7a5e 100644
--- a/net/dsa/tag_sja1105.c
+++ b/net/dsa/tag_sja1105.c
@@ -707,8 +707,8 @@ static int sja1105_connect(struct dsa_switch *ds)
spin_lock_init(&priv->meta_lock);
- xmit_worker = kthread_run_worker(0, "dsa%d:%d_xmit",
- ds->dst->index, ds->index);
+ xmit_worker = kthread_run_worker("dsa%d:%d_xmit",
+ ds->dst->index, ds->index);
if (IS_ERR(xmit_worker)) {
err = PTR_ERR(xmit_worker);
kfree(priv);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 03/11] kthread: add extensible kthread_create()/kthread_run() pattern
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 01/11] kthread: refactor __kthread_create_on_node() to take a struct argument Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 02/11] kthread: remove unused flags argument from kthread worker creation API Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 04/11] fs: notice when init abandons fs sharing Christian Brauner
` (8 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
This is similar to what I did for kmem_cache_create() in
b2e7456b5c25 ("slab: create kmem_cache_create() compatibility layer").
Instead of piling on new variants of the functions add a struct
kthread_args variant that just passes the relevant paramter.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
include/linux/kthread.h | 69 +++++++++++++++++++++++++++-------------
kernel/kthread.c | 83 +++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 118 insertions(+), 34 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 2630791295ac..972cb2960b61 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -25,26 +25,53 @@ static inline struct kthread *tsk_is_kthread(struct task_struct *p)
return NULL;
}
+/**
+ * struct kthread_args - kthread creation parameters.
+ * @threadfn: the function to run in the kthread.
+ * @data: data pointer passed to @threadfn.
+ * @node: NUMA node for stack/task allocation (NUMA_NO_NODE for any).
+ * @kthread_worker: set to 1 to create a kthread worker.
+ *
+ * Pass a pointer to this struct as the first argument of kthread_create()
+ * or kthread_run() to use the struct-based creation path. Legacy callers
+ * that pass a function pointer as the first argument continue to work
+ * unchanged via _Generic dispatch.
+ */
+struct kthread_args {
+ int (*threadfn)(void *data);
+ void *data;
+ int node;
+ u32 kthread_worker:1;
+};
+
__printf(4, 5)
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data,
int node,
const char namefmt[], ...);
+__printf(2, 3)
+struct task_struct *kthread_create_on_info(struct kthread_args *kargs,
+ const char namefmt[], ...);
+
+__printf(3, 4)
+struct task_struct *__kthread_create(int (*threadfn)(void *data),
+ void *data,
+ const char namefmt[], ...);
+
/**
- * kthread_create - create a kthread on the current node
- * @threadfn: the function to run in the thread
- * @data: data pointer for @threadfn()
- * @namefmt: printf-style format string for the thread name
- * @arg: arguments for @namefmt.
+ * kthread_create - create a kthread on the current node.
+ * @first: either a function pointer (legacy) or a &struct kthread_args
+ * pointer (struct-based).
*
- * This macro will create a kthread on the current node, leaving it in
- * the stopped state. This is just a helper for kthread_create_on_node();
- * see the documentation there for more details.
+ * _Generic dispatch: when @first is a &struct kthread_args pointer the
+ * call is forwarded to kthread_create_on_info(); otherwise it goes through
+ * __kthread_create() which wraps kthread_create_on_node() with NUMA_NO_NODE.
*/
-#define kthread_create(threadfn, data, namefmt, arg...) \
- kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
-
+#define kthread_create(__first, ...) \
+ _Generic((__first), \
+ struct kthread_args *: kthread_create_on_info, \
+ default: __kthread_create)(__first, __VA_ARGS__)
struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
void *data,
@@ -59,20 +86,20 @@ bool kthread_is_per_cpu(struct task_struct *k);
/**
* kthread_run - create and wake a thread.
- * @threadfn: the function to run until signal_pending(current).
- * @data: data ptr for @threadfn.
- * @namefmt: printf-style name for the thread.
+ * @first: either a function pointer (legacy) or a &struct kthread_args
+ * pointer (struct-based). Remaining arguments are forwarded to
+ * kthread_create().
*
* Description: Convenient wrapper for kthread_create() followed by
* wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
*/
-#define kthread_run(threadfn, data, namefmt, ...) \
-({ \
- struct task_struct *__k \
- = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
- if (!IS_ERR(__k)) \
- wake_up_process(__k); \
- __k; \
+#define kthread_run(__first, ...) \
+({ \
+ struct task_struct *__k \
+ = kthread_create(__first, __VA_ARGS__); \
+ if (!IS_ERR(__k)) \
+ wake_up_process(__k); \
+ __k; \
})
/**
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4c60c8082126..20ec96142ce6 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -38,8 +38,7 @@ struct task_struct *kthreadd_task;
static LIST_HEAD(kthread_affinity_list);
static DEFINE_MUTEX(kthread_affinity_lock);
-struct kthread_create_info
-{
+struct kthread_create_req {
/* Information passed to kthread() from kthreadd. */
char *full_name;
int (*threadfn)(void *data);
@@ -382,7 +381,7 @@ static int kthread(void *_create)
{
static const struct sched_param param = { .sched_priority = 0 };
/* Copy data: it's on kthread's stack */
- struct kthread_create_info *create = _create;
+ struct kthread_create_req *create = _create;
int (*threadfn)(void *data) = create->threadfn;
void *data = create->data;
struct completion *done;
@@ -449,7 +448,7 @@ int tsk_fork_get_node(struct task_struct *tsk)
return NUMA_NO_NODE;
}
-static void create_kthread(struct kthread_create_info *create)
+static void create_kthread(struct kthread_create_req *create)
{
int pid;
struct kernel_clone_args args = {
@@ -480,20 +479,23 @@ static void create_kthread(struct kthread_create_info *create)
}
}
-static struct task_struct *__kthread_create_on_node(const struct kthread_create_info *info,
+static struct task_struct *__kthread_create_on_node(const struct kthread_args *kargs,
const char namefmt[],
va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
struct kthread_worker *worker = NULL;
struct task_struct *task;
- struct kthread_create_info *create;
+ struct kthread_create_req *create;
create = kmalloc_obj(*create);
if (!create)
return ERR_PTR(-ENOMEM);
- *create = *info;
+ create->threadfn = kargs->threadfn;
+ create->data = kargs->data;
+ create->node = kargs->node;
+ create->kthread_worker = kargs->kthread_worker;
if (create->kthread_worker) {
worker = kzalloc_obj(*worker);
@@ -573,7 +575,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
const char namefmt[],
...)
{
- struct kthread_create_info info = {
+ struct kthread_args kargs = {
.threadfn = threadfn,
.data = data,
.node = node,
@@ -582,13 +584,68 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
va_list args;
va_start(args, namefmt);
- task = __kthread_create_on_node(&info, namefmt, args);
+ task = __kthread_create_on_node(&kargs, namefmt, args);
va_end(args);
return task;
}
EXPORT_SYMBOL(kthread_create_on_node);
+/**
+ * kthread_create_on_info - create a kthread from a struct kthread_args.
+ * @kargs: kthread creation parameters.
+ * @namefmt: printf-style name for the thread.
+ *
+ * This is the struct-based kthread creation path, dispatched via the
+ * kthread_create() _Generic macro when the first argument is a
+ * &struct kthread_args pointer.
+ *
+ * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
+ */
+struct task_struct *kthread_create_on_info(struct kthread_args *kargs,
+ const char namefmt[], ...)
+{
+ struct task_struct *task;
+ va_list args;
+
+ va_start(args, namefmt);
+ task = __kthread_create_on_node(kargs, namefmt, args);
+ va_end(args);
+
+ return task;
+}
+EXPORT_SYMBOL(kthread_create_on_info);
+
+/**
+ * __kthread_create - create a kthread (legacy positional-argument path).
+ * @threadfn: the function to run until signal_pending(current).
+ * @data: data ptr for @threadfn.
+ * @namefmt: printf-style name for the thread.
+ *
+ * _Generic dispatch target for kthread_create() when the first argument
+ * is a function pointer rather than a &struct kthread_args.
+ *
+ * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
+ */
+struct task_struct *__kthread_create(int (*threadfn)(void *data),
+ void *data, const char namefmt[], ...)
+{
+ struct kthread_args kargs = {
+ .threadfn = threadfn,
+ .data = data,
+ .node = NUMA_NO_NODE,
+ };
+ struct task_struct *task;
+ va_list args;
+
+ va_start(args, namefmt);
+ task = __kthread_create_on_node(&kargs, namefmt, args);
+ va_end(args);
+
+ return task;
+}
+EXPORT_SYMBOL(__kthread_create);
+
static void __kthread_bind_mask(struct task_struct *p, const struct cpumask *mask, unsigned int state)
{
if (!wait_task_inactive(p, state)) {
@@ -833,10 +890,10 @@ int kthreadd(void *unused)
spin_lock(&kthread_create_lock);
while (!list_empty(&kthread_create_list)) {
- struct kthread_create_info *create;
+ struct kthread_create_req *create;
create = list_entry(kthread_create_list.next,
- struct kthread_create_info, list);
+ struct kthread_create_req, list);
list_del_init(&create->list);
spin_unlock(&kthread_create_lock);
@@ -1080,7 +1137,7 @@ EXPORT_SYMBOL_GPL(kthread_worker_fn);
struct kthread_worker *
kthread_create_worker_on_node(int node, const char namefmt[], ...)
{
- struct kthread_create_info info = {
+ struct kthread_args kargs = {
.node = node,
.kthread_worker = 1,
};
@@ -1089,7 +1146,7 @@ kthread_create_worker_on_node(int node, const char namefmt[], ...)
va_list args;
va_start(args, namefmt);
- task = __kthread_create_on_node(&info, namefmt, args);
+ task = __kthread_create_on_node(&kargs, namefmt, args);
va_end(args);
if (IS_ERR(task))
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 04/11] fs: notice when init abandons fs sharing
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (2 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 03/11] kthread: add extensible kthread_create()/kthread_run() pattern Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 05/11] fs: add LOOKUP_IN_INIT Christian Brauner
` (7 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
PID 1 may choose to stop sharing fs_struct state with us. Either via
unshare(CLONE_FS) or unshare(CLONE_NEWNS). Of course, PID 1 could have
chosen to create arbitrary process trees that all share fs_struct state
via CLONE_FS. This is a strong statement: We only care about PID 1 aka
the thread-group leader so ubthread's fs_struct state doesn't matter.
PID 1 unsharing fs_struct state is a bug. PID 1 relies on various
kthreads to be able to perform work based on its fs_struct state.
Breaking that contract sucks for both sides. So just don't bother with
extra work for this. No sane init system should ever do this.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/fs_struct.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/fs_struct.h | 2 ++
kernel/fork.c | 14 +++-----------
3 files changed, 48 insertions(+), 11 deletions(-)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 394875d06fd6..ab6826d7a6a9 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -147,6 +147,49 @@ int unshare_fs_struct(void)
}
EXPORT_SYMBOL_GPL(unshare_fs_struct);
+/*
+ * PID 1 may choose to stop sharing fs_struct state with us.
+ * Either via unshare(CLONE_FS) or unshare(CLONE_NEWNS). Of
+ * course, PID 1 could have chosen to create arbitrary process
+ * trees that all share fs_struct state via CLONE_FS. This is a
+ * strong statement: We only care about PID 1 aka the thread-group
+ * leader so ubthread's fs_struct state doesn't matter.
+ *
+ * PID 1 unsharing fs_struct state is a bug. PID 1 relies on
+ * various kthreads to be able to perform work based on its
+ * fs_struct state. Breaking that contract sucks for both sides.
+ * So just don't bother with extra work for this. No sane init
+ * system should ever do this.
+ */
+static inline bool nullfs_userspace_init(void)
+{
+ struct fs_struct *fs = current->fs;
+
+ if (unlikely(current->pid == 1) && fs != &init_fs) {
+ pr_warn("VFS: Pid 1 stopped sharing filesystem state\n");
+ return true;
+ }
+
+ return false;
+}
+
+struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
+{
+ struct fs_struct *fs;
+
+ fs = current->fs;
+ read_seqlock_excl(&fs->seq);
+ current->fs = new_fs;
+ if (--fs->users)
+ new_fs = NULL;
+ else
+ new_fs = fs;
+ read_sequnlock_excl(&fs->seq);
+
+ nullfs_userspace_init();
+ return new_fs;
+}
+
/* to be mentioned only in INIT_TASK */
struct fs_struct init_fs = {
.users = 1,
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 0070764b790a..ade459383f92 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -40,6 +40,8 @@ static inline void get_fs_pwd(struct fs_struct *fs, struct path *pwd)
read_sequnlock_excl(&fs->seq);
}
+struct fs_struct *switch_fs_struct(struct fs_struct *new_fs);
+
extern bool current_chrooted(void);
static inline int current_umask(void)
diff --git a/kernel/fork.c b/kernel/fork.c
index 65113a304518..583078c69bbd 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -3123,7 +3123,7 @@ static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp
*/
int ksys_unshare(unsigned long unshare_flags)
{
- struct fs_struct *fs, *new_fs = NULL;
+ struct fs_struct *new_fs = NULL;
struct files_struct *new_fd = NULL;
struct cred *new_cred = NULL;
struct nsproxy *new_nsproxy = NULL;
@@ -3200,16 +3200,8 @@ int ksys_unshare(unsigned long unshare_flags)
task_lock(current);
- if (new_fs) {
- fs = current->fs;
- read_seqlock_excl(&fs->seq);
- current->fs = new_fs;
- if (--fs->users)
- new_fs = NULL;
- else
- new_fs = fs;
- read_sequnlock_excl(&fs->seq);
- }
+ if (new_fs)
+ new_fs = switch_fs_struct(new_fs);
if (new_fd)
swap(current->files, new_fd);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 05/11] fs: add LOOKUP_IN_INIT
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (3 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 04/11] fs: notice when init abandons fs sharing Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 06/11] fs: add file_open_init() Christian Brauner
` (6 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Add a new LOOKUP_IN_INIT flag that causes the lookup to be performed
relative to userspace init's root or working directory. This will be
used to force kthreads to be isolated in nullfs and explicitly opt-in to
lookup stuff in init's filesystem state.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/namei.c | 17 ++++++++++++++---
include/linux/namei.h | 3 ++-
2 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 58f715f7657e..dd2710d5f5df 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1099,7 +1099,12 @@ static int complete_walk(struct nameidata *nd)
static int set_root(struct nameidata *nd)
{
- struct fs_struct *fs = current->fs;
+ struct fs_struct *fs;
+
+ if (nd->flags & LOOKUP_IN_INIT)
+ fs = &init_fs;
+ else
+ fs = current->fs;
/*
* Jumping to the real root in a scoped-lookup is a BUG in namei, but we
@@ -2716,8 +2721,14 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
/* Relative pathname -- get the starting-point it is relative to. */
if (nd->dfd == AT_FDCWD) {
+ struct fs_struct *fs;
+
+ if (nd->flags & LOOKUP_IN_INIT)
+ fs = &init_fs;
+ else
+ fs = current->fs;
+
if (flags & LOOKUP_RCU) {
- struct fs_struct *fs = current->fs;
unsigned seq;
do {
@@ -2727,7 +2738,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
} while (read_seqretry(&fs->seq, seq));
} else {
- get_fs_pwd(current->fs, &nd->path);
+ get_fs_pwd(fs, &nd->path);
nd->inode = nd->path.dentry->d_inode;
}
} else {
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 58600cf234bc..072533ec367b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -46,9 +46,10 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
#define LOOKUP_NO_XDEV BIT(26) /* No mountpoint crossing. */
#define LOOKUP_BENEATH BIT(27) /* No escaping from starting point. */
#define LOOKUP_IN_ROOT BIT(28) /* Treat dirfd as fs root. */
+#define LOOKUP_IN_INIT BIT(29) /* Lookup in init's namespace. */
/* LOOKUP_* flags which do scope-related checks based on the dirfd. */
#define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
-/* 3 spare bits for scoping */
+/* 2 spare bits for scoping */
extern int path_pts(struct path *path);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 06/11] fs: add file_open_init()
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (4 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 05/11] fs: add LOOKUP_IN_INIT Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 07/11] block: add bdev_file_open_init() Christian Brauner
` (5 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Add a helper to allow the few users that need it to open a file in
init's fs_struct from a kernel thread.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/open.c | 25 +++++++++++++++++++++++++
include/linux/fs.h | 1 +
2 files changed, 26 insertions(+)
diff --git a/fs/open.c b/fs/open.c
index 91f1139591ab..bc97d66b6348 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1342,6 +1342,31 @@ struct file *filp_open(const char *filename, int flags, umode_t mode)
}
EXPORT_SYMBOL(filp_open);
+/**
+ * filp_open_init - open file resolving paths against init's root
+ *
+ * @filename: path to open
+ * @flags: open flags as per the open(2) second argument
+ * @mode: mode for the new file if O_CREAT is set, else ignored
+ *
+ * Same as filp_open() but path resolution is done relative to init's
+ * root (using pid1_fs) instead of current->fs. Intended for kernel
+ * threads that need to open files by absolute path after being rooted
+ * in nullfs.
+ */
+struct file *filp_open_init(const char *filename, int flags, umode_t mode)
+{
+ struct open_flags op;
+ struct open_how how = build_open_how(flags, mode);
+ int err = build_open_flags(&how, &op);
+ if (err)
+ return ERR_PTR(err);
+ op.lookup_flags |= LOOKUP_IN_INIT;
+ CLASS(filename_kernel, name)(filename);
+ return do_file_open(AT_FDCWD, name, &op);
+}
+EXPORT_SYMBOL(filp_open_init);
+
struct file *file_open_root(const struct path *root,
const char *filename, int flags, umode_t mode)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b3dd145b25e..bc0430e72c74 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2459,6 +2459,7 @@ int do_sys_open(int dfd, const char __user *filename, int flags,
umode_t mode);
extern struct file *file_open_name(struct filename *, int, umode_t);
extern struct file *filp_open(const char *, int, umode_t);
+extern struct file *filp_open_init(const char *, int, umode_t);
extern struct file *file_open_root(const struct path *,
const char *, int, umode_t);
static inline struct file *file_open_root_mnt(struct vfsmount *mnt,
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 07/11] block: add bdev_file_open_init()
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (5 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 06/11] fs: add file_open_init() Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 08/11] fs: allow to pass lookup flags to filename_*() Christian Brauner
` (4 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Add a helper to open a block device from a kthread.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
block/bdev.c | 60 +++++++++++++++++++++++++++++++++++++-------------
include/linux/blkdev.h | 2 ++
2 files changed, 47 insertions(+), 15 deletions(-)
diff --git a/block/bdev.c b/block/bdev.c
index ed022f8c48c7..79152c3ffa76 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -1083,6 +1083,20 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
}
EXPORT_SYMBOL(bdev_file_open_by_dev);
+static int validate_bdev(const struct path *path, dev_t *dev)
+{
+ struct inode *inode;
+
+ inode = d_backing_inode(path->dentry);
+ if (!S_ISBLK(inode->i_mode))
+ return -ENOTBLK;
+ if (!may_open_dev(path))
+ return -EACCES;
+
+ *dev = inode->i_rdev;
+ return 0;
+}
+
struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
void *holder,
const struct blk_holder_ops *hops)
@@ -1107,6 +1121,35 @@ struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
}
EXPORT_SYMBOL(bdev_file_open_by_path);
+struct file *bdev_file_open_init(const char *path, blk_mode_t mode,
+ void *holder,
+ const struct blk_holder_ops *hops)
+{
+ struct path p __free(path_put) = {};
+ struct file *file;
+ dev_t dev;
+ int error;
+
+ error = kern_path(path, LOOKUP_FOLLOW | LOOKUP_IN_INIT, &p);
+ if (error)
+ return ERR_PTR(error);
+
+ error = validate_bdev(&p, &dev);
+ if (error)
+ return ERR_PTR(error);
+
+ file = bdev_file_open_by_dev(dev, mode, holder, hops);
+ if (!IS_ERR(file) && (mode & BLK_OPEN_WRITE)) {
+ if (bdev_read_only(file_bdev(file))) {
+ fput(file);
+ file = ERR_PTR(-EACCES);
+ }
+ }
+
+ return file;
+}
+EXPORT_SYMBOL(bdev_file_open_init);
+
static inline void bd_yield_claim(struct file *bdev_file)
{
struct block_device *bdev = file_bdev(bdev_file);
@@ -1211,8 +1254,7 @@ EXPORT_SYMBOL(bdev_fput);
*/
int lookup_bdev(const char *pathname, dev_t *dev)
{
- struct inode *inode;
- struct path path;
+ struct path path __free(path_put) = {};
int error;
if (!pathname || !*pathname)
@@ -1222,19 +1264,7 @@ int lookup_bdev(const char *pathname, dev_t *dev)
if (error)
return error;
- inode = d_backing_inode(path.dentry);
- error = -ENOTBLK;
- if (!S_ISBLK(inode->i_mode))
- goto out_path_put;
- error = -EACCES;
- if (!may_open_dev(&path))
- goto out_path_put;
-
- *dev = inode->i_rdev;
- error = 0;
-out_path_put:
- path_put(&path);
- return error;
+ return validate_bdev(&path, dev);
}
EXPORT_SYMBOL(lookup_bdev);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index d463b9b5a0a5..9070979b6616 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1773,6 +1773,8 @@ struct file *bdev_file_open_by_dev(dev_t dev, blk_mode_t mode, void *holder,
const struct blk_holder_ops *hops);
struct file *bdev_file_open_by_path(const char *path, blk_mode_t mode,
void *holder, const struct blk_holder_ops *hops);
+struct file *bdev_file_open_init(const char *path, blk_mode_t mode,
+ void *holder, const struct blk_holder_ops *hops);
int bd_prepare_to_claim(struct block_device *bdev, void *holder,
const struct blk_holder_ops *hops);
void bd_abort_claiming(struct block_device *bdev, void *holder);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 08/11] fs: allow to pass lookup flags to filename_*()
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (6 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 07/11] block: add bdev_file_open_init() Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 09/11] fs: add init_root() Christian Brauner
` (3 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Allow lookup flags to be passed to filename_*() so callers can pass
LOOUP_IN_INIT to explicitly opt-into to performing lookups in init's
filesystem state.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/coredump.c | 2 +-
fs/init.c | 12 ++++++------
fs/internal.h | 18 ++++++++++++------
fs/namei.c | 52 +++++++++++++++++++++++++++-------------------------
io_uring/fs.c | 10 +++++-----
5 files changed, 51 insertions(+), 43 deletions(-)
diff --git a/fs/coredump.c b/fs/coredump.c
index 29df8aa19e2e..550a1553f6cb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -900,7 +900,7 @@ static bool coredump_file(struct core_name *cn, struct coredump_params *cprm,
* If it doesn't exist, that's fine. If there's some
* other problem, we'll catch it at the filp_open().
*/
- filename_unlinkat(AT_FDCWD, name);
+ filename_unlinkat(AT_FDCWD, name, 0);
}
/*
diff --git a/fs/init.c b/fs/init.c
index 33e312d74f58..a79872d5af3b 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -158,39 +158,39 @@ int __init init_stat(const char *filename, struct kstat *stat, int flags)
int __init init_mknod(const char *filename, umode_t mode, unsigned int dev)
{
CLASS(filename_kernel, name)(filename);
- return filename_mknodat(AT_FDCWD, name, mode, dev);
+ return filename_mknodat(AT_FDCWD, name, mode, dev, 0);
}
int __init init_link(const char *oldname, const char *newname)
{
CLASS(filename_kernel, old)(oldname);
CLASS(filename_kernel, new)(newname);
- return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0);
+ return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0, 0);
}
int __init init_symlink(const char *oldname, const char *newname)
{
CLASS(filename_kernel, old)(oldname);
CLASS(filename_kernel, new)(newname);
- return filename_symlinkat(old, AT_FDCWD, new);
+ return filename_symlinkat(old, AT_FDCWD, new, 0);
}
int __init init_unlink(const char *pathname)
{
CLASS(filename_kernel, name)(pathname);
- return filename_unlinkat(AT_FDCWD, name);
+ return filename_unlinkat(AT_FDCWD, name, 0);
}
int __init init_mkdir(const char *pathname, umode_t mode)
{
CLASS(filename_kernel, name)(pathname);
- return filename_mkdirat(AT_FDCWD, name, mode);
+ return filename_mkdirat(AT_FDCWD, name, mode, 0);
}
int __init init_rmdir(const char *pathname)
{
CLASS(filename_kernel, name)(pathname);
- return filename_rmdir(AT_FDCWD, name);
+ return filename_rmdir(AT_FDCWD, name, 0);
}
int __init init_utimes(char *filename, struct timespec64 *ts)
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa09..7302badcae69 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -53,16 +53,22 @@ extern int finish_clean_context(struct fs_context *fc);
*/
extern int filename_lookup(int dfd, struct filename *name, unsigned flags,
struct path *path, const struct path *root);
-int filename_rmdir(int dfd, struct filename *name);
-int filename_unlinkat(int dfd, struct filename *name);
+int filename_rmdir(int dfd, struct filename *name,
+ unsigned int lookup_flags);
+int filename_unlinkat(int dfd, struct filename *name,
+ unsigned int lookup_flags);
int may_linkat(struct mnt_idmap *idmap, const struct path *link);
int filename_renameat2(int olddfd, struct filename *oldname, int newdfd,
struct filename *newname, unsigned int flags);
-int filename_mkdirat(int dfd, struct filename *name, umode_t mode);
-int filename_mknodat(int dfd, struct filename *name, umode_t mode, unsigned int dev);
-int filename_symlinkat(struct filename *from, int newdfd, struct filename *to);
+int filename_mkdirat(int dfd, struct filename *name, umode_t mode,
+ unsigned int lookup_flags);
+int filename_mknodat(int dfd, struct filename *name, umode_t mode,
+ unsigned int dev, unsigned int lookup_flags);
+int filename_symlinkat(struct filename *from, int newdfd, struct filename *to,
+ unsigned int lookup_flags);
int filename_linkat(int olddfd, struct filename *old, int newdfd,
- struct filename *new, int flags);
+ struct filename *new, int flags,
+ unsigned int lookup_flags);
int vfs_tmpfile(struct mnt_idmap *idmap,
const struct path *parentpath,
struct file *file, umode_t mode);
diff --git a/fs/namei.c b/fs/namei.c
index dd2710d5f5df..5cf407aad5b3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -5125,14 +5125,13 @@ static int may_mknod(umode_t mode)
}
int filename_mknodat(int dfd, struct filename *name, umode_t mode,
- unsigned int dev)
+ unsigned int dev, unsigned int lookup_flags)
{
struct delegated_inode di = { };
struct mnt_idmap *idmap;
struct dentry *dentry;
struct path path;
int error;
- unsigned int lookup_flags = 0;
error = may_mknod(mode);
if (error)
@@ -5181,13 +5180,13 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
unsigned int, dev)
{
CLASS(filename, name)(filename);
- return filename_mknodat(dfd, name, mode, dev);
+ return filename_mknodat(dfd, name, mode, dev, 0);
}
SYSCALL_DEFINE3(mknod, const char __user *, filename, umode_t, mode, unsigned, dev)
{
CLASS(filename, name)(filename);
- return filename_mknodat(AT_FDCWD, name, mode, dev);
+ return filename_mknodat(AT_FDCWD, name, mode, dev, 0);
}
/**
@@ -5258,14 +5257,16 @@ struct dentry *vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
}
EXPORT_SYMBOL(vfs_mkdir);
-int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
+int filename_mkdirat(int dfd, struct filename *name, umode_t mode,
+ unsigned int lookup_flags)
{
struct dentry *dentry;
struct path path;
int error;
- unsigned int lookup_flags = LOOKUP_DIRECTORY;
struct delegated_inode delegated_inode = { };
+ lookup_flags |= LOOKUP_DIRECTORY;
+
retry:
dentry = filename_create(dfd, name, &path, lookup_flags);
if (IS_ERR(dentry))
@@ -5295,13 +5296,13 @@ int filename_mkdirat(int dfd, struct filename *name, umode_t mode)
SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
{
CLASS(filename, name)(pathname);
- return filename_mkdirat(dfd, name, mode);
+ return filename_mkdirat(dfd, name, mode, 0);
}
SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
{
CLASS(filename, name)(pathname);
- return filename_mkdirat(AT_FDCWD, name, mode);
+ return filename_mkdirat(AT_FDCWD, name, mode, 0);
}
/**
@@ -5364,14 +5365,14 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
}
EXPORT_SYMBOL(vfs_rmdir);
-int filename_rmdir(int dfd, struct filename *name)
+int filename_rmdir(int dfd, struct filename *name,
+ unsigned int lookup_flags)
{
int error;
struct dentry *dentry;
struct path path;
struct qstr last;
int type;
- unsigned int lookup_flags = 0;
struct delegated_inode delegated_inode = { };
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
@@ -5424,7 +5425,7 @@ int filename_rmdir(int dfd, struct filename *name)
SYSCALL_DEFINE1(rmdir, const char __user *, pathname)
{
CLASS(filename, name)(pathname);
- return filename_rmdir(AT_FDCWD, name);
+ return filename_rmdir(AT_FDCWD, name, 0);
}
/**
@@ -5506,7 +5507,8 @@ EXPORT_SYMBOL(vfs_unlink);
* writeout happening, and we don't want to prevent access to the directory
* while waiting on the I/O.
*/
-int filename_unlinkat(int dfd, struct filename *name)
+int filename_unlinkat(int dfd, struct filename *name,
+ unsigned int lookup_flags)
{
int error;
struct dentry *dentry;
@@ -5515,7 +5517,6 @@ int filename_unlinkat(int dfd, struct filename *name)
int type;
struct inode *inode;
struct delegated_inode delegated_inode = { };
- unsigned int lookup_flags = 0;
retry:
error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
if (error)
@@ -5576,14 +5577,14 @@ SYSCALL_DEFINE3(unlinkat, int, dfd, const char __user *, pathname, int, flag)
CLASS(filename, name)(pathname);
if (flag & AT_REMOVEDIR)
- return filename_rmdir(dfd, name);
- return filename_unlinkat(dfd, name);
+ return filename_rmdir(dfd, name, 0);
+ return filename_unlinkat(dfd, name, 0);
}
SYSCALL_DEFINE1(unlink, const char __user *, pathname)
{
CLASS(filename, name)(pathname);
- return filename_unlinkat(AT_FDCWD, name);
+ return filename_unlinkat(AT_FDCWD, name, 0);
}
/**
@@ -5630,12 +5631,12 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
}
EXPORT_SYMBOL(vfs_symlink);
-int filename_symlinkat(struct filename *from, int newdfd, struct filename *to)
+int filename_symlinkat(struct filename *from, int newdfd, struct filename *to,
+ unsigned int lookup_flags)
{
int error;
struct dentry *dentry;
struct path path;
- unsigned int lookup_flags = 0;
struct delegated_inode delegated_inode = { };
if (IS_ERR(from))
@@ -5668,14 +5669,14 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
{
CLASS(filename, old)(oldname);
CLASS(filename, new)(newname);
- return filename_symlinkat(old, newdfd, new);
+ return filename_symlinkat(old, newdfd, new, 0);
}
SYSCALL_DEFINE2(symlink, const char __user *, oldname, const char __user *, newname)
{
CLASS(filename, old)(oldname);
CLASS(filename, new)(newname);
- return filename_symlinkat(old, AT_FDCWD, new);
+ return filename_symlinkat(old, AT_FDCWD, new, 0);
}
/**
@@ -5779,13 +5780,14 @@ EXPORT_SYMBOL(vfs_link);
* and other special files. --ADM
*/
int filename_linkat(int olddfd, struct filename *old,
- int newdfd, struct filename *new, int flags)
+ int newdfd, struct filename *new, int flags,
+ unsigned int lookup_flags)
{
struct mnt_idmap *idmap;
struct dentry *new_dentry;
struct path old_path, new_path;
struct delegated_inode delegated_inode = { };
- int how = 0;
+ int how = lookup_flags;
int error;
if ((flags & ~(AT_SYMLINK_FOLLOW | AT_EMPTY_PATH)) != 0)
@@ -5807,7 +5809,7 @@ int filename_linkat(int olddfd, struct filename *old,
return error;
new_dentry = filename_create(newdfd, new, &new_path,
- (how & LOOKUP_REVAL));
+ (how & (LOOKUP_REVAL | LOOKUP_IN_INIT)));
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_putpath;
@@ -5848,14 +5850,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
CLASS(filename_uflags, old)(oldname, flags);
CLASS(filename, new)(newname);
- return filename_linkat(olddfd, old, newdfd, new, flags);
+ return filename_linkat(olddfd, old, newdfd, new, flags, 0);
}
SYSCALL_DEFINE2(link, const char __user *, oldname, const char __user *, newname)
{
CLASS(filename, old)(oldname);
CLASS(filename, new)(newname);
- return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0);
+ return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0, 0);
}
/**
diff --git a/io_uring/fs.c b/io_uring/fs.c
index d0580c754bf8..1d9b2939f5ae 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -140,9 +140,9 @@ int io_unlinkat(struct io_kiocb *req, unsigned int issue_flags)
WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
if (un->flags & AT_REMOVEDIR)
- ret = filename_rmdir(un->dfd, name);
+ ret = filename_rmdir(un->dfd, name, 0);
else
- ret = filename_unlinkat(un->dfd, name);
+ ret = filename_unlinkat(un->dfd, name, 0);
req->flags &= ~REQ_F_NEED_CLEANUP;
io_req_set_res(req, ret, 0);
@@ -188,7 +188,7 @@ int io_mkdirat(struct io_kiocb *req, unsigned int issue_flags)
WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
- ret = filename_mkdirat(mkd->dfd, name, mkd->mode);
+ ret = filename_mkdirat(mkd->dfd, name, mkd->mode, 0);
req->flags &= ~REQ_F_NEED_CLEANUP;
io_req_set_res(req, ret, 0);
@@ -241,7 +241,7 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags)
WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
- ret = filename_symlinkat(old, sl->new_dfd, new);
+ ret = filename_symlinkat(old, sl->new_dfd, new, 0);
req->flags &= ~REQ_F_NEED_CLEANUP;
io_req_set_res(req, ret, 0);
@@ -289,7 +289,7 @@ int io_linkat(struct io_kiocb *req, unsigned int issue_flags)
WARN_ON_ONCE(issue_flags & IO_URING_F_NONBLOCK);
- ret = filename_linkat(lnk->old_dfd, old, lnk->new_dfd, new, lnk->flags);
+ ret = filename_linkat(lnk->old_dfd, old, lnk->new_dfd, new, lnk->flags, 0);
req->flags &= ~REQ_F_NEED_CLEANUP;
io_req_set_res(req, ret, 0);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 09/11] fs: add init_root()
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (7 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 08/11] fs: allow to pass lookup flags to filename_*() Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT Christian Brauner
` (2 subsequent siblings)
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Add a init_root() helper that allows to grab init's current filesystem
root. This can be used by callers to perform tasks relative to init's
current filesystem root.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/fs_struct.c | 6 ++++++
include/linux/fs_struct.h | 2 ++
2 files changed, 8 insertions(+)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index ab6826d7a6a9..64b5840131cb 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -196,3 +196,9 @@ struct fs_struct init_fs = {
.seq = __SEQLOCK_UNLOCKED(init_fs.seq),
.umask = 0022,
};
+
+void init_root(struct path *root)
+{
+ get_fs_root(&init_fs, root);
+}
+EXPORT_SYMBOL_GPL(init_root);
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index ade459383f92..8ff1acd8389d 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -49,4 +49,6 @@ static inline int current_umask(void)
return current->fs->umask;
}
+void init_root(struct path *root);
+
#endif /* _LINUX_FS_STRUCT_H */
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (8 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 09/11] fs: add init_root() Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-03 15:03 ` Christoph Hellwig
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 11/11] fs: isolate all kthreads in nullfs Christian Brauner
2026-03-06 7:26 ` [PATCH RFC DRAFT POC 00/11] fs,kthread: " Askar Safin
11 siblings, 1 reply; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
In preparation to isolate all kthreads in nullfs convert all lookups
performed from kthread context to use LOOKUP_IN_INIT. This will make
them all perform the relevant lookup operation in init's filesystem
state.
This should be switched to individual commits for easy bisectability but
right now it serves to illustrate the idea without creating a massive
patchbomb.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
drivers/block/rnbd/rnbd-srv.c | 2 +-
drivers/char/misc_minor_kunit.c | 2 +-
drivers/crypto/ccp/sev-dev.c | 4 +---
drivers/target/target_core_alua.c | 2 +-
drivers/target/target_core_pr.c | 2 +-
fs/btrfs/volumes.c | 6 +++++-
fs/coredump.c | 6 ++----
fs/init.c | 23 ++++++++++++-----------
fs/kernel_read_file.c | 4 +---
fs/namei.c | 2 +-
fs/nfs/blocklayout/dev.c | 4 ++--
fs/smb/server/mgmt/share_config.c | 3 ++-
fs/smb/server/smb2pdu.c | 2 +-
fs/smb/server/vfs.c | 6 ++++--
init/initramfs.c | 4 ++--
init/initramfs_test.c | 4 ++--
net/unix/af_unix.c | 4 +---
17 files changed, 40 insertions(+), 40 deletions(-)
diff --git a/drivers/block/rnbd/rnbd-srv.c b/drivers/block/rnbd/rnbd-srv.c
index 10e8c438bb43..6796aee9a2f0 100644
--- a/drivers/block/rnbd/rnbd-srv.c
+++ b/drivers/block/rnbd/rnbd-srv.c
@@ -734,7 +734,7 @@ static int process_msg_open(struct rnbd_srv_session *srv_sess,
goto reject;
}
- bdev_file = bdev_file_open_by_path(full_path, open_flags, NULL, NULL);
+ bdev_file = bdev_file_open_init(full_path, open_flags, NULL, NULL);
if (IS_ERR(bdev_file)) {
ret = PTR_ERR(bdev_file);
pr_err("Opening device '%s' on session %s failed, failed to open the block device, err: %pe\n",
diff --git a/drivers/char/misc_minor_kunit.c b/drivers/char/misc_minor_kunit.c
index e930c78e1ef9..8af1377c42f9 100644
--- a/drivers/char/misc_minor_kunit.c
+++ b/drivers/char/misc_minor_kunit.c
@@ -165,7 +165,7 @@ static void __init miscdev_test_can_open(struct kunit *test, struct miscdevice *
if (ret != 0)
KUNIT_FAIL(test, "failed to create node\n");
- filp = filp_open(devname, O_RDONLY, 0);
+ filp = filp_open_init(devname, O_RDONLY, 0);
if (IS_ERR(filp))
KUNIT_FAIL(test, "failed to open misc device: %ld\n", PTR_ERR(filp));
else
diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c
index 096f993974d1..92971671fa9d 100644
--- a/drivers/crypto/ccp/sev-dev.c
+++ b/drivers/crypto/ccp/sev-dev.c
@@ -262,9 +262,7 @@ static struct file *open_file_as_root(const char *filename, int flags, umode_t m
{
struct path root __free(path_put) = {};
- task_lock(&init_task);
- get_fs_root(init_task.fs, &root);
- task_unlock(&init_task);
+ init_root(&root);
CLASS(prepare_creds, cred)();
if (!cred)
diff --git a/drivers/target/target_core_alua.c b/drivers/target/target_core_alua.c
index 10250aca5a81..d23390d1b6ab 100644
--- a/drivers/target/target_core_alua.c
+++ b/drivers/target/target_core_alua.c
@@ -856,7 +856,7 @@ static int core_alua_write_tpg_metadata(
unsigned char *md_buf,
u32 md_buf_len)
{
- struct file *file = filp_open(path, O_RDWR | O_CREAT | O_TRUNC, 0600);
+ struct file *file = filp_open_init(path, O_RDWR | O_CREAT | O_TRUNC, 0600);
loff_t pos = 0;
int ret;
diff --git a/drivers/target/target_core_pr.c b/drivers/target/target_core_pr.c
index f88e63aefcd8..7ad6b534ccc6 100644
--- a/drivers/target/target_core_pr.c
+++ b/drivers/target/target_core_pr.c
@@ -1969,7 +1969,7 @@ static int __core_scsi3_write_aptpl_to_file(
if (!path)
return -ENOMEM;
- file = filp_open(path, flags, 0600);
+ file = filp_open_init(path, flags, 0600);
if (IS_ERR(file)) {
pr_err("filp_open(%s) for APTPL metadata"
" failed\n", path);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6fb0c4cd50ff..8baeacca01da 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2119,8 +2119,12 @@ static int btrfs_add_dev_item(struct btrfs_trans_handle *trans,
static void update_dev_time(const char *device_path)
{
struct path path;
+ unsigned int flags = LOOKUP_FOLLOW;
- if (!kern_path(device_path, LOOKUP_FOLLOW, &path)) {
+ if (tsk_is_kthread(current))
+ flags |= LOOKUP_IN_INIT;
+
+ if (!kern_path(device_path, flags, &path)) {
vfs_utimes(&path, NULL);
path_put(&path);
}
diff --git a/fs/coredump.c b/fs/coredump.c
index 550a1553f6cb..1e631c5d2076 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -919,13 +919,11 @@ static bool coredump_file(struct core_name *cn, struct coredump_params *cprm,
* with a fully qualified path" rule is to control where
* coredumps may be placed using root privileges,
* current->fs->root must not be used. Instead, use the
- * root directory of init_task.
+ * root directory of PID 1.
*/
struct path root;
- task_lock(&init_task);
- get_fs_root(init_task.fs, &root);
- task_unlock(&init_task);
+ init_root(&root);
file = file_open_root(&root, cn->corename, open_flags, 0600);
path_put(&root);
} else {
diff --git a/fs/init.c b/fs/init.c
index a79872d5af3b..eb224e945328 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -12,6 +12,7 @@
#include <linux/init_syscalls.h>
#include <linux/security.h>
#include "internal.h"
+#include "mount.h"
int __init init_pivot_root(const char *new_root, const char *put_old)
{
@@ -102,7 +103,7 @@ int __init init_chown(const char *filename, uid_t user, gid_t group, int flags)
struct path path;
int error;
- error = kern_path(filename, lookup_flags, &path);
+ error = kern_path(filename, lookup_flags | LOOKUP_IN_INIT, &path);
if (error)
return error;
error = mnt_want_write(path.mnt);
@@ -119,7 +120,7 @@ int __init init_chmod(const char *filename, umode_t mode)
struct path path;
int error;
- error = kern_path(filename, LOOKUP_FOLLOW, &path);
+ error = kern_path(filename, LOOKUP_FOLLOW | LOOKUP_IN_INIT, &path);
if (error)
return error;
error = chmod_common(&path, mode);
@@ -132,7 +133,7 @@ int __init init_eaccess(const char *filename)
struct path path;
int error;
- error = kern_path(filename, LOOKUP_FOLLOW, &path);
+ error = kern_path(filename, LOOKUP_FOLLOW | LOOKUP_IN_INIT, &path);
if (error)
return error;
error = path_permission(&path, MAY_ACCESS);
@@ -146,7 +147,7 @@ int __init init_stat(const char *filename, struct kstat *stat, int flags)
struct path path;
int error;
- error = kern_path(filename, lookup_flags, &path);
+ error = kern_path(filename, lookup_flags | LOOKUP_IN_INIT, &path);
if (error)
return error;
error = vfs_getattr(&path, stat, STATX_BASIC_STATS,
@@ -158,39 +159,39 @@ int __init init_stat(const char *filename, struct kstat *stat, int flags)
int __init init_mknod(const char *filename, umode_t mode, unsigned int dev)
{
CLASS(filename_kernel, name)(filename);
- return filename_mknodat(AT_FDCWD, name, mode, dev, 0);
+ return filename_mknodat(AT_FDCWD, name, mode, dev, LOOKUP_IN_INIT);
}
int __init init_link(const char *oldname, const char *newname)
{
CLASS(filename_kernel, old)(oldname);
CLASS(filename_kernel, new)(newname);
- return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0, 0);
+ return filename_linkat(AT_FDCWD, old, AT_FDCWD, new, 0, LOOKUP_IN_INIT);
}
int __init init_symlink(const char *oldname, const char *newname)
{
CLASS(filename_kernel, old)(oldname);
CLASS(filename_kernel, new)(newname);
- return filename_symlinkat(old, AT_FDCWD, new, 0);
+ return filename_symlinkat(old, AT_FDCWD, new, LOOKUP_IN_INIT);
}
int __init init_unlink(const char *pathname)
{
CLASS(filename_kernel, name)(pathname);
- return filename_unlinkat(AT_FDCWD, name, 0);
+ return filename_unlinkat(AT_FDCWD, name, LOOKUP_IN_INIT);
}
int __init init_mkdir(const char *pathname, umode_t mode)
{
CLASS(filename_kernel, name)(pathname);
- return filename_mkdirat(AT_FDCWD, name, mode, 0);
+ return filename_mkdirat(AT_FDCWD, name, mode, LOOKUP_IN_INIT);
}
int __init init_rmdir(const char *pathname)
{
CLASS(filename_kernel, name)(pathname);
- return filename_rmdir(AT_FDCWD, name, 0);
+ return filename_rmdir(AT_FDCWD, name, LOOKUP_IN_INIT);
}
int __init init_utimes(char *filename, struct timespec64 *ts)
@@ -198,7 +199,7 @@ int __init init_utimes(char *filename, struct timespec64 *ts)
struct path path;
int error;
- error = kern_path(filename, 0, &path);
+ error = kern_path(filename, LOOKUP_IN_INIT, &path);
if (error)
return error;
error = vfs_utimes(&path, ts);
diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index de32c95d823d..00bbe0757ad3 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -156,9 +156,7 @@ ssize_t kernel_read_file_from_path_initns(const char *path, loff_t offset,
if (!path || !*path)
return -EINVAL;
- task_lock(&init_task);
- get_fs_root(init_task.fs, &root);
- task_unlock(&init_task);
+ init_root(&root);
file = file_open_root(&root, path, O_RDONLY, 0);
path_put(&root);
diff --git a/fs/namei.c b/fs/namei.c
index 5cf407aad5b3..976b1e9f7032 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4906,7 +4906,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
struct dentry *dentry = ERR_PTR(-EEXIST);
struct qstr last;
bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
- unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
+ unsigned int reval_flag = lookup_flags & (LOOKUP_REVAL | LOOKUP_IN_INIT);
unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
int type;
int error;
diff --git a/fs/nfs/blocklayout/dev.c b/fs/nfs/blocklayout/dev.c
index cc6327d97a91..32dee716237a 100644
--- a/fs/nfs/blocklayout/dev.c
+++ b/fs/nfs/blocklayout/dev.c
@@ -370,8 +370,8 @@ bl_open_path(struct pnfs_block_volume *v, const char *prefix)
if (!devname)
return ERR_PTR(-ENOMEM);
- bdev_file = bdev_file_open_by_path(devname, BLK_OPEN_READ | BLK_OPEN_WRITE,
- NULL, NULL);
+ bdev_file = bdev_file_open_init(devname, BLK_OPEN_READ | BLK_OPEN_WRITE,
+ NULL, NULL);
if (IS_ERR(bdev_file)) {
dprintk("failed to open device %s (%ld)\n",
devname, PTR_ERR(bdev_file));
diff --git a/fs/smb/server/mgmt/share_config.c b/fs/smb/server/mgmt/share_config.c
index 53f44ff4d376..2deefdc242a8 100644
--- a/fs/smb/server/mgmt/share_config.c
+++ b/fs/smb/server/mgmt/share_config.c
@@ -189,7 +189,8 @@ static struct ksmbd_share_config *share_config_request(struct ksmbd_work *work,
goto out;
}
- ret = kern_path(share->path, 0, &share->vfs_path);
+ ret = kern_path(share->path, LOOKUP_IN_INIT,
+ &share->vfs_path);
ksmbd_revert_fsids(work);
if (ret) {
ksmbd_debug(SMB, "failed to access '%s'\n",
diff --git a/fs/smb/server/smb2pdu.c b/fs/smb/server/smb2pdu.c
index 95901a78951c..8e89fb9a8c35 100644
--- a/fs/smb/server/smb2pdu.c
+++ b/fs/smb/server/smb2pdu.c
@@ -5462,7 +5462,7 @@ static int smb2_get_info_filesystem(struct ksmbd_work *work,
if (!share->path)
return -EIO;
- rc = kern_path(share->path, LOOKUP_NO_SYMLINKS, &path);
+ rc = kern_path(share->path, LOOKUP_NO_SYMLINKS | LOOKUP_IN_INIT, &path);
if (rc) {
pr_err("cannot create vfs path\n");
return -EIO;
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index d08973b288e5..2e64ed65dcca 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -62,6 +62,7 @@ static int ksmbd_vfs_path_lookup(struct ksmbd_share_config *share_conf,
if (pathname[0] == '\0') {
pathname = share_conf->path;
root_share_path = NULL;
+ flags |= LOOKUP_IN_INIT;
} else {
flags |= LOOKUP_BENEATH;
}
@@ -622,7 +623,7 @@ int ksmbd_vfs_link(struct ksmbd_work *work, const char *oldname,
if (ksmbd_override_fsids(work))
return -ENOMEM;
- err = kern_path(oldname, LOOKUP_NO_SYMLINKS, &oldpath);
+ err = kern_path(oldname, LOOKUP_NO_SYMLINKS | LOOKUP_IN_INIT, &oldpath);
if (err) {
pr_err("cannot get linux path for %s, err = %d\n",
oldname, err);
@@ -1258,7 +1259,8 @@ struct dentry *ksmbd_vfs_kern_path_create(struct ksmbd_work *work,
if (!abs_name)
return ERR_PTR(-ENOMEM);
- dent = start_creating_path(AT_FDCWD, abs_name, path, flags);
+ dent = start_creating_path(AT_FDCWD, abs_name, path,
+ flags | LOOKUP_IN_INIT);
kfree(abs_name);
return dent;
}
diff --git a/init/initramfs.c b/init/initramfs.c
index 139baed06589..f44d772f960b 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -382,7 +382,7 @@ static int __init do_name(void)
int openflags = O_WRONLY|O_CREAT|O_LARGEFILE;
if (ml != 1)
openflags |= O_TRUNC;
- wfile = filp_open(collected, openflags, mode);
+ wfile = filp_open_init(collected, openflags, mode);
if (IS_ERR(wfile))
return 0;
wfile_pos = 0;
@@ -702,7 +702,7 @@ static void __init populate_initrd_image(char *err)
printk(KERN_INFO "rootfs image is not initramfs (%s); looks like an initrd\n",
err);
- file = filp_open("/initrd.image", O_WRONLY|O_CREAT|O_LARGEFILE, 0700);
+ file = filp_open_init("/initrd.image", O_WRONLY|O_CREAT|O_LARGEFILE, 0700);
if (IS_ERR(file))
return;
diff --git a/init/initramfs_test.c b/init/initramfs_test.c
index 2ce38d9a8fd0..9415b9cfb9d3 100644
--- a/init/initramfs_test.c
+++ b/init/initramfs_test.c
@@ -224,7 +224,7 @@ static void __init initramfs_test_data(struct kunit *test)
err = unpack_to_rootfs(cpio_srcbuf, len);
KUNIT_EXPECT_NULL(test, err);
- file = filp_open(c[0].fname, O_RDONLY, 0);
+ file = filp_open_init(c[0].fname, O_RDONLY, 0);
if (IS_ERR(file)) {
KUNIT_FAIL(test, "open failed");
goto out;
@@ -430,7 +430,7 @@ static void __init initramfs_test_fname_pad(struct kunit *test)
err = unpack_to_rootfs(tbufs->cpio_srcbuf, len);
KUNIT_EXPECT_NULL(test, err);
- file = filp_open(c[0].fname, O_RDONLY, 0);
+ file = filp_open_init(c[0].fname, O_RDONLY, 0);
if (IS_ERR(file)) {
KUNIT_FAIL(test, "open failed");
goto out;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3756a93dc63a..6f370cb44afe 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1200,9 +1200,7 @@ static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
if (flags & SOCK_COREDUMP) {
struct path root;
- task_lock(&init_task);
- get_fs_root(init_task.fs, &root);
- task_unlock(&init_task);
+ init_root(&root);
scoped_with_kernel_creds()
err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH RFC DRAFT POC 11/11] fs: isolate all kthreads in nullfs
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (9 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT Christian Brauner
@ 2026-03-03 13:49 ` Christian Brauner
2026-03-06 7:26 ` [PATCH RFC DRAFT POC 00/11] fs,kthread: " Askar Safin
11 siblings, 0 replies; 14+ messages in thread
From: Christian Brauner @ 2026-03-03 13:49 UTC (permalink / raw)
To: linux-fsdevel, Linus Torvalds
Cc: linux-kernel, Alexander Viro, Jens Axboe, Jan Kara, Tejun Heo,
Jann Horn, Christian Brauner
Leave all kthreads isolated in nullfs and move userspace init into its
separate fs_struct that any kthread can grab on demand to perform
lookup. This isolates kthreads from userspace filesystem state quite a
bit and makes it hard for anyone to mess up when performing filesystem
operations from kthreads. Without LOOKUP_IN_INIT they will just not be
able to do anything at all: no lookup or creation.
Add a new struct kernel_clone_args extension that allows to create a
task that shares init's filesystem state. This is only going to be used
by user_mode_thread() which execute stuff in init's filesystem state.
That concept should go away.
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
fs/fs_struct.c | 49 +++++++++++++++++++++++++++++++++++++++++++---
fs/namei.c | 4 ++--
fs/namespace.c | 4 ----
include/linux/fs_struct.h | 1 +
include/linux/init_task.h | 1 +
include/linux/sched/task.h | 1 +
init/main.c | 10 +++++++++-
kernel/fork.c | 26 +++++++++++++++++++++---
8 files changed, 83 insertions(+), 13 deletions(-)
diff --git a/fs/fs_struct.c b/fs/fs_struct.c
index 64b5840131cb..164139c27380 100644
--- a/fs/fs_struct.c
+++ b/fs/fs_struct.c
@@ -8,6 +8,7 @@
#include <linux/fs_struct.h>
#include <linux/init_task.h>
#include "internal.h"
+#include "mount.h"
/*
* Replace the fs->{rootmnt,root} with {mnt,dentry}. Put the old values.
@@ -160,13 +161,30 @@ EXPORT_SYMBOL_GPL(unshare_fs_struct);
* fs_struct state. Breaking that contract sucks for both sides.
* So just don't bother with extra work for this. No sane init
* system should ever do this.
+ *
+ * On older kernels if PID 1 unshared its filesystem state with us the
+ * kernel simply used the stale fs_struct state implicitly pinning
+ * anything that PID 1 had last used. Even if PID 1 might've moved on to
+ * some completely different fs_struct state and might've even unmounted
+ * the old root.
+ *
+ * This has hilarious consequences: Think continuing to dump coredump
+ * state into an implicitly pinned directory somewhere. Calling random
+ * binaries in the old rootfs via usermodehelpers.
+ *
+ * Be aggressive about this: We simply reject operating on stale
+ * fs_struct state by reverting to nullfs. Every kworker that does
+ * lookups after this point will fail. Every usermodehelper call will
+ * fail. Tough luck but let's be kind and emit a warning to userspace.
*/
static inline bool nullfs_userspace_init(void)
{
struct fs_struct *fs = current->fs;
- if (unlikely(current->pid == 1) && fs != &init_fs) {
+ if (unlikely(current->pid == 1) && fs != &userspace_init_fs) {
pr_warn("VFS: Pid 1 stopped sharing filesystem state\n");
+ set_fs_root(&userspace_init_fs, &init_fs.root);
+ set_fs_pwd(&userspace_init_fs, &init_fs.root);
return true;
}
@@ -186,7 +204,9 @@ struct fs_struct *switch_fs_struct(struct fs_struct *new_fs)
new_fs = fs;
read_sequnlock_excl(&fs->seq);
- nullfs_userspace_init();
+ /* one reference belongs to us */
+ if (nullfs_userspace_init())
+ return NULL;
return new_fs;
}
@@ -197,8 +217,31 @@ struct fs_struct init_fs = {
.umask = 0022,
};
+struct fs_struct userspace_init_fs = {
+ .users = 1,
+ .seq = __SEQLOCK_UNLOCKED(userspace_init_fs.seq),
+ .umask = 0022,
+};
+
void init_root(struct path *root)
{
- get_fs_root(&init_fs, root);
+ get_fs_root(&userspace_init_fs, root);
}
EXPORT_SYMBOL_GPL(init_root);
+
+void __init init_userspace_fs(void)
+{
+ struct mount *m;
+ struct path root;
+
+ /* Move PID 1 from nullfs into the initramfs. */
+ m = topmost_overmount(current->nsproxy->mnt_ns->root);
+ root.mnt = &m->mnt;
+ root.dentry = root.mnt->mnt_root;
+
+ VFS_WARN_ON_ONCE(current->fs != &init_fs);
+ VFS_WARN_ON_ONCE(current->pid != 1);
+ set_fs_root(&userspace_init_fs, &root);
+ set_fs_pwd(&userspace_init_fs, &root);
+ switch_fs_struct(&userspace_init_fs);
+}
diff --git a/fs/namei.c b/fs/namei.c
index 976b1e9f7032..6cc53040e9eb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1102,7 +1102,7 @@ static int set_root(struct nameidata *nd)
struct fs_struct *fs;
if (nd->flags & LOOKUP_IN_INIT)
- fs = &init_fs;
+ fs = &userspace_init_fs;
else
fs = current->fs;
@@ -2724,7 +2724,7 @@ static const char *path_init(struct nameidata *nd, unsigned flags)
struct fs_struct *fs;
if (nd->flags & LOOKUP_IN_INIT)
- fs = &init_fs;
+ fs = &userspace_init_fs;
else
fs = current->fs;
diff --git a/fs/namespace.c b/fs/namespace.c
index 854f4fc66469..10056ac1dcd2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -6190,10 +6190,6 @@ static void __init init_mount_tree(void)
init_task.nsproxy->mnt_ns = &init_mnt_ns;
get_mnt_ns(&init_mnt_ns);
-
- /* The root and pwd always point to the mutable rootfs. */
- root.mnt = mnt;
- root.dentry = mnt->mnt_root;
set_fs_pwd(current->fs, &root);
set_fs_root(current->fs, &root);
diff --git a/include/linux/fs_struct.h b/include/linux/fs_struct.h
index 8ff1acd8389d..5c40fdc39550 100644
--- a/include/linux/fs_struct.h
+++ b/include/linux/fs_struct.h
@@ -50,5 +50,6 @@ static inline int current_umask(void)
}
void init_root(struct path *root);
+void __init init_userspace_fs(void);
#endif /* _LINUX_FS_STRUCT_H */
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a6cb241ea00c..f27f88598394 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -24,6 +24,7 @@
extern struct files_struct init_files;
extern struct fs_struct init_fs;
+extern struct fs_struct userspace_init_fs;
extern struct nsproxy init_nsproxy;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 41ed884cffc9..e0c1ca8c6a18 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -31,6 +31,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
+ u32 umh:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/init/main.c b/init/main.c
index 1cb395dd94e4..ca0d0914c63e 100644
--- a/init/main.c
+++ b/init/main.c
@@ -102,6 +102,7 @@
#include <linux/stackdepot.h>
#include <linux/randomize_kstack.h>
#include <linux/pidfs.h>
+#include <linux/fs_struct.h>
#include <linux/ptdump.h>
#include <linux/time_namespace.h>
#include <linux/unaligned.h>
@@ -713,6 +714,11 @@ static __initdata DECLARE_COMPLETION(kthreadd_done);
static noinline void __ref __noreturn rest_init(void)
{
+ struct kernel_clone_args init_args = {
+ .flags = (CLONE_FS | CLONE_VM | CLONE_UNTRACED),
+ .fn = kernel_init,
+ .fn_arg = NULL,
+ };
struct task_struct *tsk;
int pid;
@@ -722,7 +728,7 @@ static noinline void __ref __noreturn rest_init(void)
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
- pid = user_mode_thread(kernel_init, NULL, CLONE_FS);
+ pid = kernel_clone(&init_args);
/*
* Pin init on the boot CPU. Task migration is not properly working
* until sched_init_smp() has been run. It will set the allowed
@@ -1574,6 +1580,8 @@ static int __ref kernel_init(void *unused)
{
int ret;
+ init_userspace_fs();
+
/*
* Wait until kthreadd is all set-up.
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index 583078c69bbd..121538f58272 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1590,9 +1590,28 @@ static int copy_mm(u64 clone_flags, struct task_struct *tsk)
return 0;
}
-static int copy_fs(u64 clone_flags, struct task_struct *tsk)
+static int copy_fs(u64 clone_flags, struct task_struct *tsk, bool umh)
{
- struct fs_struct *fs = current->fs;
+ struct fs_struct *fs;
+
+ /*
+ * Usermodehelper may use userspace_init_fs filesystem state but
+ * they don't get to create mount namespaces, share the
+ * filesystem state, or be started from a non-initial mount
+ * namespace.
+ */
+ if (umh) {
+ if (clone_flags & (CLONE_NEWNS | CLONE_FS))
+ return -EINVAL;
+ if (current->nsproxy->mnt_ns != &init_mnt_ns)
+ return -EINVAL;
+ }
+
+ if (umh)
+ fs = &userspace_init_fs;
+ else
+ fs = current->fs;
+
if (clone_flags & CLONE_FS) {
/* tsk->fs is already what we want */
read_seqlock_excl(&fs->seq);
@@ -2211,7 +2230,7 @@ __latent_entropy struct task_struct *copy_process(
retval = copy_files(clone_flags, p, args->no_files);
if (retval)
goto bad_fork_cleanup_semundo;
- retval = copy_fs(clone_flags, p);
+ retval = copy_fs(clone_flags, p, args->umh);
if (retval)
goto bad_fork_cleanup_files;
retval = copy_sighand(clone_flags, p);
@@ -2725,6 +2744,7 @@ pid_t user_mode_thread(int (*fn)(void *), void *arg, unsigned long flags)
.exit_signal = (flags & CSIGNAL),
.fn = fn,
.fn_arg = arg,
+ .umh = 1,
};
return kernel_clone(&args);
--
2.47.3
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT Christian Brauner
@ 2026-03-03 15:03 ` Christoph Hellwig
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Hellwig @ 2026-03-03 15:03 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, Linus Torvalds, linux-kernel, Alexander Viro,
Jens Axboe, Jan Kara, Tejun Heo, Jann Horn
On Tue, Mar 03, 2026 at 02:49:21PM +0100, Christian Brauner wrote:
> In preparation to isolate all kthreads in nullfs convert all lookups
> performed from kthread context to use LOOKUP_IN_INIT. This will make
> them all perform the relevant lookup operation in init's filesystem
> state.
>
> This should be switched to individual commits for easy bisectability but
Not just for bisectability, but also to explain how we end up calling
these from thread context. I suspect in many cases by just undestanding
that we could get rid of it entirely with a bit of work.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
` (10 preceding siblings ...)
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 11/11] fs: isolate all kthreads in nullfs Christian Brauner
@ 2026-03-06 7:26 ` Askar Safin
11 siblings, 0 replies; 14+ messages in thread
From: Askar Safin @ 2026-03-06 7:26 UTC (permalink / raw)
To: brauner; +Cc: axboe, jack, jannh, linux-fsdevel, linux-kernel, tj, torvalds,
viro
Christian Brauner <brauner@kernel.org>:
> Instead of sharing fs_struct between kernel threads and pid 1 we give
> pid a separate userspace_init_fs struct.
You meant "we give pid 1 a separate"
> The only remaining kernel tasks that actually share init's filesystem
> state are usermodhelpers
This sounds like usermodhelpers actually *share* init's fs_struct.
This is false, otherwise usermodhelpers could just do "chdir" and change
cwd of pid 1. So, please, rephrase this sentence.
> Be aggressive about this: We simply reject operating on stale
> fs_struct state by reverting userspace_init_fs to nullfs.
I think in this case unshare should simply fail and return error to
userspace.
> PID 1 uses pid1_fs which points to the initramfs
You meant userspace_init_fs, not pid1_fs. (pid1_fs is mentioned two times
in cover letter. Also in comments in patch 06/11.)
> `init_chroot_to_overmount()` at the start of `kernel_init()`)
There is no function named "init_chroot_to_overmount".
I hope you are not offended. I just did some proof-reading.
--
Askar Safin
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2026-03-06 7:26 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 13:49 [PATCH RFC DRAFT POC 00/11] fs,kthread: isolate all kthreads in nullfs Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 01/11] kthread: refactor __kthread_create_on_node() to take a struct argument Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 02/11] kthread: remove unused flags argument from kthread worker creation API Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 03/11] kthread: add extensible kthread_create()/kthread_run() pattern Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 04/11] fs: notice when init abandons fs sharing Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 05/11] fs: add LOOKUP_IN_INIT Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 06/11] fs: add file_open_init() Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 07/11] block: add bdev_file_open_init() Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 08/11] fs: allow to pass lookup flags to filename_*() Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 09/11] fs: add init_root() Christian Brauner
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 10/11] tree-wide: make all kthread path lookups to use LOOKUP_IN_INIT Christian Brauner
2026-03-03 15:03 ` Christoph Hellwig
2026-03-03 13:49 ` [PATCH RFC DRAFT POC 11/11] fs: isolate all kthreads in nullfs Christian Brauner
2026-03-06 7:26 ` [PATCH RFC DRAFT POC 00/11] fs,kthread: " Askar Safin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox