* [PATCH v3 0/3] initrd: remove half of classic initrd support
From: Askar Safin @ 2025-10-17 6:09 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel
Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
Intro
====
This patchset removes half of classic initrd (initial RAM disk) support,
i. e. linuxrc code path, which was deprecated in 2020.
Initramfs still stays, RAM disk itself (brd) still stays.
And other half of initrd stays, too.
init/do_mounts* are listed in VFS entry in
MAINTAINERS, so I think this patchset should go through VFS tree.
I tested the patchset on 8 (!!!) archs in Qemu (see details below).
If you still use initrd, see below for workaround.
In 2020 deprecation notice was put to linuxrc initrd code path.
In v1 I tried to remove initrd
fully, but Nicolas Schichan reported that he still uses
other code path (root=/dev/ram0 one) on million devices [4].
root=/dev/ram0 code path did not contain deprecation notice.
So, in this version of patchset I remove deprecated code path,
i. e. linuxrc one, while keeping other, i. e. root=/dev/ram0 one.
Also I put deprecation notice to remaining code path, i. e. to
root=/dev/ram0 one. I plan to send patches for full removal
of initrd after one year, i. e. in September 2026 (of course,
initramfs will still work).
Also, I tried to make this patchset small to make sure it
can be reverted easily. I plan to send cleanups later.
Details
====
Other user-visible changes:
- Removed kernel command line parameters "load_ramdisk" and
"prompt_ramdisk", which did nothing and were deprecated
- Removed /proc/sys/kernel/real-root-dev . It was used
for initrd only
- Command line parameters "noinitrd" and "ramdisk_start=" are deprecated
This patchset is based on v6.18-rc1.
Testing
====
I tested my patchset on many architectures in Qemu using my Rust
program, heavily based on mkroot [1].
I used the following cross-compilers:
aarch64-linux-musleabi
armv4l-linux-musleabihf
armv5l-linux-musleabihf
armv7l-linux-musleabihf
i486-linux-musl
i686-linux-musl
mips-linux-musl
mips64-linux-musl
mipsel-linux-musl
powerpc-linux-musl
powerpc64-linux-musl
powerpc64le-linux-musl
riscv32-linux-musl
riscv64-linux-musl
s390x-linux-musl
sh4-linux-musl
sh4eb-linux-musl
x86_64-linux-musl
taken from this directory [2].
So, as you can see, there are 18 triplets, which correspond to 8 subdirs in arch/.
For every triplet I tested that:
- Initramfs still works (both builtin and external)
- Direct boot from disk still works
- Remaining initrd code path (root=/dev/ram0) still works
Workaround
====
If "retain_initrd" is passed to kernel, then initramfs/initrd,
passed by bootloader, is retained and becomes available after boot
as read-only magic file /sys/firmware/initrd [3].
No copies are involved. I. e. /sys/firmware/initrd is simply
a reference to original blob passed by bootloader.
This works even if initrd/initramfs is not recognized by kernel
in any way, i. e. even if it is not valid cpio archive, nor
a fs image supported by classic initrd.
This works both with my patchset and without it.
This means that you can emulate classic initrd so:
link builtin initramfs to kernel; in /init in this initramfs
copy /sys/firmware/initrd to some file in / and loop-mount it.
This is even better than classic initrd, because:
- You can use fs not supported by classic initrd, for example erofs
- One copy is involved (from /sys/firmware/initrd to some file in /)
as opposed to two when using classic initrd
Still, I don't recommend using this workaround, because
I want everyone to migrate to proper modern initramfs.
But still you can use this workaround if you want.
Also: it is not possible to directly loop-mount
/sys/firmware/initrd . Theoretically kernel can be changed
to allow this (and/or to make it writable), but I think nobody needs this.
And I don't want to implement this.
On Qemu's -initrd and GRUB's initrd
====
Don't panic, this patchset doesn't remove initramfs
(which is used by nearly all Linux distros). And I don't
have plans to remove it.
Qemu's -initrd option and GRUB's initrd command refer
to initrd bootloader mechanism, which is used to
load both initrd and (external) initramfs.
So, if you use Qemu's -initrd or GRUB's initrd,
then you likely use them to pass initramfs, and thus
you are safe.
v1: https://lore.kernel.org/lkml/20250913003842.41944-1-safinaskar@gmail.com/
v1 -> v2 changes:
- A lot. I removed most patches, see cover letter for details
v2: https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com/
v2 -> v3 changes:
- Commit messages
- Expanded docs for "noinitrd"
- Added link to /sys/firmware/initrd workaround to pr_warn
[1] https://github.com/landley/toybox/tree/master/mkroot
[2] https://landley.net/toybox/downloads/binaries/toolchains/latest
[3] https://lore.kernel.org/all/20231207235654.16622-1-graf@amazon.com/
[4] https://lore.kernel.org/lkml/20250918152830.438554-1-nschichan@freebox.fr/
Askar Safin (3):
init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command
line parameters
initrd: remove deprecated code path (linuxrc)
init: remove /proc/sys/kernel/real-root-dev
.../admin-guide/kernel-parameters.txt | 12 +-
Documentation/admin-guide/sysctl/kernel.rst | 6 -
arch/arm/configs/neponset_defconfig | 2 +-
fs/init.c | 14 ---
include/linux/init_syscalls.h | 1 -
include/linux/initrd.h | 2 -
include/uapi/linux/sysctl.h | 1 -
init/do_mounts.c | 11 +-
init/do_mounts.h | 18 +--
init/do_mounts_initrd.c | 107 ++----------------
init/do_mounts_rd.c | 24 +---
11 files changed, 23 insertions(+), 175 deletions(-)
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
--
2.47.3
^ permalink raw reply
* [PATCH v3 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Askar Safin @ 2025-10-17 6:09 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel
Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251017060956.1151347-1-safinaskar@gmail.com>
...which do nothing. They were deprecated (in documentation) in
6b99e6e6aa62 ("Documentation/admin-guide: blockdev/ramdisk: remove use of
"rdev"") in 2020 and in kernel messages in c8376994c86c ("initrd: remove
support for multiple floppies") in 2020.
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
Documentation/admin-guide/kernel-parameters.txt | 4 ----
arch/arm/configs/neponset_defconfig | 2 +-
init/do_mounts.c | 7 -------
init/do_mounts_rd.c | 7 -------
4 files changed, 1 insertion(+), 19 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061ca20e..15af6933eab4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3319,8 +3319,6 @@
If there are multiple matching configurations changing
the same attribute, the last one is used.
- load_ramdisk= [RAM] [Deprecated]
-
lockd.nlm_grace_period=P [NFS] Assign grace period.
Format: <integer>
@@ -5284,8 +5282,6 @@
Param: <number> - step/bucket size as a power of 2 for
statistical time based profiling.
- prompt_ramdisk= [RAM] [Deprecated]
-
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
that). If enabled, the default kernel base address
diff --git a/arch/arm/configs/neponset_defconfig b/arch/arm/configs/neponset_defconfig
index 2227f86100ad..4d720001c12e 100644
--- a/arch/arm/configs/neponset_defconfig
+++ b/arch/arm/configs/neponset_defconfig
@@ -9,7 +9,7 @@ CONFIG_ASSABET_NEPONSET=y
CONFIG_ZBOOT_ROM_TEXT=0x80000
CONFIG_ZBOOT_ROM_BSS=0xc1000000
CONFIG_ZBOOT_ROM=y
-CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) load_ramdisk=1 prompt_ramdisk=0 mem=32M noinitrd initrd=0xc0800000,3M"
+CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) mem=32M noinitrd initrd=0xc0800000,3M"
CONFIG_FPE_NWFPE=y
CONFIG_PM=y
CONFIG_MODULES=y
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 6af29da8889e..0f2f44e6250c 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -34,13 +34,6 @@ static int root_wait;
dev_t ROOT_DEV;
-static int __init load_ramdisk(char *str)
-{
- pr_warn("ignoring the deprecated load_ramdisk= option\n");
- return 1;
-}
-__setup("load_ramdisk=", load_ramdisk);
-
static int __init readonly(char *str)
{
if (*str)
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 19d9f33dcacf..5311f2d7edc8 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -18,13 +18,6 @@
static struct file *in_file, *out_file;
static loff_t in_pos, out_pos;
-static int __init prompt_ramdisk(char *str)
-{
- pr_warn("ignoring the deprecated prompt_ramdisk= option\n");
- return 1;
-}
-__setup("prompt_ramdisk=", prompt_ramdisk);
-
int __initdata rd_image_start; /* starting block # of image */
static int __init ramdisk_start_setup(char *str)
--
2.47.3
^ permalink raw reply related
* [PATCH v3 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-17 6:09 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel
Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251017060956.1151347-1-safinaskar@gmail.com>
Remove linuxrc initrd code path, which was deprecated in 2020.
Initramfs and (non-initial) RAM disks (i. e. brd) still work.
Both built-in and bootloader-supplied initramfs still work.
Non-linuxrc initrd code path (i. e. using /dev/ram as final root
filesystem) still works, but I put deprecation message into it.
Also I deprecate command line parameters "noinitrd" and "ramdisk_start=".
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
.../admin-guide/kernel-parameters.txt | 8 +-
fs/init.c | 14 ---
include/linux/init_syscalls.h | 1 -
include/linux/initrd.h | 2 -
init/do_mounts.c | 4 +-
init/do_mounts.h | 18 +---
init/do_mounts_initrd.c | 87 ++-----------------
init/do_mounts_rd.c | 17 +---
8 files changed, 22 insertions(+), 129 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 15af6933eab4..df441d1a9555 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4324,8 +4324,10 @@
Note that this argument takes precedence over
the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
- noinitrd [RAM] Tells the kernel not to load any configured
- initial RAM disk.
+ noinitrd [Deprecated,RAM] Tells the kernel not to load any configured
+ initial RAM disk. Currently this parameter applies to
+ initrd only, not to initramfs. But it applies to both
+ in EFI mode.
nointremap [X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
remapping.
@@ -5338,7 +5340,7 @@
ramdisk_size= [RAM] Sizes of RAM disks in kilobytes
See Documentation/admin-guide/blockdev/ramdisk.rst.
- ramdisk_start= [RAM] RAM disk image start address
+ ramdisk_start= [Deprecated,RAM] RAM disk image start address
random.trust_cpu=off
[KNL,EARLY] Disable trusting the use of the CPU's
diff --git a/fs/init.c b/fs/init.c
index 07f592ccdba8..60719494d9a0 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -27,20 +27,6 @@ int __init init_mount(const char *dev_name, const char *dir_name,
return ret;
}
-int __init init_umount(const char *name, int flags)
-{
- int lookup_flags = LOOKUP_MOUNTPOINT;
- struct path path;
- int ret;
-
- if (!(flags & UMOUNT_NOFOLLOW))
- lookup_flags |= LOOKUP_FOLLOW;
- ret = kern_path(name, lookup_flags, &path);
- if (ret)
- return ret;
- return path_umount(&path, flags);
-}
-
int __init init_chdir(const char *filename)
{
struct path path;
diff --git a/include/linux/init_syscalls.h b/include/linux/init_syscalls.h
index 92045d18cbfc..0bdbc458a881 100644
--- a/include/linux/init_syscalls.h
+++ b/include/linux/init_syscalls.h
@@ -2,7 +2,6 @@
int __init init_mount(const char *dev_name, const char *dir_name,
const char *type_page, unsigned long flags, void *data_page);
-int __init init_umount(const char *name, int flags);
int __init init_chdir(const char *filename);
int __init init_chroot(const char *filename);
int __init init_chown(const char *filename, uid_t user, gid_t group, int flags);
diff --git a/include/linux/initrd.h b/include/linux/initrd.h
index f1a1f4c92ded..7e5d26c8136f 100644
--- a/include/linux/initrd.h
+++ b/include/linux/initrd.h
@@ -3,8 +3,6 @@
#ifndef __LINUX_INITRD_H
#define __LINUX_INITRD_H
-#define INITRD_MINOR 250 /* shouldn't collide with /dev/ram* too soon ... */
-
/* starting block # of image */
extern int rd_image_start;
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 0f2f44e6250c..1054ad3c905a 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -476,13 +476,11 @@ void __init prepare_namespace(void)
if (saved_root_name[0])
ROOT_DEV = parse_root_device(saved_root_name);
- if (initrd_load(saved_root_name))
- goto out;
+ initrd_load();
if (root_wait)
wait_for_root(saved_root_name);
mount_root(saved_root_name);
-out:
devtmpfs_mount();
init_mount(".", "/", NULL, MS_MOVE, NULL);
init_chroot(".");
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 6069ea3eb80d..a386ee5314c9 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -23,25 +23,15 @@ static inline __init int create_dev(char *name, dev_t dev)
}
#ifdef CONFIG_BLK_DEV_RAM
-
-int __init rd_load_disk(int n);
-int __init rd_load_image(char *from);
-
+int __init rd_load_image(void);
#else
-
-static inline int rd_load_disk(int n) { return 0; }
-static inline int rd_load_image(char *from) { return 0; }
-
+static inline int rd_load_image(void) { return 0; }
#endif
#ifdef CONFIG_BLK_DEV_INITRD
-bool __init initrd_load(char *root_device_name);
+void __init initrd_load(void);
#else
-static inline bool initrd_load(char *root_device_name)
-{
- return false;
- }
-
+static inline void initrd_load(void) { }
#endif
/* Ensure that async file closing finished to prevent spurious errors. */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index f6867bad0d78..bf381aa0400f 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -2,13 +2,7 @@
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/fs.h>
-#include <linux/minix_fs.h>
-#include <linux/romfs_fs.h>
#include <linux/initrd.h>
-#include <linux/sched.h>
-#include <linux/freezer.h>
-#include <linux/kmod.h>
-#include <uapi/linux/mount.h>
#include "do_mounts.h"
@@ -41,6 +35,7 @@ late_initcall(kernel_do_mounts_initrd_sysctls_init);
static int __init no_initrd(char *str)
{
+ pr_warn("noinitrd option is deprecated and will be removed soon\n");
mount_initrd = 0;
return 1;
}
@@ -70,85 +65,19 @@ static int __init early_initrd(char *p)
}
early_param("initrd", early_initrd);
-static int __init init_linuxrc(struct subprocess_info *info, struct cred *new)
-{
- ksys_unshare(CLONE_FS | CLONE_FILES);
- console_on_rootfs();
- /* move initrd over / and chdir/chroot in initrd root */
- init_chdir("/root");
- init_mount(".", "/", NULL, MS_MOVE, NULL);
- init_chroot(".");
- ksys_setsid();
- return 0;
-}
-
-static void __init handle_initrd(char *root_device_name)
-{
- struct subprocess_info *info;
- static char *argv[] = { "linuxrc", NULL, };
- extern char *envp_init[];
- int error;
-
- pr_warn("using deprecated initrd support, will be removed soon.\n");
-
- real_root_dev = new_encode_dev(ROOT_DEV);
- create_dev("/dev/root.old", Root_RAM0);
- /* mount initrd on rootfs' /root */
- mount_root_generic("/dev/root.old", root_device_name,
- root_mountflags & ~MS_RDONLY);
- init_mkdir("/old", 0700);
- init_chdir("/old");
-
- info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
- GFP_KERNEL, init_linuxrc, NULL, NULL);
- if (!info)
- return;
- call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
-
- /* move initrd to rootfs' /old */
- init_mount("..", ".", NULL, MS_MOVE, NULL);
- /* switch root and cwd back to / of rootfs */
- init_chroot("..");
-
- if (new_decode_dev(real_root_dev) == Root_RAM0) {
- init_chdir("/old");
- return;
- }
-
- init_chdir("/");
- ROOT_DEV = new_decode_dev(real_root_dev);
- mount_root(root_device_name);
-
- printk(KERN_NOTICE "Trying to move old root to /initrd ... ");
- error = init_mount("/old", "/root/initrd", NULL, MS_MOVE, NULL);
- if (!error)
- printk("okay\n");
- else {
- if (error == -ENOENT)
- printk("/initrd does not exist. Ignored.\n");
- else
- printk("failed\n");
- printk(KERN_NOTICE "Unmounting old root\n");
- init_umount("/old", MNT_DETACH);
- }
-}
-
-bool __init initrd_load(char *root_device_name)
+void __init initrd_load(void)
{
if (mount_initrd) {
create_dev("/dev/ram", Root_RAM0);
/*
- * Load the initrd data into /dev/ram0. Execute it as initrd
- * unless /dev/ram0 is supposed to be our actual root device,
- * in that case the ram disk is just set up here, and gets
- * mounted in the normal path.
+ * Load the initrd data into /dev/ram0.
*/
- if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
- init_unlink("/initrd.image");
- handle_initrd(root_device_name);
- return true;
+ if (rd_load_image()) {
+ pr_warn("using deprecated initrd support, will be removed in September 2026; "
+ "use initramfs instead or (as a last resort) /sys/firmware/initrd; "
+ "see section \"Workaround\" in "
+ "https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com\n");
}
}
init_unlink("/initrd.image");
- return false;
}
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 5311f2d7edc8..0a021bbcd501 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -22,6 +22,7 @@ int __initdata rd_image_start; /* starting block # of image */
static int __init ramdisk_start_setup(char *str)
{
+ pr_warn("ramdisk_start= option is deprecated and will be removed soon\n");
rd_image_start = simple_strtol(str,NULL,0);
return 1;
}
@@ -177,7 +178,7 @@ static unsigned long nr_blocks(struct file *file)
return i_size_read(inode) >> 10;
}
-int __init rd_load_image(char *from)
+int __init rd_load_image(void)
{
int res = 0;
unsigned long rd_blocks, devblocks, nr_disks;
@@ -191,7 +192,7 @@ int __init rd_load_image(char *from)
if (IS_ERR(out_file))
goto out;
- in_file = filp_open(from, O_RDONLY, 0);
+ in_file = filp_open("/initrd.image", O_RDONLY, 0);
if (IS_ERR(in_file))
goto noclose_input;
@@ -220,10 +221,7 @@ int __init rd_load_image(char *from)
/*
* OK, time to copy in the data
*/
- if (strcmp(from, "/initrd.image") == 0)
- devblocks = nblocks;
- else
- devblocks = nr_blocks(in_file);
+ devblocks = nblocks;
if (devblocks == 0) {
printk(KERN_ERR "RAMDISK: could not determine device size\n");
@@ -267,13 +265,6 @@ int __init rd_load_image(char *from)
return res;
}
-int __init rd_load_disk(int n)
-{
- create_dev("/dev/root", ROOT_DEV);
- create_dev("/dev/ram", MKDEV(RAMDISK_MAJOR, n));
- return rd_load_image("/dev/root");
-}
-
static int exit_code;
static int decompress_error;
--
2.47.3
^ permalink raw reply related
* [PATCH v3 3/3] init: remove /proc/sys/kernel/real-root-dev
From: Askar Safin @ 2025-10-17 6:09 UTC (permalink / raw)
To: linux-fsdevel, linux-kernel
Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251017060956.1151347-1-safinaskar@gmail.com>
It is not used anymore.
Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
Documentation/admin-guide/sysctl/kernel.rst | 6 ------
include/uapi/linux/sysctl.h | 1 -
init/do_mounts_initrd.c | 20 --------------------
3 files changed, 27 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f3ee807b5d8b..218265babaf9 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1215,12 +1215,6 @@ that support this feature.
== ===========================================================================
-real-root-dev
-=============
-
-See Documentation/admin-guide/initrd.rst.
-
-
reboot-cmd (SPARC only)
=======================
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 63d1464cb71c..1c7fe0f4dca4 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -92,7 +92,6 @@ enum
KERN_DOMAINNAME=8, /* string: domainname */
KERN_PANIC=15, /* int: panic timeout */
- KERN_REALROOTDEV=16, /* real root device to mount after initrd */
KERN_SPARC_REBOOT=21, /* reboot command on Sparc */
KERN_CTLALTDEL=22, /* int: allow ctl-alt-del to reboot */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index bf381aa0400f..82613a3be756 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -8,31 +8,11 @@
unsigned long initrd_start, initrd_end;
int initrd_below_start_ok;
-static unsigned int real_root_dev; /* do_proc_dointvec cannot handle kdev_t */
static int __initdata mount_initrd = 1;
phys_addr_t phys_initrd_start __initdata;
unsigned long phys_initrd_size __initdata;
-#ifdef CONFIG_SYSCTL
-static const struct ctl_table kern_do_mounts_initrd_table[] = {
- {
- .procname = "real-root-dev",
- .data = &real_root_dev,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
-};
-
-static __init int kernel_do_mounts_initrd_sysctls_init(void)
-{
- register_sysctl_init("kernel", kern_do_mounts_initrd_table);
- return 0;
-}
-late_initcall(kernel_do_mounts_initrd_sysctls_init);
-#endif /* CONFIG_SYSCTL */
-
static int __init no_initrd(char *str)
{
pr_warn("noinitrd option is deprecated and will be removed soon\n");
--
2.47.3
^ permalink raw reply related
* Re: [PATCH v3 20/20] mm: stop maintaining the per-page mapcount of large folios (CONFIG_NO_PAGE_MAPCOUNT)
From: David Hildenbrand @ 2025-10-20 13:53 UTC (permalink / raw)
To: Wei Yang
Cc: Matthew Wilcox, linux-kernel, linux-doc, cgroups, linux-mm,
linux-fsdevel, linux-api, Andrew Morton, Tejun Heo, Zefan Li,
Johannes Weiner, Michal Koutný, Jonathan Corbet,
Andy Lutomirski, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
Dave Hansen, Muchun Song, Liam R. Howlett, Lorenzo Stoakes,
Vlastimil Babka, Jann Horn
In-Reply-To: <20251015004543.md5x4cjtkyjzpf4b@master>
On 15.10.25 02:45, Wei Yang wrote:
> On Tue, Oct 14, 2025 at 04:38:38PM +0200, David Hildenbrand wrote:
>> On 14.10.25 16:32, Matthew Wilcox wrote:
>>> On Tue, Oct 14, 2025 at 02:59:30PM +0200, David Hildenbrand wrote:
>>>>> As commit 349994cf61e6 mentioned, we don't support partially mapped PUD-sized
>>>>> folio yet.
>>>>
>>>> We do support partially mapped PUD-sized folios I think, but not anonymous
>>>> PUD-sized folios.
>>>
>>> I don't think so? The only mechanism I know of to allocate PUD-sized
>>> chunks of memory is hugetlb, and that doesn't permit partial mappings.
>>
>> Greetings from the latest DAX rework :)
>
> After a re-think, do you think it's better to align the behavior between
> CONFIG_NO_PAGE_MAPCOUNT and CONFIG_PAGE_MAPCOUNT?
>
> It looks we treat a PUD-sized folio partially_mapped if CONFIG_NO_PAGE_MAPCOUNT,
> but !partially_mapped if CONFIG_PAGE_MAPCOUNT, if my understanding is correct.
I'd just leave it alone unless there is a problem right now.
--
Cheers
David / dhildenb
^ permalink raw reply
* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-20 14:29 UTC (permalink / raw)
To: Pratyush Yadav
Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <mafs0bjm9lig8.fsf@kernel.org>
On Tue, Oct 14, 2025 at 03:29:59PM +0200, Pratyush Yadav wrote:
> > 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
> > frozen, can't add/remove PFNs.
>
> Doesn't that circumvent LUO's state machine? The idea with the state
> machine was to have clear points in time when the system goes into the
> "limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
> event.
I wouldn't get too invested in the FSM, it is there but it doesn't
mean every luo client has to be focused on it.
> With what you propose, the first FD being preserved implicitly
> triggers the prepare event. Same thing for unprepare/cancel operations.
Yes, this is easy to write and simple to manage.
> I am wondering if it is better to do it the other way round: prepare all
> files first, and then prepare the hugetlb subsystem at
> LIVEUPDATE_PREPARE event. At that point it already knows which pages to
> mark preserved so the serialization can be done in one go.
I think this would be slower and more complex?
> > 2) Require the users of hugetlb memory, like memfd, to
> > preserve/restore the folios they are using (using their hugetlb order)
> > 3) Just before kexec run over the PFN list and mark a bit if the folio
> > was preserved by KHO or not. Make sure everything gets KHO
> > preserved.
>
> "just before kexec" would need a callback from LUO. I suppose a
> subsystem is the place for that callback. I wrote my email under the
> (wrong) impression that we were replacing subsystems.
The file descriptors path should have luo client ops that have all
the required callbacks. This is probably an existing op.
> That makes me wonder: how is the subsystem-level callback supposed to
> access the global data? I suppose it can use the liveupdate_file_handler
> directly, but it is kind of strange since technically the subsystem and
> file handler are two different entities.
If we need such things we would need a way to link these together, but
I'm wonder if we really don't..
> Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
> I'm not sure how that would map with this shared global data. memfd and
> guest_memfd will likely have different liveupdate_file_handler but would
> share data from the same subsystem. Maybe that's a problem to solve for
> later...
On preserve memfd should call into hugetlb to activate it as a hugetlb
page provider and preserve it too.
Jason
^ permalink raw reply
* Re: [PATCH v3 0/3] initrd: remove half of classic initrd support
From: Christian Brauner @ 2025-10-21 13:05 UTC (permalink / raw)
To: Askar Safin, Christoph Hellwig
Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
Al Viro, Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251017060956.1151347-1-safinaskar@gmail.com>
On Fri, Oct 17, 2025 at 06:09:53AM +0000, Askar Safin wrote:
> Intro
> ====
> This patchset removes half of classic initrd (initial RAM disk) support,
> i. e. linuxrc code path, which was deprecated in 2020.
> Initramfs still stays, RAM disk itself (brd) still stays.
> And other half of initrd stays, too.
> init/do_mounts* are listed in VFS entry in
> MAINTAINERS, so I think this patchset should go through VFS tree.
> I tested the patchset on 8 (!!!) archs in Qemu (see details below).
> If you still use initrd, see below for workaround.
>
> In 2020 deprecation notice was put to linuxrc initrd code path.
> In v1 I tried to remove initrd
> fully, but Nicolas Schichan reported that he still uses
> other code path (root=/dev/ram0 one) on million devices [4].
> root=/dev/ram0 code path did not contain deprecation notice.
Without Acks or buy-in from other maintainers this is not a change we
can just do given that a few people already piped up and expressed
reservations that this would be doable for them.
@Christoph, you marked this as deprecated years ago.
What's your take on this?
>
> So, in this version of patchset I remove deprecated code path,
> i. e. linuxrc one, while keeping other, i. e. root=/dev/ram0 one.
>
> Also I put deprecation notice to remaining code path, i. e. to
> root=/dev/ram0 one. I plan to send patches for full removal
> of initrd after one year, i. e. in September 2026 (of course,
> initramfs will still work).
>
> Also, I tried to make this patchset small to make sure it
> can be reverted easily. I plan to send cleanups later.
>
> Details
> ====
> Other user-visible changes:
>
> - Removed kernel command line parameters "load_ramdisk" and
> "prompt_ramdisk", which did nothing and were deprecated
> - Removed /proc/sys/kernel/real-root-dev . It was used
> for initrd only
> - Command line parameters "noinitrd" and "ramdisk_start=" are deprecated
>
> This patchset is based on v6.18-rc1.
>
> Testing
> ====
> I tested my patchset on many architectures in Qemu using my Rust
> program, heavily based on mkroot [1].
>
> I used the following cross-compilers:
>
> aarch64-linux-musleabi
> armv4l-linux-musleabihf
> armv5l-linux-musleabihf
> armv7l-linux-musleabihf
> i486-linux-musl
> i686-linux-musl
> mips-linux-musl
> mips64-linux-musl
> mipsel-linux-musl
> powerpc-linux-musl
> powerpc64-linux-musl
> powerpc64le-linux-musl
> riscv32-linux-musl
> riscv64-linux-musl
> s390x-linux-musl
> sh4-linux-musl
> sh4eb-linux-musl
> x86_64-linux-musl
>
> taken from this directory [2].
>
> So, as you can see, there are 18 triplets, which correspond to 8 subdirs in arch/.
>
> For every triplet I tested that:
> - Initramfs still works (both builtin and external)
> - Direct boot from disk still works
> - Remaining initrd code path (root=/dev/ram0) still works
>
> Workaround
> ====
> If "retain_initrd" is passed to kernel, then initramfs/initrd,
> passed by bootloader, is retained and becomes available after boot
> as read-only magic file /sys/firmware/initrd [3].
>
> No copies are involved. I. e. /sys/firmware/initrd is simply
> a reference to original blob passed by bootloader.
>
> This works even if initrd/initramfs is not recognized by kernel
> in any way, i. e. even if it is not valid cpio archive, nor
> a fs image supported by classic initrd.
>
> This works both with my patchset and without it.
>
> This means that you can emulate classic initrd so:
> link builtin initramfs to kernel; in /init in this initramfs
> copy /sys/firmware/initrd to some file in / and loop-mount it.
>
> This is even better than classic initrd, because:
> - You can use fs not supported by classic initrd, for example erofs
> - One copy is involved (from /sys/firmware/initrd to some file in /)
> as opposed to two when using classic initrd
>
> Still, I don't recommend using this workaround, because
> I want everyone to migrate to proper modern initramfs.
> But still you can use this workaround if you want.
>
> Also: it is not possible to directly loop-mount
> /sys/firmware/initrd . Theoretically kernel can be changed
> to allow this (and/or to make it writable), but I think nobody needs this.
> And I don't want to implement this.
>
> On Qemu's -initrd and GRUB's initrd
> ====
> Don't panic, this patchset doesn't remove initramfs
> (which is used by nearly all Linux distros). And I don't
> have plans to remove it.
>
> Qemu's -initrd option and GRUB's initrd command refer
> to initrd bootloader mechanism, which is used to
> load both initrd and (external) initramfs.
>
> So, if you use Qemu's -initrd or GRUB's initrd,
> then you likely use them to pass initramfs, and thus
> you are safe.
>
> v1: https://lore.kernel.org/lkml/20250913003842.41944-1-safinaskar@gmail.com/
>
> v1 -> v2 changes:
> - A lot. I removed most patches, see cover letter for details
>
> v2: https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com/
>
> v2 -> v3 changes:
> - Commit messages
> - Expanded docs for "noinitrd"
> - Added link to /sys/firmware/initrd workaround to pr_warn
>
> [1] https://github.com/landley/toybox/tree/master/mkroot
> [2] https://landley.net/toybox/downloads/binaries/toolchains/latest
> [3] https://lore.kernel.org/all/20231207235654.16622-1-graf@amazon.com/
> [4] https://lore.kernel.org/lkml/20250918152830.438554-1-nschichan@freebox.fr/
>
> Askar Safin (3):
> init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command
> line parameters
> initrd: remove deprecated code path (linuxrc)
> init: remove /proc/sys/kernel/real-root-dev
>
> .../admin-guide/kernel-parameters.txt | 12 +-
> Documentation/admin-guide/sysctl/kernel.rst | 6 -
> arch/arm/configs/neponset_defconfig | 2 +-
> fs/init.c | 14 ---
> include/linux/init_syscalls.h | 1 -
> include/linux/initrd.h | 2 -
> include/uapi/linux/sysctl.h | 1 -
> init/do_mounts.c | 11 +-
> init/do_mounts.h | 18 +--
> init/do_mounts_initrd.c | 107 ++----------------
> init/do_mounts_rd.c | 24 +---
> 11 files changed, 23 insertions(+), 175 deletions(-)
>
>
> base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
> --
> 2.47.3
>
^ permalink raw reply
* Re: [PATCH v3 2/3] initrd: remove deprecated code path (linuxrc)
From: Bagas Sanjaya @ 2025-10-22 2:16 UTC (permalink / raw)
To: Askar Safin, linux-fsdevel, linux-kernel
Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251017060956.1151347-3-safinaskar@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 548 bytes --]
On Fri, Oct 17, 2025 at 06:09:55AM +0000, Askar Safin wrote:
> + if (rd_load_image()) {
> + pr_warn("using deprecated initrd support, will be removed in September 2026; "
> + "use initramfs instead or (as a last resort) /sys/firmware/initrd; "
> + "see section \"Workaround\" in "
> + "https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com\n");
> }
Do you mean that initrd support will be removed in LTS kernel release of 2026?
Thanks.
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH v3 0/3] initrd: remove half of classic initrd support
From: Christoph Hellwig @ 2025-10-22 6:44 UTC (permalink / raw)
To: Christian Brauner
Cc: Askar Safin, linux-fsdevel, linux-kernel, Linus Torvalds,
Greg Kroah-Hartman, Al Viro, Jan Kara, Jens Axboe,
Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
Alexander Graf, Rob Landley, Lennart Poettering, linux-arch,
linux-block, initramfs, linux-api, linux-doc, Michal Simek,
Luis Chamberlain, Kees Cook, Thorsten Blum, Heiko Carstens,
Arnd Bergmann, Dave Young, Christophe Leroy, Krzysztof Kozlowski,
Borislav Petkov, Jessica Clarke, Nicolas Schichan,
David Disseldorp, patches
In-Reply-To: <20251021-bannmeile-arkaden-ae2ea9264b85@brauner>
On Tue, Oct 21, 2025 at 03:05:35PM +0200, Christian Brauner wrote:
> Without Acks or buy-in from other maintainers this is not a change we
> can just do given that a few people already piped up and expressed
> reservations that this would be doable for them.
>
> @Christoph, you marked this as deprecated years ago.
> What's your take on this?
I'd love to see it go obviously. But IIRC we had various users show
up, which speaks against removing it. Maybe the first step would be
a separate config option just for block-based initrd?
^ permalink raw reply
* Re: [PATCH v3 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-22 8:06 UTC (permalink / raw)
To: bagasdotme
Cc: akpm, andy.shevchenko, arnd, axboe, bp, brauner, christophe.leroy,
cyphar, ddiss, dyoung, email2tema, graf, gregkh, hca, hch,
hsiangkao, initramfs, jack, jrtc27, julian.stecklina, kees, krzk,
linux-api, linux-arch, linux-block, linux-doc, linux-fsdevel,
linux-kernel, mcgrof, monstr, mzxreary, nschichan, patches, rob,
safinaskar, thomas.weissschuh, thorsten.blum, torvalds, viro
In-Reply-To: <aPg-YF2pcyI-HusN@archie.me>
Bagas Sanjaya <bagasdotme@gmail.com>:
> Do you mean that initrd support will be removed in LTS kernel release of 2026?
I meant September 2026. But okay, if there is v4, then I will change this to
"after LTS release in the end of 2026".
--
Askar Safin
^ permalink raw reply
* Re: [PATCH v3 0/3] initrd: remove half of classic initrd support
From: Askar Safin @ 2025-10-22 8:26 UTC (permalink / raw)
To: hch
Cc: akpm, andy.shevchenko, arnd, axboe, bp, brauner, christophe.leroy,
cyphar, ddiss, dyoung, email2tema, graf, gregkh, hca, hsiangkao,
initramfs, jack, jrtc27, julian.stecklina, kees, krzk, linux-api,
linux-arch, linux-block, linux-doc, linux-fsdevel, linux-kernel,
mcgrof, monstr, mzxreary, nschichan, patches, rob, safinaskar,
thomas.weissschuh, thorsten.blum, torvalds, viro
In-Reply-To: <aPh9Tx95Yhm_EkLN@infradead.org>
Christoph Hellwig <hch@infradead.org>:
> On Tue, Oct 21, 2025 at 03:05:35PM +0200, Christian Brauner wrote:
> > Without Acks or buy-in from other maintainers this is not a change we
> > can just do given that a few people already piped up and expressed
> > reservations that this would be doable for them.
> >
> > @Christoph, you marked this as deprecated years ago.
> > What's your take on this?
>
> I'd love to see it go obviously. But IIRC we had various users show
> up, which speaks against removing it. Maybe the first step would be
> a separate config option just for block-based initrd?
So far in recent months 3 people spoke against initrd removal. All they are in Cc. They are:
- Julian Stecklina. He planned to use initrd with erofs, which is currently
not supported anyway. Also, he replied to v1:
"You have all my support for nuking so much legacy code!"
"Acked-by: Julian Stecklina <julian.stecklina@cyberus-technology.de>"
( https://lore.kernel.org/lkml/1f9aee6090716db537e9911685904786b030111f.camel@cyberus-technology.de/ )
- Gao Xiang, maintainer of erofs. He also planned to use initrd with erofs,
which is currently not supported anyway. Also, he said to me:
> Again, I don't have any strong opinion to kill initrd entirely because
> I think initdax may be more efficient and I don't have any time to work
> on this part -- it's unrelated to my job.
( https://lore.kernel.org/all/79315382-5ba8-42c1-ad03-5cb448b23b72@linux.alibaba.com/ )
- Nicolas Schichan. He has million devices, which use initrd. But they use
root=/dev/ram code path, not linuxrc code path, which I'm removing. He
explained this here:
https://lore.kernel.org/lkml/20250918152830.438554-1-nschichan@freebox.fr/
So, this patchset will not impact these people. So, I think it is okay
to remove linuxrc now. We can revert this patchset if needed.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH v3 2/3] initrd: remove deprecated code path (linuxrc)
From: Andy Shevchenko @ 2025-10-22 16:41 UTC (permalink / raw)
To: Askar Safin
Cc: bagasdotme, akpm, arnd, axboe, bp, brauner, christophe.leroy,
cyphar, ddiss, dyoung, email2tema, graf, gregkh, hca, hch,
hsiangkao, initramfs, jack, jrtc27, julian.stecklina, kees, krzk,
linux-api, linux-arch, linux-block, linux-doc, linux-fsdevel,
linux-kernel, mcgrof, monstr, mzxreary, nschichan, patches, rob,
thomas.weissschuh, thorsten.blum, torvalds, viro
In-Reply-To: <20251022080626.24446-1-safinaskar@gmail.com>
On Wed, Oct 22, 2025 at 11:06 AM Askar Safin <safinaskar@gmail.com> wrote:
> Bagas Sanjaya <bagasdotme@gmail.com>:
...
> > Do you mean that initrd support will be removed in LTS kernel release of 2026?
>
> I meant September 2026. But okay, if there is v4, then I will change this to
> "after LTS release in the end of 2026".
No need to mention "ater LTS release", we all know that this is the
last release that made the year in question.
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH v5 0/8] man2: document "new" mount API
From: Askar Safin @ 2025-10-26 12:27 UTC (permalink / raw)
To: alx
Cc: brauner, cyphar, dhowells, g.branden.robinson, jack, linux-api,
linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
viro
In-Reply-To: <hk5kr2fbrpalyggobuz3zpqeekzqv7qlhfh6sjfifb6p5n5bjs@gjowkgi776ey>
Alejandro Colomar <alx@kernel.org>:
> The full patch set has been merged now. I've done a merge commit where
Alejandro, I still don't see manpages for "new" mount API here:
https://man7.org/linux/man-pages/dir_section_2.html
Please, publish.
--
Askar Safin
^ permalink raw reply
* Re: [PATCH v5 0/8] man2: document "new" mount API
From: Alejandro Colomar @ 2025-10-26 17:27 UTC (permalink / raw)
To: Askar Safin
Cc: brauner, cyphar, dhowells, g.branden.robinson, jack, linux-api,
linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
viro
In-Reply-To: <20251026122742.960661-1-safinaskar@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 886 bytes --]
Hi Askar,
On Sun, Oct 26, 2025 at 03:27:42PM +0300, Askar Safin wrote:
> Alejandro Colomar <alx@kernel.org>:
> > The full patch set has been merged now. I've done a merge commit where
>
> Alejandro, I still don't see manpages for "new" mount API here:
> https://man7.org/linux/man-pages/dir_section_2.html
<man7.org> is not official. It's Michael Kerrisk's (previous
maintainer) website. He usually publishes new pages shortly-ish after
each new release, and I haven't issued a new release yet.
I have plans to release soon-ish, but have internet issues at home (the
cable in the street is broken, so I'm connecting on cell internet from
the laptop). Hopefully, I'll be able to release this month.
Have a lovely day!
Alex
>
> Please, publish.
>
> --
> Askar Safin
>
--
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-10-27 11:37 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
roman.gushchin, chenridong, axboe, mark.rutland, jannh,
vincent.guittot, hannes, dan.j.williams, david, joel.granados,
rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
djeffery, stuart.w.hayes, lennart, brauner, linux-api,
linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
skhawaja, chrisl, steven.sistare
In-Reply-To: <20251020142924.GS316284@nvidia.com>
On Mon, Oct 20 2025, Jason Gunthorpe wrote:
> On Tue, Oct 14, 2025 at 03:29:59PM +0200, Pratyush Yadav wrote:
>> > 1) Use a vmalloc and store a list of the PFNs in the pool. Pool becomes
>> > frozen, can't add/remove PFNs.
>>
>> Doesn't that circumvent LUO's state machine? The idea with the state
>> machine was to have clear points in time when the system goes into the
>> "limited capacity"/"frozen" state, which is the LIVEUPDATE_PREPARE
>> event.
>
> I wouldn't get too invested in the FSM, it is there but it doesn't
> mean every luo client has to be focused on it.
Having each subsystem have its own state machine sounds like a bad idea
to me. It can get tricky to manage both for us and our users.
>
>> With what you propose, the first FD being preserved implicitly
>> triggers the prepare event. Same thing for unprepare/cancel operations.
>
> Yes, this is easy to write and simple to manage.
>
>> I am wondering if it is better to do it the other way round: prepare all
>> files first, and then prepare the hugetlb subsystem at
>> LIVEUPDATE_PREPARE event. At that point it already knows which pages to
>> mark preserved so the serialization can be done in one go.
>
> I think this would be slower and more complex?
>
>> > 2) Require the users of hugetlb memory, like memfd, to
>> > preserve/restore the folios they are using (using their hugetlb order)
>> > 3) Just before kexec run over the PFN list and mark a bit if the folio
>> > was preserved by KHO or not. Make sure everything gets KHO
>> > preserved.
>>
>> "just before kexec" would need a callback from LUO. I suppose a
>> subsystem is the place for that callback. I wrote my email under the
>> (wrong) impression that we were replacing subsystems.
>
> The file descriptors path should have luo client ops that have all
> the required callbacks. This is probably an existing op.
>
>> That makes me wonder: how is the subsystem-level callback supposed to
>> access the global data? I suppose it can use the liveupdate_file_handler
>> directly, but it is kind of strange since technically the subsystem and
>> file handler are two different entities.
>
> If we need such things we would need a way to link these together, but
> I'm wonder if we really don't..
>
>> Also as Pasha mentioned, 1G pages for guest_memfd will use hugetlb, and
>> I'm not sure how that would map with this shared global data. memfd and
>> guest_memfd will likely have different liveupdate_file_handler but would
>> share data from the same subsystem. Maybe that's a problem to solve for
>> later...
>
> On preserve memfd should call into hugetlb to activate it as a hugetlb
> page provider and preserve it too.
From what I understand, the main problem you want to solve is that the
life cycle of the global data should be tied to the file descriptors.
And since everything should have a FD anyway, can't we directly tie the
subsystems to file handlers? The subsystem gets a "preserve" callback
when the first FD that uses it gets preserved. It gets a "unpreserve"
callback when the last FD goes away. And the rest of the state machine
like prepare, cancel, etc. stay the same.
I think this gives us a clean abstraction that has LUO-managed lifetime.
It also works with the guest_memfd and memfd case since both can have
hugetlb as their underlying subsystem. For example,
static const struct liveupdate_file_ops memfd_luo_file_ops = {
.preserve = memfd_luo_preserve,
.unpreserve = memfd_luo_unpreserve,
[...]
.subsystem = &luo_hugetlb_subsys,
};
And then luo_{un,}preserve_file() can keep a refcount for the subsystem
and preserve or unpreserve the subsystem as needed. LUO can manage the
locking for these callbacks too.
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH] man/man2/clone.2: Document CLONE_NEWPID and CLONE_NEWUSER flag
From: hoodit dev @ 2025-10-29 9:00 UTC (permalink / raw)
To: Alejandro Colomar, Carlos O'Donell
Cc: linux-man, linux-api, Andrew Morton
In-Reply-To: <e2wxznnsnew5vrlhbvvpc5gbjlfd5nimnlwhsgnh6qanyjhpjo@2hxdsmag3rsk>
Hi, Alejandro Colomar and Carlos
Just a friendly ping to check if you had a chance to review this patch.
Thanks
2025년 5월 2일 (금) 오전 6:30, Alejandro Colomar <alx@kernel.org>님이 작성:
>
> Hi Carlos,
>
> On Mon, Apr 21, 2025 at 04:16:03AM +0900, devhoodit wrote:
> > CLONE_NEWPID and CLONE_PARENT can be used together, but not CLONE_THREAD. Similarly, CLONE_NEWUSER and CLONE_PARENT can be used together, but not CLONE_THREAD.
> > This was discussed here: <https://lore.kernel.org/linux-man/06febfb3-e2e2-4363-bc34-83a07692144f@redhat.com/T/>
> > Relevant code: <https://github.com/torvalds/linux/blob/219d54332a09e8d8741c1e1982f5eae56099de85/kernel/fork.c#L1815>
> >
> > Cc: Carlos O'Donell <carlos@redhat.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Signed-off-by: devhoodit <devhoodit@gmail.com>
>
> Could you please review this patch?
>
>
> Have a lovely night!
> Alex
>
> > ---
> > man/man2/clone.2 | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/man/man2/clone.2 b/man/man2/clone.2
> > index 1b74e4c92..b9561125a 100644
> > --- a/man/man2/clone.2
> > +++ b/man/man2/clone.2
> > @@ -776,9 +776,7 @@ .SS The flags mask
> > no privileges are needed to create a user namespace.
> > .IP
> > This flag can't be specified in conjunction with
> > -.B CLONE_THREAD
> > -or
> > -.BR CLONE_PARENT .
> > +.BR CLONE_THREAD .
> > For security reasons,
> > .\" commit e66eded8309ebf679d3d3c1f5820d1f2ca332c71
> > .\" https://lwn.net/Articles/543273/
> > @@ -1319,11 +1317,10 @@ .SH ERRORS
> > mask.
> > .TP
> > .B EINVAL
> > +Both
> > .B CLONE_NEWPID
> > -and one (or both) of
> > +and
> > .B CLONE_THREAD
> > -or
> > -.B CLONE_PARENT
> > were specified in the
> > .I flags
> > mask.
> > --
> > 2.49.0
> >
>
> --
> <https://www.alejandro-colomar.es/>
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pratyush Yadav @ 2025-10-29 19:07 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-15-pasha.tatashin@soleen.com>
Hi Pasha,
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> Introducing the userspace interface and internal logic required to
> manage the lifecycle of file descriptors within a session. Previously, a
> session was merely a container; this change makes it a functional
> management unit.
>
> The following capabilities are added:
>
> A new set of ioctl commands are added, which operate on the file
> descriptor returned by CREATE_SESSION. This allows userspace to:
> - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
> to be preserved across the live update.
> - LIVEUPDATE_SESSION_UNPRESERVE_FD: Remove a previously added file
> descriptor from the session.
> - LIVEUPDATE_SESSION_RESTORE_FD: Retrieve a preserved file in the
> new kernel using its unique token.
>
> A state machine for each individual session, distinct from the global
> LUO state. This enables more granular control, allowing userspace to
> prepare or freeze specific sessions independently. This is managed via:
> - LIVEUPDATE_SESSION_SET_EVENT: An ioctl to send PREPARE, FREEZE,
> CANCEL, or FINISH events to a single session.
> - LIVEUPDATE_SESSION_GET_STATE: An ioctl to query the current state
> of a single session.
>
> The global subsystem callbacks (luo_session_prepare, luo_session_freeze)
> are updated to iterate through all existing sessions. They now trigger
> the appropriate per-session state transitions for any sessions that
> haven't already been transitioned individually by userspace.
>
> The session's .release handler is enhanced to be state-aware. When a
> session's file descriptor is closed, it now correctly cancels or
> finishes the session based on its current state before freeing all
> associated file resources, preventing resource leaks.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]
> +/**
> + * struct liveupdate_session_get_state - ioctl(LIVEUPDATE_SESSION_GET_STATE)
> + * @size: Input; sizeof(struct liveupdate_session_get_state)
> + * @incoming: Input; If 1, query the state of a restored file from the incoming
> + * (previous kernel's) set. If 0, query a file being prepared for
> + * preservation in the current set.
Spotted this when working on updating my test suite for LUO. This seems
to be a leftover from a previous version. I don't see it being used
anywhere in the code.
Also, I think the model we should have is to only allow new sessions in
normal state. Currently luo_session_create() allows creating a new
session in updated state. This would end up mixing sessions from a
previous boot and sessions from current boot. I don't really see a
reason for that and I think the userspace should first call finish
before starting new serialization. Keeps things simpler.
> + * @reserved: Must be zero.
> + * @state: Output; The live update state of this FD.
> + *
> + * Query the current live update state of a specific preserved file descriptor.
> + *
> + * - %LIVEUPDATE_STATE_NORMAL: Default state
> + * - %LIVEUPDATE_STATE_PREPARED: Prepare callback has been performed on this FD.
> + * - %LIVEUPDATE_STATE_FROZEN: Freeze callback ahs been performed on this FD.
> + * - %LIVEUPDATE_STATE_UPDATED: The system has successfully rebooted into the
> + * new kernel.
> + *
> + * See the definition of &enum liveupdate_state for more details on each state.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +struct liveupdate_session_get_state {
> + __u32 size;
> + __u8 incoming;
> + __u8 reserved[3];
> + __u32 state;
> +};
> +
> +#define LIVEUPDATE_SESSION_GET_STATE \
> + _IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SESSION_GET_STATE)
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pasha Tatashin @ 2025-10-29 20:13 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0tszhcyrw.fsf@kernel.org>
On Wed, Oct 29, 2025 at 3:07 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Introducing the userspace interface and internal logic required to
> > manage the lifecycle of file descriptors within a session. Previously, a
> > session was merely a container; this change makes it a functional
> > management unit.
> >
> > The following capabilities are added:
> >
> > A new set of ioctl commands are added, which operate on the file
> > descriptor returned by CREATE_SESSION. This allows userspace to:
> > - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
> > to be preserved across the live update.
> > - LIVEUPDATE_SESSION_UNPRESERVE_FD: Remove a previously added file
> > descriptor from the session.
> > - LIVEUPDATE_SESSION_RESTORE_FD: Retrieve a preserved file in the
> > new kernel using its unique token.
> >
> > A state machine for each individual session, distinct from the global
> > LUO state. This enables more granular control, allowing userspace to
> > prepare or freeze specific sessions independently. This is managed via:
> > - LIVEUPDATE_SESSION_SET_EVENT: An ioctl to send PREPARE, FREEZE,
> > CANCEL, or FINISH events to a single session.
> > - LIVEUPDATE_SESSION_GET_STATE: An ioctl to query the current state
> > of a single session.
> >
> > The global subsystem callbacks (luo_session_prepare, luo_session_freeze)
> > are updated to iterate through all existing sessions. They now trigger
> > the appropriate per-session state transitions for any sessions that
> > haven't already been transitioned individually by userspace.
> >
> > The session's .release handler is enhanced to be state-aware. When a
> > session's file descriptor is closed, it now correctly cancels or
> > finishes the session based on its current state before freeing all
> > associated file resources, preventing resource leaks.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > +/**
> > + * struct liveupdate_session_get_state - ioctl(LIVEUPDATE_SESSION_GET_STATE)
> > + * @size: Input; sizeof(struct liveupdate_session_get_state)
> > + * @incoming: Input; If 1, query the state of a restored file from the incoming
> > + * (previous kernel's) set. If 0, query a file being prepared for
> > + * preservation in the current set.
>
> Spotted this when working on updating my test suite for LUO. This seems
> to be a leftover from a previous version. I don't see it being used
> anywhere in the code.
thank you will remove this.
> Also, I think the model we should have is to only allow new sessions in
> normal state. Currently luo_session_create() allows creating a new
> session in updated state. This would end up mixing sessions from a
> previous boot and sessions from current boot. I don't really see a
> reason for that and I think the userspace should first call finish
> before starting new serialization. Keeps things simpler.
It does. However, yesterday Jason Gunthorpe suggested that we simplify
the uapi, at least for the initial landing, by removing the state
machine during boot and allowing new sessions to be created at any
time. This would also mean separating the incoming and outgoing
sessions and removing the ioctl() call used to bring the machine into
a normal state; instead, only individual sessions could be brought
into a 'normal' state.
Simplified uAPI Proposal
The simplest uAPI would look like this:
IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
LIVEUPDATE_IOCTL_CREATE_SESSION
LIVEUPDATE_IOCTL_RETRIEVE_SESSION
IOCTLs on session FDs:
LIVEUPDATE_CMD_SESSION_PRESERVE_FD
LIVEUPDATE_CMD_SESSION_RETRIEVE_FD
LIVEUPDATE_CMD_SESSION_FINISH
Happy Path
The happy path would look like this:
- luod creates a session with a specific name and passes it to the vmm.
- The vmm preserves FDs in a specific order: memfd, iommufd, vfiofd.
(If the order is wrong, the preserve callbacks will fail.)
- A reboot(KEXEC) is performed.
- Each session receives a freeze() callback to notify it that
mutations are no longer possible.
- During boot, liveupdate_fh_global_state_get(&h, &obj) can be used to
retrieve the global state.
- Once the machine has booted, luod retrieves the incoming sessions
and passes them to the vmms.
- The vmm retrieves the FDs from the session and performs the
necessary IOCTLs on them.
- The vmm calls LIVEUPDATE_CMD_SESSION_FINISH on the session. Each FD
receives a finish() callback in LIFO order.
- If everything succeeds, the session becomes an empty "outgoing"
session. It can then be closed and discarded or reused for the next
live update by preserving new FDs into it.
- Once the last FD for a file-handler is finished,
h->ops->global_state_finish(h, h->global_state_obj) is called to
finish the incoming global state.
Unhappy Paths
- If an outgoing session FD is closed, each FD in that session
receives an unpreserve callback in LIFO order.
- If the last FD for a global state is unpreserved,
h->ops->global_state_unpreserve(h, h->global_state_obj) is called.
- If freeze() fails, a cancel() is performed on each FD that received
freeze() cb, and reboot(KEXEC) returns a failure.
- If an incoming session FD is closed, the resources are considered
"leaked." They are discarded only during the next live-update; this is
intended to prevent implementing rare and untested clean-up code.
- If a user tries to finish a session and it fails, it is considered
the user's problem. This might happen because some IOCTLs still need
to be run on the retrieved FDs to bring them to a state where finish
is possible.
This would also mean that subsystems would not be needed, leaving only
FLB (File-Lifecycle-Bound Global State) to use as a handle for global
state. The API I am proposing for FLB keeps the same global state for
a single file-handler type. However, HugeTLB might have multiple file
handlers, so the API would need to be extended slightly to support
this case. Multiple file handlers will share the same global resource
with the same callbacks.
Pasha
> > + * @reserved: Must be zero.
> > + * @state: Output; The live update state of this FD.
> > + *
> > + * Query the current live update state of a specific preserved file descriptor.
> > + *
> > + * - %LIVEUPDATE_STATE_NORMAL: Default state
> > + * - %LIVEUPDATE_STATE_PREPARED: Prepare callback has been performed on this FD.
> > + * - %LIVEUPDATE_STATE_FROZEN: Freeze callback ahs been performed on this FD.
> > + * - %LIVEUPDATE_STATE_UPDATED: The system has successfully rebooted into the
> > + * new kernel.
> > + *
> > + * See the definition of &enum liveupdate_state for more details on each state.
> > + *
> > + * Return: 0 on success, negative error code on failure.
> > + */
> > +struct liveupdate_session_get_state {
> > + __u32 size;
> > + __u8 incoming;
> > + __u8 reserved[3];
> > + __u32 state;
> > +};
> > +
> > +#define LIVEUPDATE_SESSION_GET_STATE \
> > + _IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SESSION_GET_STATE)
> [...]
>
> --
> Regards,
> Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pratyush Yadav @ 2025-10-29 20:37 UTC (permalink / raw)
To: Pasha Tatashin
Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-15-pasha.tatashin@soleen.com>
Hi Pasha,
On Mon, Sep 29 2025, Pasha Tatashin wrote:
> Introducing the userspace interface and internal logic required to
> manage the lifecycle of file descriptors within a session. Previously, a
> session was merely a container; this change makes it a functional
> management unit.
>
> The following capabilities are added:
>
> A new set of ioctl commands are added, which operate on the file
> descriptor returned by CREATE_SESSION. This allows userspace to:
> - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
> to be preserved across the live update.
> - LIVEUPDATE_SESSION_UNPRESERVE_FD: Remove a previously added file
> descriptor from the session.
> - LIVEUPDATE_SESSION_RESTORE_FD: Retrieve a preserved file in the
> new kernel using its unique token.
>
> A state machine for each individual session, distinct from the global
> LUO state. This enables more granular control, allowing userspace to
> prepare or freeze specific sessions independently. This is managed via:
> - LIVEUPDATE_SESSION_SET_EVENT: An ioctl to send PREPARE, FREEZE,
> CANCEL, or FINISH events to a single session.
> - LIVEUPDATE_SESSION_GET_STATE: An ioctl to query the current state
> of a single session.
>
> The global subsystem callbacks (luo_session_prepare, luo_session_freeze)
> are updated to iterate through all existing sessions. They now trigger
> the appropriate per-session state transitions for any sessions that
> haven't already been transitioned individually by userspace.
>
> The session's .release handler is enhanced to be state-aware. When a
> session's file descriptor is closed, it now correctly cancels or
> finishes the session based on its current state before freeing all
> associated file resources, preventing resource leaks.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
[...]
> +static int luo_session_restore_fd(struct luo_session *session,
> + struct luo_ucmd *ucmd)
> +{
> + struct liveupdate_session_restore_fd *argp = ucmd->cmd;
> + struct file *file;
> + int ret;
> +
> + guard(rwsem_read)(&luo_state_rwsem);
> + if (!liveupdate_state_updated())
> + return -EBUSY;
> +
> + argp->fd = get_unused_fd_flags(O_CLOEXEC);
> + if (argp->fd < 0)
> + return argp->fd;
> +
> + guard(mutex)(&session->mutex);
> +
> + /* Session might have already finished independatly from global state */
> + if (session->state != LIVEUPDATE_STATE_UPDATED)
> + return -EBUSY;
> +
> + ret = luo_retrieve_file(session, argp->token, &file);
The retrieve behaviour here causes some nastiness.
When the session is deserialized by luo_session_deserialize(), all the
files get added to the session's files_list. Now when a process
retrieves the session after kexec and restores a file, the file
handler's retrieve callback is invoked, deserializing and restoring the
file. Once deserialization is done, the callback usually frees up the
metadata. All this is fine.
The problem is that the file stays on on the files_list. When the
process closes the session FD, the unpreserve callback is invoked for
all files.
The unpreserve callback should undo what preserve did. That is, free up
serialization data. After a file is restored post-kexec, the things to
free up are different. For example, on a memfd, the folios won't be
pinned anymore. So invoking unpreserve on a retrieved file doesn't work
and causes UAF or other invalid behaviour.
I think you should treat retrieve as a unpreserve as well, and remove
the file from the session's list.
Side note: I see that a lot of code in luo_file.c works with the session
data structures directly. For example, luo_file_deserialize() adds the
file to session->files_list. I think the code would be a lot cleaner and
maintainable if the concerns were clearly separated.
luo_file_deserialize() should focus on deserializing a file given a
compatible and data, and all the dealing with the session's state should
be done by luo_session_deserialize().
luo_file_deserialize() is just an example, but I think the idea can be
applied in more places.
[...]
--
Regards,
Pratyush Yadav
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: David Matlack @ 2025-10-29 20:43 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CA+CK2bBVSX26TKwgLkXCDop5u3e9McH3sQMascT47ZwwrwraOw@mail.gmail.com>
On Wed, Oct 29, 2025 at 1:13 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> On Wed, Oct 29, 2025 at 3:07 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > Also, I think the model we should have is to only allow new sessions in
> > normal state. Currently luo_session_create() allows creating a new
> > session in updated state. This would end up mixing sessions from a
> > previous boot and sessions from current boot. I don't really see a
> > reason for that and I think the userspace should first call finish
> > before starting new serialization. Keeps things simpler.
>
> It does. However, yesterday Jason Gunthorpe suggested that we simplify
> the uapi, at least for the initial landing, by removing the state
> machine during boot and allowing new sessions to be created at any
> time. This would also mean separating the incoming and outgoing
> sessions and removing the ioctl() call used to bring the machine into
> a normal state; instead, only individual sessions could be brought
> into a 'normal' state.
>
> Simplified uAPI Proposal
> The simplest uAPI would look like this:
> IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
> LIVEUPDATE_IOCTL_CREATE_SESSION
> LIVEUPDATE_IOCTL_RETRIEVE_SESSION
>
> IOCTLs on session FDs:
> LIVEUPDATE_CMD_SESSION_PRESERVE_FD
> LIVEUPDATE_CMD_SESSION_RETRIEVE_FD
> LIVEUPDATE_CMD_SESSION_FINISH
Should we drop LIVEUPDATE_CMD_SESSION_FINISH and do this work in
close(session_fd)? close() can return an error.
I think this cleans up a few parts of the uAPI:
- One less ioctl.
- The only way to get an outgoing session would be through
LIVEUPDATE_IOCTL_CREATE_SESSION. The kernel does not have to deal with
an empty incoming session "becoming" an outgoing session (as described
below).
- The kernel can properly leak the session and its resources by
refusing to close the session file.
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pasha Tatashin @ 2025-10-29 20:57 UTC (permalink / raw)
To: David Matlack
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CALzav=d_Gmb8xKCwWCGsQQrdxHJrnk5VP-8hvO6FugUP7_ukAw@mail.gmail.com>
On Wed, Oct 29, 2025 at 4:44 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Oct 29, 2025 at 1:13 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> > On Wed, Oct 29, 2025 at 3:07 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> > > Also, I think the model we should have is to only allow new sessions in
> > > normal state. Currently luo_session_create() allows creating a new
> > > session in updated state. This would end up mixing sessions from a
> > > previous boot and sessions from current boot. I don't really see a
> > > reason for that and I think the userspace should first call finish
> > > before starting new serialization. Keeps things simpler.
> >
> > It does. However, yesterday Jason Gunthorpe suggested that we simplify
> > the uapi, at least for the initial landing, by removing the state
> > machine during boot and allowing new sessions to be created at any
> > time. This would also mean separating the incoming and outgoing
> > sessions and removing the ioctl() call used to bring the machine into
> > a normal state; instead, only individual sessions could be brought
> > into a 'normal' state.
> >
> > Simplified uAPI Proposal
> > The simplest uAPI would look like this:
> > IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
> > LIVEUPDATE_IOCTL_CREATE_SESSION
> > LIVEUPDATE_IOCTL_RETRIEVE_SESSION
> >
> > IOCTLs on session FDs:
> > LIVEUPDATE_CMD_SESSION_PRESERVE_FD
> > LIVEUPDATE_CMD_SESSION_RETRIEVE_FD
> > LIVEUPDATE_CMD_SESSION_FINISH
>
> Should we drop LIVEUPDATE_CMD_SESSION_FINISH and do this work in
> close(session_fd)? close() can return an error.
>
> I think this cleans up a few parts of the uAPI:
>
> - One less ioctl.
> - The only way to get an outgoing session would be through
> LIVEUPDATE_IOCTL_CREATE_SESSION. The kernel does not have to deal with
> an empty incoming session "becoming" an outgoing session (as described
> below).
> - The kernel can properly leak the session and its resources by
> refusing to close the session file.
I was considering this. But, in AFAIK even if close() fails, the FD is
still closed, therefore, I am not aware of any existing api that
relies on close() to fail. The finish or (set event if we decide to
expands events in the future) should be a separate ioctl() and close()
should release FD unconditionally as it still would do even if return
failure from release()
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pasha Tatashin @ 2025-10-29 20:58 UTC (permalink / raw)
To: Pratyush Yadav
Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0pla5cuml.fsf@kernel.org>
On Wed, Oct 29, 2025 at 4:37 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Mon, Sep 29 2025, Pasha Tatashin wrote:
>
> > Introducing the userspace interface and internal logic required to
> > manage the lifecycle of file descriptors within a session. Previously, a
> > session was merely a container; this change makes it a functional
> > management unit.
> >
> > The following capabilities are added:
> >
> > A new set of ioctl commands are added, which operate on the file
> > descriptor returned by CREATE_SESSION. This allows userspace to:
> > - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
> > to be preserved across the live update.
> > - LIVEUPDATE_SESSION_UNPRESERVE_FD: Remove a previously added file
> > descriptor from the session.
> > - LIVEUPDATE_SESSION_RESTORE_FD: Retrieve a preserved file in the
> > new kernel using its unique token.
> >
> > A state machine for each individual session, distinct from the global
> > LUO state. This enables more granular control, allowing userspace to
> > prepare or freeze specific sessions independently. This is managed via:
> > - LIVEUPDATE_SESSION_SET_EVENT: An ioctl to send PREPARE, FREEZE,
> > CANCEL, or FINISH events to a single session.
> > - LIVEUPDATE_SESSION_GET_STATE: An ioctl to query the current state
> > of a single session.
> >
> > The global subsystem callbacks (luo_session_prepare, luo_session_freeze)
> > are updated to iterate through all existing sessions. They now trigger
> > the appropriate per-session state transitions for any sessions that
> > haven't already been transitioned individually by userspace.
> >
> > The session's .release handler is enhanced to be state-aware. When a
> > session's file descriptor is closed, it now correctly cancels or
> > finishes the session based on its current state before freeing all
> > associated file resources, preventing resource leaks.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> [...]
> > +static int luo_session_restore_fd(struct luo_session *session,
> > + struct luo_ucmd *ucmd)
> > +{
> > + struct liveupdate_session_restore_fd *argp = ucmd->cmd;
> > + struct file *file;
> > + int ret;
> > +
> > + guard(rwsem_read)(&luo_state_rwsem);
> > + if (!liveupdate_state_updated())
> > + return -EBUSY;
> > +
> > + argp->fd = get_unused_fd_flags(O_CLOEXEC);
> > + if (argp->fd < 0)
> > + return argp->fd;
> > +
> > + guard(mutex)(&session->mutex);
> > +
> > + /* Session might have already finished independatly from global state */
> > + if (session->state != LIVEUPDATE_STATE_UPDATED)
> > + return -EBUSY;
> > +
> > + ret = luo_retrieve_file(session, argp->token, &file);
>
> The retrieve behaviour here causes some nastiness.
>
> When the session is deserialized by luo_session_deserialize(), all the
> files get added to the session's files_list. Now when a process
> retrieves the session after kexec and restores a file, the file
> handler's retrieve callback is invoked, deserializing and restoring the
> file. Once deserialization is done, the callback usually frees up the
> metadata. All this is fine.
>
> The problem is that the file stays on on the files_list. When the
> process closes the session FD, the unpreserve callback is invoked for
> all files.
> The unpreserve callback should undo what preserve did. That is, free up
Right, we discussed that continous preservation is not going to be
possible. So, this bug is not going to be present in the next version.
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: David Matlack @ 2025-10-29 21:13 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CA+CK2bBVSX26TKwgLkXCDop5u3e9McH3sQMascT47ZwwrwraOw@mail.gmail.com>
On Wed, Oct 29, 2025 at 1:13 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> Simplified uAPI Proposal
> The simplest uAPI would look like this:
> IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
> LIVEUPDATE_IOCTL_CREATE_SESSION
> LIVEUPDATE_IOCTL_RETRIEVE_SESSION
> - If everything succeeds, the session becomes an empty "outgoing"
> session. It can then be closed and discarded or reused for the next
> live update by preserving new FDs into it.
I think it would be useful to cleanly separate incoming and outgoing
sessions. The only way to get an outgoing session is with
LIVEUPDATE_IOCTL_CREATE_SESSION. Incoming sessions can be retrieved
with LIVEUPDATE_IOCTL_RETRIEVE_SESSION.
It is fine and expected for incoming and outgoing sessions to have the
same name. But they are different sessions. This way, the kernel can
easily keep track of incoming and outgoing sessions separately, and
there is not need to "transition" and session from incoming to
outgoing.
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Pasha Tatashin @ 2025-10-29 21:17 UTC (permalink / raw)
To: David Matlack
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, rientjes,
corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
chrisl, steven.sistare
In-Reply-To: <CALzav=frK48c1=nsbVJ4EvqqOqr33pUArP4G17su0hxOYveALw@mail.gmail.com>
On Wed, Oct 29, 2025 at 5:13 PM David Matlack <dmatlack@google.com> wrote:
>
> On Wed, Oct 29, 2025 at 1:13 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
>
> > Simplified uAPI Proposal
> > The simplest uAPI would look like this:
> > IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
> > LIVEUPDATE_IOCTL_CREATE_SESSION
> > LIVEUPDATE_IOCTL_RETRIEVE_SESSION
>
> > - If everything succeeds, the session becomes an empty "outgoing"
> > session. It can then be closed and discarded or reused for the next
> > live update by preserving new FDs into it.
>
> I think it would be useful to cleanly separate incoming and outgoing
> sessions. The only way to get an outgoing session is with
> LIVEUPDATE_IOCTL_CREATE_SESSION. Incoming sessions can be retrieved
> with LIVEUPDATE_IOCTL_RETRIEVE_SESSION.
>
> It is fine and expected for incoming and outgoing sessions to have the
> same name. But they are different sessions. This way, the kernel can
> easily keep track of incoming and outgoing sessions separately, and
> there is not need to "transition" and session from incoming to
> outgoing.
Yes, good idea, I was thinking of recycling finished and empty
sessions, but it will only add complications.
Pasha
^ permalink raw reply
* Re: [PATCH v4 14/30] liveupdate: luo_session: Add ioctls for file preservation and state management
From: Samiullah Khawaja @ 2025-10-29 22:00 UTC (permalink / raw)
To: Pasha Tatashin
Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, chrisl,
steven.sistare
In-Reply-To: <CA+CK2bBVSX26TKwgLkXCDop5u3e9McH3sQMascT47ZwwrwraOw@mail.gmail.com>
On Wed, Oct 29, 2025 at 1:13 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Oct 29, 2025 at 3:07 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > Hi Pasha,
> >
> > On Mon, Sep 29 2025, Pasha Tatashin wrote:
> >
> > > Introducing the userspace interface and internal logic required to
> > > manage the lifecycle of file descriptors within a session. Previously, a
> > > session was merely a container; this change makes it a functional
> > > management unit.
> > >
> > > The following capabilities are added:
> > >
> > > A new set of ioctl commands are added, which operate on the file
> > > descriptor returned by CREATE_SESSION. This allows userspace to:
> > > - LIVEUPDATE_SESSION_PRESERVE_FD: Add a file descriptor to a session
> > > to be preserved across the live update.
> > > - LIVEUPDATE_SESSION_UNPRESERVE_FD: Remove a previously added file
> > > descriptor from the session.
> > > - LIVEUPDATE_SESSION_RESTORE_FD: Retrieve a preserved file in the
> > > new kernel using its unique token.
> > >
> > > A state machine for each individual session, distinct from the global
> > > LUO state. This enables more granular control, allowing userspace to
> > > prepare or freeze specific sessions independently. This is managed via:
> > > - LIVEUPDATE_SESSION_SET_EVENT: An ioctl to send PREPARE, FREEZE,
> > > CANCEL, or FINISH events to a single session.
> > > - LIVEUPDATE_SESSION_GET_STATE: An ioctl to query the current state
> > > of a single session.
> > >
> > > The global subsystem callbacks (luo_session_prepare, luo_session_freeze)
> > > are updated to iterate through all existing sessions. They now trigger
> > > the appropriate per-session state transitions for any sessions that
> > > haven't already been transitioned individually by userspace.
> > >
> > > The session's .release handler is enhanced to be state-aware. When a
> > > session's file descriptor is closed, it now correctly cancels or
> > > finishes the session based on its current state before freeing all
> > > associated file resources, preventing resource leaks.
> > >
> > > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > [...]
> > > +/**
> > > + * struct liveupdate_session_get_state - ioctl(LIVEUPDATE_SESSION_GET_STATE)
> > > + * @size: Input; sizeof(struct liveupdate_session_get_state)
> > > + * @incoming: Input; If 1, query the state of a restored file from the incoming
> > > + * (previous kernel's) set. If 0, query a file being prepared for
> > > + * preservation in the current set.
> >
> > Spotted this when working on updating my test suite for LUO. This seems
> > to be a leftover from a previous version. I don't see it being used
> > anywhere in the code.
>
> thank you will remove this.
>
> > Also, I think the model we should have is to only allow new sessions in
> > normal state. Currently luo_session_create() allows creating a new
> > session in updated state. This would end up mixing sessions from a
> > previous boot and sessions from current boot. I don't really see a
> > reason for that and I think the userspace should first call finish
> > before starting new serialization. Keeps things simpler.
>
> It does. However, yesterday Jason Gunthorpe suggested that we simplify
> the uapi, at least for the initial landing, by removing the state
> machine during boot and allowing new sessions to be created at any
> time. This would also mean separating the incoming and outgoing
> sessions and removing the ioctl() call used to bring the machine into
> a normal state; instead, only individual sessions could be brought
> into a 'normal' state.
>
> Simplified uAPI Proposal
> The simplest uAPI would look like this:
> IOCTLs on /dev/liveupdate (to create and retrieve session FDs):
> LIVEUPDATE_IOCTL_CREATE_SESSION
> LIVEUPDATE_IOCTL_RETRIEVE_SESSION
>
> IOCTLs on session FDs:
> LIVEUPDATE_CMD_SESSION_PRESERVE_FD
> LIVEUPDATE_CMD_SESSION_RETRIEVE_FD
> LIVEUPDATE_CMD_SESSION_FINISH
>
> Happy Path
> The happy path would look like this:
> - luod creates a session with a specific name and passes it to the vmm.
> - The vmm preserves FDs in a specific order: memfd, iommufd, vfiofd.
> (If the order is wrong, the preserve callbacks will fail.)
> - A reboot(KEXEC) is performed.
> - Each session receives a freeze() callback to notify it that
> mutations are no longer possible.
> - During boot, liveupdate_fh_global_state_get(&h, &obj) can be used to
> retrieve the global state.
> - Once the machine has booted, luod retrieves the incoming sessions
> and passes them to the vmms.
> - The vmm retrieves the FDs from the session and performs the
> necessary IOCTLs on them.
> - The vmm calls LIVEUPDATE_CMD_SESSION_FINISH on the session. Each FD
> receives a finish() callback in LIFO order.
> - If everything succeeds, the session becomes an empty "outgoing"
> session. It can then be closed and discarded or reused for the next
> live update by preserving new FDs into it.
> - Once the last FD for a file-handler is finished,
> h->ops->global_state_finish(h, h->global_state_obj) is called to
> finish the incoming global state.
>
> Unhappy Paths
> - If an outgoing session FD is closed, each FD in that session
> receives an unpreserve callback in LIFO order.
> - If the last FD for a global state is unpreserved,
> h->ops->global_state_unpreserve(h, h->global_state_obj) is called.
> - If freeze() fails, a cancel() is performed on each FD that received
> freeze() cb, and reboot(KEXEC) returns a failure.
nit: Maybe we can rename cancel to unfreeze. So it matches preserve/unpreserve?
> - If an incoming session FD is closed, the resources are considered
> "leaked." They are discarded only during the next live-update; this is
> intended to prevent implementing rare and untested clean-up code.
I am assuming the preserved folios will become unpreserved during
shutdown and in the next kernel those folios are free.
> - If a user tries to finish a session and it fails, it is considered
> the user's problem. This might happen because some IOCTLs still need
> to be run on the retrieved FDs to bring them to a state where finish
> is possible.
Sounds great.
>
> This would also mean that subsystems would not be needed, leaving only
> FLB (File-Lifecycle-Bound Global State) to use as a handle for global
> state. The API I am proposing for FLB keeps the same global state for
> a single file-handler type. However, HugeTLB might have multiple file
> handlers, so the API would need to be extended slightly to support
> this case. Multiple file handlers will share the same global resource
> with the same callbacks.
>
> Pasha
>
> > > + * @reserved: Must be zero.
> > > + * @state: Output; The live update state of this FD.
> > > + *
> > > + * Query the current live update state of a specific preserved file descriptor.
> > > + *
> > > + * - %LIVEUPDATE_STATE_NORMAL: Default state
> > > + * - %LIVEUPDATE_STATE_PREPARED: Prepare callback has been performed on this FD.
> > > + * - %LIVEUPDATE_STATE_FROZEN: Freeze callback ahs been performed on this FD.
> > > + * - %LIVEUPDATE_STATE_UPDATED: The system has successfully rebooted into the
> > > + * new kernel.
> > > + *
> > > + * See the definition of &enum liveupdate_state for more details on each state.
> > > + *
> > > + * Return: 0 on success, negative error code on failure.
> > > + */
> > > +struct liveupdate_session_get_state {
> > > + __u32 size;
> > > + __u8 incoming;
> > > + __u8 reserved[3];
> > > + __u32 state;
> > > +};
> > > +
> > > +#define LIVEUPDATE_SESSION_GET_STATE \
> > > + _IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SESSION_GET_STATE)
> > [...]
> >
> > --
> > Regards,
> > Pratyush Yadav
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox