Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: Safety of resolving untrusted paths with detached mount dirfd
From: Aleksa Sarai @ 2025-11-20  9:24 UTC (permalink / raw)
  To: Demi Marie Obenour
  Cc: Alyssa Ross, linux-fsdevel, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Al Viro, Arnd Bergmann, shuah, David Howells,
	Andy Lutomirski, Christian Brauner, Tycho Andersen, linux-kernel,
	linux-api
In-Reply-To: <cdf9deb2-7a09-48c5-97e2-2ea6d5901882@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2982 bytes --]

On 2025-11-19, Demi Marie Obenour <demiobenour@gmail.com> wrote:
> On 11/19/25 21:18, Aleksa Sarai wrote:
> > On 2025-11-19, Alyssa Ross <hi@alyssa.is> wrote:
> >> Hello,
> >>
> >> As we know, it's not safe to use chroot() for resolving untrusted paths
> >> within some root, as a subdirectory could be moved outside of the
> >> process root while walking the path[1].  On the other hand,
> >> LOOKUP_BENEATH is supposed to be robust against this, and going by [2],
> >> it sounds like resolving with the mount namespace root as dirfd should
> >> also be.
> >>
> >> My question is: would resolving an untrusted path against a detached
> >> mount root dirfd opened with OPEN_TREE_CLONE (not necessarily a
> >> filesystem root) also be expected to be robust against traversal issues?
> >> i.e. can I rely on an untrusted path never resolving to a path that
> >> isn't under the mount root?
> > 
> > No, if you hit an absolute symlink or use an absolute path it will
> > resolve to your current->fs->root (mount namespace root or chroot).
> > However, OPEN_TREE_CLONE will stop ".." from naively stepping out of the
> > detached bind-mount. If you are dealing with procfs then magic-links can
> > also jump out.
> 
> Is using open_tree_attr() with MOUNT_ATTR_NOSYMFOLLOW enough to prevent
> these?  Will it still provide protection even if someone concurrently
> renames one of the files out from under the root?  I know that can
> escape a chroot, but I wonder if this provides more guarantees.

That will block symlinks (in a similar manner to RESOLVE_NO_SYMLINKS),
so those particular problems would not be an issue. Of course, a lot of
symlink usages are valid and so this will block those as well (back when
I wrote openat2 I did a cursory scan and something like 15% of system
paths contained symlinks on my system).

I think that ".." will not be a problem even with renames because the
detached mount is associated with the directory (just like how moving a
bind-mount source doesn't suddenly expose more information).

It also goes without saying that you need to make sure an absolute path
*never* gets passed to any of the helper functions you write to do this
-- in my view this is usually going to be quite a fragile setup. Who is
providing the paths to your program?

> https://github.com/QubesOS/qubes-secpack/blob/main/QSBs/qsb-014-2015.txt
> was the chroot breakout.
> 
> > You can always use RESOLVE_BENEATH or RESOLVE_IN_ROOT in combination
> > with OPEN_TREE_CLONE.
>
> Unfortunately not everything supports that.  For instance, mkdirat()
> doesn't.

You can openat2(RESOLVE_BENEATH) the parent directory and then mkdirat()
the final component (because mkdirat does not follow trailing symlinks).
This is what libpathrs[1] does, and it works for most *at() syscalls
(those that support AT_EMPTY_PATH are even easier).

[1]: https://github.com/cyphar/libpathrs

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Arnd Bergmann @ 2025-11-20  8:20 UTC (permalink / raw)
  To: Askar Safin, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner,
	Alexander Viro, Jan Kara, Christoph Hellwig, Jens Axboe,
	Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Alexander Graf, Rob Landley, Lennart Poettering, Linux-Arch,
	linux-block, initramfs, linux-api, linux-doc, Michal Simek,
	Luis Chamberlain, Kees Cook, Thorsten Blum, Heiko Carstens,
	Dave Young, Christophe Leroy, Krzysztof Kozlowski,
	Borislav Petkov, Jessica Clarke, Nicolas Schichan,
	David Disseldorp, patches
In-Reply-To: <20251119222407.3333257-2-safinaskar@gmail.com>

On Wed, Nov 19, 2025, at 23:24, Askar Safin wrote:
> ...which do nothing. They were deprecated (in documentation) in
> 6b99e6e6aa62 ("Documentation/admin-guide: blockdev/ramdisk: remove use of
> "rdev"") in 2020 and in kernel messages in c8376994c86c ("initrd: remove
> support for multiple floppies") in 2020.
>
> Signed-off-by: Askar Safin <safinaskar@gmail.com>
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 4 ----
>  arch/arm/configs/neponset_defconfig             | 2 +-

For the arm defconfig:

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* Re: Safety of resolving untrusted paths with detached mount dirfd
From: Demi Marie Obenour @ 2025-11-20  2:39 UTC (permalink / raw)
  To: Aleksa Sarai, Alyssa Ross
  Cc: linux-fsdevel, Jann Horn, Eric W. Biederman, jlayton,
	Bruce Fields, Al Viro, Arnd Bergmann, shuah, David Howells,
	Andy Lutomirski, Christian Brauner, Tycho Andersen, linux-kernel,
	linux-api
In-Reply-To: <2025-11-20-limber-salted-luncheon-scads-7AT044@cyphar.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 1730 bytes --]

On 11/19/25 21:18, Aleksa Sarai wrote:
> On 2025-11-19, Alyssa Ross <hi@alyssa.is> wrote:
>> Hello,
>>
>> As we know, it's not safe to use chroot() for resolving untrusted paths
>> within some root, as a subdirectory could be moved outside of the
>> process root while walking the path[1].  On the other hand,
>> LOOKUP_BENEATH is supposed to be robust against this, and going by [2],
>> it sounds like resolving with the mount namespace root as dirfd should
>> also be.
>>
>> My question is: would resolving an untrusted path against a detached
>> mount root dirfd opened with OPEN_TREE_CLONE (not necessarily a
>> filesystem root) also be expected to be robust against traversal issues?
>> i.e. can I rely on an untrusted path never resolving to a path that
>> isn't under the mount root?
> 
> No, if you hit an absolute symlink or use an absolute path it will
> resolve to your current->fs->root (mount namespace root or chroot).
> However, OPEN_TREE_CLONE will stop ".." from naively stepping out of the
> detached bind-mount. If you are dealing with procfs then magic-links can
> also jump out.

Is using open_tree_attr() with MOUNT_ATTR_NOSYMFOLLOW enough to prevent
these?  Will it still provide protection even if someone concurrently
renames one of the files out from under the root?  I know that can
escape a chroot, but I wonder if this provides more guarantees.
https://github.com/QubesOS/qubes-secpack/blob/main/QSBs/qsb-014-2015.txt
was the chroot breakout.

> You can always use RESOLVE_BENEATH or RESOLVE_IN_ROOT in combination
> with OPEN_TREE_CLONE.
Unfortunately not everything supports that.  For instance, mkdirat()
doesn't.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Safety of resolving untrusted paths with detached mount dirfd
From: Aleksa Sarai @ 2025-11-20  2:18 UTC (permalink / raw)
  To: Alyssa Ross
  Cc: linux-fsdevel, Demi Marie Obenour, Jann Horn, Eric W. Biederman,
	jlayton, Bruce Fields, Al Viro, Arnd Bergmann, shuah,
	David Howells, Andy Lutomirski, Christian Brauner, Tycho Andersen,
	linux-kernel, linux-api
In-Reply-To: <87cy5eqgn8.fsf@alyssa.is>

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

On 2025-11-19, Alyssa Ross <hi@alyssa.is> wrote:
> Hello,
> 
> As we know, it's not safe to use chroot() for resolving untrusted paths
> within some root, as a subdirectory could be moved outside of the
> process root while walking the path[1].  On the other hand,
> LOOKUP_BENEATH is supposed to be robust against this, and going by [2],
> it sounds like resolving with the mount namespace root as dirfd should
> also be.
> 
> My question is: would resolving an untrusted path against a detached
> mount root dirfd opened with OPEN_TREE_CLONE (not necessarily a
> filesystem root) also be expected to be robust against traversal issues?
> i.e. can I rely on an untrusted path never resolving to a path that
> isn't under the mount root?

No, if you hit an absolute symlink or use an absolute path it will
resolve to your current->fs->root (mount namespace root or chroot).
However, OPEN_TREE_CLONE will stop ".." from naively stepping out of the
detached bind-mount. If you are dealing with procfs then magic-links can
also jump out.

You can always use RESOLVE_BENEATH or RESOLVE_IN_ROOT in combination
with OPEN_TREE_CLONE.

-- 
Aleksa Sarai
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* [PATCH v4 3/3] init: remove /proc/sys/kernel/real-root-dev
From: Askar Safin @ 2025-11-19 22:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251119222407.3333257-1-safinaskar@gmail.com>

It is not used anymore.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  6 ------
 include/uapi/linux/sysctl.h                 |  1 -
 init/do_mounts_initrd.c                     | 20 --------------------
 3 files changed, 27 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index f3ee807b5d8b..218265babaf9 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1215,12 +1215,6 @@ that support this feature.
 ==  ===========================================================================
 
 
-real-root-dev
-=============
-
-See Documentation/admin-guide/initrd.rst.
-
-
 reboot-cmd (SPARC only)
 =======================
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 63d1464cb71c..1c7fe0f4dca4 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -92,7 +92,6 @@ enum
 	KERN_DOMAINNAME=8,	/* string: domainname */
 
 	KERN_PANIC=15,		/* int: panic timeout */
-	KERN_REALROOTDEV=16,	/* real root device to mount after initrd */
 
 	KERN_SPARC_REBOOT=21,	/* reboot command on Sparc */
 	KERN_CTLALTDEL=22,	/* int: allow ctl-alt-del to reboot */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index fe335dbc95e0..892e69ab41c4 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -8,31 +8,11 @@
 
 unsigned long initrd_start, initrd_end;
 int initrd_below_start_ok;
-static unsigned int real_root_dev;	/* do_proc_dointvec cannot handle kdev_t */
 static int __initdata mount_initrd = 1;
 
 phys_addr_t phys_initrd_start __initdata;
 unsigned long phys_initrd_size __initdata;
 
-#ifdef CONFIG_SYSCTL
-static const struct ctl_table kern_do_mounts_initrd_table[] = {
-	{
-		.procname       = "real-root-dev",
-		.data           = &real_root_dev,
-		.maxlen         = sizeof(int),
-		.mode           = 0644,
-		.proc_handler   = proc_dointvec,
-	},
-};
-
-static __init int kernel_do_mounts_initrd_sysctls_init(void)
-{
-	register_sysctl_init("kernel", kern_do_mounts_initrd_table);
-	return 0;
-}
-late_initcall(kernel_do_mounts_initrd_sysctls_init);
-#endif /* CONFIG_SYSCTL */
-
 static int __init no_initrd(char *str)
 {
 	pr_warn("noinitrd option is deprecated and will be removed soon\n");
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-11-19 22:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251119222407.3333257-1-safinaskar@gmail.com>

Remove linuxrc initrd code path, which was deprecated in 2020.

Initramfs and (non-initial) RAM disks (i. e. brd) still work.

Both built-in and bootloader-supplied initramfs still work.

Non-linuxrc initrd code path (i. e. using /dev/ram as final root
filesystem) still works, but I put deprecation message into it.

Also I deprecate command line parameters "noinitrd" and "ramdisk_start=".

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |  8 +-
 fs/init.c                                     | 14 ---
 include/linux/init_syscalls.h                 |  1 -
 include/linux/initrd.h                        |  2 -
 init/do_mounts.c                              |  4 +-
 init/do_mounts.h                              | 18 +---
 init/do_mounts_initrd.c                       | 87 ++-----------------
 init/do_mounts_rd.c                           | 17 +---
 8 files changed, 22 insertions(+), 129 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 15af6933eab4..df441d1a9555 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4324,8 +4324,10 @@
 			Note that this argument takes precedence over
 			the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
 
-	noinitrd	[RAM] Tells the kernel not to load any configured
-			initial RAM disk.
+	noinitrd	[Deprecated,RAM] Tells the kernel not to load any configured
+			initial RAM disk. Currently this parameter applies to
+			initrd only, not to initramfs. But it applies to both
+			in EFI mode.
 
 	nointremap	[X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
 			remapping.
@@ -5338,7 +5340,7 @@
 	ramdisk_size=	[RAM] Sizes of RAM disks in kilobytes
 			See Documentation/admin-guide/blockdev/ramdisk.rst.
 
-	ramdisk_start=	[RAM] RAM disk image start address
+	ramdisk_start=	[Deprecated,RAM] RAM disk image start address
 
 	random.trust_cpu=off
 			[KNL,EARLY] Disable trusting the use of the CPU's
diff --git a/fs/init.c b/fs/init.c
index 07f592ccdba8..60719494d9a0 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -27,20 +27,6 @@ int __init init_mount(const char *dev_name, const char *dir_name,
 	return ret;
 }
 
-int __init init_umount(const char *name, int flags)
-{
-	int lookup_flags = LOOKUP_MOUNTPOINT;
-	struct path path;
-	int ret;
-
-	if (!(flags & UMOUNT_NOFOLLOW))
-		lookup_flags |= LOOKUP_FOLLOW;
-	ret = kern_path(name, lookup_flags, &path);
-	if (ret)
-		return ret;
-	return path_umount(&path, flags);
-}
-
 int __init init_chdir(const char *filename)
 {
 	struct path path;
diff --git a/include/linux/init_syscalls.h b/include/linux/init_syscalls.h
index 92045d18cbfc..0bdbc458a881 100644
--- a/include/linux/init_syscalls.h
+++ b/include/linux/init_syscalls.h
@@ -2,7 +2,6 @@
 
 int __init init_mount(const char *dev_name, const char *dir_name,
 		const char *type_page, unsigned long flags, void *data_page);
-int __init init_umount(const char *name, int flags);
 int __init init_chdir(const char *filename);
 int __init init_chroot(const char *filename);
 int __init init_chown(const char *filename, uid_t user, gid_t group, int flags);
diff --git a/include/linux/initrd.h b/include/linux/initrd.h
index f1a1f4c92ded..7e5d26c8136f 100644
--- a/include/linux/initrd.h
+++ b/include/linux/initrd.h
@@ -3,8 +3,6 @@
 #ifndef __LINUX_INITRD_H
 #define __LINUX_INITRD_H
 
-#define INITRD_MINOR 250 /* shouldn't collide with /dev/ram* too soon ... */
-
 /* starting block # of image */
 extern int rd_image_start;
 
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 0f2f44e6250c..1054ad3c905a 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -476,13 +476,11 @@ void __init prepare_namespace(void)
 	if (saved_root_name[0])
 		ROOT_DEV = parse_root_device(saved_root_name);
 
-	if (initrd_load(saved_root_name))
-		goto out;
+	initrd_load();
 
 	if (root_wait)
 		wait_for_root(saved_root_name);
 	mount_root(saved_root_name);
-out:
 	devtmpfs_mount();
 	init_mount(".", "/", NULL, MS_MOVE, NULL);
 	init_chroot(".");
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 6069ea3eb80d..a386ee5314c9 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -23,25 +23,15 @@ static inline __init int create_dev(char *name, dev_t dev)
 }
 
 #ifdef CONFIG_BLK_DEV_RAM
-
-int __init rd_load_disk(int n);
-int __init rd_load_image(char *from);
-
+int __init rd_load_image(void);
 #else
-
-static inline int rd_load_disk(int n) { return 0; }
-static inline int rd_load_image(char *from) { return 0; }
-
+static inline int rd_load_image(void) { return 0; }
 #endif
 
 #ifdef CONFIG_BLK_DEV_INITRD
-bool __init initrd_load(char *root_device_name);
+void __init initrd_load(void);
 #else
-static inline bool initrd_load(char *root_device_name)
-{
-	return false;
-	}
-
+static inline void initrd_load(void) { }
 #endif
 
 /* Ensure that async file closing finished to prevent spurious errors. */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index f6867bad0d78..fe335dbc95e0 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -2,13 +2,7 @@
 #include <linux/unistd.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
-#include <linux/minix_fs.h>
-#include <linux/romfs_fs.h>
 #include <linux/initrd.h>
-#include <linux/sched.h>
-#include <linux/freezer.h>
-#include <linux/kmod.h>
-#include <uapi/linux/mount.h>
 
 #include "do_mounts.h"
 
@@ -41,6 +35,7 @@ late_initcall(kernel_do_mounts_initrd_sysctls_init);
 
 static int __init no_initrd(char *str)
 {
+	pr_warn("noinitrd option is deprecated and will be removed soon\n");
 	mount_initrd = 0;
 	return 1;
 }
@@ -70,85 +65,19 @@ static int __init early_initrd(char *p)
 }
 early_param("initrd", early_initrd);
 
-static int __init init_linuxrc(struct subprocess_info *info, struct cred *new)
-{
-	ksys_unshare(CLONE_FS | CLONE_FILES);
-	console_on_rootfs();
-	/* move initrd over / and chdir/chroot in initrd root */
-	init_chdir("/root");
-	init_mount(".", "/", NULL, MS_MOVE, NULL);
-	init_chroot(".");
-	ksys_setsid();
-	return 0;
-}
-
-static void __init handle_initrd(char *root_device_name)
-{
-	struct subprocess_info *info;
-	static char *argv[] = { "linuxrc", NULL, };
-	extern char *envp_init[];
-	int error;
-
-	pr_warn("using deprecated initrd support, will be removed soon.\n");
-
-	real_root_dev = new_encode_dev(ROOT_DEV);
-	create_dev("/dev/root.old", Root_RAM0);
-	/* mount initrd on rootfs' /root */
-	mount_root_generic("/dev/root.old", root_device_name,
-			   root_mountflags & ~MS_RDONLY);
-	init_mkdir("/old", 0700);
-	init_chdir("/old");
-
-	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
-					 GFP_KERNEL, init_linuxrc, NULL, NULL);
-	if (!info)
-		return;
-	call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
-
-	/* move initrd to rootfs' /old */
-	init_mount("..", ".", NULL, MS_MOVE, NULL);
-	/* switch root and cwd back to / of rootfs */
-	init_chroot("..");
-
-	if (new_decode_dev(real_root_dev) == Root_RAM0) {
-		init_chdir("/old");
-		return;
-	}
-
-	init_chdir("/");
-	ROOT_DEV = new_decode_dev(real_root_dev);
-	mount_root(root_device_name);
-
-	printk(KERN_NOTICE "Trying to move old root to /initrd ... ");
-	error = init_mount("/old", "/root/initrd", NULL, MS_MOVE, NULL);
-	if (!error)
-		printk("okay\n");
-	else {
-		if (error == -ENOENT)
-			printk("/initrd does not exist. Ignored.\n");
-		else
-			printk("failed\n");
-		printk(KERN_NOTICE "Unmounting old root\n");
-		init_umount("/old", MNT_DETACH);
-	}
-}
-
-bool __init initrd_load(char *root_device_name)
+void __init initrd_load(void)
 {
 	if (mount_initrd) {
 		create_dev("/dev/ram", Root_RAM0);
 		/*
-		 * Load the initrd data into /dev/ram0. Execute it as initrd
-		 * unless /dev/ram0 is supposed to be our actual root device,
-		 * in that case the ram disk is just set up here, and gets
-		 * mounted in the normal path.
+		 * Load the initrd data into /dev/ram0.
 		 */
-		if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
-			init_unlink("/initrd.image");
-			handle_initrd(root_device_name);
-			return true;
+		if (rd_load_image()) {
+			pr_warn("using deprecated initrd support, will be removed in January 2027; "
+				"use initramfs instead or (as a last resort) /sys/firmware/initrd; "
+				"see section \"Workaround\" in "
+				"https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com\n");
 		}
 	}
 	init_unlink("/initrd.image");
-	return false;
 }
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 5311f2d7edc8..0a021bbcd501 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -22,6 +22,7 @@ int __initdata rd_image_start;		/* starting block # of image */
 
 static int __init ramdisk_start_setup(char *str)
 {
+	pr_warn("ramdisk_start= option is deprecated and will be removed soon\n");
 	rd_image_start = simple_strtol(str,NULL,0);
 	return 1;
 }
@@ -177,7 +178,7 @@ static unsigned long nr_blocks(struct file *file)
 	return i_size_read(inode) >> 10;
 }
 
-int __init rd_load_image(char *from)
+int __init rd_load_image(void)
 {
 	int res = 0;
 	unsigned long rd_blocks, devblocks, nr_disks;
@@ -191,7 +192,7 @@ int __init rd_load_image(char *from)
 	if (IS_ERR(out_file))
 		goto out;
 
-	in_file = filp_open(from, O_RDONLY, 0);
+	in_file = filp_open("/initrd.image", O_RDONLY, 0);
 	if (IS_ERR(in_file))
 		goto noclose_input;
 
@@ -220,10 +221,7 @@ int __init rd_load_image(char *from)
 	/*
 	 * OK, time to copy in the data
 	 */
-	if (strcmp(from, "/initrd.image") == 0)
-		devblocks = nblocks;
-	else
-		devblocks = nr_blocks(in_file);
+	devblocks = nblocks;
 
 	if (devblocks == 0) {
 		printk(KERN_ERR "RAMDISK: could not determine device size\n");
@@ -267,13 +265,6 @@ int __init rd_load_image(char *from)
 	return res;
 }
 
-int __init rd_load_disk(int n)
-{
-	create_dev("/dev/root", ROOT_DEV);
-	create_dev("/dev/ram", MKDEV(RAMDISK_MAJOR, n));
-	return rd_load_image("/dev/root");
-}
-
 static int exit_code;
 static int decompress_error;
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Askar Safin @ 2025-11-19 22:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251119222407.3333257-1-safinaskar@gmail.com>

...which do nothing. They were deprecated (in documentation) in
6b99e6e6aa62 ("Documentation/admin-guide: blockdev/ramdisk: remove use of
"rdev"") in 2020 and in kernel messages in c8376994c86c ("initrd: remove
support for multiple floppies") in 2020.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 4 ----
 arch/arm/configs/neponset_defconfig             | 2 +-
 init/do_mounts.c                                | 7 -------
 init/do_mounts_rd.c                             | 7 -------
 4 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 6c42061ca20e..15af6933eab4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3319,8 +3319,6 @@
 			If there are multiple matching configurations changing
 			the same attribute, the last one is used.
 
-	load_ramdisk=	[RAM] [Deprecated]
-
 	lockd.nlm_grace_period=P  [NFS] Assign grace period.
 			Format: <integer>
 
@@ -5284,8 +5282,6 @@
 			Param: <number> - step/bucket size as a power of 2 for
 				statistical time based profiling.
 
-	prompt_ramdisk=	[RAM] [Deprecated]
-
 	prot_virt=	[S390] enable hosting protected virtual machines
 			isolated from the hypervisor (if hardware supports
 			that). If enabled, the default kernel base address
diff --git a/arch/arm/configs/neponset_defconfig b/arch/arm/configs/neponset_defconfig
index 2227f86100ad..4d720001c12e 100644
--- a/arch/arm/configs/neponset_defconfig
+++ b/arch/arm/configs/neponset_defconfig
@@ -9,7 +9,7 @@ CONFIG_ASSABET_NEPONSET=y
 CONFIG_ZBOOT_ROM_TEXT=0x80000
 CONFIG_ZBOOT_ROM_BSS=0xc1000000
 CONFIG_ZBOOT_ROM=y
-CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) load_ramdisk=1 prompt_ramdisk=0 mem=32M noinitrd initrd=0xc0800000,3M"
+CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) mem=32M noinitrd initrd=0xc0800000,3M"
 CONFIG_FPE_NWFPE=y
 CONFIG_PM=y
 CONFIG_MODULES=y
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 6af29da8889e..0f2f44e6250c 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -34,13 +34,6 @@ static int root_wait;
 
 dev_t ROOT_DEV;
 
-static int __init load_ramdisk(char *str)
-{
-	pr_warn("ignoring the deprecated load_ramdisk= option\n");
-	return 1;
-}
-__setup("load_ramdisk=", load_ramdisk);
-
 static int __init readonly(char *str)
 {
 	if (*str)
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 19d9f33dcacf..5311f2d7edc8 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -18,13 +18,6 @@
 static struct file *in_file, *out_file;
 static loff_t in_pos, out_pos;
 
-static int __init prompt_ramdisk(char *str)
-{
-	pr_warn("ignoring the deprecated prompt_ramdisk= option\n");
-	return 1;
-}
-__setup("prompt_ramdisk=", prompt_ramdisk);
-
 int __initdata rd_image_start;		/* starting block # of image */
 
 static int __init ramdisk_start_setup(char *str)
-- 
2.47.3


^ permalink raw reply related

* [PATCH v4 0/3] initrd: remove half of classic initrd support
From: Askar Safin @ 2025-11-19 22:24 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches

This patchset will not affect anyone, who showed up in these lists.
See [5] for details.

Intro
====
This patchset removes half of classic initrd (initial RAM disk) support,
i. e. linuxrc code path, which was deprecated in 2020.
Initramfs still stays, RAM disk itself (brd) still stays.
And other half of initrd stays, too.
init/do_mounts* are listed in VFS entry in
MAINTAINERS, so I think this patchset should go through VFS tree.
I tested the patchset on 8 (!!!) archs in Qemu (see details below).
If you still use initrd, see below for workaround.

In 2020 deprecation notice was put to linuxrc initrd code path.
In v1 I tried to remove initrd
fully, but Nicolas Schichan reported that he still uses
other code path (root=/dev/ram0 one) on million devices [4].
root=/dev/ram0 code path did not contain deprecation notice.

So, in this version of patchset I remove deprecated code path,
i. e. linuxrc one, while keeping other, i. e. root=/dev/ram0 one.

Also I put deprecation notice to remaining code path, i. e. to
root=/dev/ram0 one. I plan to send patches for full removal
of initrd after one year, i. e. in January 2027 (of course,
initramfs will still work).

Also, I tried to make this patchset small to make sure it
can be reverted easily. I plan to send cleanups later.

Details
====
Other user-visible changes:

- Removed kernel command line parameters "load_ramdisk" and
"prompt_ramdisk", which did nothing and were deprecated
- Removed /proc/sys/kernel/real-root-dev . It was used
for initrd only
- Command line parameters "noinitrd" and "ramdisk_start=" are deprecated

Testing
====
I tested my patchset on many architectures in Qemu using my Rust
program, heavily based on mkroot [1].

I used the following cross-compilers:

aarch64-linux-musleabi
armv4l-linux-musleabihf
armv5l-linux-musleabihf
armv7l-linux-musleabihf
i486-linux-musl
i686-linux-musl
mips-linux-musl
mips64-linux-musl
mipsel-linux-musl
powerpc-linux-musl
powerpc64-linux-musl
powerpc64le-linux-musl
riscv32-linux-musl
riscv64-linux-musl
s390x-linux-musl
sh4-linux-musl
sh4eb-linux-musl
x86_64-linux-musl

taken from this directory [2].

So, as you can see, there are 18 triplets, which correspond to 8 subdirs in arch/.

For every triplet I tested that:
- Initramfs still works (both builtin and external)
- Direct boot from disk still works
- Remaining initrd code path (root=/dev/ram0) still works

Workaround
====
If "retain_initrd" is passed to kernel, then initramfs/initrd,
passed by bootloader, is retained and becomes available after boot
as read-only magic file /sys/firmware/initrd [3].

No copies are involved. I. e. /sys/firmware/initrd is simply
a reference to original blob passed by bootloader.

This works even if initrd/initramfs is not recognized by kernel
in any way, i. e. even if it is not valid cpio archive, nor
a fs image supported by classic initrd.

This works both with my patchset and without it.

This means that you can emulate classic initrd so:
link builtin initramfs to kernel; in /init in this initramfs
copy /sys/firmware/initrd to some file in / and loop-mount it.

This is even better than classic initrd, because:
- You can use fs not supported by classic initrd, for example erofs
- One copy is involved (from /sys/firmware/initrd to some file in /)
as opposed to two when using classic initrd

Still, I don't recommend using this workaround, because
I want everyone to migrate to proper modern initramfs.
But still you can use this workaround if you want.

Also: it is not possible to directly loop-mount
/sys/firmware/initrd . Theoretically kernel can be changed
to allow this (and/or to make it writable), but I think nobody needs this.
And I don't want to implement this.

On Qemu's -initrd and GRUB's initrd
====
Don't panic, this patchset doesn't remove initramfs
(which is used by nearly all Linux distros). And I don't
have plans to remove it.

Qemu's -initrd option and GRUB's initrd command refer
to initrd bootloader mechanism, which is used to
load both initrd and (external) initramfs.

So, if you use Qemu's -initrd or GRUB's initrd,
then you likely use them to pass initramfs, and thus
you are safe.

v1: https://lore.kernel.org/lkml/20250913003842.41944-1-safinaskar@gmail.com/

v1 -> v2 changes:
- A lot. I removed most patches, see cover letter for details

v2: https://lore.kernel.org/lkml/20251010094047.3111495-1-safinaskar@gmail.com/

v2 -> v3 changes:
- Commit messages
- Expanded docs for "noinitrd"
- Added link to /sys/firmware/initrd workaround to pr_warn

v3: https://lore.kernel.org/lkml/20251017060956.1151347-1-safinaskar@gmail.com/

v3 -> v4 changes:
- Changed "September 2026" to "January 2027" (i. e. after 2026 LTS release)

[1] https://github.com/landley/toybox/tree/master/mkroot
[2] https://landley.net/toybox/downloads/binaries/toolchains/latest
[3] https://lore.kernel.org/all/20231207235654.16622-1-graf@amazon.com/
[4] https://lore.kernel.org/lkml/20250918152830.438554-1-nschichan@freebox.fr/
[5] https://lore.kernel.org/lkml/20251022082604.25437-1-safinaskar@gmail.com/

Askar Safin (3):
  init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command
    line parameters
  initrd: remove deprecated code path (linuxrc)
  init: remove /proc/sys/kernel/real-root-dev

 .../admin-guide/kernel-parameters.txt         |  12 +-
 Documentation/admin-guide/sysctl/kernel.rst   |   6 -
 arch/arm/configs/neponset_defconfig           |   2 +-
 fs/init.c                                     |  14 ---
 include/linux/init_syscalls.h                 |   1 -
 include/linux/initrd.h                        |   2 -
 include/uapi/linux/sysctl.h                   |   1 -
 init/do_mounts.c                              |  11 +-
 init/do_mounts.h                              |  18 +--
 init/do_mounts_initrd.c                       | 107 ++----------------
 init/do_mounts_rd.c                           |  24 +---
 11 files changed, 23 insertions(+), 175 deletions(-)

base-commit: 6a23ae0a96a600d1d12557add110e0bb6e32730c (v6.18-rc6)
-- 
2.47.3

^ permalink raw reply

* Re: [PATCH v6 18/20] selftests/liveupdate: Add kexec-based selftest for session lifecycle
From: Pasha Tatashin @ 2025-11-19 22:12 UTC (permalink / raw)
  To: David Matlack
  Cc: pratyush, jasonmiu, graf, rppt, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aR40oVOxZ-dezpy0@google.com>

On Wed, Nov 19, 2025 at 4:20 PM David Matlack <dmatlack@google.com> wrote:
>
> On 2025-11-15 06:34 PM, Pasha Tatashin wrote:
>
> > diff --git a/tools/testing/selftests/liveupdate/do_kexec.sh b/tools/testing/selftests/liveupdate/do_kexec.sh
> > new file mode 100755
> > index 000000000000..3c7c6cafbef8
> > --- /dev/null
> > +++ b/tools/testing/selftests/liveupdate/do_kexec.sh
> > @@ -0,0 +1,16 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0
> > +set -e
> > +
> > +# Use $KERNEL and $INITRAMFS to pass custom Kernel and optional initramfs
>
> It'd be nice to use proper command line options for KERNEL and INITRAMFS
> instead of relying on environment variables.

Now that tests and do_kexec are separate, I do not think we should
complicate do_kexec.sh to support every possible environment. On most
modern distros kexec is managed via systemd, and the load and reboot
commands are going to be handled through systemd. do_kexec.sh is meant
for a very simplistic environment such as with busybox rootfs to
perform selftests.

> e.g.
>
>   ./do_kexec.sh -k <kernel> -i <initramfs>
>
> > +
> > +KERNEL="${KERNEL:-/boot/bzImage}"
> > +set -- -l -s --reuse-cmdline "$KERNEL"
>
> I've observed --reuse-cmdline causing overload of the kernel command
> line when doing repeated kexecs, since it includes the built-in command
> line (CONFIG_CMDLINE) which then also gets added by the next kernel
> during boot.

There is a problem with CONFIG_CMDLINE + KEXEC, ideally, it should be
addressed in the kernel

>
> Should we have something like this instead?
>
> diff --git a/tools/testing/selftests/liveupdate/do_kexec.sh b/tools/testing/selftests/liveupdate/do_kexec.sh
> index 3c7c6cafbef8..2590a870993d 100755
> --- a/tools/testing/selftests/liveupdate/do_kexec.sh
> +++ b/tools/testing/selftests/liveupdate/do_kexec.sh
> @@ -4,8 +4,16 @@ set -e
>
>  # Use $KERNEL and $INITRAMFS to pass custom Kernel and optional initramfs
>
> +# Determine the boot command line we need to pass to the kexec kernel.  Note
> +# that the kernel will append to it its builtin command line, so make sure we
> +# subtract the builtin command to avoid accumulating kernel parameters and
> +# eventually overflowing the command line.
> +full_cmdline=$(cat /proc/cmdline)
> +builtin_cmdline=$(zcat /proc/config.gz|grep CONFIG_CMDLINE=|cut -f2 -d\")

This also implies we have /proc/config.gz or CONFIG_IKCONFIG_PROC ...

> +cmdline=${full_cmdline/$builtin_cmdline /}
> +
>  KERNEL="${KERNEL:-/boot/bzImage}"
> -set -- -l -s --reuse-cmdline "$KERNEL"
> +set -- -l -s --command-line="${cmdline}" "$KERNEL"
>
>  INITRAMFS="${INITRAMFS:-/boot/initramfs}"
>  if [ -f "$INITRAMFS" ]; then
>
> > +
> > +INITRAMFS="${INITRAMFS:-/boot/initramfs}"
> > +if [ -f "$INITRAMFS" ]; then
> > +    set -- "$@" --initrd="$INITRAMFS"
> > +fi
> > +
> > +kexec "$@"
> > +kexec -e
>
> Consider separating the kexec load into its own script, in case systems have
> their own ways of shutting down for kexec.

I think, if do_kexec.sh does not work (load + reboot), the user should
use whatever the standard way on a distro to do kexec.

>
> e.g. a kexec_load.sh script that does everything that do_kexec.sh does execpt
> the `kexec -e`. Then do_kexec.sh just calls kexec_load.sh and kexec -e.

^ permalink raw reply

* Re: [PATCH v6 15/20] mm: memfd_luo: allow preserving memfd
From: Pasha Tatashin @ 2025-11-19 21:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aRsBHy5aQ_Ypyy9r@kernel.org>

On Mon, Nov 17, 2025 at 6:04 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Sat, Nov 15, 2025 at 06:34:01PM -0500, Pasha Tatashin wrote:
> > From: Pratyush Yadav <ptyadav@amazon.de>
> >
> > The ability to preserve a memfd allows userspace to use KHO and LUO to
> > transfer its memory contents to the next kernel. This is useful in many
> > ways. For one, it can be used with IOMMUFD as the backing store for
> > IOMMU page tables. Preserving IOMMUFD is essential for performing a
> > hypervisor live update with passthrough devices. memfd support provides
> > the first building block for making that possible.
> >
> > For another, applications with a large amount of memory that takes time
> > to reconstruct, reboots to consume kernel upgrades can be very
> > expensive. memfd with LUO gives those applications reboot-persistent
> > memory that they can use to quickly save and reconstruct that state.
> >
> > While memfd is backed by either hugetlbfs or shmem, currently only
> > support on shmem is added. To be more precise, support for anonymous
> > shmem files is added.
> >
> > The handover to the next kernel is not transparent. All the properties
> > of the file are not preserved; only its memory contents, position, and
> > size. The recreated file gets the UID and GID of the task doing the
> > restore, and the task's cgroup gets charged with the memory.
> >
> > Once preserved, the file cannot grow or shrink, and all its pages are
> > pinned to avoid migrations and swapping. The file can still be read from
> > or written to.
> >
> > Use vmalloc to get the buffer to hold the folios, and preserve
> > it using kho_preserve_vmalloc(). This doesn't have the size limit.
> >
> > Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>
> The order of signed-offs seems wrong, Pasha's should be the last one.

Updated.


> > + * This interface is a contract. Any modification to the FDT structure,
> > + * node properties, compatible string, or the layout of the serialization
> > + * structures defined here constitutes a breaking change. Such changes require
> > + * incrementing the version number in the MEMFD_LUO_FH_COMPATIBLE string.
>
> The same comment about contract as for the generic LUO documentation
> applies here (https://lore.kernel.org/all/aRnG8wDSSAtkEI_z@kernel.org/)

Added.

>
> > + *
> > + * FDT Structure Overview:
> > + *   The memfd state is contained within a single FDT with the following layout:
>
> ...
>
> > +static struct memfd_luo_folio_ser *memfd_luo_preserve_folios(struct file *file, void *fdt,
> > +                                                          u64 *nr_foliosp)
> > +{
>
> If we are already returning nr_folios by reference, we might do it for
> memfd_luo_folio_ser as well and make the function return int.

Done

>
> > +     struct inode *inode = file_inode(file);
> > +     struct memfd_luo_folio_ser *pfolios;
> > +     struct kho_vmalloc *kho_vmalloc;
> > +     unsigned int max_folios;
> > +     long i, size, nr_pinned;
> > +     struct folio **folios;
>
> pfolios and folios read like the former is a pointer to latter.
> I'd s/pfolios/folios_ser/

Done

> > +     int err = -EINVAL;
> > +     pgoff_t offset;
> > +     u64 nr_folios;
>
> ...
>
> > +     kvfree(folios);
> > +     *nr_foliosp = nr_folios;
> > +     return pfolios;
> > +
> > +err_unpreserve:
> > +     i--;
> > +     for (; i >= 0; i--)
>
> Maybe a single line
>
>         for (--i; i >= 0; --i)

Done, but wrote it as:
for (i = i - 1; i >= 0; i--)
Which looks a little cleaner to me.

>
> > +             kho_unpreserve_folio(folios[i]);
> > +     vfree(pfolios);
> > +err_unpin:
> > +     unpin_folios(folios, nr_folios);
> > +err_free_folios:
> > +     kvfree(folios);
> > +     return ERR_PTR(err);
> > +}
> > +
> > +static void memfd_luo_unpreserve_folios(void *fdt, struct memfd_luo_folio_ser *pfolios,
> > +                                     u64 nr_folios)
> > +{
> > +     struct kho_vmalloc *kho_vmalloc;
> > +     long i;
> > +
> > +     if (!nr_folios)
> > +             return;
> > +
> > +     kho_vmalloc = (struct kho_vmalloc *)fdt_getprop(fdt, 0, MEMFD_FDT_FOLIOS, NULL);
> > +     /* The FDT was created by this kernel so expect it to be sane. */
> > +     WARN_ON_ONCE(!kho_vmalloc);
>
> The FDT won't have FOLIOS property if size was zero, will it?
> I think that if we add kho_vmalloc handle to struct memfd_luo_private and
> pass that around it will make things easier and simpler.

I am actually thinking of removing FDTs and using versioned struct directly.

>
> > +     kho_unpreserve_vmalloc(kho_vmalloc);
> > +
> > +     for (i = 0; i < nr_folios; i++) {
> > +             const struct memfd_luo_folio_ser *pfolio = &pfolios[i];
> > +             struct folio *folio;
> > +
> > +             if (!pfolio->foliodesc)
> > +                     continue;
>
> How can this happen? Can pfolios be a sparse array?

With the current implementation of memfd_pin_folios, which populates
holes, this array will be dense. This check is defensive coding in
case we switch to a sparse preservation mechanism in the future. I
will add a comment, and add a warn_on_once.

>
> > +             folio = pfn_folio(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > +
> > +             kho_unpreserve_folio(folio);
> > +             unpin_folio(folio);
> > +     }
> > +
> > +     vfree(pfolios);
> > +}
>
> ...
>
> > +static void memfd_luo_finish(struct liveupdate_file_op_args *args)
> > +{
> > +     const struct memfd_luo_folio_ser *pfolios;
> > +     struct folio *fdt_folio;
> > +     const void *fdt;
> > +     u64 nr_folios;
> > +
> > +     if (args->retrieved)
> > +             return;
> > +
> > +     fdt_folio = memfd_luo_get_fdt(args->serialized_data);
> > +     if (!fdt_folio) {
> > +             pr_err("failed to restore memfd FDT\n");
> > +             return;
> > +     }
> > +
> > +     fdt = folio_address(fdt_folio);
> > +
> > +     pfolios = memfd_luo_fdt_folios(fdt, &nr_folios);
> > +     if (!pfolios)
> > +             goto out;
> > +
> > +     memfd_luo_discard_folios(pfolios, nr_folios);
>
> Does not this free the actual folios that were supposed to be preserved?

It does, when memfd was not reclaimed.

>
> > +     vfree(pfolios);
> > +
> > +out:
> > +     folio_put(fdt_folio);
> > +}
>
> ...
>
> > +static int memfd_luo_retrieve(struct liveupdate_file_op_args *args)
> > +{
> > +     struct folio *fdt_folio;
> > +     const u64 *pos, *size;
> > +     struct file *file;
> > +     int len, ret = 0;
> > +     const void *fdt;
> > +
> > +     fdt_folio = memfd_luo_get_fdt(args->serialized_data);
>
> Why do we need to kho_restore_folio() twice? Here and in
> memfd_luo_finish()?

Here we retrieve memfd and give it to userspace. In finish, discard
whatever was not reclaimed.

>
> > +     if (!fdt_folio)
> > +             return -ENOENT;
> > +
> > +     fdt = page_to_virt(folio_page(fdt_folio, 0));
>
> folio_address()

Done

>

>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply

* Re: [PATCH v6 18/20] selftests/liveupdate: Add kexec-based selftest for session lifecycle
From: David Matlack @ 2025-11-19 21:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, rppt, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <20251115233409.768044-19-pasha.tatashin@soleen.com>

On 2025-11-15 06:34 PM, Pasha Tatashin wrote:

> diff --git a/tools/testing/selftests/liveupdate/do_kexec.sh b/tools/testing/selftests/liveupdate/do_kexec.sh
> new file mode 100755
> index 000000000000..3c7c6cafbef8
> --- /dev/null
> +++ b/tools/testing/selftests/liveupdate/do_kexec.sh
> @@ -0,0 +1,16 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +set -e
> +
> +# Use $KERNEL and $INITRAMFS to pass custom Kernel and optional initramfs

It'd be nice to use proper command line options for KERNEL and INITRAMFS
instead of relying on environment variables.

e.g.

  ./do_kexec.sh -k <kernel> -i <initramfs>

> +
> +KERNEL="${KERNEL:-/boot/bzImage}"
> +set -- -l -s --reuse-cmdline "$KERNEL"

I've observed --reuse-cmdline causing overload of the kernel command
line when doing repeated kexecs, since it includes the built-in command
line (CONFIG_CMDLINE) which then also gets added by the next kernel
during boot.

Should we have something like this instead?

diff --git a/tools/testing/selftests/liveupdate/do_kexec.sh b/tools/testing/selftests/liveupdate/do_kexec.sh
index 3c7c6cafbef8..2590a870993d 100755
--- a/tools/testing/selftests/liveupdate/do_kexec.sh
+++ b/tools/testing/selftests/liveupdate/do_kexec.sh
@@ -4,8 +4,16 @@ set -e

 # Use $KERNEL and $INITRAMFS to pass custom Kernel and optional initramfs

+# Determine the boot command line we need to pass to the kexec kernel.  Note
+# that the kernel will append to it its builtin command line, so make sure we
+# subtract the builtin command to avoid accumulating kernel parameters and
+# eventually overflowing the command line.
+full_cmdline=$(cat /proc/cmdline)
+builtin_cmdline=$(zcat /proc/config.gz|grep CONFIG_CMDLINE=|cut -f2 -d\")
+cmdline=${full_cmdline/$builtin_cmdline /}
+
 KERNEL="${KERNEL:-/boot/bzImage}"
-set -- -l -s --reuse-cmdline "$KERNEL"
+set -- -l -s --command-line="${cmdline}" "$KERNEL"

 INITRAMFS="${INITRAMFS:-/boot/initramfs}"
 if [ -f "$INITRAMFS" ]; then

> +
> +INITRAMFS="${INITRAMFS:-/boot/initramfs}"
> +if [ -f "$INITRAMFS" ]; then
> +    set -- "$@" --initrd="$INITRAMFS"
> +fi
> +
> +kexec "$@"
> +kexec -e

Consider separating the kexec load into its own script, in case systems have
their own ways of shutting down for kexec.

e.g. a kexec_load.sh script that does everything that do_kexec.sh does execpt
the `kexec -e`. Then do_kexec.sh just calls kexec_load.sh and kexec -e.

^ permalink raw reply related

* Re: Safety of resolving untrusted paths with detached mount dirfd
From: David Laight @ 2025-11-19 18:34 UTC (permalink / raw)
  To: Alyssa Ross
  Cc: linux-fsdevel, Demi Marie Obenour, Aleksa Sarai, Jann Horn,
	Eric W. Biederman, jlayton, Bruce Fields, Al Viro, Arnd Bergmann,
	shuah, David Howells, Andy Lutomirski, Christian Brauner,
	Tycho Andersen, linux-kernel, linux-api
In-Reply-To: <87cy5eqgn8.fsf@alyssa.is>

On Wed, 19 Nov 2025 14:46:35 +0100
Alyssa Ross <hi@alyssa.is> wrote:

> Hello,
> 
> As we know, it's not safe to use chroot() for resolving untrusted paths
> within some root, as a subdirectory could be moved outside of the
> process root while walking the path[1].  On the other hand,
> LOOKUP_BENEATH is supposed to be robust against this, and going by [2],
> it sounds like resolving with the mount namespace root as dirfd should
> also be.
> 
> My question is: would resolving an untrusted path against a detached
> mount root dirfd opened with OPEN_TREE_CLONE (not necessarily a
> filesystem root) also be expected to be robust against traversal issues?
> i.e. can I rely on an untrusted path never resolving to a path that
> isn't under the mount root?
> 
> [1]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
> [2]: https://lore.kernel.org/lkml/C89D720F-3CC4-4FA9-9CBB-E41A67360A6B@amacapital.net/

May not be directly relevant, but I found 'pwd' giving the wrong answer
when done inside a chroot (that isn't a filesytem mount point) after
'faffing' [1] with network namespaces.

The basic problem was that two kernel 'inode' structures end up referencing
the base of the chroot - so the pointer equality test fails.

So you could find the path of the chroot without any help from outside. 

[1] Brain thinks it might have been an 'unshare' to leave a network namespace
that cause the problem.

	David

^ permalink raw reply

* Safety of resolving untrusted paths with detached mount dirfd
From: Alyssa Ross @ 2025-11-19 13:46 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Demi Marie Obenour, Aleksa Sarai, Jann Horn, Eric W. Biederman,
	jlayton, Bruce Fields, Al Viro, Arnd Bergmann, shuah,
	David Howells, Andy Lutomirski, Christian Brauner, Tycho Andersen,
	linux-kernel, linux-api

[-- Attachment #1: Type: text/plain, Size: 851 bytes --]

Hello,

As we know, it's not safe to use chroot() for resolving untrusted paths
within some root, as a subdirectory could be moved outside of the
process root while walking the path[1].  On the other hand,
LOOKUP_BENEATH is supposed to be robust against this, and going by [2],
it sounds like resolving with the mount namespace root as dirfd should
also be.

My question is: would resolving an untrusted path against a detached
mount root dirfd opened with OPEN_TREE_CLONE (not necessarily a
filesystem root) also be expected to be robust against traversal issues?
i.e. can I rely on an untrusted path never resolving to a path that
isn't under the mount root?

[1]: https://lore.kernel.org/lkml/CAG48ez30WJhbsro2HOc_DR7V91M+hNFzBP5ogRMZaxbAORvqzg@mail.gmail.com/
[2]: https://lore.kernel.org/lkml/C89D720F-3CC4-4FA9-9CBB-E41A67360A6B@amacapital.net/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply

* Wiadomość z księgowości
From: Marek Poradecki @ 2025-11-19  8:55 UTC (permalink / raw)
  To: linux-api

Dzień dobry,

pomagamy przedsiębiorcom wprowadzić model wymiany walut, który minimalizuje wahania kosztów przy rozliczeniach międzynarodowych.

Kiedyv możemy umówić się na 15-minutową rozmowę, aby zaprezentować, jak taki model mógłby działać w Państwa firmie - z gwarancją indywidualnych kursów i pełnym uproszczeniem płatności? Proszę o propozycję dogodnego terminu.

Pozdrawiam
Marek Poradecki

^ permalink raw reply

* Re: [PATCH v6 02/20] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-11-19  3:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, pratyush, jasonmiu, graf, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl
In-Reply-To: <20251118232517.GD120075@nvidia.com>

On Tue, Nov 18, 2025 at 6:25 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Nov 18, 2025 at 05:07:15PM -0500, Pasha Tatashin wrote:
>
> > In this case, we cannot even rely on having "safe" memory, i.e. this
> > scratch only boot to preserve dmesg/core etc, this is unfortunate. Is
> > there a way to avoid defaulting to identify mode when we are booting
> > into the "maintenance" mode?
>
> Maybe one could be created?
>
> It's tricky though because you also really want to block drivers from
> using the iommu if you don't know they are quieted and you can't do
> that without parsing the KHO data, which you can't do because it
> doesn't understand it..
>
> IDK, I think the "maintenance" mode is something that is probably best
> effort and shouldn't be relied on. It will work if the iommu data is
> restored or other lucky conditions hit, so it is not useless, but it
> is certainly not robust or guaranteed.

Right, even kdump has always been best-effort; many types of crashes
do not make it to the crash kernel.

> You are better to squirt a panic message out of the serial port and

For early boot LUO mismatches, or if FLB data is inaccessible for any
reason, devices might go rogue, so triggering a panic during boot is
appropriate.

However, session and file data structures are deserialized later, when
/dev/liveupdate is first opened by userspace. If deserialization fails
at that stage, I think we should simply fail the open(/dev/liveupdate)
call with an error such as -EIO.

Pasha

^ permalink raw reply

* Re: [PATCH v6 02/20] liveupdate: luo_core: integrate with KHO
From: Jason Gunthorpe @ 2025-11-18 23:25 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, pratyush, jasonmiu, graf, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl
In-Reply-To: <CA+CK2bCguutAdsXETdDSEPCPT_=OQupgyTfGKQuxi924mOfhTQ@mail.gmail.com>

On Tue, Nov 18, 2025 at 05:07:15PM -0500, Pasha Tatashin wrote:

> In this case, we cannot even rely on having "safe" memory, i.e. this
> scratch only boot to preserve dmesg/core etc, this is unfortunate. Is
> there a way to avoid defaulting to identify mode when we are booting
> into the "maintenance" mode?

Maybe one could be created?

It's tricky though because you also really want to block drivers from
using the iommu if you don't know they are quieted and you can't do
that without parsing the KHO data, which you can't do because it
doesn't understand it..

IDK, I think the "maintenance" mode is something that is probably best
effort and shouldn't be relied on. It will work if the iommu data is
restored or other lucky conditions hit, so it is not useless, but it
is certainly not robust or guaranteed.

You are better to squirt a panic message out of the serial port and
hope for the best I guess.

Jason

^ permalink raw reply

* Re: [PATCH v6 02/20] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-11-18 22:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, pratyush, jasonmiu, graf, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	skhawaja, chrisl
In-Reply-To: <20251118161526.GD90703@nvidia.com>

On Tue, Nov 18, 2025 at 11:15 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Nov 18, 2025 at 10:46:35AM -0500, Pasha Tatashin wrote:
> > > > This won't leak data, as /dev/liveupdate is completely disabled, so
> > > > nothing preserved in memory will be recoverable.
> > >
> > > This seems reasonable, but it is still dangerous.
> > >
> > > At the minimum the KHO startup either needs to succeed, panic, or fail
> > > to online most of the memory (ie run from the safe region only)
> >
> > Allowing degrade booting using only scratch memory sounds like a very
> > good compromise. This allows the live-update boot to stay alive as a
> > sort of "crash kernel," particularly since kdump functionality is not
> > available here. However, it would require some work in KHO to enable
> > such a feature.
> >
> > > The above approach works better for things like VFIO or memfd where
> > > you can boot significantly safely. Not sure about iommu though, if
> > > iommu doesn't deserialize properly then it probably corrupts all
> > > memory too.
> >
> > Yes, DMA may corrupt memory if KHO is broken, *but* we are discussing
> > broken LUO recovering, the KHO preserved memory should still stay as
> > preserved but unretriable, so DMA activity should only happen to those
> > regions...
>
> If the iommu is not preserved then normal iommu boot will possibly set
> the translation the identiy and it will scribble over random memory.
>
> You can't rely on the translation being present and only reaching kho
> preserved memroy if the iommu can't restore itself.

In this case, we cannot even rely on having "safe" memory, i.e. this
scratch only boot to preserve dmesg/core etc, this is unfortunate. Is
there a way to avoid defaulting to identify mode when we are booting
into the "maintenance" mode?

Thanks,
Pasha

>
> Jason

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: Pasha Tatashin @ 2025-11-18 19:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, David Matlack, jasonmiu, graf, rppt, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu, hughd, skhawaja,
	chrisl
In-Reply-To: <20251118190901.GS10864@nvidia.com>

On Tue, Nov 18, 2025 at 2:09 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Nov 18, 2025 at 12:58:20PM -0500, Pasha Tatashin wrote:
> > I actually had full unregister functionality in v4 and earlier, but I
> > dropped it from this series to minimize the footprint and get the core
> > infrastructure landed first.
>
> I don't think this will make sense, there are enough error paths we
> can't have registers without unregisters to unwind them.

I will add them back in LUOv7.

>
> Jason

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: Jason Gunthorpe @ 2025-11-18 19:09 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, David Matlack, jasonmiu, graf, rppt, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu, hughd, skhawaja,
	chrisl
In-Reply-To: <CA+CK2bAqisSdZ7gSBd7=hGd1VbLHX5WXfBazR=rO8BOVCRx3pg@mail.gmail.com>

On Tue, Nov 18, 2025 at 12:58:20PM -0500, Pasha Tatashin wrote:
> I actually had full unregister functionality in v4 and earlier, but I
> dropped it from this series to minimize the footprint and get the core
> infrastructure landed first.

I don't think this will make sense, there are enough error paths we
can't have registers without unregisters to unwind them.

Jason

^ permalink raw reply

* Re: [PATCH v6 20/20] tests/liveupdate: Add in-kernel liveupdate test
From: Pasha Tatashin @ 2025-11-18 18:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aRxY53gBbeH-6L0Y@kernel.org>

On Tue, Nov 18, 2025 at 6:31 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Nov 17, 2025 at 02:00:15PM -0500, Pasha Tatashin wrote:
> > > >  #endif /* _LINUX_LIVEUPDATE_ABI_LUO_H */
> > > > diff --git a/kernel/liveupdate/luo_file.c b/kernel/liveupdate/luo_file.c
> > > > index df337c9c4f21..9a531096bdb5 100644
> > > > --- a/kernel/liveupdate/luo_file.c
> > > > +++ b/kernel/liveupdate/luo_file.c
> > > > @@ -834,6 +834,8 @@ int liveupdate_register_file_handler(struct liveupdate_file_handler *fh)
> > > >       INIT_LIST_HEAD(&fh->flb_list);
> > > >       list_add_tail(&fh->list, &luo_file_handler_list);
> > > >
> > > > +     liveupdate_test_register(fh);
> > > > +
> > >
> > > Why this cannot be called from the test?
> >
> > Because test does not have access to all file_handlers that are being
> > registered with LUO.
>
> Unless I'm missing something, an FLB users registers a file handlers and
> let's LUO know that it will need FLB. Why the test can't do the same?

The test needs to attach to every registered file handler because we
want to ensure that FLB scales and works correctly with any file
handler. For this in-kernel test, there is no need to create our own
file type or to drive it from userspace (where a user would create a
file of that type, preserve it with LUO, so FLB can be allocated and
checked. This in-kernel test is self-sufficient.

> > Pasha
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: Pratyush Yadav @ 2025-11-18 18:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, David Matlack, jasonmiu, graf, rppt, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx,
	mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl
In-Reply-To: <CA+CK2bAqisSdZ7gSBd7=hGd1VbLHX5WXfBazR=rO8BOVCRx3pg@mail.gmail.com>

On Tue, Nov 18 2025, Pasha Tatashin wrote:

> On Tue, Nov 18, 2025 at 12:43 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Tue, Nov 18 2025, David Matlack wrote:
>>
>> > On 2025-11-15 06:33 PM, Pasha Tatashin wrote:
>> >> This patch implements the core mechanism for managing preserved
>> >> files throughout the live update lifecycle. It provides the logic to
>> >> invoke the file handler callbacks (preserve, unpreserve, freeze,
>> >> unfreeze, retrieve, and finish) at the appropriate stages.
>> >>
>> >> During the reboot phase, luo_file_freeze() serializes the final
>> >> metadata for each file (handler compatible string, token, and data
>> >> handle) into a memory region preserved by KHO. In the new kernel,
>> >> luo_file_deserialize() reconstructs the in-memory file list from this
>> >> data, preparing the session for retrieval.
>> >>
>> >> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> >
>> >> +int liveupdate_register_file_handler(struct liveupdate_file_handler *h);
>> >
>> > Should there be a way to unregister a file handler?
>> >
>> > If VFIO is built as module then I think it  would need to be able to
>> > unregister its file handler when the module is unloaded to avoid leaking
>> > pointers to its text in LUO.
>
> I actually had full unregister functionality in v4 and earlier, but I
> dropped it from this series to minimize the footprint and get the core
> infrastructure landed first.
>
> For now, safety is guaranteed because
> liveupdate_register_file_handler() and liveupdate_register_flb() take
> a module reference. This effectively pins any module that registers
> with LUO, meaning those driver modules cannot be unloaded or upgraded
> dynamically, they can only be updated via Live Update or full reboot.

What if liveupdate_register_flb() fails? It would need to unregister its
file handler too, since the file handler can't really work without its
FLB. Shouldn't happen in practice, but still LUO clients need a way to
handle this failure.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: RFC: Serial port DTR/RTS - O_<something>
From: H. Peter Anvin @ 2025-11-18 18:05 UTC (permalink / raw)
  To: Ned Ulbricht, Maciej W. Rozycki
  Cc: Greg KH, Theodore Ts'o, Maarten Brock,
	linux-serial@vger.kernel.org, linux-api@vger.kernel.org, LKML
In-Reply-To: <06279d25-73d6-01f5-dcf8-8667415048d2@netscape.net>

On 2025-11-18 08:33, Ned Ulbricht wrote:
>>
>> O_NOCLOBBER looks like an odd in-between between O_EXCL and
>> (O_EXCL|O_NOFOLLOW); stated to be specifically to implement the shell
>> "noclobber" semantic.
> 
> "(O_EXCL|O_NOFOLLOW)" provokes a thought...
> 
> As essential context, fs/open.c build_open_flags() has:
> 
> if (flags & O_CREAT) {
>     op->intent |= LOOKUP_CREATE;
>     if (flags & O_EXCL) {
>         op->intent |= LOOKUP_EXCL;
>         flags |= O_NOFOLLOW;
>     }
> }
> 
> if (!(flags & O_NOFOLLOW))
>     lookup_flags |= LOOKUP_FOLLOW;
> 

Interesting. As far as O_NOCLOBBER is concerned, that is an "O_EXCL unless the
output is a special file (device node, FIFO, etc)"; presumably to allow the
shell to not flip out when doing, say "foo > /dev/ttyS0" when in noclobber mode.

I had missed the bit in the spec that says that O_CREAT|O_EXCL is required to
imply O_NOFOLLOW (as Linux indeed does as seen above.)

O_NOCLOBBER emulation in user space would seem to be possible with a loop;
first try to open O_CREAT|O_EXCL and if that fails with EEXIST then open
without either; if that succeeds test with fstat() to see if it is a regular
file, and if it is, close it and error. However, it is hardly ideal, and I
might have overlooked some mechanism by which this may fail.

	-hpa

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: Pasha Tatashin @ 2025-11-18 17:58 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: David Matlack, jasonmiu, graf, rppt, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <mafs05xb744pb.fsf@kernel.org>

On Tue, Nov 18, 2025 at 12:43 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Nov 18 2025, David Matlack wrote:
>
> > On 2025-11-15 06:33 PM, Pasha Tatashin wrote:
> >> This patch implements the core mechanism for managing preserved
> >> files throughout the live update lifecycle. It provides the logic to
> >> invoke the file handler callbacks (preserve, unpreserve, freeze,
> >> unfreeze, retrieve, and finish) at the appropriate stages.
> >>
> >> During the reboot phase, luo_file_freeze() serializes the final
> >> metadata for each file (handler compatible string, token, and data
> >> handle) into a memory region preserved by KHO. In the new kernel,
> >> luo_file_deserialize() reconstructs the in-memory file list from this
> >> data, preparing the session for retrieval.
> >>
> >> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >
> >> +int liveupdate_register_file_handler(struct liveupdate_file_handler *h);
> >
> > Should there be a way to unregister a file handler?
> >
> > If VFIO is built as module then I think it  would need to be able to
> > unregister its file handler when the module is unloaded to avoid leaking
> > pointers to its text in LUO.

I actually had full unregister functionality in v4 and earlier, but I
dropped it from this series to minimize the footprint and get the core
infrastructure landed first.

For now, safety is guaranteed because
liveupdate_register_file_handler() and liveupdate_register_flb() take
a module reference. This effectively pins any module that registers
with LUO, meaning those driver modules cannot be unloaded or upgraded
dynamically, they can only be updated via Live Update or full reboot.

I plan to introduce unregister support in a future improvements to
relax this constraint. The design I have in mind is:
1. Unregistration will acquire the singleton lock on /dev/liveupdate
to ensure no new sessions can be created during teardown.
2. Verify that there are no incoming/outgoing sessions.
2.  File-Handler can only be unregistered if there are no FLBs
currently registered against it.

Pasha

> Good point. We also need when using FLB. You would first do
> liveupdate_register_file_handler(), and then do
> liveupdate_register_flb(). If the latter fails, you would want to
> unregister the file handler too.
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: Pratyush Yadav @ 2025-11-18 17:43 UTC (permalink / raw)
  To: David Matlack
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <aRyvG308oNRVzuN7@google.com>

On Tue, Nov 18 2025, David Matlack wrote:

> On 2025-11-15 06:33 PM, Pasha Tatashin wrote:
>> This patch implements the core mechanism for managing preserved
>> files throughout the live update lifecycle. It provides the logic to
>> invoke the file handler callbacks (preserve, unpreserve, freeze,
>> unfreeze, retrieve, and finish) at the appropriate stages.
>> 
>> During the reboot phase, luo_file_freeze() serializes the final
>> metadata for each file (handler compatible string, token, and data
>> handle) into a memory region preserved by KHO. In the new kernel,
>> luo_file_deserialize() reconstructs the in-memory file list from this
>> data, preparing the session for retrieval.
>> 
>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
>> +int liveupdate_register_file_handler(struct liveupdate_file_handler *h);
>
> Should there be a way to unregister a file handler?
>
> If VFIO is built as module then I think it  would need to be able to
> unregister its file handler when the module is unloaded to avoid leaking
> pointers to its text in LUO.

Good point. We also need when using FLB. You would first do
liveupdate_register_file_handler(), and then do
liveupdate_register_flb(). If the latter fails, you would want to
unregister the file handler too.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v6 06/20] liveupdate: luo_file: implement file systems callbacks
From: David Matlack @ 2025-11-18 17:38 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, rppt, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu, hughd, skhawaja, chrisl
In-Reply-To: <20251115233409.768044-7-pasha.tatashin@soleen.com>

On 2025-11-15 06:33 PM, Pasha Tatashin wrote:
> This patch implements the core mechanism for managing preserved
> files throughout the live update lifecycle. It provides the logic to
> invoke the file handler callbacks (preserve, unpreserve, freeze,
> unfreeze, retrieve, and finish) at the appropriate stages.
> 
> During the reboot phase, luo_file_freeze() serializes the final
> metadata for each file (handler compatible string, token, and data
> handle) into a memory region preserved by KHO. In the new kernel,
> luo_file_deserialize() reconstructs the in-memory file list from this
> data, preparing the session for retrieval.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

> +int liveupdate_register_file_handler(struct liveupdate_file_handler *h);

Should there be a way to unregister a file handler?

If VFIO is built as module then I think it  would need to be able to
unregister its file handler when the module is unloaded to avoid leaking
pointers to its text in LUO.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox