Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-10 12:45 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0ms5zn0nm.fsf@kernel.org>

On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> >>
> [...]
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
>
> Is this going to replace LUO subsystems? If yes, then why? The global
> state will likely need to have its own lifecycle just like the FDs, and
> subsystems are a simple and clean abstraction to control that. I get the
> idea of only "activating" a subsystem when one or more of its FDs are
> participating in LUO, but we can do that while keeping subsystems
> around.
>
> >
> > The key characteristics of this new mechanism are:
> > The global state is optionally created on the first preserve() call
> > for a given file handler.
> > The state can be updated on subsequent preserve() calls.
> > The state is destroyed when the last corresponding file is unpreserved
> > or finished.
> > The data can be accessed during boot.
> >
> > I am thinking of an API like this.
> >
> > 1. Add three more callbacks to liveupdate_file_ops:
> > /*
> >  * Optional. Called by LUO during first get global state call.
> >  * The handler should allocate/KHO preserve its global state object and return a
> >  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
> >  * address of preserved memory) via 'data_handle' that LUO will save.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_create)(struct liveupdate_file_handler *h,
> >                            void **obj, u64 *data_handle);
> >
> > /*
> >  * Optional. Called by LUO in the new kernel
> >  * before the first access to the global state. The handler receives
> >  * the preserved u64 data_handle and should use it to reconstruct its
> >  * global state object, returning a pointer to it via 'obj'.
> >  * Return: 0 on success.
> >  */
> > int (*global_state_restore)(struct liveupdate_file_handler *h,
> >                             u64 data_handle, void **obj);
> >
> > /*
> >  * Optional. Called by LUO after the last
> >  * file for this handler is unpreserved or finished. The handler
> >  * must free its global state object and any associated resources.
> >  */
> > void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
> >
> > The get/put global state data:
> >
> > /* Get and lock the data with file_handler scoped lock */
> > int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
> >                                    void **obj);
> >
> > /* Unlock the data */
> > void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> IMHO this looks clunky and overcomplicated. Each LUO FD type knows what
> its subsystem is. It should talk to it directly. I don't get why we are
> adding this intermediate step.
>
> Here is how I imagine the proposed API would compare against subsystems
> with hugetlb as an example (hugetlb support is still WIP, so I'm still
> not clear on specifics, but this is how I imagine it will work):
>
> - Hugetlb subsystem needs to track its huge page pools and which pages
>   are allocated and free. This is its global state. The pools get
>   reconstructed after kexec. Post-kexec, the free pages are ready for
>   allocation from other "regular" files and the pages used in LUO files
>   are reserved.

Thinking more about this, HugeTLB is different from iommufd/iommu-core
vfiofd/pci because it supports many types of FDs, such as memfd and
guest_memfd (1G support is coming soon!). Also, since not all memfds
or guest_memfd instances require HugeTLB, binding their lifecycles to
HugeTLB doesn't make sense here. I agree that a subsystem is more
appropriate for this use case.

Pasha

^ permalink raw reply

* Re: [PATCH 0/2] Fix to EOPNOTSUPP double conversion in ioctl_setflags()
From: Christian Brauner @ 2025-10-10 11:47 UTC (permalink / raw)
  To: linux-api, linux-fsdevel, linux-kernel, linux-xfs,
	Andrey Albershteyn
  Cc: Christian Brauner, Jan Kara, Jiri Slaby, Arnd Bergmann,
	Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-0-5990de009c9f@kernel.org>

On Wed, 08 Oct 2025 14:44:16 +0200, Andrey Albershteyn wrote:
> Revert original double conversion patch from ENOIOCTLCMD to EOPNOSUPP for
> vfs_fileattr_get and vfs_fileattr_set. Instead, convert ENOIOCTLCMD only
> where necessary.
> 
> To: linux-api@vger.kernel.org
> To: linux-fsdevel@vger.kernel.org
> To: linux-kernel@vger.kernel.org
> To: linux-xfs@vger.kernel.org,
> Cc: "Jan Kara" <jack@suse.cz>
> Cc: "Jiri Slaby" <jirislaby@kernel.org>
> Cc: "Christian Brauner" <brauner@kernel.org>
> Cc: "Arnd Bergmann" <arnd@arndb.de>
> 
> [...]

Applied to the vfs.fixes branch of the vfs/vfs.git tree.
Patches in the vfs.fixes branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.fixes

[1/2] Revert "fs: make vfs_fileattr_[get|set] return -EOPNOTSUPP"
      https://git.kernel.org/vfs/vfs/c/4dd5b5ac089b
[2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
      https://git.kernel.org/vfs/vfs/c/d90ad28e8aa4

^ permalink raw reply

* Re: [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Christian Brauner @ 2025-10-10 11:45 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: Darrick J. Wong, linux-api, linux-fsdevel, linux-kernel,
	linux-xfs, Jan Kara, Jiri Slaby, Arnd Bergmann,
	Andrey Albershteyn
In-Reply-To: <q6phvrrl2fumjwwd66d5glauch76uca4rr5pkvl2dwaxzx62bm@sjcixwa7r6r5>

On Fri, Oct 10, 2025 at 12:05:04PM +0200, Andrey Albershteyn wrote:
> On 2025-10-09 10:20:41, Darrick J. Wong wrote:
> > On Wed, Oct 08, 2025 at 02:44:18PM +0200, Andrey Albershteyn wrote:
> > > These syscalls call to vfs_fileattr_get/set functions which return
> > > ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
> > > inode. For syscalls EOPNOTSUPP would be more appropriate return error.
> > > 
> > > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > > ---
> > >  fs/file_attr.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/fs/file_attr.c b/fs/file_attr.c
> > > index 460b2dd21a85..5e3e2aba97b5 100644
> > > --- a/fs/file_attr.c
> > > +++ b/fs/file_attr.c
> > > @@ -416,6 +416,8 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
> > >  	}
> > >  
> > >  	error = vfs_fileattr_get(filepath.dentry, &fa);
> > > +	if (error == -ENOIOCTLCMD)
> > 
> > Hrm.  Back in 6.17, XFS would return ENOTTY if you called ->fileattr_get
> > on a special file:
> > 
> > int
> > xfs_fileattr_get(
> > 	struct dentry		*dentry,
> > 	struct file_kattr	*fa)
> > {
> > 	struct xfs_inode	*ip = XFS_I(d_inode(dentry));
> > 
> > 	if (d_is_special(dentry))
> > 		return -ENOTTY;
> > 	...
> > }
> > 
> > Given that there are other fileattr_[gs]et implementations out there
> > that might return ENOTTY (e.g. fuse servers and other externally
> > maintained filesystems), I think both syscall functions need to check
> > for that as well:
> > 
> > 	if (error == -ENOIOCTLCMD || error == -ENOTTY)
> > 		return -EOPNOTSUPP;
> 
> Make sense (looks like ubifs, jfs and gfs2 also return ENOTTY for
> special files), I haven't found ENOTTY being used for anything else
> there

I'm folding this in.

^ permalink raw reply

* Re: [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Andrey Albershteyn @ 2025-10-10 10:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-xfs, Jan Kara,
	Jiri Slaby, Christian Brauner, Arnd Bergmann, Andrey Albershteyn
In-Reply-To: <20251009172041.GA6174@frogsfrogsfrogs>

On 2025-10-09 10:20:41, Darrick J. Wong wrote:
> On Wed, Oct 08, 2025 at 02:44:18PM +0200, Andrey Albershteyn wrote:
> > These syscalls call to vfs_fileattr_get/set functions which return
> > ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
> > inode. For syscalls EOPNOTSUPP would be more appropriate return error.
> > 
> > Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> > ---
> >  fs/file_attr.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/file_attr.c b/fs/file_attr.c
> > index 460b2dd21a85..5e3e2aba97b5 100644
> > --- a/fs/file_attr.c
> > +++ b/fs/file_attr.c
> > @@ -416,6 +416,8 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
> >  	}
> >  
> >  	error = vfs_fileattr_get(filepath.dentry, &fa);
> > +	if (error == -ENOIOCTLCMD)
> 
> Hrm.  Back in 6.17, XFS would return ENOTTY if you called ->fileattr_get
> on a special file:
> 
> int
> xfs_fileattr_get(
> 	struct dentry		*dentry,
> 	struct file_kattr	*fa)
> {
> 	struct xfs_inode	*ip = XFS_I(d_inode(dentry));
> 
> 	if (d_is_special(dentry))
> 		return -ENOTTY;
> 	...
> }
> 
> Given that there are other fileattr_[gs]et implementations out there
> that might return ENOTTY (e.g. fuse servers and other externally
> maintained filesystems), I think both syscall functions need to check
> for that as well:
> 
> 	if (error == -ENOIOCTLCMD || error == -ENOTTY)
> 		return -EOPNOTSUPP;

Make sense (looks like ubifs, jfs and gfs2 also return ENOTTY for
special files), I haven't found ENOTTY being used for anything else
there

-- 
- Andrey


^ permalink raw reply

* [PATCH v2 3/3] init: remove /proc/sys/kernel/real-root-dev
From: Askar Safin @ 2025-10-10  9:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-1-safinaskar@gmail.com>

It is not used anymore

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  6 ------
 include/uapi/linux/sysctl.h                 |  1 -
 init/do_mounts_initrd.c                     | 20 --------------------
 3 files changed, 27 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 8b49eab937d0..cc958c228bc2 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1215,12 +1215,6 @@ that support this feature.
 ==  ===========================================================================
 
 
-real-root-dev
-=============
-
-See Documentation/admin-guide/initrd.rst.
-
-
 reboot-cmd (SPARC only)
 =======================
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 63d1464cb71c..1c7fe0f4dca4 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -92,7 +92,6 @@ enum
 	KERN_DOMAINNAME=8,	/* string: domainname */
 
 	KERN_PANIC=15,		/* int: panic timeout */
-	KERN_REALROOTDEV=16,	/* real root device to mount after initrd */
 
 	KERN_SPARC_REBOOT=21,	/* reboot command on Sparc */
 	KERN_CTLALTDEL=22,	/* int: allow ctl-alt-del to reboot */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index d4f5f4c60a22..fb0c9d3b722f 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -8,31 +8,11 @@
 
 unsigned long initrd_start, initrd_end;
 int initrd_below_start_ok;
-static unsigned int real_root_dev;	/* do_proc_dointvec cannot handle kdev_t */
 static int __initdata mount_initrd = 1;
 
 phys_addr_t phys_initrd_start __initdata;
 unsigned long phys_initrd_size __initdata;
 
-#ifdef CONFIG_SYSCTL
-static const struct ctl_table kern_do_mounts_initrd_table[] = {
-	{
-		.procname       = "real-root-dev",
-		.data           = &real_root_dev,
-		.maxlen         = sizeof(int),
-		.mode           = 0644,
-		.proc_handler   = proc_dointvec,
-	},
-};
-
-static __init int kernel_do_mounts_initrd_sysctls_init(void)
-{
-	register_sysctl_init("kernel", kern_do_mounts_initrd_table);
-	return 0;
-}
-late_initcall(kernel_do_mounts_initrd_sysctls_init);
-#endif /* CONFIG_SYSCTL */
-
 static int __init no_initrd(char *str)
 {
 	pr_warn("noinitrd option is deprecated and will be removed soon\n");
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 2/3] initrd: remove deprecated code path (linuxrc)
From: Askar Safin @ 2025-10-10  9:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-1-safinaskar@gmail.com>

Remove linuxrc initrd code path, which was deprecated in 2020.

Initramfs and (non-initial) RAM disks (i. e. brd) still work.

Both built-in and bootloader-supplied initramfs still work.

Non-linuxrc initrd code path (i. e. using /dev/ram as final root
filesystem) still works, but I put deprecation message into it

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 .../admin-guide/kernel-parameters.txt         |  4 +-
 fs/init.c                                     | 14 ---
 include/linux/init_syscalls.h                 |  1 -
 include/linux/initrd.h                        |  2 -
 init/do_mounts.c                              |  4 +-
 init/do_mounts.h                              | 18 +---
 init/do_mounts_initrd.c                       | 85 ++-----------------
 init/do_mounts_rd.c                           | 17 +---
 8 files changed, 17 insertions(+), 128 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 521ab3425504..24d8899d8a39 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4285,7 +4285,7 @@
 			Note that this argument takes precedence over
 			the CONFIG_RCU_NOCB_CPU_DEFAULT_ALL option.
 
-	noinitrd	[RAM] Tells the kernel not to load any configured
+	noinitrd	[Deprecated,RAM] Tells the kernel not to load any configured
 			initial RAM disk.
 
 	nointremap	[X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
@@ -5299,7 +5299,7 @@
 	ramdisk_size=	[RAM] Sizes of RAM disks in kilobytes
 			See Documentation/admin-guide/blockdev/ramdisk.rst.
 
-	ramdisk_start=	[RAM] RAM disk image start address
+	ramdisk_start=	[Deprecated,RAM] RAM disk image start address
 
 	random.trust_cpu=off
 			[KNL,EARLY] Disable trusting the use of the CPU's
diff --git a/fs/init.c b/fs/init.c
index 07f592ccdba8..60719494d9a0 100644
--- a/fs/init.c
+++ b/fs/init.c
@@ -27,20 +27,6 @@ int __init init_mount(const char *dev_name, const char *dir_name,
 	return ret;
 }
 
-int __init init_umount(const char *name, int flags)
-{
-	int lookup_flags = LOOKUP_MOUNTPOINT;
-	struct path path;
-	int ret;
-
-	if (!(flags & UMOUNT_NOFOLLOW))
-		lookup_flags |= LOOKUP_FOLLOW;
-	ret = kern_path(name, lookup_flags, &path);
-	if (ret)
-		return ret;
-	return path_umount(&path, flags);
-}
-
 int __init init_chdir(const char *filename)
 {
 	struct path path;
diff --git a/include/linux/init_syscalls.h b/include/linux/init_syscalls.h
index 92045d18cbfc..0bdbc458a881 100644
--- a/include/linux/init_syscalls.h
+++ b/include/linux/init_syscalls.h
@@ -2,7 +2,6 @@
 
 int __init init_mount(const char *dev_name, const char *dir_name,
 		const char *type_page, unsigned long flags, void *data_page);
-int __init init_umount(const char *name, int flags);
 int __init init_chdir(const char *filename);
 int __init init_chroot(const char *filename);
 int __init init_chown(const char *filename, uid_t user, gid_t group, int flags);
diff --git a/include/linux/initrd.h b/include/linux/initrd.h
index f1a1f4c92ded..7e5d26c8136f 100644
--- a/include/linux/initrd.h
+++ b/include/linux/initrd.h
@@ -3,8 +3,6 @@
 #ifndef __LINUX_INITRD_H
 #define __LINUX_INITRD_H
 
-#define INITRD_MINOR 250 /* shouldn't collide with /dev/ram* too soon ... */
-
 /* starting block # of image */
 extern int rd_image_start;
 
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 0f2f44e6250c..1054ad3c905a 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -476,13 +476,11 @@ void __init prepare_namespace(void)
 	if (saved_root_name[0])
 		ROOT_DEV = parse_root_device(saved_root_name);
 
-	if (initrd_load(saved_root_name))
-		goto out;
+	initrd_load();
 
 	if (root_wait)
 		wait_for_root(saved_root_name);
 	mount_root(saved_root_name);
-out:
 	devtmpfs_mount();
 	init_mount(".", "/", NULL, MS_MOVE, NULL);
 	init_chroot(".");
diff --git a/init/do_mounts.h b/init/do_mounts.h
index 6069ea3eb80d..a386ee5314c9 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -23,25 +23,15 @@ static inline __init int create_dev(char *name, dev_t dev)
 }
 
 #ifdef CONFIG_BLK_DEV_RAM
-
-int __init rd_load_disk(int n);
-int __init rd_load_image(char *from);
-
+int __init rd_load_image(void);
 #else
-
-static inline int rd_load_disk(int n) { return 0; }
-static inline int rd_load_image(char *from) { return 0; }
-
+static inline int rd_load_image(void) { return 0; }
 #endif
 
 #ifdef CONFIG_BLK_DEV_INITRD
-bool __init initrd_load(char *root_device_name);
+void __init initrd_load(void);
 #else
-static inline bool initrd_load(char *root_device_name)
-{
-	return false;
-	}
-
+static inline void initrd_load(void) { }
 #endif
 
 /* Ensure that async file closing finished to prevent spurious errors. */
diff --git a/init/do_mounts_initrd.c b/init/do_mounts_initrd.c
index f6867bad0d78..d4f5f4c60a22 100644
--- a/init/do_mounts_initrd.c
+++ b/init/do_mounts_initrd.c
@@ -2,13 +2,7 @@
 #include <linux/unistd.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
-#include <linux/minix_fs.h>
-#include <linux/romfs_fs.h>
 #include <linux/initrd.h>
-#include <linux/sched.h>
-#include <linux/freezer.h>
-#include <linux/kmod.h>
-#include <uapi/linux/mount.h>
 
 #include "do_mounts.h"
 
@@ -41,6 +35,7 @@ late_initcall(kernel_do_mounts_initrd_sysctls_init);
 
 static int __init no_initrd(char *str)
 {
+	pr_warn("noinitrd option is deprecated and will be removed soon\n");
 	mount_initrd = 0;
 	return 1;
 }
@@ -70,85 +65,17 @@ static int __init early_initrd(char *p)
 }
 early_param("initrd", early_initrd);
 
-static int __init init_linuxrc(struct subprocess_info *info, struct cred *new)
-{
-	ksys_unshare(CLONE_FS | CLONE_FILES);
-	console_on_rootfs();
-	/* move initrd over / and chdir/chroot in initrd root */
-	init_chdir("/root");
-	init_mount(".", "/", NULL, MS_MOVE, NULL);
-	init_chroot(".");
-	ksys_setsid();
-	return 0;
-}
-
-static void __init handle_initrd(char *root_device_name)
-{
-	struct subprocess_info *info;
-	static char *argv[] = { "linuxrc", NULL, };
-	extern char *envp_init[];
-	int error;
-
-	pr_warn("using deprecated initrd support, will be removed soon.\n");
-
-	real_root_dev = new_encode_dev(ROOT_DEV);
-	create_dev("/dev/root.old", Root_RAM0);
-	/* mount initrd on rootfs' /root */
-	mount_root_generic("/dev/root.old", root_device_name,
-			   root_mountflags & ~MS_RDONLY);
-	init_mkdir("/old", 0700);
-	init_chdir("/old");
-
-	info = call_usermodehelper_setup("/linuxrc", argv, envp_init,
-					 GFP_KERNEL, init_linuxrc, NULL, NULL);
-	if (!info)
-		return;
-	call_usermodehelper_exec(info, UMH_WAIT_PROC|UMH_FREEZABLE);
-
-	/* move initrd to rootfs' /old */
-	init_mount("..", ".", NULL, MS_MOVE, NULL);
-	/* switch root and cwd back to / of rootfs */
-	init_chroot("..");
-
-	if (new_decode_dev(real_root_dev) == Root_RAM0) {
-		init_chdir("/old");
-		return;
-	}
-
-	init_chdir("/");
-	ROOT_DEV = new_decode_dev(real_root_dev);
-	mount_root(root_device_name);
-
-	printk(KERN_NOTICE "Trying to move old root to /initrd ... ");
-	error = init_mount("/old", "/root/initrd", NULL, MS_MOVE, NULL);
-	if (!error)
-		printk("okay\n");
-	else {
-		if (error == -ENOENT)
-			printk("/initrd does not exist. Ignored.\n");
-		else
-			printk("failed\n");
-		printk(KERN_NOTICE "Unmounting old root\n");
-		init_umount("/old", MNT_DETACH);
-	}
-}
-
-bool __init initrd_load(char *root_device_name)
+void __init initrd_load(void)
 {
 	if (mount_initrd) {
 		create_dev("/dev/ram", Root_RAM0);
 		/*
-		 * Load the initrd data into /dev/ram0. Execute it as initrd
-		 * unless /dev/ram0 is supposed to be our actual root device,
-		 * in that case the ram disk is just set up here, and gets
-		 * mounted in the normal path.
+		 * Load the initrd data into /dev/ram0.
 		 */
-		if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
-			init_unlink("/initrd.image");
-			handle_initrd(root_device_name);
-			return true;
+		if (rd_load_image()) {
+			pr_warn("using deprecated initrd support, will be removed in September 2026; "
+				"use initramfs instead or (as a last resort) /sys/firmware/initrd.\n");
 		}
 	}
 	init_unlink("/initrd.image");
-	return false;
 }
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 5311f2d7edc8..0a021bbcd501 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -22,6 +22,7 @@ int __initdata rd_image_start;		/* starting block # of image */
 
 static int __init ramdisk_start_setup(char *str)
 {
+	pr_warn("ramdisk_start= option is deprecated and will be removed soon\n");
 	rd_image_start = simple_strtol(str,NULL,0);
 	return 1;
 }
@@ -177,7 +178,7 @@ static unsigned long nr_blocks(struct file *file)
 	return i_size_read(inode) >> 10;
 }
 
-int __init rd_load_image(char *from)
+int __init rd_load_image(void)
 {
 	int res = 0;
 	unsigned long rd_blocks, devblocks, nr_disks;
@@ -191,7 +192,7 @@ int __init rd_load_image(char *from)
 	if (IS_ERR(out_file))
 		goto out;
 
-	in_file = filp_open(from, O_RDONLY, 0);
+	in_file = filp_open("/initrd.image", O_RDONLY, 0);
 	if (IS_ERR(in_file))
 		goto noclose_input;
 
@@ -220,10 +221,7 @@ int __init rd_load_image(char *from)
 	/*
 	 * OK, time to copy in the data
 	 */
-	if (strcmp(from, "/initrd.image") == 0)
-		devblocks = nblocks;
-	else
-		devblocks = nr_blocks(in_file);
+	devblocks = nblocks;
 
 	if (devblocks == 0) {
 		printk(KERN_ERR "RAMDISK: could not determine device size\n");
@@ -267,13 +265,6 @@ int __init rd_load_image(char *from)
 	return res;
 }
 
-int __init rd_load_disk(int n)
-{
-	create_dev("/dev/root", ROOT_DEV);
-	create_dev("/dev/ram", MKDEV(RAMDISK_MAJOR, n));
-	return rd_load_image("/dev/root");
-}
-
 static int exit_code;
 static int decompress_error;
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 1/3] init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command line parameters
From: Askar Safin @ 2025-10-10  9:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches
In-Reply-To: <20251010094047.3111495-1-safinaskar@gmail.com>

...which do nothing. They were deprecated (in documentation) in
6b99e6e6aa62 ("Documentation/admin-guide: blockdev/ramdisk: remove use of
"rdev"") and in kernel messages in c8376994c86c ("initrd: remove support
for multiple floppies")

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 4 ----
 arch/arm/configs/neponset_defconfig             | 2 +-
 init/do_mounts.c                                | 7 -------
 init/do_mounts_rd.c                             | 7 -------
 4 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e019db1633fd..521ab3425504 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3280,8 +3280,6 @@
 			If there are multiple matching configurations changing
 			the same attribute, the last one is used.
 
-	load_ramdisk=	[RAM] [Deprecated]
-
 	lockd.nlm_grace_period=P  [NFS] Assign grace period.
 			Format: <integer>
 
@@ -5245,8 +5243,6 @@
 			Param: <number> - step/bucket size as a power of 2 for
 				statistical time based profiling.
 
-	prompt_ramdisk=	[RAM] [Deprecated]
-
 	prot_virt=	[S390] enable hosting protected virtual machines
 			isolated from the hypervisor (if hardware supports
 			that). If enabled, the default kernel base address
diff --git a/arch/arm/configs/neponset_defconfig b/arch/arm/configs/neponset_defconfig
index 2227f86100ad..4d720001c12e 100644
--- a/arch/arm/configs/neponset_defconfig
+++ b/arch/arm/configs/neponset_defconfig
@@ -9,7 +9,7 @@ CONFIG_ASSABET_NEPONSET=y
 CONFIG_ZBOOT_ROM_TEXT=0x80000
 CONFIG_ZBOOT_ROM_BSS=0xc1000000
 CONFIG_ZBOOT_ROM=y
-CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) load_ramdisk=1 prompt_ramdisk=0 mem=32M noinitrd initrd=0xc0800000,3M"
+CONFIG_CMDLINE="console=ttySA0,38400n8 cpufreq=221200 rw root=/dev/mtdblock2 mtdparts=sa1100:512K(boot),1M(kernel),2560K(initrd),4M(root) mem=32M noinitrd initrd=0xc0800000,3M"
 CONFIG_FPE_NWFPE=y
 CONFIG_PM=y
 CONFIG_MODULES=y
diff --git a/init/do_mounts.c b/init/do_mounts.c
index 6af29da8889e..0f2f44e6250c 100644
--- a/init/do_mounts.c
+++ b/init/do_mounts.c
@@ -34,13 +34,6 @@ static int root_wait;
 
 dev_t ROOT_DEV;
 
-static int __init load_ramdisk(char *str)
-{
-	pr_warn("ignoring the deprecated load_ramdisk= option\n");
-	return 1;
-}
-__setup("load_ramdisk=", load_ramdisk);
-
 static int __init readonly(char *str)
 {
 	if (*str)
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index 19d9f33dcacf..5311f2d7edc8 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -18,13 +18,6 @@
 static struct file *in_file, *out_file;
 static loff_t in_pos, out_pos;
 
-static int __init prompt_ramdisk(char *str)
-{
-	pr_warn("ignoring the deprecated prompt_ramdisk= option\n");
-	return 1;
-}
-__setup("prompt_ramdisk=", prompt_ramdisk);
-
 int __initdata rd_image_start;		/* starting block # of image */
 
 static int __init ramdisk_start_setup(char *str)
-- 
2.47.3


^ permalink raw reply related

* [PATCH v2 0/3] initrd: remove half of classic initrd support
From: Askar Safin @ 2025-10-10  9:40 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-block, initramfs, linux-api,
	linux-doc, Michal Simek, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, Arnd Bergmann, Dave Young,
	Christophe Leroy, Krzysztof Kozlowski, Borislav Petkov,
	Jessica Clarke, Nicolas Schichan, David Disseldorp, patches

Intro
====
This patchset removes half of classic initrd (initial RAM disk) support,
i. e. linuxrc code path, which was deprecated in 2020.
Initramfs still stays, RAM disk itself (brd) still stays.
And other half of initrd stays, too.
init/do_mounts* are listed in VFS entry in
MAINTAINERS, so I think this patchset should go through VFS tree.
I tested the patchset on 8 (!!!) archs in Qemu (see details below).
If you still use initrd, see below for workaround.

In 2020 deprecation notice was put to linuxrc initrd code path.
In previous version of this patchset I tried to remove initrd
fully, but Nicolas Schichan reported that he still uses
other code path (root=/dev/ram0 one) on million devices [4].
root=/dev/ram0 code path did not contain deprecation notice.

So, in this version of patchset I remove deprecated code path,
i. e. linuxrc one, while keeping other, i. e. root=/dev/ram0 one.

Also I put deprecation notice to remaining code path, i. e. to
root=/dev/ram0 one. I plan to send patches for full removal
of initrd after one year, i. e. in September 2026 (of course,
initramfs will still work).

Also, I tried to make this patchset small to make sure it
can be reverted easily. I plan to send cleanups later.

Details
====
Other user-visible changes:

- Removed kernel command line parameters "load_ramdisk" and
"prompt_ramdisk", which did nothing and were deprecated
- Removed /proc/sys/kernel/real-root-dev . It was used
for initrd only
- Command line parameters "noinitrd" and "ramdisk_start=" are deprecated

This patchset is based on current mainline (7f7072574127).

Testing
====
I tested my patchset on many architectures in Qemu using my Rust
program, heavily based on mkroot [1].

I used the following cross-compilers:

aarch64-linux-musleabi
armv4l-linux-musleabihf
armv5l-linux-musleabihf
armv7l-linux-musleabihf
i486-linux-musl
i686-linux-musl
mips-linux-musl
mips64-linux-musl
mipsel-linux-musl
powerpc-linux-musl
powerpc64-linux-musl
powerpc64le-linux-musl
riscv32-linux-musl
riscv64-linux-musl
s390x-linux-musl
sh4-linux-musl
sh4eb-linux-musl
x86_64-linux-musl

taken from this directory [2].

So, as you can see, there are 18 triplets, which correspond to 8 subdirs in arch/.

For every triplet I tested that:
- Initramfs still works (both builtin and external)
- Direct boot from disk still works
- Remaining initrd code path (root=/dev/ram0) still works

Workaround
====
If "retain_initrd" is passed to kernel, then initramfs/initrd,
passed by bootloader, is retained and becomes available after boot
as read-only magic file /sys/firmware/initrd [3].

No copies are involved. I. e. /sys/firmware/initrd is simply
a reference to original blob passed by bootloader.

This works even if initrd/initramfs is not recognized by kernel
in any way, i. e. even if it is not valid cpio archive, nor
a fs image supported by classic initrd.

This works both with my patchset and without it.

This means that you can emulate classic initrd so:
link builtin initramfs to kernel; in /init in this initramfs
copy /sys/firmware/initrd to some file in / and loop-mount it.

This is even better than classic initrd, because:
- You can use fs not supported by classic initrd, for example erofs
- One copy is involved (from /sys/firmware/initrd to some file in /)
as opposed to two when using classic initrd

Still, I don't recommend using this workaround, because
I want everyone to migrate to proper modern initramfs.
But still you can use this workaround if you want.

Also: it is not possible to directly loop-mount
/sys/firmware/initrd . Theoretically kernel can be changed
to allow this (and/or to make it writable), but I think nobody needs this.
And I don't want to implement this.

On Qemu's -initrd and GRUB's initrd
====
Don't panic, this patchset doesn't remove initramfs
(which is used by nearly all Linux distros). And I don't
have plans to remove it.

Qemu's -initrd option and GRUB's initrd command refer
to initrd bootloader mechanism, which is used to
load both initrd and (external) initramfs.

So, if you use Qemu's -initrd or GRUB's initrd,
then you likely use them to pass initramfs, and thus
you are safe.

v1: https://lore.kernel.org/lkml/20250913003842.41944-1-safinaskar@gmail.com/

v1 -> v2 changes:
- A lot. I removed most patches, see cover letter for details

[1] https://github.com/landley/toybox/tree/master/mkroot
[2] https://landley.net/toybox/downloads/binaries/toolchains/latest
[3] https://lore.kernel.org/all/20231207235654.16622-1-graf@amazon.com/
[4] https://lore.kernel.org/lkml/20250918152830.438554-1-nschichan@freebox.fr/

Askar Safin (3):
  init: remove deprecated "load_ramdisk" and "prompt_ramdisk" command
    line parameters
  initrd: remove deprecated code path (linuxrc)
  init: remove /proc/sys/kernel/real-root-dev

 .../admin-guide/kernel-parameters.txt         |   8 +-
 Documentation/admin-guide/sysctl/kernel.rst   |   6 -
 arch/arm/configs/neponset_defconfig           |   2 +-
 fs/init.c                                     |  14 ---
 include/linux/init_syscalls.h                 |   1 -
 include/linux/initrd.h                        |   2 -
 include/uapi/linux/sysctl.h                   |   1 -
 init/do_mounts.c                              |  11 +-
 init/do_mounts.h                              |  18 +--
 init/do_mounts_initrd.c                       | 105 +-----------------
 init/do_mounts_rd.c                           |  24 +---
 11 files changed, 18 insertions(+), 174 deletions(-)


base-commit: 7f7072574127c9e971cad83a0274e86f6275c0d5
-- 
2.47.3


^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Greg KH @ 2025-10-10  6:39 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Yanjun.Zhu, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, tglx, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <mafs0ecrbmzzh.fsf@kernel.org>

On Fri, Oct 10, 2025 at 01:12:18AM +0200, Pratyush Yadav wrote:
> On Thu, Oct 09 2025, Yanjun.Zhu wrote:
> 
> > On 10/9/25 10:04 AM, Pasha Tatashin wrote:
> >> On Thu, Oct 9, 2025 at 11:35 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
> >>>
> >>> 在 2025/10/9 5:01, Pasha Tatashin 写道:
> >>>>>> Because the window of kernel live update is short, it is difficult to statistics
> >>>>>> how many times the kernel is live updated.
> >>>>>>
> >>>>>> Is it possible to add a variable to statistics the times that the kernel is live
> >>>>>> updated?
> >>>>> The kernel doesn't do the live update on its own. The process is driven
> >>>>> and sequenced by userspace. So if you want to keep statistics, you
> >>>>> should do it from your userspace (luod maybe?). I don't see any need for
> >>>>> this in the kernel.
> >>>>>
> >>>> One use case I can think of is including information in kdump or the
> >>>> backtrace warning/panic messages about how many times this machine has
> >>>> been live-updated. In the past, I've seen bugs (related to memory
> >>>> corruption) that occurred only after several kexecs, not on the first
> >>>> one. With live updates, especially while the code is being stabilized,
> >>>> I imagine we might have a similar situation. For that reason, it could
> >>>> be useful to have a count in the dmesg logs showing how many times
> >>>> this machine has been live-updated. While this information is also
> >>>> available in userspace, it would be simpler for kernel developers
> >>>> triaging these issues if everything were in one place.
> 
> Hmm, good point.
> 
> >>> I’m considering this issue from a system security perspective. After the
> >>> kernel is automatically updated, user-space applications are usually
> >>> unaware of the change. In one possible scenario, an attacker could
> >>> replace the kernel with a compromised version, while user-space
> >>> applications remain unaware of it — which poses a potential security risk.
> 
> Wouldn't signing be the way to avoid that? Because if the kernel is
> compromised then it can very well fake the reboot count as well.
> 
> >>>
> >>> To mitigate this, it would be useful to expose the number of kernel
> >>> updates through a sysfs interface, so that we can detect whether the
> >>> kernel has been updated and then collect information about the new
> >>> kernel to check for possible security issues.
> >>>
> >>> Of course, there are other ways to detect kernel updates — for example,
> >>> by using ftrace to monitor functions involved in live kernel updates —
> >>> but such approaches tend to have a higher performance overhead. In
> >>> contrast, adding a simple update counter to track live kernel updates
> >>> would provide similar monitoring capability with minimal overhead.
> >> Would a print during boot, i.e. when we print that this kernel is live
> >> updating, we could include the number, work for you? Otherwise, we
> >> could export this number in a debugfs.
> > Since I received a notification that my previous message was not sent
> > successfully, I am resending it.
> >
> > IMO, it would be better to export this number via debugfs. This approach reduces
> > the overhead involved in detecting a kernel live update.
> > If the number is printed in logs instead, the overhead would be higher compared
> > to using debugfs.
> 
> Yeah, debugfs sounds fine. No ABI at least.

Do not provide any functionality in debugfs that userspace relies on at
all, as odds are, it will not be able to be accessed by most/all of
userspace on many systems.  It is for debugging only.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH] fs: Propagate FMODE_NOCMTIME flag to user-facing O_NOCMTIME
From: Christoph Hellwig @ 2025-10-10  5:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Pavel Emelyanov, linux-fsdevel,
	Raphael S . Carvalho, linux-api, linux-xfs
In-Reply-To: <CALCETrW3iQWQTdMbB52R4=GztfuFYvN_8p52H1fopdS8uExQWg@mail.gmail.com>

On Wed, Oct 08, 2025 at 08:22:35AM -0700, Andy Lutomirski wrote:
> On Mon, Oct 6, 2025 at 10:08 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Sat, Oct 04, 2025 at 09:08:05AM -0700, Andy Lutomirski wrote:
> > > > Well, we'll need to look into that, including maybe non-blockin
> > > > timestamp updates.
> > > >
> > >
> > > It's been 12 years (!), but maybe it's time to reconsider this:
> > >
> > > https://lore.kernel.org/all/cover.1377193658.git.luto@amacapital.net/
> >
> > I don't see how that is relevant here.  Also writes through shared
> > mmaps are problematic for so many reasons that I'm not sure we want
> > to encourage people to use that more.
> >
> 
> Because the same exact issue exists in the normal non-mmap write path,
> and I can even quote you upthread :)

The thread that started this is about io_uring nonblock writes, aka
O_DIRECT.  So there isn't any writeback to defer to. 

^ permalink raw reply

* Re: [PATCH-RFC] init: simplify initrd code (was Re: [PATCH RESEND 00/62] initrd: remove classic initrd support).
From: Askar Safin @ 2025-10-10  4:57 UTC (permalink / raw)
  To: nschichan
  Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
	ecurtin, email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs,
	jack, julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
	linux-arch, linux-arm-kernel, linux-block, linux-csky, linux-doc,
	linux-efi, linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
	linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
	loongarch, mcgrof, mingo, monstr, mzxreary, patches, rob,
	sparclinux, thomas.weissschuh, thorsten.blum, torvalds, tytso,
	viro, x86
In-Reply-To: <20250925131055.3933381-1-nschichan@freebox.fr>

On Thu, Sep 25, 2025 at 4:12 PM <nschichan@freebox.fr> wrote:
> - drop prompt_ramdisk and ramdisk_start kernel parameters
> - drop compression support
> - drop image autodetection, the whole /initrd.image content is now
>   copied into /dev/ram0
> - remove rd_load_disk() which doesn't seem to be used anywhere.

I welcome any initrd simplification!

> Hopefully my email config is now better and reaches gmail users
> correctly.

Yes, I got this email.

--
Askar Safin

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Askar Safin @ 2025-10-10  4:09 UTC (permalink / raw)
  To: Jessica Clarke
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <A08066E1-A57E-4980-B15A-8FB00AC747CC@jrtc27.com>

On Tue, Sep 16, 2025 at 8:08 PM Jessica Clarke <jrtc27@jrtc27.com> wrote:
> I strongly suggest picking different names given __builtin_foo is the
> naming scheme used for GNU C builtins/intrinsics. I leave you and
> others to bikeshed that one.

Thank you! I will fix this.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH RESEND 28/62] init: alpha, arc, arm, arm64, csky, m68k, microblaze, mips, nios2, openrisc, parisc, powerpc, s390, sh, sparc, um, x86, xtensa: rename initrd_{start,end} to virt_external_initramfs_{start,end}
From: Askar Safin @ 2025-10-10  4:07 UTC (permalink / raw)
  To: Rob Herring
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <20250916030903.GA3598798-robh@kernel.org>

On Tue, Sep 16, 2025 at 6:09 AM Rob Herring <robh@kernel.org> wrote:
> There's not really any point in listing every arch in the subject.

Ok, I will fix this.


-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH RESEND 02/62] init: remove deprecated "prompt_ramdisk" command line parameter, which does nothing
From: Askar Safin @ 2025-10-10  3:17 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <053f39a9-06dc-4fbd-ad1b-325f9d3f3f66@csgroup.eu>

On Mon, Sep 15, 2025 at 2:16 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
> Squash patch 1 and patch 2 together and say this is cleanup of two
> options deprecated by commit c8376994c86c ("initrd: remove support for
> multiple floppies") with the documentation by commit 6b99e6e6aa62
> ("Documentation/admin-guide: blockdev/ramdisk: remove use of "rdev"")

Will do in v2.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH RESEND 21/62] init: remove all mentions of root=/dev/ram*
From: Askar Safin @ 2025-10-10  2:48 UTC (permalink / raw)
  To: Krzysztof Kozlowski
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <a079375f-38c2-4f38-b2be-57737084fde8@kernel.org>

On Sun, Sep 14, 2025 at 1:06 PM Krzysztof Kozlowski <krzk@kernel.org> wrote:
> Please wrap commit message according to Linux coding style / submission
I will do this for v2

> To me your patchset is way too big bomb, too difficult to review. You
v2 will be small.

--
Askar Safin

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-09 23:50 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu, hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <mafs0ms5zn0nm.fsf@kernel.org>

On Thu, Oct 9, 2025 at 6:58 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Oct 07 2025, Pasha Tatashin wrote:
>
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> >>
> [...]
> > 4. New File-Lifecycle-Bound Global State
> > ----------------------------------------
> > A new mechanism for managing global state was proposed, designed to be
> > tied to the lifecycle of the preserved files themselves. This would
> > allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> > global state that is only relevant when one or more of its FDs are
> > being managed by LUO.
>
> Is this going to replace LUO subsystems? If yes, then why? The global
> state will likely need to have its own lifecycle just like the FDs, and
> subsystems are a simple and clean abstraction to control that. I get the
> idea of only "activating" a subsystem when one or more of its FDs are
> participating in LUO, but we can do that while keeping subsystems
> around.

Thanks for the feedback. The FLB Global State is not replacing the LUO
subsystems. On the contrary, it's a higher-level abstraction that is
itself implemented as a LUO subsystem. The goal is to provide a
solution for a pattern that emerged during the PCI and IOMMU
discussions.

You can see the WIP implementation here, which shows it registering as
a subsystem named "luo-fh-states-v1-struct":
https://github.com/soleen/linux/commit/94e191aab6b355d83633718bc4a1d27dda390001

The existing subsystem API is a low-level tool that provides for the
preservation of a raw 8-byte handle. It doesn't provide locking, nor
is it explicitly tied to the lifecycle of any higher-level object like
a file handler. The new API is designed to solve a more specific
problem: allowing global components (like IOMMU or PCI) to
automatically track when resources relevant to them are added to or
removed from preservation. If HugeTLB requires a subsystem, it can
still use it, but I suspect it might benefit from FLB Global State as
well.

> Here is how I imagine the proposed API would compare against subsystems
> with hugetlb as an example (hugetlb support is still WIP, so I'm still
> not clear on specifics, but this is how I imagine it will work):
>
> - Hugetlb subsystem needs to track its huge page pools and which pages
>   are allocated and free. This is its global state. The pools get
>   reconstructed after kexec. Post-kexec, the free pages are ready for
>   allocation from other "regular" files and the pages used in LUO files
>   are reserved.
>
> - Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
>   in hugetlb's global data structure tracking this. This is runtime data
>   (say xarray), and _not_ serialized data. Reason being, there are
>   likely more FDs to come so no point in wasting time serializing just
>   yet.
>
>   This can look something like:
>
>   hugetlb_luo_preserve_folio(folio, ...);
>
>   Nice and simple.
>
>   Compare this with the new proposed API:
>
>   liveupdate_fh_global_state_get(h, &hugetlb_data);
>   // This will have update serialized state now.
>   hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
>   liveupdate_fh_global_state_put(h);
>
>   We do the same thing but in a very complicated way.
>
> - When the system-wide preserve happens, the hugetlb subsystem gets a
>   callback to serialize. It converts its runtime global state to
>   serialized state since now it knows no more FDs will be added.
>
>   With the new API, this doesn't need to be done since each FD prepare
>   already updates serialized state.
>
> - If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
>   anything in LUO. This is same as new API.
>
> - If some hugetlb FDs are not restored after liveupdate and the finish
>   event is triggered, the subsystem gets its finish() handler called and
>   it can free things up.
>
>   I don't get how that would work with the new API.

The new API isn't more complicated; It codifies the common pattern of
"create on first use, destroy on last use" into a reusable helper,
saving each file handler from having to reinvent the same reference
counting and locking scheme. But, as you point out, subsystems provide
more control, specifically they handle full creation/free instead of
relying on file-handlers for that.

> My point is, I see subsystems working perfectly fine here and I don't
> get how the proposed API is any better.
>
> Am I missing something?

No, I don't think you are. Your analysis is correct that this is
achievable with subsystems. The goal of the new API is to make that
specific, common use case simpler.

Pasha

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Pratyush Yadav @ 2025-10-09 23:12 UTC (permalink / raw)
  To: Yanjun.Zhu
  Cc: Pasha Tatashin, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <d09881f5-0e0b-4795-99bf-cd3711ee48ab@linux.dev>

On Thu, Oct 09 2025, Yanjun.Zhu wrote:

> On 10/9/25 10:04 AM, Pasha Tatashin wrote:
>> On Thu, Oct 9, 2025 at 11:35 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>>
>>> 在 2025/10/9 5:01, Pasha Tatashin 写道:
>>>>>> Because the window of kernel live update is short, it is difficult to statistics
>>>>>> how many times the kernel is live updated.
>>>>>>
>>>>>> Is it possible to add a variable to statistics the times that the kernel is live
>>>>>> updated?
>>>>> The kernel doesn't do the live update on its own. The process is driven
>>>>> and sequenced by userspace. So if you want to keep statistics, you
>>>>> should do it from your userspace (luod maybe?). I don't see any need for
>>>>> this in the kernel.
>>>>>
>>>> One use case I can think of is including information in kdump or the
>>>> backtrace warning/panic messages about how many times this machine has
>>>> been live-updated. In the past, I've seen bugs (related to memory
>>>> corruption) that occurred only after several kexecs, not on the first
>>>> one. With live updates, especially while the code is being stabilized,
>>>> I imagine we might have a similar situation. For that reason, it could
>>>> be useful to have a count in the dmesg logs showing how many times
>>>> this machine has been live-updated. While this information is also
>>>> available in userspace, it would be simpler for kernel developers
>>>> triaging these issues if everything were in one place.

Hmm, good point.

>>> I’m considering this issue from a system security perspective. After the
>>> kernel is automatically updated, user-space applications are usually
>>> unaware of the change. In one possible scenario, an attacker could
>>> replace the kernel with a compromised version, while user-space
>>> applications remain unaware of it — which poses a potential security risk.

Wouldn't signing be the way to avoid that? Because if the kernel is
compromised then it can very well fake the reboot count as well.

>>>
>>> To mitigate this, it would be useful to expose the number of kernel
>>> updates through a sysfs interface, so that we can detect whether the
>>> kernel has been updated and then collect information about the new
>>> kernel to check for possible security issues.
>>>
>>> Of course, there are other ways to detect kernel updates — for example,
>>> by using ftrace to monitor functions involved in live kernel updates —
>>> but such approaches tend to have a higher performance overhead. In
>>> contrast, adding a simple update counter to track live kernel updates
>>> would provide similar monitoring capability with minimal overhead.
>> Would a print during boot, i.e. when we print that this kernel is live
>> updating, we could include the number, work for you? Otherwise, we
>> could export this number in a debugfs.
> Since I received a notification that my previous message was not sent
> successfully, I am resending it.
>
> IMO, it would be better to export this number via debugfs. This approach reduces
> the overhead involved in detecting a kernel live update.
> If the number is printed in logs instead, the overhead would be higher compared
> to using debugfs.

Yeah, debugfs sounds fine. No ABI at least.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 26/30] selftests/liveupdate: Add multi-kexec session lifecycle test
From: Vipin Sharma @ 2025-10-09 22:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <CA+CK2bBSObHG=9Rj623mahyhE81DhhKbN09aHS96p==8y_mCGw@mail.gmail.com>

On 2025-10-03 22:37:10, Pasha Tatashin wrote:
> > > > --- a/tools/testing/selftests/liveupdate/Makefile
> > > > +++ b/tools/testing/selftests/liveupdate/Makefile
> > > > @@ -1,7 +1,38 @@
> > > >  # SPDX-License-Identifier: GPL-2.0-only
> > > > +
> > > > +KHDR_INCLUDES ?= -I../../../usr/include
> > >
> > > If make is run from the tools/testing/selftests/liveupdate directory, this
> > > will not work because it needs one more "..".

This causes a build issue, see my response at the bottom.

> > >
> > > If this is built using selftest Makefile from root directory
> > >
> > >   make -C tools/testing/selftests TARGETS=liveupdate
> > >
> > > there will not be build errors because tools/testing/selftests/Makefile
> > > defines KHDR_INCLUDES, so above definition will never happen.
> > >

If one is just building test using the above make command (without
install) we don't see other liveupdate test binaries.

> > > > +# --- Test Configuration (Edit this section when adding new tests) ---
> > > > +LUO_SHARED_SRCS := luo_test_utils.c
> > > > +LUO_SHARED_HDRS += luo_test_utils.h
> > > > +
> > > > +LUO_MANUAL_TESTS += luo_multi_kexec
> > > > +
> > > > +TEST_FILES += do_kexec.sh
> > > >
> > > >  TEST_GEN_PROGS += liveupdate
> > > >
> > > > +# --- Automatic Rule Generation (Do not edit below) ---
> > > > +
> > > > +TEST_GEN_PROGS_EXTENDED += $(LUO_MANUAL_TESTS)
> > > > +
> > > > +# Define the full list of sources for each manual test.
> > > > +$(foreach test,$(LUO_MANUAL_TESTS), \
> > > > +     $(eval $(test)_SOURCES := $(test).c $(LUO_SHARED_SRCS)))
> > > > +
> > > > +# This loop automatically generates an explicit build rule for each manual test.
> > > > +# It includes dependencies on the shared headers and makes the output
> > > > +# executable.
> > > > +# Note the use of '$$' to escape automatic variables for the 'eval' command.
> > > > +$(foreach test,$(LUO_MANUAL_TESTS), \
> > > > +     $(eval $(OUTPUT)/$(test): $($(test)_SOURCES) $(LUO_SHARED_HDRS) \
> > > > +             $(call msg,LINK,,$$@) ; \
> > > > +             $(Q)$(LINK.c) $$^ $(LDLIBS) -o $$@ ; \
> > > > +             $(Q)chmod +x $$@ \
> > > > +     ) \
> > > > +)
> > > > +
> > > >  include ../lib.mk
> > >
> > > make is not building LUO_MANUAL_TESTS, it is only building liveupdate.
> > > How to build them?
> >
> > I am building them out of tree:
> > make O=x86_64 -s -C tools/testing/selftests TARGETS=liveupdate install
> > make O=x86_64 -s -C tools/testing/selftests TARGETS=kho install
> 
> Actually, I just tested in-tree and everything works for me, could you
> please verify:
> 
> make mrproper  # Clean the tree
> cat tools/testing/selftests/liveupdate/config > .config # Copy LUO depends.
> make olddefconfig  # make a def config with LUO
> make kvm_guest.config # Build minimal KVM guest with LUO
> make headers # Make uAPI headers
> make -C tools/testing/selftests TARGETS=liveupdate install # make and
> install liveupdate selftests

Yes, this one builds the tests.

However, if instead of using the above make command, we do

  cd tools/testing/selftests/liveupdate
  make

This will error out

    LINK     liveupdate
  liveupdate.c:19:10: fatal error: linux/liveupdate.h: No such file or directory
     19 | #include <linux/liveupdate.h>
        |          ^~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  In file included from luo_test_utils.c:21:
  luo_test_utils.h:13:10: fatal error: linux/liveupdate.h: No such file or directory
     13 | #include <linux/liveupdate.h>
        |          ^~~~~~~~~~~~~~~~~~~~
  compilation terminated.
  In file included from <command-line>:
  /usr/include/stdc-predef.h:1: fatal error: cannot create precompiled header /liveupdate: Permission denied
      1 | /* Copyright (C) 1991-2025 Free Software Foundation, Inc.
  compilation terminated.
  make: *** [Makefile:30: /liveupdate] Error 1

Reason for this build error is KHDR_INCLUDES in the selftest/liveupdate/Makefile

Following fix resolves this above two "No such file or directory" error.

diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile
index 25a6dec790bb..6507682addac 100644
--- a/tools/testing/selftests/liveupdate/Makefile
+++ b/tools/testing/selftests/liveupdate/Makefile
@@ -1,6 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only

-KHDR_INCLUDES ?= -I../../../usr/include
+KHDR_INCLUDES ?= -I../../../../usr/include
 CFLAGS += -Wall -O2 -Wno-unused-function
 CFLAGS += $(KHDR_INCLUDES)
 LDFLAGS += -static

My git diff in the first response fixes build issue and generate tests.
https://lore.kernel.org/linux-mm/20251003225120.GA2035091.vipinsh@google.com/

I am used to kvm and vfio selftests. They both build all their binaries
by running 'make' from their directories. That's why I found it odd that
liveupdate is behaving differently.


^ permalink raw reply related

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-10-09 22:57 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, hughd, skhawaja,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com>

On Tue, Oct 07 2025, Pasha Tatashin wrote:

> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
>>
[...]
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.

Is this going to replace LUO subsystems? If yes, then why? The global
state will likely need to have its own lifecycle just like the FDs, and
subsystems are a simple and clean abstraction to control that. I get the
idea of only "activating" a subsystem when one or more of its FDs are
participating in LUO, but we can do that while keeping subsystems
around.

>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);

IMHO this looks clunky and overcomplicated. Each LUO FD type knows what
its subsystem is. It should talk to it directly. I don't get why we are
adding this intermediate step.

Here is how I imagine the proposed API would compare against subsystems
with hugetlb as an example (hugetlb support is still WIP, so I'm still
not clear on specifics, but this is how I imagine it will work):

- Hugetlb subsystem needs to track its huge page pools and which pages
  are allocated and free. This is its global state. The pools get
  reconstructed after kexec. Post-kexec, the free pages are ready for
  allocation from other "regular" files and the pages used in LUO files
  are reserved.

- Pre-kexec, when a hugetlb FD is preserved, it marks that as preserved
  in hugetlb's global data structure tracking this. This is runtime data
  (say xarray), and _not_ serialized data. Reason being, there are
  likely more FDs to come so no point in wasting time serializing just
  yet.

  This can look something like:

  hugetlb_luo_preserve_folio(folio, ...);

  Nice and simple.

  Compare this with the new proposed API:

  liveupdate_fh_global_state_get(h, &hugetlb_data);
  // This will have update serialized state now.
  hugetlb_luo_preserve_folio(hugetlb_data, folio, ...);
  liveupdate_fh_global_state_put(h);

  We do the same thing but in a very complicated way.

- When the system-wide preserve happens, the hugetlb subsystem gets a
  callback to serialize. It converts its runtime global state to
  serialized state since now it knows no more FDs will be added.

  With the new API, this doesn't need to be done since each FD prepare
  already updates serialized state.

- If there are no hugetlb FDs, then the hugetlb subsystem doesn't put
  anything in LUO. This is same as new API.

- If some hugetlb FDs are not restored after liveupdate and the finish
  event is triggered, the subsystem gets its finish() handler called and
  it can free things up.

  I don't get how that would work with the new API.

My point is, I see subsystems working perfectly fine here and I don't
get how the proposed API is any better.

Am I missing something?

>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.

The huge page pools are allocated early-ish in boot. On x86, the 1 GiB
pages are allocated from setup_arch(). Other sizes are allocated later
in boot from a subsys_initcall. This is way before the first FD gets
restored, and in 1 GiB case even before LUO gets initialized.

At that point, it would be great if the hugetlb preserved data can be
retrieved. If not, then there needs to at least be some indication that
LUO brings huge pages with it, so that the kernel can trust that it will
be able to successfully get the pages later in boot.

This flow is tricky to implement in the proposed model. With subsystems,
it might just end up working with some early boot tricks to fetch LUO
data.

> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-09 22:42 UTC (permalink / raw)
  To: Samiullah Khawaja
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, chrisl, steven.sistare
In-Reply-To: <CAAywjhT_9vV-V+BBs1_=QqhCGQqHo89qWy7r5zW1ej51yHPGJA@mail.gmail.com>

On Thu, Oct 9, 2025 at 5:58 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> > <pasha.tatashin@soleen.com> wrote:
> > >
> > > This series introduces the Live Update Orchestrator (LUO), a kernel
> > > subsystem designed to facilitate live kernel updates. LUO enables
> > > kexec-based reboots with minimal downtime, a critical capability for
> > > cloud environments where hypervisors must be updated without disrupting
> > > running virtual machines. By preserving the state of selected resources,
> > > such as file descriptors and memory, LUO allows workloads to resume
> > > seamlessly in the new kernel.
> > >
> > > The git branch for this series can be found at:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> > >
> > > The patch series applies against linux-next tag: next-20250926
> > >
> > > While this series is showed cased using memfd preservation. There are
> > > works to preserve devices:
> > > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> > >
> > > =======================================================================
> > > Changelog since v3:
> > > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> > >
> > > - The main architectural change in this version is introduction of
> > >   "sessions" to manage the lifecycle of preserved file descriptors.
> > >   In v3, session management was left to a single userspace agent. This
> > >   approach has been revised to improve robustness. Now, each session is
> > >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> > >   all preserved resources within a session is tied to this FD, ensuring
> > >   automatic cleanup by the kernel if the controlling userspace agent
> > >   crashes or exits unexpectedly.
> > >
> > > - The first three KHO fixes from the previous series have been merged
> > >   into Linus' tree.
> > >
> > > - Various bug fixes and refactorings, including correcting memory
> > >   unpreservation logic during a kho_abort() sequence.
> > >
> > > - Addressing all comments from reviewers.
> > >
> > > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> > >   can now be queried  only via ioctl() API.
> > >
> > > =======================================================================
> >
> > Hi all,
> >
> > Following up on yesterday's Hypervisor Live Update meeting, we
> > discussed the requirements for the LUO to track dependencies,
> > particularly for IOMMU preservation and other stateful file
> > descriptors. This email summarizes the main design decisions and
> > outcomes from that discussion.
> >
> > For context, the notes from the previous meeting can be found here:
> > https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> > The notes for yesterday's meeting are not yes available.
> >
> > The key outcomes are as follows:
> >
> > 1. User-Enforced Ordering
> > -------------------------
> > The responsibility for enforcing the correct order of operations will
> > lie with the userspace agent. If fd_A is a dependency for fd_B,
> > userspace must ensure that fd_A is preserved before fd_B. This same
> > ordering must be honored during the restoration phase after the reboot
> > (fd_A must be restored before fd_B). The kernel preserve the ordering.
> >
> > 2. Serialization in PRESERVE_FD
> > -------------------------------
> > To keep the global prepare() phase lightweight and predictable, the
> > consensus was to shift the heavy serialization work into the
> > PRESERVE_FD ioctl handler. This means that when userspace requests to
> > preserve a file, the file handler should perform the bulk of the
> > state-saving work immediately.
> >
> > The proposed sequence of operations reflects this shift:
> >
> > Shutdown Flow:
> > fd_preserve() (heavy serialization) -> prepare() (lightweight final
> > checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
> >
> > Boot & Restore Flow:
> > fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> > post-restore IOCTLs (e.g., hardware page table re-creation) ->
> > finish() (lightweight cleanup)
> >
> > This decision primarily serves as a guideline for file handler
> > implementations. For the LUO core, this implies minor API changes,
> > such as renaming can_preserve() to a more active preserve() and adding
> > a corresponding unpreserve() callback to be called during
> > UNPRESERVE_FD.
> >
> > 3. FD Data Query API
> > --------------------
> > We identified the need for a kernel API to allow subsystems to query
> > preserved FD data during the boot process, before userspace has
> > initiated the restore.
> >
> > The proposed API would allow a file handler to retrieve a list of all
> > its preserved FDs, including their session names, tokens, and the
> > private data payload.
> >
> > Proposed Data Structure:
> >
> > struct liveupdate_fd {
> >         char *session; /* session name */
> >         u64 token; /* Preserved FD token */
> >         u64 data; /* Private preserved data */
> > };
> >
> > Proposed Function:
> > liveupdate_fd_data_query(struct liveupdate_file_handler *h,
> >                          struct liveupdate_fd *fds, long *count);
>
> Now that you are adding the "File-Lifecycle-Bound Global State", I was
> wondering if this session data query mechanism is still necessary. It
> seems that any preserved state a file handler needs to restore during
> boot could be fetched using the Global data support instead. For
> example, I don't think session information will be needed to restore
> iommu domains during boot (iommu init), but even if some other file
> handler needs it then it can keep this info in global data. I
> discussed this briefly with Pasha today, but wanted to raise it here
> as well.

I agree, the query API is ugly and indeed not needed with the FLB
Global State. The biggest problem with the query API is that the
caller must somehow know how to interpret the preserved file-handler
data before the struct file is reconstructed. This is problematic;
there should only be one place that knows how to store and interpret
the data, not multiple.

It looks like the combination of an enforced ordering:
Preservation: A->B->C->D
Un-preservation: D->C->B->A
Retrieval: A->B->C->D

and the FLB Global State (where data is automatically created and
destroyed when a particular file type participates in a live update)
solves the need for this query mechanism. For example, the IOMMU
driver/core can add its data only when an iommufd is preserved and add
more data as more iommufds are added. The preserved data is also
automatically removed once the live update is finished or canceled.

Pasha

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Samiullah Khawaja @ 2025-10-09 21:58 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, chrisl, steven.sistare
In-Reply-To: <CA+CK2bB+RdapsozPHe84MP4NVSPLo6vje5hji5MKSg8L6ViAbw@mail.gmail.com>

On Tue, Oct 7, 2025 at 10:11 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Sun, Sep 28, 2025 at 9:03 PM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> >
> > This series introduces the Live Update Orchestrator (LUO), a kernel
> > subsystem designed to facilitate live kernel updates. LUO enables
> > kexec-based reboots with minimal downtime, a critical capability for
> > cloud environments where hypervisors must be updated without disrupting
> > running virtual machines. By preserving the state of selected resources,
> > such as file descriptors and memory, LUO allows workloads to resume
> > seamlessly in the new kernel.
> >
> > The git branch for this series can be found at:
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4
> >
> > The patch series applies against linux-next tag: next-20250926
> >
> > While this series is showed cased using memfd preservation. There are
> > works to preserve devices:
> > 1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
> > 2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org
> >
> > =======================================================================
> > Changelog since v3:
> > (https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):
> >
> > - The main architectural change in this version is introduction of
> >   "sessions" to manage the lifecycle of preserved file descriptors.
> >   In v3, session management was left to a single userspace agent. This
> >   approach has been revised to improve robustness. Now, each session is
> >   represented by a file descriptor (/dev/liveupdate). The lifecycle of
> >   all preserved resources within a session is tied to this FD, ensuring
> >   automatic cleanup by the kernel if the controlling userspace agent
> >   crashes or exits unexpectedly.
> >
> > - The first three KHO fixes from the previous series have been merged
> >   into Linus' tree.
> >
> > - Various bug fixes and refactorings, including correcting memory
> >   unpreservation logic during a kho_abort() sequence.
> >
> > - Addressing all comments from reviewers.
> >
> > - Removing sysfs interface (/sys/kernel/liveupdate/state), the state
> >   can now be queried  only via ioctl() API.
> >
> > =======================================================================
>
> Hi all,
>
> Following up on yesterday's Hypervisor Live Update meeting, we
> discussed the requirements for the LUO to track dependencies,
> particularly for IOMMU preservation and other stateful file
> descriptors. This email summarizes the main design decisions and
> outcomes from that discussion.
>
> For context, the notes from the previous meeting can be found here:
> https://lore.kernel.org/all/365acb25-4b25-86a2-10b0-1df98703e287@google.com
> The notes for yesterday's meeting are not yes available.
>
> The key outcomes are as follows:
>
> 1. User-Enforced Ordering
> -------------------------
> The responsibility for enforcing the correct order of operations will
> lie with the userspace agent. If fd_A is a dependency for fd_B,
> userspace must ensure that fd_A is preserved before fd_B. This same
> ordering must be honored during the restoration phase after the reboot
> (fd_A must be restored before fd_B). The kernel preserve the ordering.
>
> 2. Serialization in PRESERVE_FD
> -------------------------------
> To keep the global prepare() phase lightweight and predictable, the
> consensus was to shift the heavy serialization work into the
> PRESERVE_FD ioctl handler. This means that when userspace requests to
> preserve a file, the file handler should perform the bulk of the
> state-saving work immediately.
>
> The proposed sequence of operations reflects this shift:
>
> Shutdown Flow:
> fd_preserve() (heavy serialization) -> prepare() (lightweight final
> checks) -> Suspend VM -> reboot(KEXEC) -> freeze() (lightweight)
>
> Boot & Restore Flow:
> fd_restore() (lightweight object creation) -> Resume VM -> Heavy
> post-restore IOCTLs (e.g., hardware page table re-creation) ->
> finish() (lightweight cleanup)
>
> This decision primarily serves as a guideline for file handler
> implementations. For the LUO core, this implies minor API changes,
> such as renaming can_preserve() to a more active preserve() and adding
> a corresponding unpreserve() callback to be called during
> UNPRESERVE_FD.
>
> 3. FD Data Query API
> --------------------
> We identified the need for a kernel API to allow subsystems to query
> preserved FD data during the boot process, before userspace has
> initiated the restore.
>
> The proposed API would allow a file handler to retrieve a list of all
> its preserved FDs, including their session names, tokens, and the
> private data payload.
>
> Proposed Data Structure:
>
> struct liveupdate_fd {
>         char *session; /* session name */
>         u64 token; /* Preserved FD token */
>         u64 data; /* Private preserved data */
> };
>
> Proposed Function:
> liveupdate_fd_data_query(struct liveupdate_file_handler *h,
>                          struct liveupdate_fd *fds, long *count);

Now that you are adding the "File-Lifecycle-Bound Global State", I was
wondering if this session data query mechanism is still necessary. It
seems that any preserved state a file handler needs to restore during
boot could be fetched using the Global data support instead. For
example, I don't think session information will be needed to restore
iommu domains during boot (iommu init), but even if some other file
handler needs it then it can keep this info in global data. I
discussed this briefly with Pasha today, but wanted to raise it here
as well.
>
> 4. New File-Lifecycle-Bound Global State
> ----------------------------------------
> A new mechanism for managing global state was proposed, designed to be
> tied to the lifecycle of the preserved files themselves. This would
> allow a file owner (e.g., the IOMMU subsystem) to save and retrieve
> global state that is only relevant when one or more of its FDs are
> being managed by LUO.
>
> The key characteristics of this new mechanism are:
> The global state is optionally created on the first preserve() call
> for a given file handler.
> The state can be updated on subsequent preserve() calls.
> The state is destroyed when the last corresponding file is unpreserved
> or finished.
> The data can be accessed during boot.
>
> I am thinking of an API like this.
>
> 1. Add three more callbacks to liveupdate_file_ops:
> /*
>  * Optional. Called by LUO during first get global state call.
>  * The handler should allocate/KHO preserve its global state object and return a
>  * pointer to it via 'obj'. It must also provide a u64 handle (e.g., a physical
>  * address of preserved memory) via 'data_handle' that LUO will save.
>  * Return: 0 on success.
>  */
> int (*global_state_create)(struct liveupdate_file_handler *h,
>                            void **obj, u64 *data_handle);
>
> /*
>  * Optional. Called by LUO in the new kernel
>  * before the first access to the global state. The handler receives
>  * the preserved u64 data_handle and should use it to reconstruct its
>  * global state object, returning a pointer to it via 'obj'.
>  * Return: 0 on success.
>  */
> int (*global_state_restore)(struct liveupdate_file_handler *h,
>                             u64 data_handle, void **obj);
>
> /*
>  * Optional. Called by LUO after the last
>  * file for this handler is unpreserved or finished. The handler
>  * must free its global state object and any associated resources.
>  */
> void (*global_state_destroy)(struct liveupdate_file_handler *h, void *obj);
>
> The get/put global state data:
>
> /* Get and lock the data with file_handler scoped lock */
> int liveupdate_fh_global_state_get(struct liveupdate_file_handler *h,
>                                    void **obj);
>
> /* Unlock the data */
> void liveupdate_fh_global_state_put(struct liveupdate_file_handler *h);
>
> Execution Flow:
> 1. Outgoing Kernel (First preserve() call):
> 2. Handler's preserve() is called. It needs the global state, so it calls
>    liveupdate_fh_global_state_get(&h, &obj). LUO acquires h->global_state_lock.
>    It sees h->global_state_obj is NULL.
>    LUO calls h->ops->global_state_create(h, &h->global_state_obj, &handle).
>    The handler allocates its state, preserves it with KHO, and returns its live
>    pointer and a u64 handle.
> 3. LUO stores the handle internally for later serialization.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock still held.
> 5. The preserve() callback does its work using the obj.
> 6. It calls liveupdate_fh_global_state_put(h), which releases the lock.
>
> Global PREPARE:
> 1. LUO iterates handlers. If h->count > 0, it writes the stored data_handle into
>    the LUO FDT.
>
> Incoming Kernel (First access):
> 1. When liveupdate_fh_global_state_get(&h, &obj) is called the first time. LUO
>    acquires h->global_state_lock.
> 2. It sees h->global_state_obj is NULL, but it knows it has a preserved u64
>    handle from the FDT. LUO calls h->ops->global_state_restore()
> 3. Reconstructs its state object, and returns the live pointer.
> 4. LUO sets *obj = h->global_state_obj and returns 0 with the lock held.
> 5. The caller does its work.
> 6. It calls liveupdate_fh_global_state_put(h) to release the lock.
>
> Last File Cleanup (in unpreserve or finish):
> 1. LUO decrements h->count to 0.
> 2. This triggers the cleanup logic.
> 3. LUO calls h->ops->global_state_destroy(h, h->global_state_obj).
> 4. The handler frees its memory and resources.
> 5. LUO sets h->global_state_obj = NULL, resetting it for a future live update
>    cycle.
>
> Pasha
>
>
> Pasha

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-10-09 18:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <20251009173914.GA3899236@nvidia.com>

On Thu, Oct 9, 2025 at 1:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> > In this case we can enforce strict
> > ordering during retrieval. If "struct file" can be retrieved by
> > anything within the kernel, then that could be any kernel process
> > during boot, meaning that charging is not going to be properly applied
> > when kernel allocations are performed.
>
> Ugh, yeah, OK that's irritating and might burn us, but we did decide
> on that strategy.
>
> > > I would argue it should always cause a preservation...
> > >
> > > But this is still backwards, what we need is something like
> > >
> > > liveupdate_preserve_file(session, file, &token);
> > > my_preserve_blob.file_token = token
> >
> > We cannot do that, the user should have already preserved that file
> > and provided us with a token to use, if that file was not preserved by
> > the user it is a bug. With this proposal, we would have to generate a
> > token, and it was argued that the kernel should not do that.
>
> The token is the label used as ABI across the kexec. Each entity doing
> a serialization can operate it's labels however it needs.
>
> Here I am suggeting that when a kernel entity goes to record a struct
> file in a kernel ABI structure it can get a kernel generated token for
> it.

Sure, we can consider allowing the kernel to preserve dependent FDs
automatically in the future, but is there a compelling use case that
requires it right now?

For the initial implementation, I think we should stick to the
simpler, agreed-upon plan: preservation order is explicitly defined by
userspace. If a preserve() call fails due to an unmet dependency, the
error is returned to the user, who is then responsible for correcting
the order. This keeps the kernel logic straightforward and places the
preservation responsibility squarely in userspace, where it belongs.

Pasha

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Yanjun.Zhu @ 2025-10-09 17:56 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <CA+CK2bBz3NvDmwUjCPiyTPH9yL6YpZ+vX=o2TkC2C7aViXO-pQ@mail.gmail.com>


On 10/9/25 10:04 AM, Pasha Tatashin wrote:
> On Thu, Oct 9, 2025 at 11:35 AM Zhu Yanjun <yanjun.zhu@linux.dev> wrote:
>>
>> 在 2025/10/9 5:01, Pasha Tatashin 写道:
>>>>> Because the window of kernel live update is short, it is difficult to statistics
>>>>> how many times the kernel is live updated.
>>>>>
>>>>> Is it possible to add a variable to statistics the times that the kernel is live
>>>>> updated?
>>>> The kernel doesn't do the live update on its own. The process is driven
>>>> and sequenced by userspace. So if you want to keep statistics, you
>>>> should do it from your userspace (luod maybe?). I don't see any need for
>>>> this in the kernel.
>>>>
>>> One use case I can think of is including information in kdump or the
>>> backtrace warning/panic messages about how many times this machine has
>>> been live-updated. In the past, I've seen bugs (related to memory
>>> corruption) that occurred only after several kexecs, not on the first
>>> one. With live updates, especially while the code is being stabilized,
>>> I imagine we might have a similar situation. For that reason, it could
>>> be useful to have a count in the dmesg logs showing how many times
>>> this machine has been live-updated. While this information is also
>>> available in userspace, it would be simpler for kernel developers
>>> triaging these issues if everything were in one place.
>> I’m considering this issue from a system security perspective. After the
>> kernel is automatically updated, user-space applications are usually
>> unaware of the change. In one possible scenario, an attacker could
>> replace the kernel with a compromised version, while user-space
>> applications remain unaware of it — which poses a potential security risk.
>>
>> To mitigate this, it would be useful to expose the number of kernel
>> updates through a sysfs interface, so that we can detect whether the
>> kernel has been updated and then collect information about the new
>> kernel to check for possible security issues.
>>
>> Of course, there are other ways to detect kernel updates — for example,
>> by using ftrace to monitor functions involved in live kernel updates —
>> but such approaches tend to have a higher performance overhead. In
>> contrast, adding a simple update counter to track live kernel updates
>> would provide similar monitoring capability with minimal overhead.
> Would a print during boot, i.e. when we print that this kernel is live
> updating, we could include the number, work for you? Otherwise, we
> could export this number in a debugfs.
Since I received a notification that my previous message was not sent 
successfully, I am resending it.

IMO, it would be better to export this number via debugfs. This approach 
reduces the overhead involved in detecting a kernel live update.
If the number is printed in logs instead, the overhead would be higher 
compared to using debugfs.

Thanks a lot.

Yanjun.Zhu

>
> Pasha

^ permalink raw reply

* Re: [PATCH v4 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-10-09 17:39 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Samiullah Khawaja, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu, hughd,
	chrisl, steven.sistare
In-Reply-To: <CA+CK2bC_m5GRxCa1szw1v24Ssq8EnCWp4e985RJ5RRCdhztQWg@mail.gmail.com>

On Thu, Oct 09, 2025 at 11:01:25AM -0400, Pasha Tatashin wrote:
> In this case we can enforce strict
> ordering during retrieval. If "struct file" can be retrieved by
> anything within the kernel, then that could be any kernel process
> during boot, meaning that charging is not going to be properly applied
> when kernel allocations are performed.

Ugh, yeah, OK that's irritating and might burn us, but we did decide
on that strategy.

> > I would argue it should always cause a preservation...
> >
> > But this is still backwards, what we need is something like
> >
> > liveupdate_preserve_file(session, file, &token);
> > my_preserve_blob.file_token = token
> 
> We cannot do that, the user should have already preserved that file
> and provided us with a token to use, if that file was not preserved by
> the user it is a bug. With this proposal, we would have to generate a
> token, and it was argued that the kernel should not do that.

The token is the label used as ABI across the kexec. Each entity doing
a serialization can operate it's labels however it needs.

Here I am suggeting that when a kernel entity goes to record a struct
file in a kernel ABI structure it can get a kernel generated token for
it.

This is a different token name space than the user provided tokens
through the ioctl. A single struct file may have many entities
serializing it and possibly many tokens.

Jason

^ permalink raw reply

* Re: [PATCH 2/2] fs: return EOPNOTSUPP from file_setattr/file_getattr syscalls
From: Darrick J. Wong @ 2025-10-09 17:20 UTC (permalink / raw)
  To: Andrey Albershteyn
  Cc: linux-api, linux-fsdevel, linux-kernel, linux-xfs, Jan Kara,
	Jiri Slaby, Christian Brauner, Arnd Bergmann, Andrey Albershteyn
In-Reply-To: <20251008-eopnosupp-fix-v1-2-5990de009c9f@kernel.org>

On Wed, Oct 08, 2025 at 02:44:18PM +0200, Andrey Albershteyn wrote:
> These syscalls call to vfs_fileattr_get/set functions which return
> ENOIOCTLCMD if filesystem doesn't support setting file attribute on an
> inode. For syscalls EOPNOTSUPP would be more appropriate return error.
> 
> Signed-off-by: Andrey Albershteyn <aalbersh@kernel.org>
> ---
>  fs/file_attr.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/file_attr.c b/fs/file_attr.c
> index 460b2dd21a85..5e3e2aba97b5 100644
> --- a/fs/file_attr.c
> +++ b/fs/file_attr.c
> @@ -416,6 +416,8 @@ SYSCALL_DEFINE5(file_getattr, int, dfd, const char __user *, filename,
>  	}
>  
>  	error = vfs_fileattr_get(filepath.dentry, &fa);
> +	if (error == -ENOIOCTLCMD)

Hrm.  Back in 6.17, XFS would return ENOTTY if you called ->fileattr_get
on a special file:

int
xfs_fileattr_get(
	struct dentry		*dentry,
	struct file_kattr	*fa)
{
	struct xfs_inode	*ip = XFS_I(d_inode(dentry));

	if (d_is_special(dentry))
		return -ENOTTY;
	...
}

Given that there are other fileattr_[gs]et implementations out there
that might return ENOTTY (e.g. fuse servers and other externally
maintained filesystems), I think both syscall functions need to check
for that as well:

	if (error == -ENOIOCTLCMD || error == -ENOTTY)
		return -EOPNOTSUPP;

--D

> +		error = -EOPNOTSUPP;
>  	if (error)
>  		return error;
>  
> @@ -483,6 +485,8 @@ SYSCALL_DEFINE5(file_setattr, int, dfd, const char __user *, filename,
>  	if (!error) {
>  		error = vfs_fileattr_set(mnt_idmap(filepath.mnt),
>  					 filepath.dentry, &fa);
> +		if (error == -ENOIOCTLCMD)
> +			error = -EOPNOTSUPP;
>  		mnt_drop_write(filepath.mnt);
>  	}
>  
> 
> -- 
> 2.51.0
> 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox