[RFC v2 00/16] Live Update Orchestrator

linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC v2 00/16] Live Update Orchestrator
@ 2025-05-15 18:23 Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 01/16] kho: make debugfs interface optional Pasha Tatashin
                   ` (17 more replies)
  0 siblings, 18 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

This v2 series introduces the LUO, a kernel subsystem designed to
facilitate live kernel updates with minimal downtime,
particularly in cloud delplyoments aiming to update without fully
disrupting running virtual machines.

This series builds upon KHO framework [1] by adding programmatic
control over KHO's lifecycle and leveraging KHO for persisting LUO's
own metadata across the kexec boundary. The git branch for this series
can be found at:
https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2

Changelog from v1:
- Control Interface: Shifted from sysfs-based control
  (/sys/kernel/liveupdate/{prepare,finish}) to an ioctl interface
  (/dev/liveupdate). Sysfs is now primarily for monitoring the state.
- Event/State Renaming: LIVEUPDATE_REBOOT event/phase is now
  LIVEUPDATE_FREEZE.
- FD Preservation: A new component for preserving file descriptors.
  Subsystem Registration: A formal mechanism for kernel subsystems
  to participate.
- Device Layer: removed device list handling from this series, it is
  going to be added separately.
- Selftests: Kernel-side selftest hooks and userspace selftests are
  now included.
KHO Enhancements:
- KHO debugfs became optional, and kernel APIs for finalize/abort
  were added (driven by LUO's needs).
- KHO unpreserve functions were also added.

What is Live Update?
Live Update is a specialized reboot process where selected kernel
resources (memory, file descriptors, and eventually devices) are kept
operational or their state preserved across a kernel transition (e.g.,
via kexec). For certain resources, DMA and interrupt activity might
continue with minimal interruption during the kernel reboot.

LUO v2 Overview:
LUO v2 provides a framework for coordinating live updates. It features:
State Machine: Manages the live update process through states:
NORMAL, PREPARED, FROZEN, UPDATED.

KHO Integration:

LUO programmatically drives KHO's finalization and abort sequences.
KHO's debugfs interface is now optional configured via
CONFIG_KEXEC_HANDOVER_DEBUG.

LUO preserves its own metadata via KHO's kho_add_subtree and
kho_preserve_phys() mechanisms.

Subsystem Participation: A callback API liveupdate_register_subsystem()
allows kernel subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register
handlers for LUO events (PREPARE, FREEZE, FINISH, CANCEL) and persist a
u64 payload via the LUO FDT.

File Descriptor Preservation: Infrastructure
liveupdate_register_filesystem, luo_register_file, luo_retrieve_file to
allow specific types of file descriptors (e.g., memfd, vfio) to be
preserved and restored.

Handlers for specific file types can be registered to manage their
preservation and restoration, storing a u64 payload in the LUO FDT.

Example WIP for memfd preservation can be found here [2].

User-space Interface:

ioctl (/dev/liveupdate): The primary control interface for
triggering LUO state transitions (prepare, freeze, finish, cancel)
and managing the preservation/restoration of file descriptors.
Access requires CAP_SYS_ADMIN.

sysfs (/sys/kernel/liveupdate/state): A read-only interface for
monitoring the current LUO state. This allows userspace services to
track progress and coordinate actions.

Selftests: Includes kernel-side hooks and userspace selftests to
verify core LUO functionality, particularly subsystem registration and
basic state transitions.

LUO State Machine and Events:

NORMAL:   Default operational state.
PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
          event. Subsystems have saved initial state.
FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
          event, just before kexec. Workloads must be suspended.
UPDATED:  Next kernel has booted via live update. Awaiting restoration
          and LIVEUPDATE_FINISH.

Events:
LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.

[1] https://lore.kernel.org/all/20250509074635.3187114-1-changyuanl@google.com
    https://github.com/googleprodkernel/linux-liveupdate/tree/luo/kho-v8
[2] https://github.com/googleprodkernel/linux-liveupdate/tree/luo/memfd-v0.1

RFC v1: https://lore.kernel.org/all/20250320024011.2995837-1-pasha.tatashin@soleen.com

Changyuan Lyu (1):
  kho: add kho_unpreserve_folio/phys

Pasha Tatashin (15):
  kho: make debugfs interface optional
  kho: allow to drive kho from within kernel
  luo: luo_core: Live Update Orchestrator
  luo: luo_core: integrate with KHO
  luo: luo_subsystems: add subsystem registration
  luo: luo_subsystems: implement subsystem callbacks
  luo: luo_files: add infrastructure for FDs
  luo: luo_files: implement file systems callbacks
  luo: luo_ioctl: add ioctl interface
  luo: luo_sysfs: add sysfs state monitoring
  reboot: call liveupdate_reboot() before kexec
  luo: add selftests for subsystems un/registration
  selftests/liveupdate: add subsystem/state tests
  docs: add luo documentation
  MAINTAINERS: add liveupdate entry

 .../ABI/testing/sysfs-kernel-liveupdate       |  51 ++
 Documentation/admin-guide/index.rst           |   1 +
 Documentation/admin-guide/liveupdate.rst      |  62 ++
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 MAINTAINERS                                   |  14 +-
 drivers/misc/Kconfig                          |   1 +
 drivers/misc/Makefile                         |   1 +
 drivers/misc/liveupdate/Kconfig               |  60 ++
 drivers/misc/liveupdate/Makefile              |   7 +
 drivers/misc/liveupdate/luo_core.c            | 547 +++++++++++++++
 drivers/misc/liveupdate/luo_files.c           | 664 ++++++++++++++++++
 drivers/misc/liveupdate/luo_internal.h        |  59 ++
 drivers/misc/liveupdate/luo_ioctl.c           | 203 ++++++
 drivers/misc/liveupdate/luo_selftests.c       | 283 ++++++++
 drivers/misc/liveupdate/luo_selftests.h       |  23 +
 drivers/misc/liveupdate/luo_subsystems.c      | 413 +++++++++++
 drivers/misc/liveupdate/luo_sysfs.c           |  92 +++
 include/linux/kexec_handover.h                |  27 +
 include/linux/liveupdate.h                    | 214 ++++++
 include/uapi/linux/liveupdate.h               | 324 +++++++++
 kernel/Kconfig.kexec                          |  10 +
 kernel/Makefile                               |   1 +
 kernel/kexec_handover.c                       | 343 +++------
 kernel/kexec_handover_debug.c                 | 237 +++++++
 kernel/kexec_handover_internal.h              |  74 ++
 kernel/reboot.c                               |   4 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/liveupdate/.gitignore |   1 +
 tools/testing/selftests/liveupdate/Makefile   |   7 +
 tools/testing/selftests/liveupdate/config     |   6 +
 .../testing/selftests/liveupdate/liveupdate.c | 440 ++++++++++++
 31 files changed, 3933 insertions(+), 238 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-liveupdate
 create mode 100644 Documentation/admin-guide/liveupdate.rst
 create mode 100644 drivers/misc/liveupdate/Kconfig
 create mode 100644 drivers/misc/liveupdate/Makefile
 create mode 100644 drivers/misc/liveupdate/luo_core.c
 create mode 100644 drivers/misc/liveupdate/luo_files.c
 create mode 100644 drivers/misc/liveupdate/luo_internal.h
 create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
 create mode 100644 drivers/misc/liveupdate/luo_selftests.c
 create mode 100644 drivers/misc/liveupdate/luo_selftests.h
 create mode 100644 drivers/misc/liveupdate/luo_subsystems.c
 create mode 100644 drivers/misc/liveupdate/luo_sysfs.c
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 include/uapi/linux/liveupdate.h
 create mode 100644 kernel/kexec_handover_debug.c
 create mode 100644 kernel/kexec_handover_internal.h
 create mode 100644 tools/testing/selftests/liveupdate/.gitignore
 create mode 100644 tools/testing/selftests/liveupdate/Makefile
 create mode 100644 tools/testing/selftests/liveupdate/config
 create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c

-- 
2.49.0.1101.gccaa498523-goog

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [RFC v2 01/16] kho: make debugfs interface optional
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-06-04 16:03   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 02/16] kho: allow to drive kho from within kernel Pasha Tatashin
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Currently, KHO is controlled via debugfs interface, but once LUO is
introduced, it can control KHO, and the debug interface becomes
optional.

Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
the debugfs interface, and allows to inspect the tree.

Move all debufs related code to a new file to keep the .c files
clear of ifdefs.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 MAINTAINERS                      |   3 +-
 kernel/Kconfig.kexec             |  10 ++
 kernel/Makefile                  |   1 +
 kernel/kexec_handover.c          | 271 ++-----------------------------
 kernel/kexec_handover_debug.c    | 237 +++++++++++++++++++++++++++
 kernel/kexec_handover_internal.h |  72 ++++++++
 6 files changed, 336 insertions(+), 258 deletions(-)
 create mode 100644 kernel/kexec_handover_debug.c
 create mode 100644 kernel/kexec_handover_internal.h

diff --git a/MAINTAINERS b/MAINTAINERS
index bdea634d63a9..4fc28b6674bd 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13143,12 +13143,13 @@ KEXEC HANDOVER (KHO)
 M:	Alexander Graf <graf@amazon.com>
 M:	Mike Rapoport <rppt@kernel.org>
 M:	Changyuan Lyu <changyuanl@google.com>
+M:	Pasha Tatashin <pasha.tatashin@soleen.com>
 L:	kexec@lists.infradead.org
 S:	Maintained
 F:	Documentation/admin-guide/mm/kho.rst
 F:	Documentation/core-api/kho/*
 F:	include/linux/kexec_handover.h
-F:	kernel/kexec_handover.c
+F:	kernel/kexec_handover*
 
 KEYS-ENCRYPTED
 M:	Mimi Zohar <zohar@linux.ibm.com>
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 4fa212909d69..44f9ac67ecbc 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -109,6 +109,16 @@ config KEXEC_HANDOVER
 	  to keep data or state alive across the kexec. For this to work,
 	  both source and target kernels need to have this option enabled.
 
+config KEXEC_HANDOVER_DEBUG
+	bool "kexec handover debug interface"
+	depends on KEXEC_HANDOVER
+	select DEBUG_FS
+	help
+	  Allow to control kexec handover device tree via debugfs
+	  interface, i.e. finalize the state or aborting the finalization.
+	  Also, enables inspecting the KHO fdt trees with the debugfs binary
+	  blobs.
+
 config CRASH_DUMP
 	bool "kernel crash dumps"
 	default ARCH_DEFAULT_CRASH_DUMP
diff --git a/kernel/Makefile b/kernel/Makefile
index 97c09847db42..ae44877c0300 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
 obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
+obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 69b953551677..5b65970e9746 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -10,7 +10,6 @@
 
 #include <linux/cma.h>
 #include <linux/count_zeros.h>
-#include <linux/debugfs.h>
 #include <linux/kexec.h>
 #include <linux/kexec_handover.h>
 #include <linux/libfdt.h>
@@ -27,6 +26,7 @@
  */
 #include "../mm/internal.h"
 #include "kexec_internal.h"
+#include "kexec_handover_internal.h"
 
 #define KHO_FDT_COMPATIBLE "kho-v1"
 #define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
@@ -75,22 +75,8 @@ struct kho_mem_phys {
 	struct xarray phys_bits;
 };
 
-struct kho_mem_track {
-	/* Points to kho_mem_phys, each order gets its own bitmap tree */
-	struct xarray orders;
-};
-
 struct khoser_mem_chunk;
 
-struct kho_serialization {
-	struct page *fdt;
-	struct list_head fdt_list;
-	struct dentry *sub_fdt_dir;
-	struct kho_mem_track track;
-	/* First chunk of serialized preserved memory map */
-	struct khoser_mem_chunk *preserved_mem_map;
-};
-
 static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 {
 	void *elm, *res;
@@ -355,8 +341,8 @@ static void __init kho_mem_deserialize(const void *fdt)
  * area for early allocations that happen before page allocator is
  * initialized.
  */
-static struct kho_scratch *kho_scratch;
-static unsigned int kho_scratch_cnt;
+struct kho_scratch *kho_scratch;
+unsigned int kho_scratch_cnt;
 
 /*
  * The scratch areas are scaled by default as percent of memory allocated from
@@ -542,37 +528,6 @@ static void __init kho_reserve_scratch(void)
 	kho_enable = false;
 }
 
-struct fdt_debugfs {
-	struct list_head list;
-	struct debugfs_blob_wrapper wrapper;
-	struct dentry *file;
-};
-
-static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
-			       const char *name, const void *fdt)
-{
-	struct fdt_debugfs *f;
-	struct dentry *file;
-
-	f = kmalloc(sizeof(*f), GFP_KERNEL);
-	if (!f)
-		return -ENOMEM;
-
-	f->wrapper.data = (void *)fdt;
-	f->wrapper.size = fdt_totalsize(fdt);
-
-	file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
-	if (IS_ERR(file)) {
-		kfree(f);
-		return PTR_ERR(file);
-	}
-
-	f->file = file;
-	list_add(&f->list, list);
-
-	return 0;
-}
-
 /**
  * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
  * @ser: serialization control object passed by KHO notifiers.
@@ -584,7 +539,8 @@ static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
  * by KHO for the new kernel to retrieve it after kexec.
  *
  * A debugfs blob entry is also created at
- * ``/sys/kernel/debug/kho/out/sub_fdts/@name``.
+ * ``/sys/kernel/debug/kho/out/sub_fdts/@name`` when kernel is configured with
+ * CONFIG_KEXEC_HANDOVER_DEBUG
  *
  * Return: 0 on success, error code on failure
  */
@@ -601,22 +557,11 @@ int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
 	if (err)
 		return err;
 
-	return kho_debugfs_fdt_add(&ser->fdt_list, ser->sub_fdt_dir, name, fdt);
+	return kho_debugfs_fdt_add(ser, name, fdt);
 }
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
-struct kho_out {
-	struct blocking_notifier_head chain_head;
-
-	struct dentry *dir;
-
-	struct mutex lock; /* protects KHO FDT finalization */
-
-	struct kho_serialization ser;
-	bool finalized;
-};
-
-static struct kho_out kho_out = {
+struct kho_out kho_out = {
 	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.lock = __MUTEX_INITIALIZER(kho_out.lock),
 	.ser = {
@@ -707,30 +652,7 @@ int kho_preserve_phys(phys_addr_t phys, size_t size)
 }
 EXPORT_SYMBOL_GPL(kho_preserve_phys);
 
-/* Handling for debug/kho/out */
-
-static struct dentry *debugfs_root;
-
-static int kho_out_update_debugfs_fdt(void)
-{
-	int err = 0;
-	struct fdt_debugfs *ff, *tmp;
-
-	if (kho_out.finalized) {
-		err = kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir,
-					  "fdt", page_to_virt(kho_out.ser.fdt));
-	} else {
-		list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) {
-			debugfs_remove(ff->file);
-			list_del(&ff->list);
-			kfree(ff);
-		}
-	}
-
-	return err;
-}
-
-static int kho_abort(void)
+int __kho_abort(void)
 {
 	int err;
 	unsigned long order;
@@ -763,7 +685,7 @@ static int kho_abort(void)
 	return err;
 }
 
-static int kho_finalize(void)
+int __kho_finalize(void)
 {
 	int err = 0;
 	u64 *preserved_mem_map;
@@ -806,117 +728,13 @@ static int kho_finalize(void)
 abort:
 	if (err) {
 		pr_err("Failed to convert KHO state tree: %d\n", err);
-		kho_abort();
+		__kho_abort();
 	}
 
 	return err;
 }
 
-static int kho_out_finalize_get(void *data, u64 *val)
-{
-	mutex_lock(&kho_out.lock);
-	*val = kho_out.finalized;
-	mutex_unlock(&kho_out.lock);
-
-	return 0;
-}
-
-static int kho_out_finalize_set(void *data, u64 _val)
-{
-	int ret = 0;
-	bool val = !!_val;
-
-	mutex_lock(&kho_out.lock);
-
-	if (val == kho_out.finalized) {
-		if (kho_out.finalized)
-			ret = -EEXIST;
-		else
-			ret = -ENOENT;
-		goto unlock;
-	}
-
-	if (val)
-		ret = kho_finalize();
-	else
-		ret = kho_abort();
-
-	if (ret)
-		goto unlock;
-
-	kho_out.finalized = val;
-	ret = kho_out_update_debugfs_fdt();
-
-unlock:
-	mutex_unlock(&kho_out.lock);
-	return ret;
-}
-
-DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
-			 kho_out_finalize_set, "%llu\n");
-
-static int scratch_phys_show(struct seq_file *m, void *v)
-{
-	for (int i = 0; i < kho_scratch_cnt; i++)
-		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
-
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(scratch_phys);
-
-static int scratch_len_show(struct seq_file *m, void *v)
-{
-	for (int i = 0; i < kho_scratch_cnt; i++)
-		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
-
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(scratch_len);
-
-static __init int kho_out_debugfs_init(void)
-{
-	struct dentry *dir, *f, *sub_fdt_dir;
-
-	dir = debugfs_create_dir("out", debugfs_root);
-	if (IS_ERR(dir))
-		return -ENOMEM;
-
-	sub_fdt_dir = debugfs_create_dir("sub_fdts", dir);
-	if (IS_ERR(sub_fdt_dir))
-		goto err_rmdir;
-
-	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
-				&scratch_phys_fops);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
-				&scratch_len_fops);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	f = debugfs_create_file("finalize", 0600, dir, NULL,
-				&fops_kho_out_finalize);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	kho_out.dir = dir;
-	kho_out.ser.sub_fdt_dir = sub_fdt_dir;
-	return 0;
-
-err_rmdir:
-	debugfs_remove_recursive(dir);
-	return -ENOENT;
-}
-
-struct kho_in {
-	struct dentry *dir;
-	phys_addr_t fdt_phys;
-	phys_addr_t scratch_phys;
-	struct list_head fdt_list;
-};
-
-static struct kho_in kho_in = {
+struct kho_in kho_in = {
 	.fdt_list = LIST_HEAD_INIT(kho_in.fdt_list),
 };
 
@@ -961,56 +779,6 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 }
 EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
 
-/* Handling for debugfs/kho/in */
-
-static __init int kho_in_debugfs_init(const void *fdt)
-{
-	struct dentry *sub_fdt_dir;
-	int err, child;
-
-	kho_in.dir = debugfs_create_dir("in", debugfs_root);
-	if (IS_ERR(kho_in.dir))
-		return PTR_ERR(kho_in.dir);
-
-	sub_fdt_dir = debugfs_create_dir("sub_fdts", kho_in.dir);
-	if (IS_ERR(sub_fdt_dir)) {
-		err = PTR_ERR(sub_fdt_dir);
-		goto err_rmdir;
-	}
-
-	err = kho_debugfs_fdt_add(&kho_in.fdt_list, kho_in.dir, "fdt", fdt);
-	if (err)
-		goto err_rmdir;
-
-	fdt_for_each_subnode(child, fdt, 0) {
-		int len = 0;
-		const char *name = fdt_get_name(fdt, child, NULL);
-		const u64 *fdt_phys;
-
-		fdt_phys = fdt_getprop(fdt, child, "fdt", &len);
-		if (!fdt_phys)
-			continue;
-		if (len != sizeof(*fdt_phys)) {
-			pr_warn("node `%s`'s prop `fdt` has invalid length: %d\n",
-				name, len);
-			continue;
-		}
-		err = kho_debugfs_fdt_add(&kho_in.fdt_list, sub_fdt_dir, name,
-					  phys_to_virt(*fdt_phys));
-		if (err) {
-			pr_warn("failed to add fdt `%s` to debugfs: %d\n", name,
-				err);
-			continue;
-		}
-	}
-
-	return 0;
-
-err_rmdir:
-	debugfs_remove_recursive(kho_in.dir);
-	return err;
-}
-
 static __init int kho_init(void)
 {
 	int err = 0;
@@ -1025,27 +793,16 @@ static __init int kho_init(void)
 		goto err_free_scratch;
 	}
 
-	debugfs_root = debugfs_create_dir("kho", NULL);
-	if (IS_ERR(debugfs_root)) {
-		err = -ENOENT;
+	err = kho_debugfs_init();
+	if (err)
 		goto err_free_fdt;
-	}
 
 	err = kho_out_debugfs_init();
 	if (err)
 		goto err_free_fdt;
 
 	if (fdt) {
-		err = kho_in_debugfs_init(fdt);
-		/*
-		 * Failure to create /sys/kernel/debug/kho/in does not prevent
-		 * reviving state from KHO and setting up KHO for the next
-		 * kexec.
-		 */
-		if (err)
-			pr_err("failed exposing handover FDT in debugfs: %d\n",
-			       err);
-
+		kho_in_debugfs_init(fdt);
 		return 0;
 	}
 
diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c
new file mode 100644
index 000000000000..696131a3480f
--- /dev/null
+++ b/kernel/kexec_handover_debug.c
@@ -0,0 +1,237 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ * Copyright (C) 2025 Google LLC, Changyuan Lyu <changyuanl@google.com>
+ * Copyright (C) 2025 Google LLC, Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/init.h>
+#include <linux/io.h>
+#include <linux/libfdt.h>
+#include <linux/mm.h>
+#include "kexec_handover_internal.h"
+
+static struct dentry *debugfs_root;
+
+struct fdt_debugfs {
+	struct list_head list;
+	struct debugfs_blob_wrapper wrapper;
+	struct dentry *file;
+};
+
+static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
+				 const char *name, const void *fdt)
+{
+	struct fdt_debugfs *f;
+	struct dentry *file;
+
+	f = kmalloc(sizeof(*f), GFP_KERNEL);
+	if (!f)
+		return -ENOMEM;
+
+	f->wrapper.data = (void *)fdt;
+	f->wrapper.size = fdt_totalsize(fdt);
+
+	file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
+	if (IS_ERR(file)) {
+		kfree(f);
+		return PTR_ERR(file);
+	}
+
+	f->file = file;
+	list_add(&f->list, list);
+
+	return 0;
+}
+
+int kho_debugfs_fdt_add(struct kho_serialization *ser, const char *name,
+			const void *fdt)
+{
+	return __kho_debugfs_fdt_add(&ser->fdt_list, ser->sub_fdt_dir, name,
+				     fdt);
+}
+
+static int kho_out_update_debugfs_fdt(void)
+{
+	int err = 0;
+	struct fdt_debugfs *ff, *tmp;
+
+	if (kho_out.finalized) {
+		err = __kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir,
+					    "fdt",
+					    page_to_virt(kho_out.ser.fdt));
+	} else {
+		list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) {
+			debugfs_remove(ff->file);
+			list_del(&ff->list);
+			kfree(ff);
+		}
+	}
+
+	return err;
+}
+
+static int kho_out_finalize_get(void *data, u64 *val)
+{
+	mutex_lock(&kho_out.lock);
+	*val = kho_out.finalized;
+	mutex_unlock(&kho_out.lock);
+
+	return 0;
+}
+
+static int kho_out_finalize_set(void *data, u64 _val)
+{
+	int ret = 0;
+	bool val = !!_val;
+
+	mutex_lock(&kho_out.lock);
+
+	if (val == kho_out.finalized) {
+		if (kho_out.finalized)
+			ret = -EEXIST;
+		else
+			ret = -ENOENT;
+		goto unlock;
+	}
+
+	if (val)
+		ret = __kho_finalize();
+	else
+		ret = __kho_abort();
+
+	if (ret)
+		goto unlock;
+
+	kho_out.finalized = val;
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	mutex_unlock(&kho_out.lock);
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
+			 kho_out_finalize_set, "%llu\n");
+
+static int scratch_phys_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_phys);
+
+static int scratch_len_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_len);
+
+__init void kho_in_debugfs_init(const void *fdt)
+{
+	struct dentry *sub_fdt_dir;
+	int err, child;
+
+	kho_in.dir = debugfs_create_dir("in", debugfs_root);
+	if (IS_ERR(kho_in.dir)) {
+		err = PTR_ERR(kho_in.dir);
+		goto err_out;
+	}
+
+	sub_fdt_dir = debugfs_create_dir("sub_fdts", kho_in.dir);
+	if (IS_ERR(sub_fdt_dir)) {
+		err = PTR_ERR(sub_fdt_dir);
+		goto err_rmdir;
+	}
+
+	err = __kho_debugfs_fdt_add(&kho_in.fdt_list, kho_in.dir, "fdt", fdt);
+	if (err)
+		goto err_rmdir;
+
+	fdt_for_each_subnode(child, fdt, 0) {
+		int len = 0;
+		const char *name = fdt_get_name(fdt, child, NULL);
+		const u64 *fdt_phys;
+
+		fdt_phys = fdt_getprop(fdt, child, "fdt", &len);
+		if (!fdt_phys)
+			continue;
+		if (len != sizeof(*fdt_phys)) {
+			pr_warn("node `%s`'s prop `fdt` has invalid length: %d\n",
+				name, len);
+			continue;
+		}
+		err = __kho_debugfs_fdt_add(&kho_in.fdt_list, sub_fdt_dir, name,
+					    phys_to_virt(*fdt_phys));
+		if (err) {
+			pr_warn("failed to add fdt `%s` to debugfs: %d\n", name,
+				err);
+			continue;
+		}
+	}
+
+	return;
+err_rmdir:
+	debugfs_remove_recursive(kho_in.dir);
+err_out:
+	/*
+	 * Failure to create /sys/kernel/debug/kho/in does not prevent
+	 * reviving state from KHO and setting up KHO for the next
+	 * kexec.
+	 */
+	if (err)
+		pr_err("failed exposing handover FDT in debugfs: %d\n", err);
+}
+
+__init int kho_out_debugfs_init(void)
+{
+	struct dentry *dir, *f, *sub_fdt_dir;
+
+	dir = debugfs_create_dir("out", debugfs_root);
+	if (IS_ERR(dir))
+		return -ENOMEM;
+
+	sub_fdt_dir = debugfs_create_dir("sub_fdts", dir);
+	if (IS_ERR(sub_fdt_dir))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
+				&scratch_phys_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
+				&scratch_len_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("finalize", 0600, dir, NULL,
+				&fops_kho_out_finalize);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	kho_out.dir = dir;
+	kho_out.ser.sub_fdt_dir = sub_fdt_dir;
+	return 0;
+
+err_rmdir:
+	debugfs_remove_recursive(dir);
+	return -ENOENT;
+}
+
+__init int kho_debugfs_init(void)
+{
+	debugfs_root = debugfs_create_dir("kho", NULL);
+	if (IS_ERR(debugfs_root))
+		return -ENOENT;
+	return 0;
+}
diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h
new file mode 100644
index 000000000000..65ff0f651192
--- /dev/null
+++ b/kernel/kexec_handover_internal.h
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_INTERNAL_H
+#define LINUX_KEXEC_HANDOVER_INTERNAL_H
+
+#include <linux/kexec_handover.h>
+#include <linux/list.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+#include <linux/debugfs.h>
+#endif
+
+struct kho_mem_track {
+	/* Points to kho_mem_phys, each order gets its own bitmap tree */
+	struct xarray orders;
+};
+
+struct kho_serialization {
+	struct page *fdt;
+	struct list_head fdt_list;
+	struct kho_mem_track track;
+	/* First chunk of serialized preserved memory map */
+	struct khoser_mem_chunk *preserved_mem_map;
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+	struct dentry *sub_fdt_dir;
+#endif
+};
+
+struct kho_in {
+	phys_addr_t fdt_phys;
+	phys_addr_t scratch_phys;
+	struct list_head fdt_list;
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+	struct dentry *dir;
+#endif
+};
+
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+	struct mutex lock; /* protects KHO FDT finalization */
+	struct kho_serialization ser;
+	bool finalized;
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+	struct dentry *dir;
+#endif
+};
+
+extern struct kho_in kho_in;
+extern struct kho_out kho_out;
+
+extern struct kho_scratch *kho_scratch;
+extern unsigned int kho_scratch_cnt;
+
+int __kho_finalize(void);
+int __kho_abort(void);
+
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+int kho_debugfs_init(void);
+void kho_in_debugfs_init(const void *fdt);
+int kho_out_debugfs_init(void);
+int kho_debugfs_fdt_add(struct kho_serialization *ser, const char *name,
+			const void *fdt);
+#else
+static inline int kho_debugfs_init(void) { return 0; }
+static inline void kho_in_debugfs_init(const void *fdt) { }
+static inline int kho_out_debugfs_init(void) { return 0; }
+static inline int kho_debugfs_fdt_add(struct kho_serialization *ser,
+				      const char *name,
+				      const void *fdt) { return 0; }
+#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
+
+#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 02/16] kho: allow to drive kho from within kernel
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 01/16] kho: make debugfs interface optional Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 03/16] kho: add kho_unpreserve_folio/phys Pasha Tatashin
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Allow to do finalize and abort from kernel modules, so LUO could
drive the KHO sequence via its own state machine.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/kexec_handover.h   | 15 +++++++++
 kernel/kexec_handover.c          | 54 ++++++++++++++++++++++++++++++++
 kernel/kexec_handover_debug.c    |  2 +-
 kernel/kexec_handover_internal.h |  2 ++
 4 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 348844cffb13..f98565def593 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -54,6 +54,10 @@ void kho_memory_init(void);
 
 void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
 		  u64 scratch_len);
+
+int kho_finalize(void);
+int kho_abort(void);
+
 #else
 static inline bool kho_is_enabled(void)
 {
@@ -104,6 +108,17 @@ static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
 				phys_addr_t scratch_phys, u64 scratch_len)
 {
 }
+
+static inline int kho_finalize(void)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_abort(void)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_KEXEC_HANDOVER */
 
 #endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 5b65970e9746..8ff561e36a87 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -734,6 +734,60 @@ int __kho_finalize(void)
 	return err;
 }
 
+int kho_finalize(void)
+{
+	int ret = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&kho_out.lock);
+
+	if (kho_out.finalized) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+
+	ret = __kho_finalize();
+	if (ret)
+		goto unlock;
+
+	kho_out.finalized = true;
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	mutex_unlock(&kho_out.lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_finalize);
+
+int kho_abort(void)
+{
+	int ret = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&kho_out.lock);
+
+	if (!kho_out.finalized) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	ret = __kho_abort();
+	if (ret)
+		goto unlock;
+
+	kho_out.finalized = false;
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	mutex_unlock(&kho_out.lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_abort);
+
 struct kho_in kho_in = {
 	.fdt_list = LIST_HEAD_INIT(kho_in.fdt_list),
 };
diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c
index 696131a3480f..a15c238ec98e 100644
--- a/kernel/kexec_handover_debug.c
+++ b/kernel/kexec_handover_debug.c
@@ -55,7 +55,7 @@ int kho_debugfs_fdt_add(struct kho_serialization *ser, const char *name,
 				     fdt);
 }
 
-static int kho_out_update_debugfs_fdt(void)
+int kho_out_update_debugfs_fdt(void)
 {
 	int err = 0;
 	struct fdt_debugfs *ff, *tmp;
diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h
index 65ff0f651192..0b534758d39d 100644
--- a/kernel/kexec_handover_internal.h
+++ b/kernel/kexec_handover_internal.h
@@ -60,6 +60,7 @@ void kho_in_debugfs_init(const void *fdt);
 int kho_out_debugfs_init(void);
 int kho_debugfs_fdt_add(struct kho_serialization *ser, const char *name,
 			const void *fdt);
+int kho_out_update_debugfs_fdt(void);
 #else
 static inline int kho_debugfs_init(void) { return 0; }
 static inline void kho_in_debugfs_init(const void *fdt) { }
@@ -67,6 +68,7 @@ static inline int kho_out_debugfs_init(void) { return 0; }
 static inline int kho_debugfs_fdt_add(struct kho_serialization *ser,
 				      const char *name,
 				      const void *fdt) { return 0; }
+static inline int kho_out_update_debugfs_fdt(void) { return 0; }
 #endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
 
 #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 03/16] kho: add kho_unpreserve_folio/phys
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 01/16] kho: make debugfs interface optional Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 02/16] kho: allow to drive kho from within kernel Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-06-04 15:00   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 04/16] luo: luo_core: Live Update Orchestrator Pasha Tatashin
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

From: Changyuan Lyu <changyuanl@google.com>

Allow users of KHO to cancel the previous preservation by adding the
necessary interfaces to unpreserve folio.

Signed-off-by: Changyuan Lyu <changyuanl@google.com>
Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/kexec_handover.h | 12 +++++
 kernel/kexec_handover.c        | 84 ++++++++++++++++++++++++++++------
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index f98565def593..3d209f9e9d3a 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -42,7 +42,9 @@ struct kho_serialization;
 bool kho_is_enabled(void);
 
 int kho_preserve_folio(struct folio *folio);
+int kho_unpreserve_folio(struct folio *folio);
 int kho_preserve_phys(phys_addr_t phys, size_t size);
+int kho_unpreserve_phys(phys_addr_t phys, size_t size);
 struct folio *kho_restore_folio(phys_addr_t phys);
 int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
@@ -69,11 +71,21 @@ static inline int kho_preserve_folio(struct folio *folio)
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_unpreserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int kho_preserve_phys(phys_addr_t phys, size_t size)
 {
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline struct folio *kho_restore_folio(phys_addr_t phys)
 {
 	return NULL;
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 8ff561e36a87..eb305e7e6129 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -101,26 +101,33 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 	return elm;
 }
 
-static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
-			     unsigned long end_pfn)
+static void __kho_unpreserve_order(struct kho_mem_track *track, unsigned long pfn,
+				   unsigned int order)
 {
 	struct kho_mem_phys_bits *bits;
 	struct kho_mem_phys *physxa;
+	const unsigned long pfn_high = pfn >> order;
 
-	while (pfn < end_pfn) {
-		const unsigned int order =
-			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
-		const unsigned long pfn_high = pfn >> order;
+	physxa = xa_load(&track->orders, order);
+	if (!physxa)
+		return;
 
-		physxa = xa_load(&track->orders, order);
-		if (!physxa)
-			continue;
+	bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
+	if (!bits)
+		return;
 
-		bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
-		if (!bits)
-			continue;
+	clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
+}
 
-		clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
+static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
+			     unsigned long end_pfn)
+{
+	unsigned int order;
+
+	while (pfn < end_pfn) {
+		order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
+
+		__kho_unpreserve_order(track, pfn, order);
 
 		pfn += 1 << order;
 	}
@@ -607,6 +614,29 @@ int kho_preserve_folio(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(kho_preserve_folio);
 
+/**
+ * kho_unpreserve_folio - unpreserve a folio.
+ * @folio: folio to unpreserve.
+ *
+ * Instructs KHO to unpreserve a folio that was preserved by
+ * kho_preserve_folio() before.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_folio(struct folio *folio)
+{
+	const unsigned long pfn = folio_pfn(folio);
+	const unsigned int order = folio_order(folio);
+	struct kho_mem_track *track = &kho_out.ser.track;
+
+	if (kho_out.finalized)
+		return -EBUSY;
+
+	__kho_unpreserve_order(track, pfn, order);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
+
 /**
  * kho_preserve_phys - preserve a physically contiguous range across kexec.
  * @phys: physical address of the range.
@@ -652,6 +682,34 @@ int kho_preserve_phys(phys_addr_t phys, size_t size)
 }
 EXPORT_SYMBOL_GPL(kho_preserve_phys);
 
+/**
+ * kho_unpreserve_phys - unpreserve a physically contiguous range across kexec.
+ * @phys: physical address of the range.
+ * @size: size of the range.
+ *
+ * Instructs KHO to unpreserve the memory range from @phys to @phys + @size
+ * across kexec.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_phys(phys_addr_t phys, size_t size)
+{
+	struct kho_mem_track *track = &kho_out.ser.track;
+	unsigned long pfn = PHYS_PFN(phys);
+	unsigned long end_pfn = PHYS_PFN(phys + size);
+
+	if (kho_out.finalized)
+		return -EBUSY;
+
+	if (!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size))
+		return -EINVAL;
+
+	__kho_unpreserve(track, pfn, end_pfn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_phys);
+
 int __kho_abort(void)
 {
 	int err;
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 04/16] luo: luo_core: Live Update Orchestrator
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (2 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 03/16] kho: add kho_unpreserve_folio/phys Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  6:31   ` Mike Rapoport
  2025-06-04 15:17   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 05/16] luo: luo_core: integrate with KHO Pasha Tatashin
                   ` (13 subsequent siblings)
  17 siblings, 2 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce LUO, a mechanism intended to facilitate kernel updates while
keeping designated devices operational across the transition (e.g., via
kexec). The primary use case is updating hypervisors with minimal
disruption to running virtual machines. For userspace side of hypervisor
update we have copyless migration. LUO is for updating the kernel.

This initial patch lays the groundwork for the LUO subsystem.

Further functionality, including the implementation of state transition
logic, integration with KHO, and hooks for subsystems and file
descriptors, will be added in subsequent patches.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/Kconfig                   |   1 +
 drivers/misc/Makefile                  |   1 +
 drivers/misc/liveupdate/Kconfig        |  27 +++
 drivers/misc/liveupdate/Makefile       |   2 +
 drivers/misc/liveupdate/luo_core.c     | 296 +++++++++++++++++++++++++
 drivers/misc/liveupdate/luo_internal.h |  26 +++
 include/linux/liveupdate.h             | 131 +++++++++++
 7 files changed, 484 insertions(+)
 create mode 100644 drivers/misc/liveupdate/Kconfig
 create mode 100644 drivers/misc/liveupdate/Makefile
 create mode 100644 drivers/misc/liveupdate/luo_core.c
 create mode 100644 drivers/misc/liveupdate/luo_internal.h
 create mode 100644 include/linux/liveupdate.h

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 6b37d61150ee..851fd9c33b36 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -636,6 +636,7 @@ source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
 source "drivers/misc/lis3lv02d/Kconfig"
+source "drivers/misc/liveupdate/Kconfig"
 source "drivers/misc/altera-stapl/Kconfig"
 source "drivers/misc/mei/Kconfig"
 source "drivers/misc/vmw_vmci/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index d6c917229c45..ed5b5bc71b85 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -41,6 +41,7 @@ obj-y				+= eeprom/
 obj-y				+= cb710/
 obj-$(CONFIG_VMWARE_BALLOON)	+= vmw_balloon.o
 obj-$(CONFIG_PCH_PHUB)		+= pch_phub.o
+obj-$(CONFIG_LIVEUPDATE)	+= liveupdate/
 obj-y				+= lis3lv02d/
 obj-$(CONFIG_ALTERA_STAPL)	+=altera-stapl/
 obj-$(CONFIG_INTEL_MEI)		+= mei/
diff --git a/drivers/misc/liveupdate/Kconfig b/drivers/misc/liveupdate/Kconfig
new file mode 100644
index 000000000000..a7424ceeba0b
--- /dev/null
+++ b/drivers/misc/liveupdate/Kconfig
@@ -0,0 +1,27 @@
+# SPDX-License-Identifier: GPL-2.0-only
+#
+# Copyright (c) 2025, Google LLC.
+# Pasha Tatashin <pasha.tatashin@soleen.com>
+#
+# Live Update Orchestrator
+#
+
+config LIVEUPDATE
+	bool "Live Update Orchestrator"
+	depends on KEXEC_HANDOVER
+	help
+	  Enable the Live Update Orchestrator. Live Update is a mechanism,
+	  typically based on kexec, that allows the kernel to be updated
+	  while keeping selected devices operational across the transition.
+	  These devices are intended to be reclaimed by the new kernel and
+	  re-attached to their original workload without requiring a device
+	  reset.
+
+	  This functionality depends on specific support within device drivers
+	  and related kernel subsystems.
+
+	  This feature is primarily used in cloud environments to quickly
+	  update the kernel hypervisor with minimal disruption to the
+	  running virtual machines.
+
+	  If unsure, say N.
diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
new file mode 100644
index 000000000000..3bfb4b9fed11
--- /dev/null
+++ b/drivers/misc/liveupdate/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-y					+= luo_core.o
diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
new file mode 100644
index 000000000000..919c37b0b4d1
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_core.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: Live Update Orchestrator (LUO)
+ *
+ * Live Update is a specialized reboot process where selected devices are
+ * kept operational across a kernel transition. For these devices, DMA activity
+ * may continue during the kernel reboot.
+ *
+ * The primary use case is in cloud environments, allowing hypervisor updates
+ * without disrupting running virtual machines. During a live update, VMs can be
+ * suspended (with their state preserved in memory), while the hypervisor kernel
+ * reboots. Devices attached to these VMs (e.g., NICs, block devices) are kept
+ * operational by the LUO during the hypervisor reboot, allowing the VMs to be
+ * quickly resumed on the new kernel.
+ *
+ * The core of LUO is a state machine that tracks the progress of a live update,
+ * along with a callback API that allows other kernel subsystems to participate
+ * in the process. Example subsystems that can hook into LUO include: kvm,
+ * iommu, interrupts, vfio, participating filesystems, and mm.
+ *
+ * LUO uses KHO to transfer memory state from the current Kernel to the next
+ * Kernel.
+ *
+ * The LUO state machine ensures that operations are performed in the correct
+ * sequence and provides a mechanism to track and recover from potential
+ * failures, and select devices and subsystems that should participate in
+ * live update sequence.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/kobject.h>
+#include <linux/liveupdate.h>
+#include <linux/rwsem.h>
+#include <linux/string.h>
+#include "luo_internal.h"
+
+static DECLARE_RWSEM(luo_state_rwsem);
+
+enum liveupdate_state luo_state;
+
+const char *const luo_state_str[] = {
+	[LIVEUPDATE_STATE_NORMAL]	= "normal",
+	[LIVEUPDATE_STATE_PREPARED]	= "prepared",
+	[LIVEUPDATE_STATE_FROZEN]	= "frozen",
+	[LIVEUPDATE_STATE_UPDATED]	= "updated",
+};
+
+bool luo_enabled;
+
+static int __init early_liveupdate_param(char *buf)
+{
+	return kstrtobool(buf, &luo_enabled);
+}
+early_param("liveupdate", early_liveupdate_param);
+
+/* Return true if the current state is equal to the provided state */
+static inline bool is_current_luo_state(enum liveupdate_state expected_state)
+{
+	return READ_ONCE(luo_state) == expected_state;
+}
+
+static void __luo_set_state(enum liveupdate_state state)
+{
+	WRITE_ONCE(luo_state, state);
+}
+
+static inline void luo_set_state(enum liveupdate_state state)
+{
+	pr_info("Switched from [%s] to [%s] state\n",
+		LUO_STATE_STR, luo_state_str[state]);
+	__luo_set_state(state);
+}
+
+static int luo_do_freeze_calls(void)
+{
+	return 0;
+}
+
+static void luo_do_finish_calls(void)
+{
+}
+
+int luo_prepare(void)
+{
+	return 0;
+}
+
+/**
+ * luo_freeze() - Initiate the final freeze notification phase for live update.
+ *
+ * Attempts to transition the live update orchestrator state from
+ * %LIVEUPDATE_STATE_PREPARED to %LIVEUPDATE_STATE_FROZEN. This function is
+ * typically called just before the actual reboot system call (e.g., kexec)
+ * is invoked, either directly by the orchestration tool or potentially from
+ * within the reboot syscall path itself.
+ *
+ * Based on the outcome of the notification process:
+ * - If luo_do_freeze_calls() returns 0 (all callbacks succeeded), the state
+ * is set to %LIVEUPDATE_STATE_FROZEN using luo_set_state(), indicating
+ * readiness for the imminent kexec.
+ * - If luo_do_freeze_calls() returns a negative error code (a callback
+ * failed), the state is reverted to %LIVEUPDATE_STATE_NORMAL using
+ * luo_set_state() to cancel the live update attempt.
+ *
+ * @return  0: Success. Negative error otherwise. State is reverted to
+ * %LIVEUPDATE_STATE_NORMAL in case of an error during callbacks.
+ */
+int luo_freeze(void)
+{
+	int ret;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[freeze] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_FROZEN],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	ret = luo_do_freeze_calls();
+	if (!ret)
+		luo_set_state(LIVEUPDATE_STATE_FROZEN);
+	else
+		luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return ret;
+}
+
+/**
+ * luo_finish - Finalize the live update process in the new kernel.
+ *
+ * This function is called  after a successful live update reboot into a new
+ * kernel, once the new kernel is ready to transition to the normal operational
+ * state. It signals the completion of the live update sequence to subsystems.
+ *
+ * It first attempts to acquire the write lock for the orchestrator state.
+ *
+ * Then, it checks if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state.
+ * If not, it logs a warning and returns ``-EINVAL``.
+ *
+ * If the state is correct, it triggers the ``LIVEUPDATE_FINISH`` notifier
+ * chain. Note that the return value of the notifier is intentionally ignored as
+ * finish callbacks must not fail. Finally, the orchestrator state is
+ * transitioned back to ``LIVEUPDATE_STATE_NORMAL``, indicating the end of the
+ * live update process.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, or ``-EINVAL`` if the orchestrator is not in
+ * the updated state.
+ */
+int luo_finish(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[finish] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_UPDATED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	luo_do_finish_calls();
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return 0;
+}
+
+int luo_cancel(void)
+{
+	return 0;
+}
+
+void luo_state_read_enter(void)
+{
+	down_read(&luo_state_rwsem);
+}
+
+void luo_state_read_exit(void)
+{
+	up_read(&luo_state_rwsem);
+}
+
+static int __init luo_startup(void)
+{
+	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	return 0;
+}
+early_initcall(luo_startup);
+
+/* Public Functions */
+
+/**
+ * liveupdate_reboot() - Kernel reboot notifier for live update final
+ * serialization.
+ *
+ * This function is invoked directly from the reboot() syscall pathway if a
+ * reboot is initiated while the live update state is %LIVEUPDATE_STATE_PREPARED
+ * (i.e., if the user did not explicitly trigger the frozen state). It handles
+ * the implicit transition into the final frozen state.
+ *
+ * It triggers the %LIVEUPDATE_REBOOT event callbacks for participating
+ * subsystems. These callbacks must perform final state saving very quickly as
+ * they execute during the blackout period just before kexec.
+ *
+ * If any %LIVEUPDATE_FREEZE callback fails, this function triggers the
+ * %LIVEUPDATE_CANCEL event for all participants to revert their state, aborts
+ * the live update, and returns an error.
+ */
+int liveupdate_reboot(void)
+{
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED))
+		return 0;
+
+	return luo_freeze();
+}
+EXPORT_SYMBOL_GPL(liveupdate_reboot);
+
+/**
+ * liveupdate_state_updated - Check if the system is in the live update
+ * 'updated' state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
+ * successfully rebooted into a new kernel as part of a live update, and the
+ * preserved devices are expected to be in the process of being reclaimed.
+ *
+ * This is typically used by subsystems during early boot of the new kernel
+ * to determine if they need to attempt to restore state from a previous
+ * live update.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_updated(void)
+{
+	return is_current_luo_state(LIVEUPDATE_STATE_UPDATED);
+}
+EXPORT_SYMBOL_GPL(liveupdate_state_updated);
+
+/**
+ * liveupdate_state_normal - Check if the system is in the live update 'normal'
+ * state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_NORMAL`` state. This state indicates that no live update
+ * is in progress. It represents the default operational state of the system.
+ *
+ * This can be used to gate actions that should only be performed when no
+ * live update activity is occurring.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_NORMAL`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_normal(void)
+{
+	return is_current_luo_state(LIVEUPDATE_STATE_NORMAL);
+}
+EXPORT_SYMBOL_GPL(liveupdate_state_normal);
+
+/**
+ * liveupdate_enabled - Check if the live update feature is enabled.
+ *
+ * This function returns the state of the live update feature flag, which
+ * can be controlled via the ``liveupdate`` kernel command-line parameter.
+ *
+ * @return true if live update is enabled, false otherwise.
+ */
+bool liveupdate_enabled(void)
+{
+	return luo_enabled;
+}
+EXPORT_SYMBOL_GPL(liveupdate_enabled);
diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
new file mode 100644
index 000000000000..34e73fb0318c
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_internal.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#ifndef _LINUX_LUO_INTERNAL_H
+#define _LINUX_LUO_INTERNAL_H
+
+int luo_cancel(void);
+int luo_prepare(void);
+int luo_freeze(void);
+int luo_finish(void);
+
+void luo_state_read_enter(void);
+void luo_state_read_exit(void);
+
+extern const char *const luo_state_str[];
+
+/* Get the current state as a string */
+#define LUO_STATE_STR luo_state_str[READ_ONCE(luo_state)]
+
+extern enum liveupdate_state luo_state;
+
+#endif /* _LINUX_LUO_INTERNAL_H */
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
new file mode 100644
index 000000000000..c2740da70958
--- /dev/null
+++ b/include/linux/liveupdate.h
@@ -0,0 +1,131 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+#ifndef _LINUX_LIVEUPDATE_H
+#define _LINUX_LIVEUPDATE_H
+
+#include <linux/bug.h>
+#include <linux/types.h>
+#include <linux/list.h>
+
+/**
+ * enum liveupdate_event - Events that trigger live update callbacks.
+ * @LIVEUPDATE_PREPARE: PREPARE should happens *before* the blackout window.
+ *                      Subsystems should prepare for an upcoming reboot by
+ *                      serializing their states. However, it must be considered
+ *                      that user applications, e.g. virtual machines are still
+ *                      running during this phase.
+ * @LIVEUPDATE_FREEZE:  FREEZE sent from the reboot() syscall, when the current
+ *                      kernel is on its way out. This is the final opportunity
+ *                      for subsystems to save any state that must persist
+ *                      across the reboot. Callbacks for this event should be as
+ *                      fast as possible since they are on the critical path of
+ *                      rebooting into the next kernel.
+ * @LIVEUPDATE_FINISH:  FINISH is sent in the newly booted kernel after a
+ *                      successful live update and normally *after* the blackout
+ *                      window. Subsystems should perform any final cleanup
+ *                      during this phase. This phase also provides an
+ *                      opportunity to clean up devices that were preserved but
+ *                      never explicitly reclaimed during the live update
+ *                      process. State restoration should have already occurred
+ *                      before this event. Callbacks for this event must not
+ *                      fail. The completion of this call transitions the
+ *                      machine from ``updated`` to ``normal`` state.
+ * @LIVEUPDATE_CANCEL:  CANCEL the live update and go back to normal state. This
+ *                      event is user initiated, or is done automatically when
+ *                      LIVEUPDATE_PREPARE or LIVEUPDATE_FREEZE stage fails.
+ *                      Subsystems should revert any actions taken during the
+ *                      corresponding prepare event. Callbacks for this event
+ *                      must not fail.
+ *
+ * These events represent the different stages and actions within the live
+ * update process that subsystems (like device drivers and bus drivers)
+ * need to be aware of to correctly serialize and restore their state.
+ *
+ */
+enum liveupdate_event {
+	LIVEUPDATE_PREPARE,
+	LIVEUPDATE_FREEZE,
+	LIVEUPDATE_FINISH,
+	LIVEUPDATE_CANCEL,
+};
+
+/**
+ * enum liveupdate_state - Defines the possible states of the live update
+ * orchestrator.
+ * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
+ * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
+ *                                   LIVEUPDATE_PREPARE callbacks have completed
+ *                                   successfully.
+ *                                   Devices might operate in a limited state
+ *                                   for example the participating devices might
+ *                                   not be allowed to unbind, and also the
+ *                                   setting up of new DMA mappings might be
+ *                                   disabled in this state.
+ * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
+ *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
+ *                                   system is performing its final state saving
+ *                                   within the "blackout window". User
+ *                                   workloads must be suspended. The actual
+ *                                   reboot (kexec) into the next kernel is
+ *                                   imminent.
+ * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
+ *                                   kernel via live update the system is now
+ *                                   running the next kernel, awaiting the
+ *                                   finish event.
+ *
+ * These states track the progress and outcome of a live update operation.
+ */
+enum liveupdate_state  {
+	LIVEUPDATE_STATE_NORMAL = 0,
+	LIVEUPDATE_STATE_PREPARED = 1,
+	LIVEUPDATE_STATE_FROZEN = 2,
+	LIVEUPDATE_STATE_UPDATED = 3,
+};
+
+#ifdef CONFIG_LIVEUPDATE
+
+/* Return true if live update orchestrator is enabled */
+bool liveupdate_enabled(void);
+
+/* Called during reboot to tell participants to complete serialization */
+int liveupdate_reboot(void);
+
+/*
+ * Return true if machine is in updated state (i.e. live update boot in
+ * progress)
+ */
+bool liveupdate_state_updated(void);
+
+/*
+ * Return true if machine is in normal state (i.e. no live update in progress).
+ */
+bool liveupdate_state_normal(void);
+
+#else /* CONFIG_LIVEUPDATE */
+
+static inline int liveupdate_reboot(void)
+{
+	return 0;
+}
+
+static inline bool liveupdate_enabled(void)
+{
+	return false;
+}
+
+static inline bool liveupdate_state_updated(void)
+{
+	return false;
+}
+
+static inline bool liveupdate_state_normal(void)
+{
+	return true;
+}
+
+#endif /* CONFIG_LIVEUPDATE */
+#endif /* _LINUX_LIVEUPDATE_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (3 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 04/16] luo: luo_core: Live Update Orchestrator Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  7:18   ` Mike Rapoport
  2025-06-04 16:00   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Integrate the LUO with the KHO framework to enable passing LUO state
across a kexec reboot.

This patch introduces the following changes:
- During the KHO finalization phase allocate FDT blob.
- Populate this FDT with a LUO compatibility string ("luo-v1") and the
  current LUO state (`luo_state`).
- Implement a KHO notifier

LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
logic (`luo_do_*_calls`) remains unimplemented in this patch.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/luo_core.c | 222 ++++++++++++++++++++++++++++-
 1 file changed, 219 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
index 919c37b0b4d1..a76e886bc3b1 100644
--- a/drivers/misc/liveupdate/luo_core.c
+++ b/drivers/misc/liveupdate/luo_core.c
@@ -36,9 +36,12 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/err.h>
+#include <linux/kexec_handover.h>
 #include <linux/kobject.h>
+#include <linux/libfdt.h>
 #include <linux/liveupdate.h>
 #include <linux/rwsem.h>
+#include <linux/sizes.h>
 #include <linux/string.h>
 #include "luo_internal.h"
 
@@ -55,6 +58,12 @@ const char *const luo_state_str[] = {
 
 bool luo_enabled;
 
+static void *luo_fdt_out;
+static void *luo_fdt_in;
+#define LUO_FDT_SIZE		SZ_1M
+#define LUO_KHO_ENTRY_NAME	"LUO"
+#define LUO_COMPATIBLE		"luo-v1"
+
 static int __init early_liveupdate_param(char *buf)
 {
 	return kstrtobool(buf, &luo_enabled);
@@ -79,6 +88,60 @@ static inline void luo_set_state(enum liveupdate_state state)
 	__luo_set_state(state);
 }
 
+/* Called during the prepare phase, to create LUO fdt tree */
+static int luo_fdt_setup(struct kho_serialization *ser)
+{
+	void *fdt_out;
+	int ret;
+
+	fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+					   get_order(LUO_FDT_SIZE));
+	if (!fdt_out) {
+		pr_err("failed to allocate FDT memory\n");
+		return -ENOMEM;
+	}
+
+	ret = fdt_create_empty_tree(fdt_out, LUO_FDT_SIZE);
+	if (ret)
+		goto exit_free;
+
+	ret = fdt_setprop(fdt_out, 0, "compatible", LUO_COMPATIBLE,
+			  strlen(LUO_COMPATIBLE) + 1);
+	if (ret)
+		goto exit_free;
+
+	ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
+	if (ret)
+		goto exit_free;
+
+	ret = kho_add_subtree(ser, LUO_KHO_ENTRY_NAME, fdt_out);
+	if (ret)
+		goto exit_unpreserve;
+	luo_fdt_out = fdt_out;
+
+	return 0;
+
+exit_unpreserve:
+	kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
+exit_free:
+	free_pages((unsigned long)fdt_out, get_order(LUO_FDT_SIZE));
+	pr_err("failed to prepare LUO FDT: %d\n", ret);
+
+	return ret;
+}
+
+static void luo_fdt_destroy(void)
+{
+	kho_unpreserve_phys(__pa(luo_fdt_out), LUO_FDT_SIZE);
+	free_pages((unsigned long)luo_fdt_out, get_order(LUO_FDT_SIZE));
+	luo_fdt_out = NULL;
+}
+
+static int luo_do_prepare_calls(void)
+{
+	return 0;
+}
+
 static int luo_do_freeze_calls(void)
 {
 	return 0;
@@ -88,11 +151,111 @@ static void luo_do_finish_calls(void)
 {
 }
 
-int luo_prepare(void)
+static void luo_do_cancel_calls(void)
+{
+}
+
+static int __luo_prepare(struct kho_serialization *ser)
 {
+	int ret;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[prepare] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_NORMAL)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_PREPARED],
+			LUO_STATE_STR);
+		ret = -EINVAL;
+		goto exit_unlock;
+	}
+
+	ret = luo_fdt_setup(ser);
+	if (ret)
+		goto exit_unlock;
+
+	ret = luo_do_prepare_calls();
+	if (ret)
+		goto exit_unlock;
+
+	luo_set_state(LIVEUPDATE_STATE_PREPARED);
+
+exit_unlock:
+	up_write(&luo_state_rwsem);
+
+	return ret;
+}
+
+static int __luo_cancel(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[cancel] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED) &&
+	    !is_current_luo_state(LIVEUPDATE_STATE_FROZEN)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			LUO_STATE_STR);
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	luo_do_cancel_calls();
+	luo_fdt_destroy();
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
 	return 0;
 }
 
+static int luo_kho_notifier(struct notifier_block *self,
+			    unsigned long cmd, void *v)
+{
+	int ret;
+
+	switch (cmd) {
+	case KEXEC_KHO_FINALIZE:
+		ret = __luo_prepare((struct kho_serialization *)v);
+		break;
+	case KEXEC_KHO_ABORT:
+		ret = __luo_cancel();
+		break;
+	default:
+		return NOTIFY_BAD;
+	}
+
+	return notifier_from_errno(ret);
+}
+
+static struct notifier_block luo_kho_notifier_nb = {
+	.notifier_call = luo_kho_notifier,
+};
+
+/**
+ * luo_prepare - Initiate the live update preparation phase.
+ *
+ * This function is called to begin the live update process. It attempts to
+ * transition the luo to the ``LIVEUPDATE_STATE_PREPARED`` state.
+ *
+ * If the calls complete successfully, the orchestrator state is set
+ * to ``LIVEUPDATE_STATE_PREPARED``. If any  call fails a
+ * ``LIVEUPDATE_CANCEL`` is sent to roll back any actions.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, ``-EINVAL`` if the orchestrator is not in
+ * the normal state, or a negative error code returned by the calls.
+ */
+int luo_prepare(void)
+{
+	return kho_finalize();
+}
+
 /**
  * luo_freeze() - Initiate the final freeze notification phase for live update.
  *
@@ -188,9 +351,23 @@ int luo_finish(void)
 	return 0;
 }
 
+/**
+ * luo_cancel - Cancel the ongoing live update from prepared or frozen states.
+ *
+ * This function is called to abort a live update that is currently in the
+ * ``LIVEUPDATE_STATE_PREPARED`` state.
+ *
+ * If the state is correct, it triggers the ``LIVEUPDATE_CANCEL`` notifier chain
+ * to allow subsystems to undo any actions performed during the prepare or
+ * freeze events. Finally, the orchestrator state is transitioned back to
+ * ``LIVEUPDATE_STATE_NORMAL``.
+ *
+ * @return 0 on success, or ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock.
+ */
 int luo_cancel(void)
 {
-	return 0;
+	return kho_abort();
 }
 
 void luo_state_read_enter(void)
@@ -205,7 +382,46 @@ void luo_state_read_exit(void)
 
 static int __init luo_startup(void)
 {
-	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+	phys_addr_t fdt_phys;
+	int ret;
+
+	if (!kho_is_enabled()) {
+		if (luo_enabled)
+			pr_warn("Disabling liveupdate because KHO is disabled\n");
+		luo_enabled = false;
+		return 0;
+	}
+
+	ret = register_kho_notifier(&luo_kho_notifier_nb);
+	if (ret) {
+		luo_enabled = false;
+		pr_warn("Failed to register with KHO [%d]\n", ret);
+	}
+
+	/*
+	 * Retrieve LUO subtree, and verify its format.  Panic in case of
+	 * exceptions, since machine devices and memory is in unpredictable
+	 * state.
+	 */
+	ret = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &fdt_phys);
+	if (ret) {
+		if (ret != -ENOENT) {
+			panic("failed to retrieve FDT '%s' from KHO: %d\n",
+			      LUO_KHO_ENTRY_NAME, ret);
+		}
+		__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+		return 0;
+	}
+
+	luo_fdt_in = __va(fdt_phys);
+	ret = fdt_node_check_compatible(luo_fdt_in, 0, LUO_COMPATIBLE);
+	if (ret) {
+		panic("FDT '%s' is incompatible with '%s' [%d]\n",
+		      LUO_KHO_ENTRY_NAME, LUO_COMPATIBLE, ret);
+	}
+
+	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
 
 	return 0;
 }
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (4 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 05/16] luo: luo_core: integrate with KHO Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  7:31   ` Mike Rapoport
                     ` (2 more replies)
  2025-05-15 18:23 ` [RFC v2 07/16] luo: luo_subsystems: implement subsystem callbacks Pasha Tatashin
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce the framework for kernel subsystems (e.g., KVM, IOMMU, device
drivers) to register with LUO and participate in the live update process
via callbacks.

Subsystem Registration:
- Defines struct liveupdate_subsystem in linux/liveupdate.h,
  which subsystems use to provide their name and optional callbacks
  (prepare, freeze, cancel, finish). The callbacks accept
  a u64 *data intended for passing state/handles.
- Exports liveupdate_register_subsystem() and
  liveupdate_unregister_subsystem() API functions.
- Adds drivers/misc/liveupdate/luo_subsystems.c to manage a list
  of registered subsystems.
  Registration/unregistration is restricted to
  specific LUO states (NORMAL/UPDATED).

Callback Framework:
- The main luo_core.c state transition functions
  now delegate to new luo_do_subsystems_*_calls() functions
  defined in luo_subsystems.c.
- These new functions are intended to iterate through the registered
  subsystems and invoke their corresponding callbacks.

FDT Integration:
- Adds a /subsystems subnode within the main LUO FDT created in
  luo_core.c. This node has its own compatibility string
  (subsystems-v1).
- luo_subsystems_fdt_setup() populates this node by adding a
  property for each registered subsystem, using the subsystem's
  name.
  Currently, these properties are initialized with a placeholder
  u64 value (0).
- luo_subsystems_startup() is called from luo_core.c on boot to
  find and validate the /subsystems node in the FDT received via
  KHO. It panics if the node is missing or incompatible.
- Adds a stub API function liveupdate_get_subsystem_data() intended
  for subsystems to retrieve their persisted u64 data from the FDT
      in the new kernel.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/Makefile         |   1 +
 drivers/misc/liveupdate/luo_core.c       |  19 +-
 drivers/misc/liveupdate/luo_internal.h   |   7 +
 drivers/misc/liveupdate/luo_subsystems.c | 284 +++++++++++++++++++++++
 include/linux/liveupdate.h               |  53 +++++
 5 files changed, 362 insertions(+), 2 deletions(-)
 create mode 100644 drivers/misc/liveupdate/luo_subsystems.c

diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
index 3bfb4b9fed11..df1c9709ba4f 100644
--- a/drivers/misc/liveupdate/Makefile
+++ b/drivers/misc/liveupdate/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y					+= luo_core.o
+obj-y					+= luo_subsystems.o
diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
index a76e886bc3b1..417e7f6bf36c 100644
--- a/drivers/misc/liveupdate/luo_core.c
+++ b/drivers/misc/liveupdate/luo_core.c
@@ -110,6 +110,10 @@ static int luo_fdt_setup(struct kho_serialization *ser)
 	if (ret)
 		goto exit_free;
 
+	ret = luo_subsystems_fdt_setup(fdt_out);
+	if (ret)
+		goto exit_free;
+
 	ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
 	if (ret)
 		goto exit_free;
@@ -139,20 +143,30 @@ static void luo_fdt_destroy(void)
 
 static int luo_do_prepare_calls(void)
 {
-	return 0;
+	int ret;
+
+	ret = luo_do_subsystems_prepare_calls();
+
+	return ret;
 }
 
 static int luo_do_freeze_calls(void)
 {
-	return 0;
+	int ret;
+
+	ret = luo_do_subsystems_freeze_calls();
+
+	return ret;
 }
 
 static void luo_do_finish_calls(void)
 {
+	luo_do_subsystems_finish_calls();
 }
 
 static void luo_do_cancel_calls(void)
 {
+	luo_do_subsystems_cancel_calls();
 }
 
 static int __luo_prepare(struct kho_serialization *ser)
@@ -422,6 +436,7 @@ static int __init luo_startup(void)
 	}
 
 	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
+	luo_subsystems_startup(luo_fdt_in);
 
 	return 0;
 }
diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
index 34e73fb0318c..63a8b93254a6 100644
--- a/drivers/misc/liveupdate/luo_internal.h
+++ b/drivers/misc/liveupdate/luo_internal.h
@@ -16,6 +16,13 @@ int luo_finish(void);
 void luo_state_read_enter(void);
 void luo_state_read_exit(void);
 
+void luo_subsystems_startup(void *fdt);
+int luo_subsystems_fdt_setup(void *fdt);
+int luo_do_subsystems_prepare_calls(void);
+int luo_do_subsystems_freeze_calls(void);
+void luo_do_subsystems_finish_calls(void);
+void luo_do_subsystems_cancel_calls(void);
+
 extern const char *const luo_state_str[];
 
 /* Get the current state as a string */
diff --git a/drivers/misc/liveupdate/luo_subsystems.c b/drivers/misc/liveupdate/luo_subsystems.c
new file mode 100644
index 000000000000..436929a17de0
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_subsystems.c
@@ -0,0 +1,284 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO Subsystems support
+ *
+ * Various kernel subsystems register with the Live Update Orchestrator to
+ * participate in the live update process. These subsystems are notified at
+ * different stages of the live update sequence, allowing them to serialize
+ * device state before the reboot and restore it afterwards. Examples include
+ * the device layer, interrupt controllers, KVM, IOMMU, and specific device
+ * drivers.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/libfdt.h>
+#include <linux/liveupdate.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include "luo_internal.h"
+
+#define LUO_SUBSYSTEMS_NODE_NAME	"subsystems"
+#define LUO_SUBSYSTEMS_COMPATIBLE	"subsystems-v1"
+
+static DEFINE_MUTEX(luo_subsystem_list_mutex);
+static LIST_HEAD(luo_subsystems_list);
+static void *luo_fdt_out;
+static void *luo_fdt_in;
+
+/**
+ * luo_subsystems_fdt_setup - Adds and populates the 'subsystems' node in the
+ * FDT.
+ * @fdt: Pointer to the LUO FDT blob.
+ *
+ * Add subsystems node and each subsystem to the LUO FDT blob.
+ *
+ * Returns: 0 on success, negative errno on failure.
+ */
+int luo_subsystems_fdt_setup(void *fdt)
+{
+	struct liveupdate_subsystem *subsystem;
+	const u64 zero_data = 0;
+	int ret, node_offset;
+
+	ret = fdt_add_subnode(fdt, 0, LUO_SUBSYSTEMS_NODE_NAME);
+	if (ret < 0)
+		goto exit_error;
+
+	node_offset = ret;
+	ret = fdt_setprop_string(fdt, node_offset, "compatible",
+				 LUO_SUBSYSTEMS_COMPATIBLE);
+	if (ret < 0)
+		goto exit_error;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		ret = fdt_setprop(fdt, node_offset, subsystem->name,
+				  &zero_data, sizeof(zero_data));
+		if (ret < 0)
+			goto exit_error;
+	}
+
+	luo_fdt_out = fdt;
+	return 0;
+exit_error:
+	pr_err("Failed to setup 'subsystems' node to FDT: %s\n",
+	       fdt_strerror(ret));
+	return -ENOSPC;
+}
+
+/**
+ * luo_subsystems_startup - Validates the LUO subsystems FDT node at startup.
+ * @fdt: Pointer to the LUO FDT blob passed from the previous kernel.
+ *
+ * This __init function checks the existence and validity of the '/subsystems'
+ * node in the FDT. This node is considered mandatory. It calls panic() if
+ * the node is missing, inaccessible, or invalid (e.g., missing compatible,
+ * wrong compatible string), indicating a critical configuration error for LUO.
+ */
+void __init luo_subsystems_startup(void *fdt)
+{
+	int ret, node_offset;
+
+	node_offset = fdt_subnode_offset(fdt, 0, LUO_SUBSYSTEMS_NODE_NAME);
+	if (node_offset < 0)
+		panic("Failed to find /subsystems node\n");
+
+	ret = fdt_node_check_compatible(fdt, node_offset,
+					LUO_SUBSYSTEMS_COMPATIBLE);
+	if (ret) {
+		panic("FDT '%s' is incompatible with '%s' [%d]\n",
+		      LUO_SUBSYSTEMS_NODE_NAME, LUO_SUBSYSTEMS_COMPATIBLE, ret);
+	}
+	luo_fdt_in = fdt;
+}
+
+/**
+ * luo_do_subsystems_prepare_calls - Calls prepare callbacks and updates FDT
+ * if all prepares succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'prepare' for all subsystems and stores results temporarily.
+ * If any 'prepare' fails, calls 'cancel' on previously prepared subsystems
+ * and returns the error.
+ * Phase 2: If all 'prepare' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* prepared subsystems and
+ * returns the FDT error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_subsystems_prepare_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_subsystems_freeze_calls - Calls freeze callbacks and updates FDT
+ * if all freezes succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'freeze' for all subsystems and stores results temporarily.
+ * If any 'freeze' fails, calls 'cancel' on previously called subsystems
+ * and returns the error.
+ * Phase 2: If all 'freeze' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* subsystems and
+ * returns the FDT error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_subsystems_freeze_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_subsystems_finish_calls- Calls finish callbacks for all subsystems.
+ *
+ * This function is called at the end of live update cycle to do the final
+ * clean-up or housekeeping of the post-live update states.
+ */
+void luo_do_subsystems_finish_calls(void)
+{
+}
+
+/**
+ * luo_do_subsystems_cancel_calls - Calls cancel callbacks for all subsystems.
+ *
+ * This function is typically called when the live update process needs to be
+ * aborted externally, for example, after the prepare phase may have run but
+ * before actual reboot. It iterates through all registered subsystems and calls
+ * the 'cancel' callback for those that implement it and likely completed
+ * prepare.
+ */
+void luo_do_subsystems_cancel_calls(void)
+{
+}
+
+/**
+ * liveupdate_register_subsystem - Register a kernel subsystem handler with LUO
+ * @h: Pointer to the liveupdate_subsystem structure allocated and populated
+ * by the calling subsystem.
+ *
+ * Registers a subsystem handler that provides callbacks for different events
+ * of the live update cycle. Registration is typically done during the
+ * subsystem's module init or core initialization.
+ *
+ * Can only be called when LUO is in the NORMAL or UPDATED states.
+ * The provided name (@h->name) must be unique among registered subsystems.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int liveupdate_register_subsystem(struct liveupdate_subsystem *h)
+{
+	struct liveupdate_subsystem *iter;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	mutex_lock(&luo_subsystem_list_mutex);
+	list_for_each_entry(iter, &luo_subsystems_list, list) {
+		if (iter == h) {
+			pr_warn("Subsystem '%s' (%p) already registered.\n",
+				h->name, h);
+			ret = -EEXIST;
+			goto out_unlock;
+		}
+
+		if (!strcmp(iter->name, h->name)) {
+			pr_err("Subsystem with name '%s' already registered.\n",
+			       h->name);
+			ret = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	INIT_LIST_HEAD(&h->list);
+	list_add_tail(&h->list, &luo_subsystems_list);
+
+out_unlock:
+	mutex_unlock(&luo_subsystem_list_mutex);
+	luo_state_read_exit();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(liveupdate_register_subsystem);
+
+/**
+ * liveupdate_unregister_subsystem - Unregister a kernel subsystem handler from
+ * LUO
+ * @h: Pointer to the same liveupdate_subsystem structure that was used during
+ * registration.
+ *
+ * Unregisters a previously registered subsystem handler. Typically called
+ * during module exit or subsystem teardown. LUO removes the structure from its
+ * internal list; the caller is responsible for any necessary memory cleanup
+ * of the structure itself.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ * -EINVAL if h is NULL.
+ * -ENOENT if the specified handler @h is not found in the registration list.
+ * -EBUSY if LUO is not in the NORMAL state.
+ */
+int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
+{
+	struct liveupdate_subsystem *iter;
+	bool found = false;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	mutex_lock(&luo_subsystem_list_mutex);
+	list_for_each_entry(iter, &luo_subsystems_list, list) {
+		if (iter == h) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found) {
+		list_del_init(&h->list);
+	} else {
+		pr_warn("Subsystem handler '%s' not found for unregistration.\n",
+			h->name);
+		ret = -ENOENT;
+	}
+
+	mutex_unlock(&luo_subsystem_list_mutex);
+	luo_state_read_exit();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(liveupdate_unregister_subsystem);
+
+/**
+ * liveupdate_get_subsystem_data - Retrieve raw private data for a subsystem
+ * from FDT.
+ * @h:      Pointer to the liveupdate_subsystem structure representing the
+ * subsystem instance. The 'name' field is used to find the property.
+ * @data:   Output pointer where the subsystem's raw private u64 data will be
+ * stored via memcpy.
+ *
+ * Reads the 8-byte data property associated with the subsystem @h->name
+ * directly from the '/subsystems' node within the globally accessible
+ * 'luo_fdt_in' blob. Returns appropriate error codes if inputs are invalid, or
+ * nodes/properties are missing or invalid.
+ *
+ * Return:  0 on success. -ENOENT on error.
+ */
+int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
+{
+	return 0;
+}
+EXPORT_SYMBOL_GPL(liveupdate_get_subsystem_data);
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index c2740da70958..7a130680b5f2 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -86,6 +86,39 @@ enum liveupdate_state  {
 	LIVEUPDATE_STATE_UPDATED = 3,
 };
 
+/**
+ * struct liveupdate_subsystem - Represents a subsystem participating in LUO
+ * @prepare:      Optional. Called during LUO prepare phase. Should perform
+ *                preparatory actions and can store a u64 handle/state
+ *                via the 'data' pointer for use in later callbacks.
+ *                Return 0 on success, negative error code on failure.
+ * @freeze:       Optional. Called during LUO freeze event (before actual jump
+ *                to new kernel). Should perform final state saving actions and
+ *                can update the u64 handle/state via the 'data' pointer. Retur:
+ *                0 on success, negative error code on failure.
+ * @cancel:       Optional. Called if the live update process is canceled after
+ *                prepare (or freeze) was called. Receives the u64 data
+ *                set by prepare/freeze. Used for cleanup.
+ * @finish:       Optional. Called after the live update is finished in the new
+ *                kernel.
+ *                Receives the u64 data set by prepare/freeze. Used for cleanup.
+ * @name:         Mandatory. Unique name identifying the subsystem.
+ * @arg:          Add this argument to callback functions.
+ * @list:         List head used internally by LUO. Should not be modified by
+ *                caller after registration.
+ * @private_data: For LUO internal use, cached value of data field.
+ */
+struct liveupdate_subsystem {
+	int (*prepare)(void *arg, u64 *data);
+	int (*freeze)(void *arg, u64 *data);
+	void (*cancel)(void *arg, u64 data);
+	void (*finish)(void *arg, u64 data);
+	const char *name;
+	void *arg;
+	struct list_head list;
+	u64 private_data;
+};
+
 #ifdef CONFIG_LIVEUPDATE
 
 /* Return true if live update orchestrator is enabled */
@@ -105,6 +138,10 @@ bool liveupdate_state_updated(void);
  */
 bool liveupdate_state_normal(void);
 
+int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
+int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
+int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
+
 #else /* CONFIG_LIVEUPDATE */
 
 static inline int liveupdate_reboot(void)
@@ -127,5 +164,21 @@ static inline bool liveupdate_state_normal(void)
 	return true;
 }
 
+static inline int liveupdate_register_subsystem(struct liveupdate_subsystem *h)
+{
+	return 0;
+}
+
+static inline int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
+{
+	return 0;
+}
+
+static inline int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h,
+						u64 *data)
+{
+	return -ENODATA;
+}
+
 #endif /* CONFIG_LIVEUPDATE */
 #endif /* _LINUX_LIVEUPDATE_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 07/16] luo: luo_subsystems: implement subsystem callbacks
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (5 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Implement the core logic within luo_subsystems.c to handle the
invocation of registered subsystem callbacks and manage the persistence
of their state via the LUO FDT. This replaces the stub implementations
from the previous patch.

This completes the core mechanism enabling subsystems to actively
participate in the LUO state machine, execute phase-specific logic, and
persist/restore a u64 state across the live update transition
using the FDT.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/luo_subsystems.c | 133 ++++++++++++++++++++++-
 1 file changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/liveupdate/luo_subsystems.c b/drivers/misc/liveupdate/luo_subsystems.c
index 436929a17de0..71f5f0468b0d 100644
--- a/drivers/misc/liveupdate/luo_subsystems.c
+++ b/drivers/misc/liveupdate/luo_subsystems.c
@@ -99,6 +99,66 @@ void __init luo_subsystems_startup(void *fdt)
 	luo_fdt_in = fdt;
 }
 
+static void __luo_do_subsystems_cancel_calls(struct liveupdate_subsystem *boundary_subsystem)
+{
+	struct liveupdate_subsystem *subsystem;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (subsystem == boundary_subsystem)
+			break;
+
+		if (subsystem->cancel) {
+			subsystem->cancel(subsystem->arg,
+					  subsystem->private_data);
+		}
+		subsystem->private_data = 0;
+	}
+}
+
+static void luo_subsystems_retrieve_data_from_fdt(void)
+{
+	struct liveupdate_subsystem *subsystem;
+	int node_offset, prop_len;
+	const void *prop;
+
+	if (!luo_fdt_in)
+		return;
+
+	node_offset = fdt_subnode_offset(luo_fdt_in, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		prop = fdt_getprop(luo_fdt_in, node_offset,
+				   subsystem->name, &prop_len);
+
+		if (!prop || prop_len != sizeof(u64)) {
+			panic("In FDt node '/%s' can't find property '%s': %s\n",
+			      LUO_SUBSYSTEMS_NODE_NAME, subsystem->name,
+			      fdt_strerror(node_offset));
+		}
+		memcpy(&subsystem->private_data, prop, sizeof(u64));
+	}
+}
+
+static int luo_subsystems_commit_data_to_fdt(void)
+{
+	struct liveupdate_subsystem *subsystem;
+	int ret, node_offset;
+
+	node_offset = fdt_subnode_offset(luo_fdt_out, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		ret = fdt_setprop(luo_fdt_out, node_offset, subsystem->name,
+				  &subsystem->private_data, sizeof(u64));
+		if (ret < 0) {
+			pr_err("Failed to set FDT property for subsystem '%s' %s\n",
+			       subsystem->name, fdt_strerror(ret));
+			return -ENOENT;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * luo_do_subsystems_prepare_calls - Calls prepare callbacks and updates FDT
  * if all prepares succeed. Handles cancellation on failure.
@@ -114,7 +174,29 @@ void __init luo_subsystems_startup(void *fdt)
  */
 int luo_do_subsystems_prepare_calls(void)
 {
-	return 0;
+	struct liveupdate_subsystem *subsystem;
+	int ret;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (!subsystem->prepare)
+			continue;
+
+		ret = subsystem->prepare(subsystem->arg,
+					 &subsystem->private_data);
+		if (ret < 0) {
+			pr_err("Subsystem '%s' prepare callback failed [%d]\n",
+			       subsystem->name, ret);
+			__luo_do_subsystems_cancel_calls(subsystem);
+
+			return ret;
+		}
+	}
+
+	ret = luo_subsystems_commit_data_to_fdt();
+	if (ret)
+		__luo_do_subsystems_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -132,7 +214,29 @@ int luo_do_subsystems_prepare_calls(void)
  */
 int luo_do_subsystems_freeze_calls(void)
 {
-	return 0;
+	struct liveupdate_subsystem *subsystem;
+	int ret;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (!subsystem->freeze)
+			continue;
+
+		ret = subsystem->freeze(subsystem->arg,
+					&subsystem->private_data);
+		if (ret < 0) {
+			pr_err("Subsystem '%s' freeze callback failed [%d]\n",
+			       subsystem->name, ret);
+			__luo_do_subsystems_cancel_calls(subsystem);
+
+			return ret;
+		}
+	}
+
+	ret = luo_subsystems_commit_data_to_fdt();
+	if (ret)
+		__luo_do_subsystems_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -143,6 +247,16 @@ int luo_do_subsystems_freeze_calls(void)
  */
 void luo_do_subsystems_finish_calls(void)
 {
+	struct liveupdate_subsystem *subsystem;
+
+	luo_subsystems_retrieve_data_from_fdt();
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (subsystem->finish) {
+			subsystem->finish(subsystem->arg,
+					  subsystem->private_data);
+		}
+	}
 }
 
 /**
@@ -156,6 +270,8 @@ void luo_do_subsystems_finish_calls(void)
  */
 void luo_do_subsystems_cancel_calls(void)
 {
+	__luo_do_subsystems_cancel_calls(NULL);
+	luo_subsystems_commit_data_to_fdt();
 }
 
 /**
@@ -279,6 +395,19 @@ EXPORT_SYMBOL_GPL(liveupdate_unregister_subsystem);
  */
 int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
 {
+	int node_offset, prop_len;
+	const void *prop;
+
+	if (!luo_fdt_in || !liveupdate_state_updated())
+		return -ENOENT;
+
+	node_offset = fdt_subnode_offset(luo_fdt_in, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	prop = fdt_getprop(luo_fdt_in, node_offset, h->name, &prop_len);
+	if (!prop || prop_len != sizeof(u64))
+		return -ENOENT;
+	memcpy(data, prop, sizeof(u64));
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(liveupdate_get_subsystem_data);
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (6 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 07/16] luo: luo_subsystems: implement subsystem callbacks Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-15 23:15   ` James Houghton
                     ` (2 more replies)
  2025-05-15 18:23 ` [RFC v2 09/16] luo: luo_files: implement file systems callbacks Pasha Tatashin
                   ` (9 subsequent siblings)
  17 siblings, 3 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce the framework within LUO to support preserving specific types
of file descriptors across a live update transition. This allows
stateful FDs (like memfds or vfio FDs used by VMs) to be recreated in
the new kernel.

Note: The core logic for iterating through the luo_files_list and
invoking the handler callbacks (prepare, freeze, cancel, finish)
within luo_do_files_*_calls, as well as managing the u64 data
persistence via the FDT for individual files, is currently implemented
as stubs in this patch. This patch sets up the registration, FDT layout,
and retrieval framework.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/Makefile       |   1 +
 drivers/misc/liveupdate/luo_core.c     |  19 +
 drivers/misc/liveupdate/luo_files.c    | 563 +++++++++++++++++++++++++
 drivers/misc/liveupdate/luo_internal.h |  11 +
 include/linux/liveupdate.h             |  62 +++
 5 files changed, 656 insertions(+)
 create mode 100644 drivers/misc/liveupdate/luo_files.c

diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
index df1c9709ba4f..b4cdd162574f 100644
--- a/drivers/misc/liveupdate/Makefile
+++ b/drivers/misc/liveupdate/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y					+= luo_core.o
+obj-y					+= luo_files.o
 obj-y					+= luo_subsystems.o
diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
index 417e7f6bf36c..ab1d76221fe2 100644
--- a/drivers/misc/liveupdate/luo_core.c
+++ b/drivers/misc/liveupdate/luo_core.c
@@ -110,6 +110,10 @@ static int luo_fdt_setup(struct kho_serialization *ser)
 	if (ret)
 		goto exit_free;
 
+	ret = luo_files_fdt_setup(fdt_out);
+	if (ret)
+		goto exit_free;
+
 	ret = luo_subsystems_fdt_setup(fdt_out);
 	if (ret)
 		goto exit_free;
@@ -145,7 +149,13 @@ static int luo_do_prepare_calls(void)
 {
 	int ret;
 
+	ret = luo_do_files_prepare_calls();
+	if (ret)
+		return ret;
+
 	ret = luo_do_subsystems_prepare_calls();
+	if (ret)
+		luo_do_files_cancel_calls();
 
 	return ret;
 }
@@ -154,18 +164,26 @@ static int luo_do_freeze_calls(void)
 {
 	int ret;
 
+	ret = luo_do_files_freeze_calls();
+	if (ret)
+		return ret;
+
 	ret = luo_do_subsystems_freeze_calls();
+	if (ret)
+		luo_do_files_cancel_calls();
 
 	return ret;
 }
 
 static void luo_do_finish_calls(void)
 {
+	luo_do_files_finish_calls();
 	luo_do_subsystems_finish_calls();
 }
 
 static void luo_do_cancel_calls(void)
 {
+	luo_do_files_cancel_calls();
 	luo_do_subsystems_cancel_calls();
 }
 
@@ -436,6 +454,7 @@ static int __init luo_startup(void)
 	}
 
 	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
+	luo_files_startup(luo_fdt_in);
 	luo_subsystems_startup(luo_fdt_in);
 
 	return 0;
diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
new file mode 100644
index 000000000000..953fc40db3d7
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_files.c
@@ -0,0 +1,563 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO file descriptors
+ *
+ * LUO provides the infrastructure necessary to preserve
+ * specific types of stateful file descriptors across a kernel live
+ * update transition. The primary goal is to allow workloads, such as virtual
+ * machines using vfio, memfd, or iommufd to retain access to their essential
+ * resources without interruption after the underlying kernel is  updated.
+ *
+ * The framework operates based on handler registration and instance tracking:
+ *
+ * 1. Handler Registration: Kernel modules responsible for specific file
+ * types (e.g., memfd, vfio) register a &struct liveupdate_filesystem
+ * handler. This handler contains callbacks (&liveupdate_filesystem.prepare,
+ * &liveupdate_filesystem.freeze, &liveupdate_filesystem.finish, etc.)
+ * and a unique 'compatible' string identifying the file type.
+ * Registration occurs via liveupdate_register_filesystem().
+ *
+ * 2. File Instance Tracking: When a potentially preservable file needs to be
+ * managed for live update, the core LUO logic (luo_register_file()) finds a
+ * compatible registered handler using its &liveupdate_filesystem.can_preserve
+ * callback. If found,  an internal &struct luo_file instance is created,
+ * assigned a unique u64 'token', and added to a list.
+ *
+ * 3. State Persistence (FDT): During the LUO prepare/freeze phases, the
+ * registered handler callbacks are invoked for each tracked file instance.
+ * These callbacks can generate a u64 data payload representing the minimal
+ * state needed for restoration. This payload, along with the handler's
+ * compatible string and the unique token, is stored in a dedicated
+ * '/file-descriptors' node within the main LUO FDT blob passed via
+ * Kexec Handover (KHO).
+ *
+ * 4. Restoration: In the new kernel, the LUO framework parses the incoming
+ * FDT to reconstruct the list of &struct luo_file instances. When the
+ * original owner requests the file, luo_retrieve_file() uses the corresponding
+ * handler's &liveupdate_filesystem.retrieve callback, passing the persisted
+ * u64 data, to recreate or find the appropriate &struct file object.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/libfdt.h>
+#include <linux/liveupdate.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+#include "luo_internal.h"
+
+#define LUO_FILES_NODE_NAME	"file-descriptors"
+#define LUO_FILES_COMPATIBLE	"file-descriptors-v1"
+
+static DEFINE_XARRAY(luo_files_xa_in);
+static DEFINE_XARRAY(luo_files_xa_out);
+static bool luo_files_xa_in_recreated;
+
+/* Regestred filesystems. */
+static DECLARE_RWSEM(luo_filesystems_list_rwsem);
+static LIST_HEAD(luo_filesystems_list);
+
+static void *luo_fdt_out;
+static void *luo_fdt_in;
+
+static u64 luo_next_file_token;
+
+/**
+ * struct luo_file - Represents a file descriptor instance preserved
+ * across live update.
+ * @fs:            Pointer to the &struct liveupdate_filesystems containing
+ *                 the implementation of prepare, freeze, cancel, and finish
+ *                 operations specific to this file's type.
+ * @file:          A pointer to the kernel's &struct file object representing
+ *                 the open file descriptor that is being preserved.
+ * @private_data:  Internal storage used by the live update core framework
+ *                 between phases.
+ * @reclaimed:     Flag indicating whether this preserved file descriptor has
+ *                 been successfully 'reclaimed' (e.g., requested via an ioctl)
+ *                 by user-space or the owning kernel subsystem in the new
+ *                 kernel after the live update.
+ * @state:         The current state of file descriptor, it is allowed to
+ *                 prepare, freeze, and finish FDs before the global state
+ *                 switch.
+ * @mutex:          Lock to protect FD state, and allow independently to change
+ *                 the FD state compared to global state.
+ *
+ * This structure holds the necessary callbacks and context for managing a
+ * specific open file descriptor throughout the different phases of a live
+ * update process. Instances of this structure are typically allocated,
+ * populated with file-specific details (&file, &arg, callbacks, compatibility
+ * string, token), and linked into a central list managed by the LUO. The
+ * private_data field is used internally by the core logic to store state
+ * between phases.
+ */
+struct luo_file {
+	struct liveupdate_filesystem *fs;
+	struct file *file;
+	u64 private_data;
+	bool reclaimed;
+	enum liveupdate_state state;
+	struct mutex mutex;
+};
+
+/**
+ * luo_files_startup - Validates the LUO file-descriptors FDT node at startup.
+ * @fdt: Pointer to the LUO FDT blob passed from the previous kernel.
+ *
+ * This __init function checks the existence and validity of the
+ * '/file-descriptors' node in the FDT. This node is considered mandatory. It
+ * calls panic() if the node is missing, inaccessible, or invalid (e.g., missing
+ * compatible, wrong compatible string), indicating a critical configuration
+ * error for LUO.
+ */
+void __init luo_files_startup(void *fdt)
+{
+	int ret, node_offset;
+
+	node_offset = fdt_subnode_offset(fdt, 0, LUO_FILES_NODE_NAME);
+	if (node_offset < 0)
+		panic("Failed to find /file-descriptors node\n");
+
+	ret = fdt_node_check_compatible(fdt, node_offset,
+					LUO_FILES_COMPATIBLE);
+	if (ret) {
+		panic("FDT '%s' is incompatible with '%s' [%d]\n",
+		      LUO_FILES_NODE_NAME, LUO_FILES_COMPATIBLE, ret);
+	}
+	luo_fdt_in = fdt;
+}
+
+static void luo_files_recreate_luo_files_xa_in(void)
+{
+	int parent_node_offset, file_node_offset;
+	const char *node_name, *fdt_compat_str;
+	struct liveupdate_filesystem *fs;
+	struct luo_file *luo_file;
+	const void *data_ptr;
+	int ret = 0;
+
+	if (luo_files_xa_in_recreated || !luo_fdt_in)
+		return;
+
+	/* Take write in order to gurantee that we re-create list once */
+	down_write(&luo_filesystems_list_rwsem);
+	if (luo_files_xa_in_recreated)
+		goto exit_unlock;
+
+	parent_node_offset = fdt_subnode_offset(luo_fdt_in, 0,
+						LUO_FILES_NODE_NAME);
+
+	fdt_for_each_subnode(file_node_offset, luo_fdt_in, parent_node_offset) {
+		bool handler_found = false;
+		u64 token;
+
+		node_name = fdt_get_name(luo_fdt_in, file_node_offset, NULL);
+		if (!node_name) {
+			panic("Skipping FDT subnode at offset %d: Cannot get name\n",
+			      file_node_offset);
+		}
+
+		ret = kstrtou64(node_name, 0, &token);
+		if (ret < 0) {
+			panic("Skipping FDT node '%s': Failed to parse token\n",
+			      node_name);
+		}
+
+		fdt_compat_str = fdt_getprop(luo_fdt_in, file_node_offset,
+					     "compatible", NULL);
+		if (!fdt_compat_str) {
+			panic("Skipping FDT node '%s': Missing 'compatible' property\n",
+			      node_name);
+		}
+
+		data_ptr = fdt_getprop(luo_fdt_in, file_node_offset, "data",
+				       NULL);
+		if (!data_ptr) {
+			panic("Can't recover property 'data' for FDT node '%s'\n",
+			      node_name);
+		}
+
+		list_for_each_entry(fs, &luo_filesystems_list, list) {
+			if (!strcmp(fs->compatible, fdt_compat_str)) {
+				handler_found = true;
+				break;
+			}
+		}
+
+		if (!handler_found) {
+			panic("Skipping FDT node '%s': No registered handler for compatible '%s'\n",
+			      node_name, fdt_compat_str);
+		}
+
+		luo_file = kmalloc(sizeof(*luo_file),
+				   GFP_KERNEL | __GFP_NOFAIL);
+		luo_file->fs = fs;
+		luo_file->file = NULL;
+		memcpy(&luo_file->private_data, data_ptr, sizeof(u64));
+		luo_file->reclaimed = false;
+		mutex_init(&luo_file->mutex);
+		luo_file->state = LIVEUPDATE_STATE_UPDATED;
+		ret = xa_err(xa_store(&luo_files_xa_in, token, luo_file,
+				      GFP_KERNEL | __GFP_NOFAIL));
+		if (ret < 0) {
+			panic("Failed to store luo_file for token %llu in XArray: %d\n",
+			      token, ret);
+		}
+	}
+	luo_files_xa_in_recreated = true;
+
+exit_unlock:
+	up_write(&luo_filesystems_list_rwsem);
+}
+
+/**
+ * luo_files_fdt_setup - Adds and populates the 'file-descriptors' node in the
+ * FDT.
+ * @fdt: Pointer to the LUO FDT blob.
+ *
+ * Add file-descriptors node and each FD node to the LUO FDT blob.
+ *
+ * Returns: 0 on success, negative errno on failure.
+ */
+int luo_files_fdt_setup(void *fdt)
+{
+	int ret, files_node_offset, node_offset;
+	const u64 zero_data = 0;
+	unsigned long token;
+	struct luo_file *h;
+	char token_str[19];
+
+	ret = fdt_add_subnode(fdt, 0, LUO_FILES_NODE_NAME);
+	if (ret < 0)
+		goto exit_error;
+
+	files_node_offset = ret;
+	ret = fdt_setprop_string(fdt, files_node_offset, "compatible",
+				 LUO_FILES_COMPATIBLE);
+	if (ret < 0)
+		goto exit_error;
+
+	xa_for_each(&luo_files_xa_out, token, h) {
+		snprintf(token_str, sizeof(token_str), "%#0llx", (u64)token);
+
+		ret = fdt_add_subnode(fdt, files_node_offset, token_str);
+		if (ret < 0)
+			goto exit_error;
+
+		node_offset = ret;
+		ret = fdt_setprop_string(fdt, node_offset, "compatible",
+					 h->fs->compatible);
+		if (ret < 0)
+			goto exit_error;
+
+		ret = fdt_setprop(fdt, node_offset, "data",
+				  &zero_data, sizeof(zero_data));
+	}
+
+	luo_fdt_out = fdt;
+
+	return 0;
+exit_error:
+	pr_err("Failed to setup 'file-descriptors' node to FDT: %s\n",
+	       fdt_strerror(ret));
+	return -ENOSPC;
+}
+
+/**
+ * luo_do_files_prepare_calls - Calls prepare callbacks and updates FDT
+ * if all prepares succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'prepare' for all files and stores results temporarily.
+ * If any 'prepare' fails, calls 'cancel' on previously prepared files
+ * and returns the error.
+ * Phase 2: If all 'prepare' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* prepared files and
+ * returns the FDT error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_files_prepare_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_files_freeze_calls - Calls freeze callbacks and updates FDT
+ * if all calls succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'freeze' for all files and stores results temporarily.
+ * If any 'freeze' fails, calls 'cancel' on previously called files.
+ * and returns the error.
+ * Phase 2: If all 'freeze' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* files and returns the FDT
+ * error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_files_freeze_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_files_finish_calls - Calls finish callbacks for all file descriptors.
+ *
+ * This function is called at the end of live update cycle to do the final
+ * clean-up or housekeeping of the post-live update states.
+ */
+void luo_do_files_finish_calls(void)
+{
+	luo_files_recreate_luo_files_xa_in();
+}
+
+/**
+ * luo_do_files_cancel_calls - Calls cancel callbacks for all file descriptors.
+ *
+ * This function is typically called when the live update process needs to be
+ * aborted externally, for example, after the prepare phase may have run but
+ * before actual reboot. It iterates through all registered files and calls
+ * the 'cancel' callback for those that implement it and likely completed
+ * prepare.
+ */
+void luo_do_files_cancel_calls(void)
+{
+}
+
+/**
+ * luo_register_file - Register a file descriptor for live update management.
+ * @tokenp: Return argument for the token value.
+ * @file: Pointer to the struct file to be preserved.
+ *
+ * Context: Must be called when LUO is in 'normal' state.
+ *
+ * Return: 0 on success. Negative errno on failure.
+ */
+int luo_register_file(u64 *tokenp, struct file *file)
+{
+	struct liveupdate_filesystem *fs;
+	bool found = false;
+	int ret = -ENOENT;
+	u64 token;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		pr_warn("File can be registered only in normal or prepared state\n");
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	down_read(&luo_filesystems_list_rwsem);
+	list_for_each_entry(fs, &luo_filesystems_list, list) {
+		if (fs->can_preserve(file, fs->arg)) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found) {
+		struct luo_file *luo_file = kmalloc(sizeof(*luo_file),
+						    GFP_KERNEL);
+
+		if (!luo_file) {
+			ret = -ENOMEM;
+			goto exit_unlock;
+		}
+
+		token = luo_next_file_token;
+		luo_next_file_token++;
+
+		luo_file->private_data = 0;
+		luo_file->reclaimed = false;
+
+		luo_file->file = file;
+		luo_file->fs = fs;
+		mutex_init(&luo_file->mutex);
+		luo_file->state = LIVEUPDATE_STATE_NORMAL;
+		ret = xa_err(xa_store(&luo_files_xa_out, token, luo_file,
+				      GFP_KERNEL));
+		if (ret < 0) {
+			pr_warn("Failed to store file for token %llu in XArray: %d\n",
+				token, ret);
+			kfree(luo_file);
+			goto exit_unlock;
+		}
+		*tokenp = token;
+	}
+
+exit_unlock:
+	up_read(&luo_filesystems_list_rwsem);
+	luo_state_read_exit();
+
+	return ret;
+}
+
+/**
+ * luo_unregister_file - Unregister a file instance using its token.
+ * @token: The unique token of the file instance to unregister.
+ *
+ * Finds the &struct luo_file associated with the @token in the
+ * global list and removes it. This function *only* removes the entry from the
+ * list; it does *not* free the memory allocated for the &struct luo_file
+ * itself. The caller is responsible for freeing the structure after this
+ * function returns successfully.
+ *
+ * Context: Can be called when a preserved file descriptor is closed or
+ * no longer needs live update management. Uses down_write_killable
+ * for list modification.
+ *
+ * Return: 0 on success. Negative errno on failure.
+ */
+int luo_unregister_file(u64 token)
+{
+	struct luo_file *luo_file;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		pr_warn("File can be unregistered only in normal or updates state\n");
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	luo_file = xa_erase(&luo_files_xa_out, token);
+	if (luo_file) {
+		kfree(luo_file);
+	} else {
+		pr_warn("Failed to unregister: token %llu not found.\n",
+			token);
+		ret = -ENOENT;
+	}
+	luo_state_read_exit();
+
+	return ret;
+}
+
+/**
+ * luo_retrieve_file - Find a registered file instance by its token.
+ * @token: The unique token of the file instance to retrieve.
+ * @file: Output parameter. On success (return value 0), this will point
+ * to the retrieved "struct file".
+ *
+ * Searches the global list for a &struct luo_file matching the @token. Uses a
+ * read lock, allowing concurrent retrievals.
+ *
+ * Return: 0 on success. Negative errno on failure.
+ */
+int luo_retrieve_file(u64 token, struct file **file)
+{
+	struct luo_file *luo_file;
+	int ret = 0;
+
+	luo_files_recreate_luo_files_xa_in();
+	luo_state_read_enter();
+	if (!liveupdate_state_updated()) {
+		pr_warn("File can be retrieved only in updated state\n");
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	luo_file = xa_load(&luo_files_xa_in, token);
+	if (luo_file && !luo_file->reclaimed) {
+		luo_file->reclaimed = true;
+		ret = luo_file->fs->retrieve(luo_file->fs->arg,
+					     luo_file->private_data,
+					     file);
+		if (!ret)
+			luo_file->file = *file;
+	} else if (luo_file && luo_file->reclaimed) {
+		pr_err("The file descriptor for token %lld has already been retrieved\n",
+		       token);
+		ret = -EINVAL;
+	} else {
+		ret = -ENOENT;
+	}
+
+	luo_state_read_exit();
+
+	return ret;
+}
+
+/**
+ * liveupdate_register_filesystem - Register a filesystem handler with LUO.
+ * @fs: Pointer to a caller-allocated &struct liveupdate_filesystem.
+ * The caller must initialize this structure, including a unique
+ * 'compatible' string and a valid 'fs' callbacks. This function adds the
+ * handler to the global list of supported filesystem handlers.
+ *
+ * Context: Typically called during module initialization for filesystems or
+ * file types that support live update preservation.
+ *
+ * Return: 0 on success. Negative errno on failure.
+ */
+int liveupdate_register_filesystem(struct liveupdate_filesystem *fs)
+{
+	struct liveupdate_filesystem *fs_iter;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	down_write(&luo_filesystems_list_rwsem);
+	list_for_each_entry(fs_iter, &luo_filesystems_list, list) {
+		if (!strcmp(fs_iter->compatible, fs->compatible)) {
+			pr_err("Filesystem handler registration failed: Compatible string '%s' already registered.\n",
+			       fs->compatible);
+			ret = -EEXIST;
+			goto exit_unlock;
+		}
+	}
+
+	INIT_LIST_HEAD(&fs->list);
+	list_add_tail(&fs->list, &luo_filesystems_list);
+
+exit_unlock:
+	up_write(&luo_filesystems_list_rwsem);
+	luo_state_read_exit();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(liveupdate_register_filesystem);
+
+/**
+ * liveupdate_unregister_filesystem - Unregister a filesystem handler.
+ * @fs: Pointer to the specific &struct liveupdate_filesystem instance
+ * that was previously returned by or passed to liveupdate_register_filesystem.
+ *
+ * Removes the specified handler instance @fs from the global list of
+ * registered filesystem handlers. This function only removes the entry from the
+ * list; it does not free the memory associated with @fs itself. The caller
+ * is responsible for freeing the structure memory after this function returns
+ * successfully.
+ *
+ * Return: 0 on success. Negative errno on failure.
+ */
+int liveupdate_unregister_filesystem(struct liveupdate_filesystem *fs)
+{
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	down_write(&luo_filesystems_list_rwsem);
+	list_del_init(&fs->list);
+	up_write(&luo_filesystems_list_rwsem);
+	luo_state_read_exit();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(liveupdate_unregister_filesystem);
diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
index 63a8b93254a6..b7a0f31ddc99 100644
--- a/drivers/misc/liveupdate/luo_internal.h
+++ b/drivers/misc/liveupdate/luo_internal.h
@@ -23,6 +23,17 @@ int luo_do_subsystems_freeze_calls(void);
 void luo_do_subsystems_finish_calls(void);
 void luo_do_subsystems_cancel_calls(void);
 
+void luo_files_startup(void *fdt);
+int luo_files_fdt_setup(void *fdt);
+int luo_do_files_prepare_calls(void);
+int luo_do_files_freeze_calls(void);
+void luo_do_files_finish_calls(void);
+void luo_do_files_cancel_calls(void);
+
+int luo_retrieve_file(u64 token, struct file **file);
+int luo_register_file(u64 *token, struct file *file);
+int luo_unregister_file(u64 token);
+
 extern const char *const luo_state_str[];
 
 /* Get the current state as a string */
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index 7a130680b5f2..7afe0aac5ce4 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -86,6 +86,55 @@ enum liveupdate_state  {
 	LIVEUPDATE_STATE_UPDATED = 3,
 };
 
+/* Forward declaration needed if definition isn't included */
+struct file;
+
+/**
+ * struct liveupdate_filesystem - Represents a handler for a live-updatable
+ * filesystem/file type.
+ * @prepare:       Optional. Saves state for a specific file instance (@file,
+ *                 @arg) before update, potentially returning value via @data.
+ *                 Returns 0 on success, negative errno on failure.
+ * @freeze:        Optional. Performs final actions just before kernel
+ *                 transition, potentially reading/updating the handle via
+ *                 @data.
+ *                 Returns 0 on success, negative errno on failure.
+ * @cancel:        Optional. Cleans up state/resources if update is aborted
+ *                 after prepare/freeze succeeded, using the @data handle (by
+ *                 value) from the successful prepare. Returns void.
+ * @finish:        Optional. Performs final cleanup in the new kernel using the
+ *                 preserved @data handle (by value). Returns void.
+ * @retrieve:      Retrieve the preserved file. Must be called before finish.
+ * @can_preserve:  callback to determine if @file with associated context (@arg)
+ *                 can be preserved by this handler.
+ *                 Return bool (true if preservable, false otherwise).
+ * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
+ *                 that uniquely identifies the filesystem or file type this
+ *                 handler supports. This is matched against the compatible
+ *                 string associated with individual &struct liveupdate_file
+ *                 instances.
+ * @arg:           An opaque pointer to implementation-specific context data
+ *                 associated with this filesystem handler registration.
+ * @list:          used for linking this handler instance into a global list of
+ *                 registered filesystem handlers.
+ *
+ * Modules that want to support live update for specific file types should
+ * register an instance of this structure. LUO uses this registration to
+ * determine if a given file can be preserved and to find the appropriate
+ * operations to manage its state across the update.
+ */
+struct liveupdate_filesystem {
+	int (*prepare)(struct file *file, void *arg, u64 *data);
+	int (*freeze)(struct file *file, void *arg, u64 *data);
+	void (*cancel)(struct file *file, void *arg, u64 data);
+	void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
+	int (*retrieve)(void *arg, u64 data, struct file **file);
+	bool (*can_preserve)(struct file *file, void *arg);
+	const char *compatible;
+	void *arg;
+	struct list_head list;
+};
+
 /**
  * struct liveupdate_subsystem - Represents a subsystem participating in LUO
  * @prepare:      Optional. Called during LUO prepare phase. Should perform
@@ -142,6 +191,9 @@ int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
 int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
 int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
 
+int liveupdate_register_filesystem(struct liveupdate_filesystem *h);
+int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h);
+
 #else /* CONFIG_LIVEUPDATE */
 
 static inline int liveupdate_reboot(void)
@@ -180,5 +232,15 @@ static inline int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h,
 	return -ENODATA;
 }
 
+static inline int liveupdate_register_filesystem(struct liveupdate_filesystem *h)
+{
+	return 0;
+}
+
+static inline int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h)
+{
+	return 0;
+}
+
 #endif /* CONFIG_LIVEUPDATE */
 #endif /* _LINUX_LIVEUPDATE_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (7 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-06-05 16:03   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Implements the core logic within luo_files.c to invoke the prepare,
reboot, finish, and cancel callbacks for preserved file instances,
replacing the previous stub implementations. It also handles
the persistence and retrieval of the u64 data payload associated with
each file via the LUO FDT.

This completes the core mechanism enabling registered filesystem
handlers to actively manage file state across the live update
transition using the LUO framework.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
index 953fc40db3d7..091bf07e051a 100644
--- a/drivers/misc/liveupdate/luo_files.c
+++ b/drivers/misc/liveupdate/luo_files.c
@@ -272,6 +272,48 @@ int luo_files_fdt_setup(void *fdt)
 	return -ENOSPC;
 }
 
+static void __luo_do_files_cancel_calls(struct luo_file *boundary_file)
+{
+	unsigned long token;
+	struct luo_file *h;
+
+	xa_for_each(&luo_files_xa_out, token, h) {
+		if (h == boundary_file)
+			break;
+
+		if (h->fs->cancel) {
+			h->fs->cancel(h->file, h->fs->arg, h->private_data);
+			h->private_data = 0;
+		}
+	}
+}
+
+static int luo_files_commit_data_to_fdt(void)
+{
+	int files_node_offset, node_offset, ret;
+	unsigned long token;
+	char token_str[19];
+	struct luo_file *h;
+
+	files_node_offset = fdt_subnode_offset(luo_fdt_out, 0,
+					       LUO_FILES_NODE_NAME);
+	xa_for_each(&luo_files_xa_out, token, h) {
+		snprintf(token_str, sizeof(token_str), "%#0llx", (u64)token);
+		node_offset = fdt_subnode_offset(luo_fdt_out,
+						 files_node_offset,
+						 token_str);
+		ret = fdt_setprop(luo_fdt_out, node_offset, "data",
+				  &h->private_data, sizeof(h->private_data));
+		if (ret < 0) {
+			pr_err("Failed to set data property for token %s: %s\n",
+			       token_str, fdt_strerror(ret));
+			return -ENOSPC;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * luo_do_files_prepare_calls - Calls prepare callbacks and updates FDT
  * if all prepares succeed. Handles cancellation on failure.
@@ -287,7 +329,29 @@ int luo_files_fdt_setup(void *fdt)
  */
 int luo_do_files_prepare_calls(void)
 {
-	return 0;
+	unsigned long token;
+	struct luo_file *h;
+	int ret;
+
+	xa_for_each(&luo_files_xa_out, token, h) {
+		if (h->fs->prepare) {
+			ret = h->fs->prepare(h->file, h->fs->arg,
+					     &h->private_data);
+			if (ret < 0) {
+				pr_err("Prepare failed for file token %#0llx handler '%s' [%d]\n",
+				       (u64)token, h->fs->compatible, ret);
+				__luo_do_files_cancel_calls(h);
+
+				return ret;
+			}
+		}
+	}
+
+	ret = luo_files_commit_data_to_fdt();
+	if (ret)
+		__luo_do_files_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
  */
 int luo_do_files_freeze_calls(void)
 {
-	return 0;
+	unsigned long token;
+	struct luo_file *h;
+	int ret;
+
+	xa_for_each(&luo_files_xa_out, token, h) {
+		if (h->fs->freeze) {
+			ret = h->fs->freeze(h->file, h->fs->arg,
+					    &h->private_data);
+			if (ret < 0) {
+				pr_err("Freeze callback failed for file token %#0llx handler '%s' [%d]\n",
+				       (u64)token, h->fs->compatible, ret);
+				__luo_do_files_cancel_calls(h);
+
+				return ret;
+			}
+		}
+	}
+
+	ret = luo_files_commit_data_to_fdt();
+	if (ret)
+		__luo_do_files_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -316,7 +402,20 @@ int luo_do_files_freeze_calls(void)
  */
 void luo_do_files_finish_calls(void)
 {
+	unsigned long token;
+	struct luo_file *h;
+
 	luo_files_recreate_luo_files_xa_in();
+	xa_for_each(&luo_files_xa_in, token, h) {
+		mutex_lock(&h->mutex);
+		if (h->state == LIVEUPDATE_STATE_UPDATED && h->fs->finish) {
+			h->fs->finish(h->file, h->fs->arg,
+				      h->private_data,
+				      h->reclaimed);
+			h->state = LIVEUPDATE_STATE_NORMAL;
+		}
+		mutex_unlock(&h->mutex);
+	}
 }
 
 /**
@@ -330,6 +429,8 @@ void luo_do_files_finish_calls(void)
  */
 void luo_do_files_cancel_calls(void)
 {
+	__luo_do_files_cancel_calls(NULL);
+	luo_files_commit_data_to_fdt();
 }
 
 /**
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (8 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 09/16] luo: luo_files: implement file systems callbacks Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  8:42   ` Mike Rapoport
                     ` (3 more replies)
  2025-05-15 18:23 ` [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring Pasha Tatashin
                   ` (7 subsequent siblings)
  17 siblings, 4 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce the user-space interface for the Live Update Orchestrator
via ioctl commands, enabling external control over the live update
process and management of preserved resources.

Create a misc character device at /dev/liveupdate. Access
to this device requires the CAP_SYS_ADMIN capability.

A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
structures. The magic number is registered in
Documentation/userspace-api/ioctl/ioctl-number.rst.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   1 +
 drivers/misc/liveupdate/Makefile              |   1 +
 drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
 include/linux/liveupdate.h                    |  34 +-
 include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
 5 files changed, 502 insertions(+), 33 deletions(-)
 create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
 create mode 100644 include/uapi/linux/liveupdate.h

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 7a1409ecc238..279c124048f2 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -375,6 +375,7 @@ Code  Seq#    Include File                                           Comments
 0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
 0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
                                                                      <mailto:linux-hyperv@vger.kernel.org>
+0xBA  all    uapi/linux/liveupdate.h                                 <mailto:Pasha Tatashin <pasha.tatashin@soleen.com>
 0xC0  00-0F  linux/usb/iowarrior.h
 0xCA  00-0F  uapi/misc/cxl.h                                         Dead since 6.15
 0xCA  10-2F  uapi/misc/ocxl.h
diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
index b4cdd162574f..7a0cd08919c9 100644
--- a/drivers/misc/liveupdate/Makefile
+++ b/drivers/misc/liveupdate/Makefile
@@ -1,4 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
+obj-y					+= luo_ioctl.o
 obj-y					+= luo_core.o
 obj-y					+= luo_files.o
 obj-y					+= luo_subsystems.o
diff --git a/drivers/misc/liveupdate/luo_ioctl.c b/drivers/misc/liveupdate/luo_ioctl.c
new file mode 100644
index 000000000000..76c687ff650b
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_ioctl.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO ioctl Interface
+ *
+ * The IOCTL user-space control interface for the LUO subsystem.
+ * It registers a misc character device, typically found at ``/dev/liveupdate``,
+ * which allows privileged userspace applications (requiring %CAP_SYS_ADMIN) to
+ * manage and monitor the LUO state machine and associated resources like
+ * preservable file descriptors.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <uapi/linux/liveupdate.h>
+#include "luo_internal.h"
+
+static int luo_ioctl_fd_preserve(struct liveupdate_fd *luo_fd)
+{
+	struct file *file;
+	int ret;
+
+	file = fget(luo_fd->fd);
+	if (!file) {
+		pr_err("Bad file descriptor\n");
+		return -EBADF;
+	}
+
+	ret = luo_register_file(&luo_fd->token, file);
+	if (ret)
+		fput(file);
+
+	return ret;
+}
+
+static int luo_ioctl_fd_unpreserve(u64 token)
+{
+	return luo_unregister_file(token);
+}
+
+static int luo_ioctl_fd_restore(struct liveupdate_fd *luo_fd)
+{
+	struct file *file;
+	int ret;
+	int fd;
+
+	fd = get_unused_fd_flags(O_CLOEXEC);
+	if (fd < 0) {
+		pr_err("Failed to allocate new fd: %d\n", fd);
+		return fd;
+	}
+
+	ret = luo_retrieve_file(luo_fd->token, &file);
+	if (ret < 0) {
+		put_unused_fd(fd);
+
+		return ret;
+	}
+
+	fd_install(fd, file);
+	luo_fd->fd = fd;
+
+	return 0;
+}
+
+static int luo_open(struct inode *inodep, struct file *filep)
+{
+	if (!capable(CAP_SYS_ADMIN))
+		return -EACCES;
+
+	if (filep->f_flags & O_EXCL)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct liveupdate_fd luo_fd;
+	enum liveupdate_state state;
+	int ret = 0;
+	u64 token;
+
+	if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
+		return -ENOTTY;
+
+	switch (cmd) {
+	case LIVEUPDATE_IOCTL_GET_STATE:
+		state = READ_ONCE(luo_state);
+		if (copy_to_user(argp, &state, sizeof(luo_state)))
+			ret = -EFAULT;
+		break;
+
+	case LIVEUPDATE_IOCTL_EVENT_PREPARE:
+		ret = luo_prepare();
+		break;
+
+	case LIVEUPDATE_IOCTL_EVENT_FREEZE:
+		ret = luo_freeze();
+		break;
+
+	case LIVEUPDATE_IOCTL_EVENT_FINISH:
+		ret = luo_finish();
+		break;
+
+	case LIVEUPDATE_IOCTL_EVENT_CANCEL:
+		ret = luo_cancel();
+		break;
+
+	case LIVEUPDATE_IOCTL_FD_PRESERVE:
+		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
+			ret = -EFAULT;
+			break;
+		}
+
+		ret = luo_ioctl_fd_preserve(&luo_fd);
+		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
+			ret = -EFAULT;
+		break;
+
+	case LIVEUPDATE_IOCTL_FD_UNPRESERVE:
+		if (copy_from_user(&token, argp, sizeof(u64))) {
+			ret = -EFAULT;
+			break;
+		}
+
+		ret = luo_ioctl_fd_unpreserve(token);
+		break;
+
+	case LIVEUPDATE_IOCTL_FD_RESTORE:
+		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
+			ret = -EFAULT;
+			break;
+		}
+
+		ret = luo_ioctl_fd_restore(&luo_fd);
+		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
+			ret = -EFAULT;
+		break;
+
+	default:
+		pr_warn("ioctl: unknown command nr: 0x%x\n", _IOC_NR(cmd));
+		ret = -ENOTTY;
+		break;
+	}
+
+	return ret;
+}
+
+static const struct file_operations fops = {
+	.owner          = THIS_MODULE,
+	.open           = luo_open,
+	.unlocked_ioctl = luo_ioctl,
+};
+
+static struct miscdevice liveupdate_miscdev = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name  = "liveupdate",
+	.fops  = &fops,
+};
+
+static int __init liveupdate_init(void)
+{
+	int err;
+
+	err = misc_register(&liveupdate_miscdev);
+	if (err < 0) {
+		pr_err("Failed to register misc device '%s': %d\n",
+		       liveupdate_miscdev.name, err);
+	}
+
+	return err;
+}
+module_init(liveupdate_init);
+
+static void __exit liveupdate_exit(void)
+{
+	misc_deregister(&liveupdate_miscdev);
+}
+module_exit(liveupdate_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pasha Tatashin");
+MODULE_DESCRIPTION("Live Update Orchestrator");
+MODULE_VERSION("0.1");
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index 7afe0aac5ce4..ff4f2ab5c673 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -10,6 +10,7 @@
 #include <linux/bug.h>
 #include <linux/types.h>
 #include <linux/list.h>
+#include <uapi/linux/liveupdate.h>
 
 /**
  * enum liveupdate_event - Events that trigger live update callbacks.
@@ -53,39 +54,6 @@ enum liveupdate_event {
 	LIVEUPDATE_CANCEL,
 };
 
-/**
- * enum liveupdate_state - Defines the possible states of the live update
- * orchestrator.
- * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
- * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
- *                                   LIVEUPDATE_PREPARE callbacks have completed
- *                                   successfully.
- *                                   Devices might operate in a limited state
- *                                   for example the participating devices might
- *                                   not be allowed to unbind, and also the
- *                                   setting up of new DMA mappings might be
- *                                   disabled in this state.
- * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
- *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
- *                                   system is performing its final state saving
- *                                   within the "blackout window". User
- *                                   workloads must be suspended. The actual
- *                                   reboot (kexec) into the next kernel is
- *                                   imminent.
- * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
- *                                   kernel via live update the system is now
- *                                   running the next kernel, awaiting the
- *                                   finish event.
- *
- * These states track the progress and outcome of a live update operation.
- */
-enum liveupdate_state  {
-	LIVEUPDATE_STATE_NORMAL = 0,
-	LIVEUPDATE_STATE_PREPARED = 1,
-	LIVEUPDATE_STATE_FROZEN = 2,
-	LIVEUPDATE_STATE_UPDATED = 3,
-};
-
 /* Forward declaration needed if definition isn't included */
 struct file;
 
diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
new file mode 100644
index 000000000000..c673d08a29ea
--- /dev/null
+++ b/include/uapi/linux/liveupdate.h
@@ -0,0 +1,300 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/*
+ * Userspace interface for /dev/liveupdate
+ * Live Update Orchestrator
+ *
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#ifndef _UAPI_LIVEUPDATE_H
+#define _UAPI_LIVEUPDATE_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/**
+ * enum liveupdate_state - Defines the possible states of the live update
+ * orchestrator.
+ * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
+ * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
+ *                                   LIVEUPDATE_PREPARE callbacks have completed
+ *                                   successfully.
+ *                                   Devices might operate in a limited state
+ *                                   for example the participating devices might
+ *                                   not be allowed to unbind, and also the
+ *                                   setting up of new DMA mappings might be
+ *                                   disabled in this state.
+ * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
+ *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
+ *                                   system is performing its final state saving
+ *                                   within the "blackout window". User
+ *                                   workloads must be suspended. The actual
+ *                                   reboot (kexec) into the next kernel is
+ *                                   imminent.
+ * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
+ *                                   kernel via live update the system is now
+ *                                   running the next kernel, awaiting the
+ *                                   finish event.
+ *
+ * These states track the progress and outcome of a live update operation.
+ */
+enum liveupdate_state  {
+	LIVEUPDATE_STATE_NORMAL = 0,
+	LIVEUPDATE_STATE_PREPARED = 1,
+	LIVEUPDATE_STATE_FROZEN = 2,
+	LIVEUPDATE_STATE_UPDATED = 3,
+};
+
+/**
+ * struct liveupdate_fd - Holds parameters for preserving and restoring file
+ * descriptors across live update.
+ * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
+ *         descriptor to be preserved.
+ *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
+ *         representing the fully restored kernel resource.
+ * @flags: Unused, reserved for future expansion, must be set to 0.
+ * @token: Output for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
+ *         generated by the kernel representing the successfully preserved
+ *         resource state.
+ *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
+ *         returned by the preserve ioctl for the resource to be restored.
+ *
+ * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
+ * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
+ * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
+ * underlying kernel state preserved across a live update cycle.
+ *
+ * To preserve an FD, user space passes this struct to
+ * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
+ * kernel populates the @token field.
+ *
+ * After the live update transition, user space passes the struct populated with
+ * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
+ * to find the preserved state and, on success, populates the @fd field with a
+ * new file descriptor referring to the fully restored resource.
+ */
+struct liveupdate_fd {
+	int		fd;
+	__u32		flags;
+	__u64		token;
+};
+
+/* The ioctl type, documented in ioctl-number.rst */
+#define LIVEUPDATE_IOCTL_TYPE		0xBA
+
+/**
+ * LIVEUPDATE_IOCTL_FD_PRESERVE - Validate and initiate preservation for a file
+ * descriptor.
+ *
+ * Argument: Pointer to &struct liveupdate_fd.
+ *
+ * User sets the @fd field identifying the file descriptor to preserve
+ * (e.g., memfd, kvm, iommufd, VFIO). The kernel validates if this FD type
+ * and its dependencies are supported for preservation. If validation passes,
+ * the kernel marks the FD internally and *initiates the process* of preparing
+ * its state for saving. The actual snapshotting of the state typically occurs
+ * during the subsequent %LIVEUPDATE_IOCTL_EVENT_PREPARE execution phase, though
+ * some finalization might occur during %LIVEUPDATE_IOCTL_EVENT_FREEZE.
+ * On successful validation and initiation, the kernel populates the @token
+ * field with an opaque identifier representing the resource being preserved.
+ * This token confirms the FD is targeted for preservation and is required for
+ * the subsequent %LIVEUPDATE_IOCTL_FD_RESTORE call after the live update. This
+ * is an I/O read/write operation.
+ *
+ * Return: 0 on success (validation passed, preservation initiated), negative
+ * error code on failure (e.g., unsupported FD type, dependency issue,
+ * validation failed).
+ */
+#define LIVEUPDATE_IOCTL_FD_PRESERVE					\
+	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x00, struct liveupdate_fd)
+
+/**
+ * LIVEUPDATE_IOCTL_FD_UNPRESERVE - Remove a file descriptor from the
+ * preservation list.
+ *
+ * Argument: Pointer to __u64 token.
+ *
+ * Allows user space to explicitly remove a file descriptor from the set of
+ * items marked as potentially preservable. User space provides a pointer to the
+ * __u64 @token that was previously returned by a successful
+ * %LIVEUPDATE_IOCTL_FD_PRESERVE call (potentially from a prior, possibly
+ * cancelled, live update attempt). The kernel reads the token value from the
+ * provided user-space address.
+ *
+ * On success, the kernel removes the corresponding entry (identified by the
+ * token value read from the user pointer) from its internal preservation list.
+ * The provided @token (representing the now-removed entry) becomes invalid
+ * after this call.
+ *
+ * This operation can only be called when the live update orchestrator is in the
+ *  %LIVEUPDATE_STATE_NORMAL state.**
+ *
+ * This is an I/O write operation (_IOW), signifying the kernel reads data (the
+ * token) from the user-provided pointer.
+ *
+ * Return: 0 on success, negative error code on failure (e.g., -EBUSY or -EINVAL
+ * if not in %LIVEUPDATE_STATE_NORMAL, bad address provided, invalid token value
+ * read, token not found).
+ */
+#define LIVEUPDATE_IOCTL_FD_UNPRESERVE					\
+	_IOW(LIVEUPDATE_IOCTL_TYPE, 0x01, __u64)
+
+/**
+ * LIVEUPDATE_IOCTL_FD_RESTORE - Restore a previously preserved file descriptor.
+ *
+ * Argument: Pointer to &struct liveupdate_fd.
+ *
+ * User sets the @token field to the value obtained from a successful
+ * %LIVEUPDATE_IOCTL_FD_PRESERVE call before the live update. On success,
+ * the kernel restores the state (saved during the PREPARE/FREEZE phases)
+ * associated with the token and populates the @fd field with a new file
+ * descriptor referencing the restored resource in the current (new) kernel.
+ * This operation must be performed *before* signaling completion via
+ * %LIVEUPDATE_IOCTL_EVENT_FINISH. This is an I/O read/write operation.
+ *
+ * Return: 0 on success, negative error code on failure (e.g., invalid token).
+ */
+#define LIVEUPDATE_IOCTL_FD_RESTORE					\
+	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x02, struct liveupdate_fd)
+
+/**
+ * LIVEUPDATE_IOCTL_GET_STATE - Query the current state of the live update
+ * orchestrator.
+ *
+ * Argument: Pointer to &enum liveupdate_state.
+ *
+ * The kernel fills the enum value pointed to by the argument with the current
+ * state of the live update subsystem. Possible states are:
+ *
+ * - %LIVEUPDATE_STATE_NORMAL:   Default state; no live update operation is
+ *                               currently in progress.
+ * - %LIVEUPDATE_STATE_PREPARED: The preparation phase (triggered by
+ *                               %LIVEUPDATE_IOCTL_EVENT_PREPARE) has completed
+ *                               successfully. The system is ready for the
+ *                               reboot transition initiated by
+ *                               %LIVEUPDATE_IOCTL_EVENT_FREEZE. Note that some
+ *                               device operations (e.g., unbinding, new DMA
+ *                               mappings) might be restricted in this state.
+ * - %LIVEUPDATE_STATE_UPDATED:  The system has successfully rebooted into the
+ *                               new kernel via live update. It is now running
+ *                               the new kernel code and is awaiting the
+ *                               completion signal from user space via
+ *                               %LIVEUPDATE_IOCTL_EVENT_FINISH after
+ *                               restoration tasks are done.
+ *
+ * See the definition of &enum liveupdate_state for more details on each state.
+ * This is an I/O read operation (kernel writes to the user-provided pointer).
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+#define LIVEUPDATE_IOCTL_GET_STATE					\
+	_IOR(LIVEUPDATE_IOCTL_TYPE, 0x03, enum liveupdate_state)
+
+/**
+ * LIVEUPDATE_IOCTL_EVENT_PREPARE - Initiate preparation phase and trigger state
+ * saving.
+ *
+ * Argument: None.
+ *
+ * Initiates the live update preparation phase. This action corresponds to
+ * the internal %LIVEUPDATE_PREPARE kernel event and can also be triggered
+ * by writing '1' to ``/sys/kernel/liveupdate/prepare``. This typically
+ * triggers the main state saving process for items marked via the PRESERVE
+ * ioctls. This occurs *before* the main "blackout window", while user
+ * applications (e.g., VMs) may still be running. Kernel subsystems
+ * receiving the %LIVEUPDATE_PREPARE event should serialize necessary state.
+ * This command does not transfer data.
+ *
+ * Return: 0 on success, negative error code on failure. Transitions state
+ * towards %LIVEUPDATE_STATE_PREPARED on success.
+ */
+#define LIVEUPDATE_IOCTL_EVENT_PREPARE					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, 0x04)
+
+/**
+ * LIVEUPDATE_IOCTL_EVENT_FREEZE - Notify subsystems of imminent reboot
+ * transition.
+ *
+ * Argument: None.
+ *
+ * Notifies the live update subsystem and associated components that the kernel
+ * is about to execute the final reboot transition into the new kernel (e.g.,
+ * via kexec). This action triggers the internal %LIVEUPDATE_FREEZE kernel
+ * event. This event provides subsystems a final, brief opportunity (within the
+ * "blackout window") to save critical state or perform last-moment quiescing.
+ * Any remaining or deferred state saving for items marked via the PRESERVE
+ * ioctls typically occurs in response to the %LIVEUPDATE_FREEZE event.
+ *
+ * This ioctl should only be called when the system is in the
+ * %LIVEUPDATE_STATE_PREPARED state. This command does not transfer data.
+ *
+ * Return: 0 if the notification is successfully processed by the kernel (but
+ * reboot follows). Returns a negative error code if the notification fails
+ * or if the system is not in the %LIVEUPDATE_STATE_PREPARED state.
+ */
+#define LIVEUPDATE_IOCTL_EVENT_FREEZE					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, 0x05)
+
+/**
+ * LIVEUPDATE_IOCTL_EVENT_CANCEL - Cancel the live update preparation phase.
+ *
+ * Argument: None.
+ *
+ * Notifies the live update subsystem to abort the preparation sequence
+ * potentially initiated by %LIVEUPDATE_IOCTL_EVENT_PREPARE. This action
+ * typically corresponds to the internal %LIVEUPDATE_CANCEL kernel event,
+ * which might also be triggered automatically if the PREPARE stage fails
+ * internally.
+ *
+ * When triggered, subsystems receiving the %LIVEUPDATE_CANCEL event should
+ * revert any state changes or actions taken specifically for the aborted
+ * prepare phase (e.g., discard partially serialized state). The kernel
+ * releases resources allocated specifically for this *aborted preparation
+ * attempt*.
+ *
+ * This operation cancels the current *attempt* to prepare for a live update
+ * but does **not** remove previously validated items from the internal list
+ * of potentially preservable resources. Consequently, preservation tokens
+ * previously generated by successful %LIVEUPDATE_IOCTL_FD_PRESERVE or calls
+ * generally **remain valid** as identifiers for those potentially preservable
+ * resources. However, since the system state returns towards
+ * %LIVEUPDATE_STATE_NORMAL, user space must initiate a new live update sequence
+ * (starting with %LIVEUPDATE_IOCTL_EVENT_PREPARE) to proceed with an update
+ * using these (or other) tokens.
+ *
+ * This command does not transfer data. Kernel callbacks for the
+ * %LIVEUPDATE_CANCEL event must not fail.
+ *
+ * Return: 0 on success, negative error code on failure. Transitions state back
+ * towards %LIVEUPDATE_STATE_NORMAL on success.
+ */
+#define LIVEUPDATE_IOCTL_EVENT_CANCEL					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, 0x06)
+
+/**
+ * LIVEUPDATE_IOCTL_EVENT_FINISH - Signal restoration completion and trigger
+ * cleanup.
+ *
+ * Argument: None.
+ *
+ * Signals that user space has completed all necessary restoration actions in
+ * the new kernel (after a live update reboot). This action corresponds to the
+ * internal %LIVEUPDATE_FINISH kernel event and may also be triggerable via
+ * sysfs (e.g., writing '1' to ``/sys/kernel/liveupdate/finish``)
+ * Calling this ioctl triggers the cleanup phase: any resources that were
+ * successfully preserved but were *not* subsequently restored (reclaimed) via
+ * the RESTORE ioctls will have their preserved state discarded and associated
+ * kernel resources released. Involved devices may be reset. All desired
+ * restorations *must* be completed *before* this. Kernel callbacks for the
+ * %LIVEUPDATE_FINISH event must not fail. Successfully completing this phase
+ * transitions the system state from %LIVEUPDATE_STATE_UPDATED back to
+ * %LIVEUPDATE_STATE_NORMAL. This command does not transfer data.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+#define LIVEUPDATE_IOCTL_EVENT_FINISH					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, 0x07)
+
+#endif /* _UAPI_LIVEUPDATE_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (9 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-06-05 16:20   ` Pratyush Yadav
  2025-05-15 18:23 ` [RFC v2 12/16] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce a sysfs interface for the Live Update Orchestrator
under /sys/kernel/liveupdate/. This interface provides a way for
userspace tools and scripts to monitor the current state of the LUO
state machine.

The main feature is a read-only file, state, which displays the
current LUO state as a string ("normal", "prepared", "frozen",
"updated"). The interface uses sysfs_notify to allow userspace
listeners (e.g., via poll) to be efficiently notified of state changes.

ABI documentation for this new sysfs interface is added in
Documentation/ABI/testing/sysfs-kernel-liveupdate.

This read-only sysfs interface complements the main ioctl interface
provided by /dev/liveupdate, which handles LUO control operations and
resource management.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 .../ABI/testing/sysfs-kernel-liveupdate       | 51 ++++++++++
 drivers/misc/liveupdate/Kconfig               | 18 ++++
 drivers/misc/liveupdate/Makefile              |  1 +
 drivers/misc/liveupdate/luo_core.c            |  1 +
 drivers/misc/liveupdate/luo_internal.h        |  6 ++
 drivers/misc/liveupdate/luo_sysfs.c           | 92 +++++++++++++++++++
 6 files changed, 169 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-liveupdate
 create mode 100644 drivers/misc/liveupdate/luo_sysfs.c

diff --git a/Documentation/ABI/testing/sysfs-kernel-liveupdate b/Documentation/ABI/testing/sysfs-kernel-liveupdate
new file mode 100644
index 000000000000..7631410a10c3
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-liveupdate
@@ -0,0 +1,51 @@
+What:		/sys/kernel/liveupdate/
+Date:		May 2025
+KernelVersion:	6.16.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Directory containing interfaces to query the live
+		update orchestrator. Live update is the ability to reboot the
+		host kernel (e.g., via kexec, without a full power cycle) while
+		keeping specifically designated devices operational ("alive")
+		across the transition. After the new kernel boots, these devices
+		can be re-attached to their original workloads (e.g., virtual
+		machines) with their state preserved. This is particularly
+		useful, for example, for quick hypervisor updates without
+		terminating running virtual machines.
+
+
+What:		/sys/kernel/liveupdate/state
+Date:		May 2025
+KernelVersion:	6.16.0
+Contact:	pasha.tatashin@soleen.com
+Description:	Read-only file that displays the current state of the live
+		update orchestrator as a string. Possible values are:
+
+		"normal":	No live update operation is in progress. This is
+				the default operational state.
+
+		"prepared":	The live update preparation phase has completed
+				successfully (e.g., triggered via the 'prepare'
+				file). Kernel subsystems have been notified via
+				the %LIVEUPDATE_PREPARE event/callback and
+				should have initiated state saving. User
+				workloads (e.g., VMs) are generally still
+				running, but some operations (like device
+				unbinding or new DMA mappings) might be
+				restricted. The system is ready for the reboot
+				trigger.
+
+		"frozen":	The final reboot notification has been sent
+				(e.g., triggered via the 'reboot' file),
+				corresponding to the %LIVEUPDATE_REBOOT kernel
+				event. Subsystems have had their final chance to
+				save state. User workloads must be suspended.
+				The system is about to execute the reboot into
+				the new kernel (imminent kexec). This state
+				corresponds to the "blackout window".
+
+		"updated":	The system has successfully rebooted into the
+				new kernel via live update. Restoration of
+				preserved resources can now occur (typically via
+				ioctl commands). The system is awaiting the
+				final 'finish' signal after user space completes
+				restoration tasks.
diff --git a/drivers/misc/liveupdate/Kconfig b/drivers/misc/liveupdate/Kconfig
index a7424ceeba0b..09940f9a724a 100644
--- a/drivers/misc/liveupdate/Kconfig
+++ b/drivers/misc/liveupdate/Kconfig
@@ -25,3 +25,21 @@ config LIVEUPDATE
 	  running virtual machines.
 
 	  If unsure, say N.
+
+config LIVEUPDATE_SYSFS_API
+	bool "Live Update sysfs monitoring interface"
+	depends on SYSFS
+	depends on LIVEUPDATE
+	help
+	  Enable a sysfs interface for the Live Update Orchestrator
+	  at /sys/kernel/liveupdate/.
+
+	  This allows monitoring the LUO state ('normal', 'prepared',
+	  'frozen', 'updated') via the read-only 'state' file.
+
+	  This interface complements the primary /dev/liveupdate ioctl
+	  interface, which handles the full update process.
+	  This sysfs API may be useful for scripting, or userspace monitoring
+	  needed to coordinate application restarts and minimize downtime.
+
+	  If unsure, say N.
diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
index 7a0cd08919c9..190323c10220 100644
--- a/drivers/misc/liveupdate/Makefile
+++ b/drivers/misc/liveupdate/Makefile
@@ -3,3 +3,4 @@ obj-y					+= luo_ioctl.o
 obj-y					+= luo_core.o
 obj-y					+= luo_files.o
 obj-y					+= luo_subsystems.o
+obj-$(CONFIG_LIVEUPDATE_SYSFS_API)	+= luo_sysfs.o
diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
index ab1d76221fe2..1a5163c116a4 100644
--- a/drivers/misc/liveupdate/luo_core.c
+++ b/drivers/misc/liveupdate/luo_core.c
@@ -79,6 +79,7 @@ static inline bool is_current_luo_state(enum liveupdate_state expected_state)
 static void __luo_set_state(enum liveupdate_state state)
 {
 	WRITE_ONCE(luo_state, state);
+	luo_sysfs_notify();
 }
 
 static inline void luo_set_state(enum liveupdate_state state)
diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
index b7a0f31ddc99..bf1ba18722e2 100644
--- a/drivers/misc/liveupdate/luo_internal.h
+++ b/drivers/misc/liveupdate/luo_internal.h
@@ -34,6 +34,12 @@ int luo_retrieve_file(u64 token, struct file **file);
 int luo_register_file(u64 *token, struct file *file);
 int luo_unregister_file(u64 token);
 
+#ifdef CONFIG_LIVEUPDATE_SYSFS_API
+void luo_sysfs_notify(void);
+#else
+static inline void luo_sysfs_notify(void) {}
+#endif
+
 extern const char *const luo_state_str[];
 
 /* Get the current state as a string */
diff --git a/drivers/misc/liveupdate/luo_sysfs.c b/drivers/misc/liveupdate/luo_sysfs.c
new file mode 100644
index 000000000000..756b341dd886
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_sysfs.c
@@ -0,0 +1,92 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO sysfs interface
+ *
+ * Provides a sysfs interface at ``/sys/kernel/liveupdate/`` for monitoring LUO
+ * state.  Live update allows rebooting the kernel (via kexec) while preserving
+ * designated device state for attached workloads (e.g., VMs), useful for
+ * minimizing downtime during hypervisor updates.
+ *
+ * /sys/kernel/liveupdate/state
+ * ----------------------------
+ * - Permissions:  Read-only
+ * - Description:  Displays the current LUO state string.
+ * - Valid States:
+ *     @normal
+ *       Idle state.
+ *     @prepared
+ *       Preparation phase complete (triggered via 'prepare'). Resources
+ *       checked, state saving initiated via %LIVEUPDATE_PREPARE event.
+ *       Workloads mostly running but may be restricted. Ready forreboot
+ *       trigger.
+ *     @frozen
+ *       Final reboot notification sent (triggered via 'reboot'). Corresponds to
+ *       %LIVEUPDATE_REBOOT event. Final state saving. Workloads must be
+ *       suspended. System about to kexec ("blackout window").
+ *     @updated
+ *       New kernel booted via live update. Awaiting 'finish' signal.
+ *
+ * Userspace Interaction & Blackout Window Reduction
+ * -------------------------------------------------
+ * Userspace monitors the ``state`` file to coordinate actions:
+ *   - Suspend workloads before @frozen state is entered.
+ *   - Initiate resource restoration upon entering @updated state.
+ *   - Resume workloads after restoration, minimizing downtime.
+ */
+
+#include <linux/kobject.h>
+#include <linux/liveupdate.h>
+#include <linux/sysfs.h>
+#include "luo_internal.h"
+
+static bool luo_sysfs_initialized;
+
+#define LUO_DIR_NAME	"liveupdate"
+
+void luo_sysfs_notify(void)
+{
+	if (luo_sysfs_initialized)
+		sysfs_notify(kernel_kobj, LUO_DIR_NAME, "state");
+}
+
+/* Show the current live update state */
+static ssize_t state_show(struct kobject *kobj, struct kobj_attribute *attr,
+			  char *buf)
+{
+	return sysfs_emit(buf, "%s\n", LUO_STATE_STR);
+}
+
+static struct kobj_attribute state_attribute = __ATTR_RO(state);
+
+static struct attribute *luo_attrs[] = {
+	&state_attribute.attr,
+	NULL
+};
+
+static struct attribute_group luo_attr_group = {
+	.attrs = luo_attrs,
+	.name = LUO_DIR_NAME,
+};
+
+static int __init luo_init(void)
+{
+	int ret;
+
+	ret = sysfs_create_group(kernel_kobj, &luo_attr_group);
+	if (ret) {
+		pr_err("Failed to create group\n");
+		return ret;
+	}
+
+	luo_sysfs_initialized = true;
+	pr_info("Initialized\n");
+
+	return 0;
+}
+subsys_initcall(luo_init);
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 12/16] reboot: call liveupdate_reboot() before kexec
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (10 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 13/16] luo: add selftests for subsystems un/registration Pasha Tatashin
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Modify the reboot() syscall handler in kernel/reboot.c to call
liveupdate_reboot() when processing the LINUX_REBOOT_CMD_KEXEC
command.

This ensures that the Live Update Orchestrator is notified just
before the kernel executes the kexec jump. The liveupdate_reboot()
function triggers the final LIVEUPDATE_REBOOT event, allowing
participating subsystems to perform last-minute state saving within
the blackout window, and transitions the LUO state machine to FROZEN.

The call is placed immediately before kernel_kexec() to ensure LUO
finalization happens at the latest possible moment before the kernel
transition.

If liveupdate_reboot() returns an error (indicating a failure during
LUO finalization), the kexec operation is aborted to prevent proceeding
with an inconsistent state.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/reboot.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/reboot.c b/kernel/reboot.c
index ec087827c85c..bdeb04a773db 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -13,6 +13,7 @@
 #include <linux/kexec.h>
 #include <linux/kmod.h>
 #include <linux/kmsg_dump.h>
+#include <linux/liveupdate.h>
 #include <linux/reboot.h>
 #include <linux/suspend.h>
 #include <linux/syscalls.h>
@@ -797,6 +798,9 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 
 #ifdef CONFIG_KEXEC_CORE
 	case LINUX_REBOOT_CMD_KEXEC:
+		ret = liveupdate_reboot();
+		if (ret)
+			break;
 		ret = kernel_kexec();
 		break;
 #endif
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 13/16] luo: add selftests for subsystems un/registration
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (11 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 12/16] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  8:52   ` Mike Rapoport
  2025-05-15 18:23 ` [RFC v2 14/16] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduce a self-test mechanism for the LUO to allow verification of
core subsystem management functionality. This is primarily intended
for developers and system integrators validating the live update
feature.

The tests are enabled via the new Kconfig option
CONFIG_LIVEUPDATE_SELFTESTS (default 'n') and are triggered through
a new ioctl command, LIVEUPDATE_IOCTL_SELFTESTS, added to the
/dev/liveupdate device node.

This ioctl accepts commands defined in luo_selftests.h to:
- LUO_CMD_SUBSYSTEM_REGISTER: Creates and registers a dummy LUO
  subsystem using the liveupdate_register_subsystem() function. It
  allocates a data page and copies initial data from userspace.
- LUO_CMD_SUBSYSTEM_UNREGISTER: Unregisters the specified dummy
  subsystem using the liveupdate_unregister_subsystem() function and
  cleans up associated test resources.
- LUO_CMD_SUBSYSTEM_GETDATA: Copies the data page associated with a
  registered test subsystem back to userspace, allowing verification of
  data potentially modified or preserved by test callbacks.

This provides a way to test the fundamental registration and
unregistration flows within the LUO framework from userspace without
requiring a full live update sequence.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 drivers/misc/liveupdate/Kconfig         |  15 ++
 drivers/misc/liveupdate/Makefile        |   1 +
 drivers/misc/liveupdate/luo_internal.h  |   9 +
 drivers/misc/liveupdate/luo_ioctl.c     |   4 +
 drivers/misc/liveupdate/luo_selftests.c | 283 ++++++++++++++++++++++++
 drivers/misc/liveupdate/luo_selftests.h |  23 ++
 include/uapi/linux/liveupdate.h         |  24 ++
 7 files changed, 359 insertions(+)
 create mode 100644 drivers/misc/liveupdate/luo_selftests.c
 create mode 100644 drivers/misc/liveupdate/luo_selftests.h

diff --git a/drivers/misc/liveupdate/Kconfig b/drivers/misc/liveupdate/Kconfig
index 09940f9a724a..304217e2fe95 100644
--- a/drivers/misc/liveupdate/Kconfig
+++ b/drivers/misc/liveupdate/Kconfig
@@ -43,3 +43,18 @@ config LIVEUPDATE_SYSFS_API
 	  needed to coordinate application restarts and minimize downtime.
 
 	  If unsure, say N.
+
+config LIVEUPDATE_SELFTESTS
+	bool "Live Update Orchestrator - self tests"
+	depends on LIVEUPDATE
+	help
+	  Say Y here to build self-tests for the LUO framework. When enabled,
+	  these tests can be initiated via the ioctl interface to help verify
+	  the core live update functionality.
+
+	  This option is primarily intended for developers working on the
+	  live update feature or for validation purposes during system
+	  integration.
+
+	  If you are unsure or are building a production kernel where size
+	  or attack surface is a concern, say N.
diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
index 190323c10220..1afa4059b99f 100644
--- a/drivers/misc/liveupdate/Makefile
+++ b/drivers/misc/liveupdate/Makefile
@@ -2,5 +2,6 @@
 obj-y					+= luo_ioctl.o
 obj-y					+= luo_core.o
 obj-y					+= luo_files.o
+obj-$(CONFIG_LIVEUPDATE_SELFTESTS)	+= luo_selftests.o
 obj-y					+= luo_subsystems.o
 obj-$(CONFIG_LIVEUPDATE_SYSFS_API)	+= luo_sysfs.o
diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
index bf1ba18722e2..45bf8398ab6e 100644
--- a/drivers/misc/liveupdate/luo_internal.h
+++ b/drivers/misc/liveupdate/luo_internal.h
@@ -40,6 +40,15 @@ void luo_sysfs_notify(void);
 static inline void luo_sysfs_notify(void) {}
 #endif
 
+#ifdef CONFIG_LIVEUPDATE_SELFTESTS
+int luo_ioctl_selftests(void __user *argp);
+#else
+static inline int luo_ioctl_selftests(void __user *argp)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
 extern const char *const luo_state_str[];
 
 /* Get the current state as a string */
diff --git a/drivers/misc/liveupdate/luo_ioctl.c b/drivers/misc/liveupdate/luo_ioctl.c
index 76c687ff650b..f92cea7eff82 100644
--- a/drivers/misc/liveupdate/luo_ioctl.c
+++ b/drivers/misc/liveupdate/luo_ioctl.c
@@ -152,6 +152,10 @@ static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
 			ret = -EFAULT;
 		break;
 
+	case LIVEUPDATE_IOCTL_SELFTESTS:
+		ret = luo_ioctl_selftests((void __user *)arg);
+		break;
+
 	default:
 		pr_warn("ioctl: unknown command nr: 0x%x\n", _IOC_NR(cmd));
 		ret = -ENOTTY;
diff --git a/drivers/misc/liveupdate/luo_selftests.c b/drivers/misc/liveupdate/luo_selftests.c
new file mode 100644
index 000000000000..7956e5c2371f
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_selftests.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO Selftests
+ *
+ * We provide ioctl-based selftest interface for the LUO. It provides a
+ * mechanism to test core LUO functionality, particularly the registration,
+ * unregistration, and data handling aspects of LUO subsystems, without
+ * requiring a full live update event sequence.
+ *
+ * The tests are intended primarily for developers working on the LUO framework
+ * or for validation purposes during system integration. This functionality is
+ * conditionally compiled based on the `CONFIG_LIVEUPDATE_SELFTESTS` Kconfig
+ * option and should typically be disabled in production kernels.
+ *
+ * Interface:
+ * The selftests are accessed via the `/dev/liveupdate` character device using
+ * the `LIVEUPDATE_IOCTL_SELFTESTS` ioctl command. The argument to the ioctl
+ * is a pointer to a `struct liveupdate_selftest` structure (defined in
+ * `uapi/linux/liveupdate.h`), which contains:
+ * - `cmd`: The specific selftest command to execute (e.g.,
+ * `LUO_CMD_SUBSYSTEM_REGISTER`).
+ * - `arg`: A pointer to a command-specific argument structure. For subsystem
+ * tests, this points to a `struct luo_arg_subsystem` (defined in
+ * `luo_selftests.h`).
+ *
+ * Commands:
+ * - `LUO_CMD_SUBSYSTEM_REGISTER`:
+ * Registers a new dummy LUO subsystem. It allocates kernel memory for test
+ * data, copies initial data from the user-provided `data_page`, sets up
+ * simple logging callbacks, and calls the core
+ * `liveupdate_register_subsystem()`
+ * function. Requires `arg` pointing to `struct luo_arg_subsystem`.
+ * - `LUO_CMD_SUBSYSTEM_UNREGISTER`:
+ * Unregisters a previously registered dummy subsystem identified by `name`.
+ * It calls the core `liveupdate_unregister_subsystem()` function and then
+ * frees the associated kernel memory and internal tracking structures.
+ * Requires `arg` pointing to `struct luo_arg_subsystem` (only `name` used).
+ * - `LUO_CMD_SUBSYSTEM_GETDATA`:
+ * Copies the content of the kernel data page associated with the specified
+ * dummy subsystem (`name`) back to the user-provided `data_page`. This allows
+ * userspace to verify the state of the data after potential test operations.
+ * Requires `arg` pointing to `struct luo_arg_subsystem`.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/errno.h>
+#include <linux/gfp.h>
+#include <linux/kexec_handover.h>
+#include <linux/liveupdate.h>
+#include <linux/mutex.h>
+#include <linux/uaccess.h>
+#include <uapi/linux/liveupdate.h>
+#include "luo_internal.h"
+#include "luo_selftests.h"
+
+struct luo_subsystems {
+	struct liveupdate_subsystem handle;
+	char name[LUO_NAME_LENGTH];
+	void *data;
+	bool in_use;
+} luo_subsystems[LUO_MAX_SUBSYSTEMS];
+
+/* Only allow one selftest ioctl operation at a time */
+static DEFINE_MUTEX(luo_ioctl_mutex);
+
+static int luo_subsystem_prepare(void *arg, u64 *data)
+{
+	unsigned long i = (unsigned long)arg;
+	unsigned long phys_addr = __pa(luo_subsystems[i].data);
+	int ret;
+
+	ret = kho_preserve_phys(phys_addr, PAGE_SIZE);
+	if (ret)
+		return ret;
+
+	*data = phys_addr;
+	pr_info("Subsystem '%s' prepare data[%lx]\n",
+		luo_subsystems[i].name, phys_addr);
+
+	return 0;
+}
+
+static int luo_subsystem_freeze(void *arg, u64 *data)
+{
+	unsigned long i = (unsigned long)arg;
+
+	pr_info("Subsystem '%s' freeze data[%llx]\n",
+		luo_subsystems[i].name, *data);
+
+	return 0;
+}
+
+static void luo_subsystem_cancel(void *arg, u64 data)
+{
+	unsigned long i = (unsigned long)arg;
+
+	pr_info("Subsystem '%s' canel data[%llx]\n",
+		luo_subsystems[i].name, data);
+}
+
+static void luo_subsystem_finish(void *arg, u64 data)
+{
+	unsigned long i = (unsigned long)arg;
+
+	pr_info("Subsystem '%s' finish data[%llx]\n",
+		luo_subsystems[i].name, data);
+}
+
+static int luo_subsystem_idx(char *name)
+{
+	int i;
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		if (luo_subsystems[i].in_use &&
+		    !strcmp(luo_subsystems[i].name, name))
+			break;
+	}
+
+	if (i == LUO_MAX_SUBSYSTEMS) {
+		pr_warn("Subsystem with name '%s' is not registred\n", name);
+
+		return -EINVAL;
+	}
+
+	return i;
+}
+
+static void luo_put_and_free_subsystem(char *name)
+{
+	int i = luo_subsystem_idx(name);
+
+	if (i < 0)
+		return;
+
+	free_page((unsigned long)luo_subsystems[i].data);
+	luo_subsystems[i].in_use = false;
+}
+
+static int luo_get_and_alloc_subsystem(char *name, void __user *data,
+				       struct liveupdate_subsystem **hp)
+{
+	unsigned long page_addr, i;
+
+	page_addr = get_zeroed_page(GFP_KERNEL);
+	if (!page_addr) {
+		pr_warn("Failed to allocate memory for subsystem data\n");
+		return -ENOMEM;
+	}
+
+	if (copy_from_user((void *)page_addr, data, PAGE_SIZE)) {
+		free_page(page_addr);
+		return -EFAULT;
+	}
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		if (!luo_subsystems[i].in_use)
+			break;
+	}
+
+	if (i == LUO_MAX_SUBSYSTEMS) {
+		pr_warn("Maximum number of subsystems registered\n");
+		return -ENOMEM;
+	}
+
+	luo_subsystems[i].in_use = true;
+	luo_subsystems[i].handle.prepare = luo_subsystem_prepare;
+	luo_subsystems[i].handle.freeze = luo_subsystem_freeze;
+	luo_subsystems[i].handle.cancel = luo_subsystem_cancel;
+	luo_subsystems[i].handle.finish = luo_subsystem_finish;
+	luo_subsystems[i].handle.name = luo_subsystems[i].name;
+	luo_subsystems[i].handle.arg = (void *)i;
+	strscpy(luo_subsystems[i].name, name, LUO_NAME_LENGTH);
+	luo_subsystems[i].data = (void *)page_addr;
+
+	*hp = &luo_subsystems[i].handle;
+
+	return 0;
+}
+
+static int luo_cmd_subsystem_unregister(void __user *argp)
+{
+	struct luo_arg_subsystem arg;
+	int ret, i;
+
+	if (copy_from_user(&arg, argp, sizeof(arg)))
+		return -EFAULT;
+
+	i = luo_subsystem_idx(arg.name);
+	if (i < 0)
+		return i;
+
+	ret = liveupdate_unregister_subsystem(&luo_subsystems[i].handle);
+	if (ret)
+		return ret;
+
+	luo_put_and_free_subsystem(arg.name);
+
+	return 0;
+}
+
+static int luo_cmd_subsystem_register(void __user *argp)
+{
+	struct liveupdate_subsystem *h;
+	struct luo_arg_subsystem arg;
+	int ret;
+
+	if (copy_from_user(&arg, argp, sizeof(arg)))
+		return -EFAULT;
+
+	ret = luo_get_and_alloc_subsystem(arg.name,
+					  (void __user *)arg.data_page, &h);
+	if (ret)
+		return ret;
+
+	ret = liveupdate_register_subsystem(h);
+	if (ret)
+		luo_put_and_free_subsystem(arg.name);
+
+	return ret;
+}
+
+static int luo_cmd_subsystem_getdata(void __user *argp)
+{
+	struct luo_arg_subsystem arg;
+	int i;
+
+	if (copy_from_user(&arg, argp, sizeof(arg)))
+		return -EFAULT;
+
+	i = luo_subsystem_idx(arg.name);
+	if (i < 0)
+		return i;
+
+	if (copy_to_user(arg.data_page, luo_subsystems[i].data,
+			 PAGE_SIZE)) {
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+int luo_ioctl_selftests(void __user *argp)
+{
+	struct liveupdate_selftest luo_st;
+	void __user *cmd_argp;
+	int ret = 0;
+
+	if (copy_from_user(&luo_st, argp, sizeof(luo_st)))
+		return -EFAULT;
+
+	cmd_argp = (void __user *)luo_st.arg;
+
+	mutex_lock(&luo_ioctl_mutex);
+	switch (luo_st.cmd) {
+	case LUO_CMD_SUBSYSTEM_REGISTER:
+		ret =  luo_cmd_subsystem_register(cmd_argp);
+		break;
+
+	case LUO_CMD_SUBSYSTEM_UNREGISTER:
+		ret =  luo_cmd_subsystem_unregister(cmd_argp);
+		break;
+
+	case LUO_CMD_SUBSYSTEM_GETDATA:
+		ret = luo_cmd_subsystem_getdata(cmd_argp);
+		break;
+
+	default:
+		pr_warn("ioctl: unknown self-test command nr: 0x%llx\n",
+			luo_st.cmd);
+		ret = -ENOTTY;
+		break;
+	}
+	mutex_unlock(&luo_ioctl_mutex);
+
+	return ret;
+}
diff --git a/drivers/misc/liveupdate/luo_selftests.h b/drivers/misc/liveupdate/luo_selftests.h
new file mode 100644
index 000000000000..a30c6ce2273e
--- /dev/null
+++ b/drivers/misc/liveupdate/luo_selftests.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#ifndef _LINUX_LUO_SELFTESTS_H
+#define _LINUX_LUO_SELFTESTS_H
+
+/* Maximum number of subsystem self-test can register */
+#define LUO_MAX_SUBSYSTEMS		16
+#define LUO_NAME_LENGTH			32
+
+#define LUO_CMD_SUBSYSTEM_REGISTER	0
+#define LUO_CMD_SUBSYSTEM_UNREGISTER	1
+#define LUO_CMD_SUBSYSTEM_GETDATA	2
+struct luo_arg_subsystem {
+	char name[LUO_NAME_LENGTH];
+	void *data_page;
+};
+
+#endif /* _LINUX_LUO_SELFTESTS_H */
diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
index c673d08a29ea..e77a7b4e3448 100644
--- a/include/uapi/linux/liveupdate.h
+++ b/include/uapi/linux/liveupdate.h
@@ -81,6 +81,18 @@ struct liveupdate_fd {
 	__u64		token;
 };
 
+/**
+ * struct liveupdate_selftest - Holds directions for the self-test operations.
+ * @cmd:    Selftest comman defined in luo_selftests.h.
+ * @arg:    Argument for the self test command.
+ *
+ * This structure is used only for the selftest purposes.
+ */
+struct liveupdate_selftest {
+	__u64		cmd;
+	__u64		arg;
+};
+
 /* The ioctl type, documented in ioctl-number.rst */
 #define LIVEUPDATE_IOCTL_TYPE		0xBA
 
@@ -297,4 +309,16 @@ struct liveupdate_fd {
 #define LIVEUPDATE_IOCTL_EVENT_FINISH					\
 	_IO(LIVEUPDATE_IOCTL_TYPE, 0x07)
 
+/**
+ * LIVEUPDATE_IOCTL_SELFTESTS - Interface for the LUO selftests
+ *
+ * Argument: Pointer to &struct liveupdate_selftest.
+ *
+ * Use by LUO selftests, commands are declared in luo_selftests.h
+ *
+ * Return: 0 on success, negative error code on failure (e.g., invalid token).
+ */
+#define LIVEUPDATE_IOCTL_SELFTESTS					\
+	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x08, struct liveupdate_selftest)
+
 #endif /* _UAPI_LIVEUPDATE_H */
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 14/16] selftests/liveupdate: add subsystem/state tests
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (12 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 13/16] luo: add selftests for subsystems un/registration Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-15 18:23 ` [RFC v2 15/16] docs: add luo documentation Pasha Tatashin
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Introduces a new set of userspace selftests for the LUO. These tests
verify the functionality LUO by using the kernel-side selftest ioctls
provided by the LUO module, primarily focusing on subsystem management
and basic LUO state transitions.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/liveupdate/.gitignore |   1 +
 tools/testing/selftests/liveupdate/Makefile   |   7 +
 tools/testing/selftests/liveupdate/config     |   6 +
 .../testing/selftests/liveupdate/liveupdate.c | 440 ++++++++++++++++++
 5 files changed, 455 insertions(+)
 create mode 100644 tools/testing/selftests/liveupdate/.gitignore
 create mode 100644 tools/testing/selftests/liveupdate/Makefile
 create mode 100644 tools/testing/selftests/liveupdate/config
 create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 80fb84fa3cfc..1a96e806a5dd 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -52,6 +52,7 @@ TARGETS += kvm
 TARGETS += landlock
 TARGETS += lib
 TARGETS += livepatch
+TARGETS += liveupdate
 TARGETS += lkdtm
 TARGETS += lsm
 TARGETS += membarrier
diff --git a/tools/testing/selftests/liveupdate/.gitignore b/tools/testing/selftests/liveupdate/.gitignore
new file mode 100644
index 000000000000..af6e773cf98f
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/.gitignore
@@ -0,0 +1 @@
+/liveupdate
diff --git a/tools/testing/selftests/liveupdate/Makefile b/tools/testing/selftests/liveupdate/Makefile
new file mode 100644
index 000000000000..2a573c36016e
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -Wall -O2 -Wno-unused-function
+CFLAGS += $(KHDR_INCLUDES)
+
+TEST_GEN_PROGS += liveupdate
+
+include ../lib.mk
diff --git a/tools/testing/selftests/liveupdate/config b/tools/testing/selftests/liveupdate/config
new file mode 100644
index 000000000000..382c85b89570
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/config
@@ -0,0 +1,6 @@
+CONFIG_KEXEC_FILE=y
+CONFIG_KEXEC_HANDOVER=y
+CONFIG_KEXEC_HANDOVER_DEBUG=y
+CONFIG_LIVEUPDATE=y
+CONFIG_LIVEUPDATE_SYSFS_API=y
+CONFIG_LIVEUPDATE_SELFTESTS=y
diff --git a/tools/testing/selftests/liveupdate/liveupdate.c b/tools/testing/selftests/liveupdate/liveupdate.c
new file mode 100644
index 000000000000..0007085e2b96
--- /dev/null
+++ b/tools/testing/selftests/liveupdate/liveupdate.c
@@ -0,0 +1,440 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+
+#include <linux/liveupdate.h>
+
+#include "../kselftest.h"
+#include "../kselftest_harness.h"
+#include "../../../../drivers/misc/liveupdate/luo_selftests.h"
+
+struct subsystem_info {
+	void *data_page;
+	void *verify_page;
+	char test_name[LUO_NAME_LENGTH];
+	bool registered;
+};
+
+FIXTURE(subsystem) {
+	enum liveupdate_state state;
+	int fd;
+	struct subsystem_info si[LUO_MAX_SUBSYSTEMS];
+};
+
+FIXTURE(state) {
+	enum liveupdate_state state;
+	int fd;
+};
+
+#define LUO_DEVICE	"/dev/liveupdate"
+#define LUO_SYSFS_STATE	"/sys/kernel/liveupdate/state"
+static size_t page_size;
+
+const char *const luo_state_str[] = {
+	[LIVEUPDATE_STATE_NORMAL]   = "normal",
+	[LIVEUPDATE_STATE_PREPARED] = "prepared",
+	[LIVEUPDATE_STATE_FROZEN]   = "frozen",
+	[LIVEUPDATE_STATE_UPDATED]  = "updated",
+};
+
+static int run_luo_selftest_cmd(int fd, __u64 cmd_code,
+				struct luo_arg_subsystem *subsys_arg)
+{
+	struct liveupdate_selftest k_arg;
+
+	if (fd < 0) {
+		errno = EBADF;
+		return -1;
+	}
+
+	k_arg.cmd = cmd_code;
+	k_arg.arg = (__u64)(unsigned long)subsys_arg;
+
+	return ioctl(fd, LIVEUPDATE_IOCTL_SELFTESTS, &k_arg);
+}
+
+static int __register_subsystem(int fd, char *name, void *data_page)
+{
+	struct luo_arg_subsystem subsys_arg;
+
+	memset(&subsys_arg, 0, sizeof(subsys_arg));
+	snprintf(subsys_arg.name, LUO_NAME_LENGTH, "%s", name);
+	subsys_arg.data_page = data_page;
+
+	return run_luo_selftest_cmd(fd, LUO_CMD_SUBSYSTEM_REGISTER,
+				    &subsys_arg);
+}
+
+static int __unregister_subsystem(int fd, char *name)
+{
+	struct luo_arg_subsystem subsys_arg;
+
+	memset(&subsys_arg, 0, sizeof(subsys_arg));
+	snprintf(subsys_arg.name, LUO_NAME_LENGTH, "%s", name);
+
+	return run_luo_selftest_cmd(fd, LUO_CMD_SUBSYSTEM_UNREGISTER,
+				    &subsys_arg);
+}
+
+static int get_sysfs_state(void)
+{
+	char buf[64];
+	ssize_t len;
+	int fd, i;
+
+	fd = open(LUO_SYSFS_STATE, O_RDONLY);
+	if (fd < 0) {
+		ksft_print_msg("Failed to open sysfs state file '%s': %s\n",
+			       LUO_SYSFS_STATE, strerror(errno));
+		return -errno;
+	}
+
+	len = read(fd, buf, sizeof(buf) - 1);
+	close(fd);
+
+	if (len <= 0) {
+		ksft_print_msg("Failed to read sysfs state file '%s': %s\n",
+			       LUO_SYSFS_STATE, strerror(errno));
+		return -errno;
+	}
+	if (buf[len - 1] == '\n')
+		buf[len - 1] = '\0';
+	else
+		buf[len] = '\0';
+
+	for (i = 0; i < ARRAY_SIZE(luo_state_str); i++) {
+		if (!strcmp(buf, luo_state_str[i]))
+			return i;
+	}
+
+	return -EIO;
+}
+
+FIXTURE_SETUP(state)
+{
+	page_size = sysconf(_SC_PAGE_SIZE);
+	self->fd = open(LUO_DEVICE, O_RDWR);
+	if (self->fd < 0) {
+		ksft_exit_skip("Setup: Cannot open %s (errno %d).\n",
+			       LUO_DEVICE, errno);
+	}
+	self->state = LIVEUPDATE_STATE_NORMAL;
+}
+
+FIXTURE_TEARDOWN(state)
+{
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (self->state != LIVEUPDATE_STATE_NORMAL)
+		ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+	close(self->fd);
+}
+
+FIXTURE_SETUP(subsystem)
+{
+	int i;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	memset(&self->si, 0, sizeof(self->si));
+	self->fd = open(LUO_DEVICE, O_RDWR);
+	if (self->fd < 0) {
+		ksft_exit_skip("Setup: Cannot open %s (errno %d).\n",
+			       LUO_DEVICE, errno);
+	}
+	self->state = LIVEUPDATE_STATE_NORMAL;
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		snprintf(self->si[i].test_name, LUO_NAME_LENGTH,
+			 "ksft_luo_%d.%d", getpid(), i);
+
+		self->si[i].data_page = mmap(NULL, page_size,
+					     PROT_READ | PROT_WRITE,
+					     MAP_PRIVATE | MAP_ANONYMOUS,
+					     -1, 0);
+
+		if (self->si[i].data_page == MAP_FAILED) {
+			ksft_print_msg("Setup: mmap data_page failed\n");
+			goto exit_fail;
+		}
+		memset(self->si[i].data_page, 'A' + i, page_size);
+
+		self->si[i].verify_page = mmap(NULL, page_size,
+					       PROT_READ | PROT_WRITE,
+					       MAP_PRIVATE | MAP_ANONYMOUS,
+					       -1, 0);
+		if (self->si[i].verify_page == MAP_FAILED) {
+			ksft_print_msg("Setup: mmap verify_page failed\n");
+			goto exit_fail;
+		}
+		memset(self->si[i].verify_page, 0, page_size);
+	}
+
+	return;
+exit_fail:
+	close(self->fd);
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		void *page;
+
+		page = self->si[i].data_page;
+		if (page && page != MAP_FAILED)
+			munmap(page, page_size);
+
+		page = self->si[i].verify_page;
+		if (page && page != MAP_FAILED)
+			munmap(page, page_size);
+	}
+	ksft_exit_fail();
+}
+
+FIXTURE_TEARDOWN(subsystem)
+{
+	int i;
+
+	if (self->state != LIVEUPDATE_STATE_NORMAL)
+		ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		if (self->si[i].registered) {
+			struct luo_arg_subsystem subsys_arg;
+
+			memset(&subsys_arg, 0, sizeof(subsys_arg));
+			snprintf(subsys_arg.name, LUO_NAME_LENGTH, "%s",
+				 self->si[i].test_name);
+			subsys_arg.data_page = NULL;
+			run_luo_selftest_cmd(self->fd, LUO_CMD_SUBSYSTEM_UNREGISTER,
+					     &subsys_arg);
+		}
+		munmap(self->si[i].data_page, page_size);
+		munmap(self->si[i].verify_page, page_size);
+	}
+
+	close(self->fd);
+}
+
+TEST_F(state, normal)
+{
+	enum liveupdate_state state;
+	int ret;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_GET_STATE, &state);
+	ASSERT_EQ(0, ret);
+	ASSERT_EQ(state, LIVEUPDATE_STATE_NORMAL);
+}
+
+TEST_F(state, prepared)
+{
+	enum liveupdate_state state;
+	int ret;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_PREPARE, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_PREPARED;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_GET_STATE, &state);
+	ASSERT_EQ(0, ret);
+	ASSERT_EQ(state, LIVEUPDATE_STATE_PREPARED);
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_NORMAL;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_GET_STATE, &state);
+	ASSERT_EQ(0, ret);
+	ASSERT_EQ(state, LIVEUPDATE_STATE_NORMAL);
+}
+
+TEST_F(state, sysfs_normal)
+{
+	int state = get_sysfs_state();
+
+	if (state < 0) {
+		if (state == -ENOENT || state == -EACCES) {
+			ksft_test_result_skip("Sysfs state file not accessible (%d)\n",
+					      state);
+			return;
+		}
+	}
+
+	ASSERT_EQ(LIVEUPDATE_STATE_NORMAL, state);
+}
+
+TEST_F(state, sysfs_prepared)
+{
+	int ret, state;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_PREPARE, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_PREPARED;
+
+	state = get_sysfs_state();
+	if (state < 0) {
+		if (state == -ENOENT || state == -EACCES) {
+			ksft_test_result_skip("Sysfs state file not accessible (%d)\n",
+					      state);
+			ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+			self->state = LIVEUPDATE_STATE_NORMAL;
+			return;
+		}
+	}
+	ASSERT_EQ(LIVEUPDATE_STATE_PREPARED, state);
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_NORMAL;
+	state = get_sysfs_state();
+	ASSERT_EQ(LIVEUPDATE_STATE_NORMAL, state);
+}
+
+TEST_F(state, sysfs_frozen)
+{
+	int ret, state;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_PREPARE, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_PREPARED;
+
+	state = get_sysfs_state();
+	if (state < 0) {
+		if (state == -ENOENT || state == -EACCES) {
+			ksft_test_result_skip("Sysfs state file not accessible (%d)\n", state);
+			ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+			self->state = LIVEUPDATE_STATE_NORMAL;
+			return;
+		}
+	}
+	ASSERT_EQ(LIVEUPDATE_STATE_PREPARED, state);
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_FREEZE, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_FROZEN;
+	state = get_sysfs_state();
+	ASSERT_EQ(LIVEUPDATE_STATE_FROZEN, state);
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_NORMAL;
+	state = get_sysfs_state();
+	ASSERT_EQ(LIVEUPDATE_STATE_NORMAL, state);
+}
+
+TEST_F(subsystem, register_unregister)
+{
+	int ret;
+
+	ret = __register_subsystem(self->fd, self->si[0].test_name,
+				   self->si[0].data_page);
+	ASSERT_EQ(0, ret);
+	self->si[0].registered = true;
+
+	ret = __unregister_subsystem(self->fd, self->si[0].test_name);
+	ASSERT_EQ(0, ret);
+	self->si[0].registered = false;
+}
+
+TEST_F(subsystem, double_unregister)
+{
+	int ret;
+
+	ret = __register_subsystem(self->fd, self->si[0].test_name,
+				   self->si[0].data_page);
+	ASSERT_EQ(0, ret);
+	self->si[0].registered = true;
+
+	ret = __unregister_subsystem(self->fd, self->si[0].test_name);
+	ASSERT_EQ(0, ret);
+	self->si[0].registered = false;
+
+	ret = __unregister_subsystem(self->fd, self->si[0].test_name);
+	EXPECT_NE(0, ret);
+	EXPECT_TRUE(errno == EINVAL || errno == ENOENT);
+	self->si[0].registered = false;
+}
+
+TEST_F(subsystem, register_unregister_many)
+{
+	int ret;
+	int i;
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		ret = __register_subsystem(self->fd, self->si[i].test_name,
+					   self->si[i].data_page);
+		ASSERT_EQ(0, ret);
+		self->si[i].registered = true;
+	}
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		ret = __unregister_subsystem(self->fd, self->si[i].test_name);
+		ASSERT_EQ(0, ret);
+		self->si[i].registered = false;
+	}
+
+}
+
+TEST_F(subsystem, getdata_verify)
+{
+	enum liveupdate_state state;
+	int ret;
+	int i;
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		ret = __register_subsystem(self->fd, self->si[i].test_name,
+					   self->si[i].data_page);
+		ASSERT_EQ(0, ret);
+		self->si[i].registered = true;
+	}
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_PREPARE, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_PREPARED;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_GET_STATE, &state);
+	ASSERT_EQ(0, ret);
+	ASSERT_EQ(state, LIVEUPDATE_STATE_PREPARED);
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		struct luo_arg_subsystem subsys_arg;
+
+		memset(&subsys_arg, 0, sizeof(subsys_arg));
+		snprintf(subsys_arg.name, LUO_NAME_LENGTH, "%s",
+			 self->si[i].test_name);
+		subsys_arg.data_page = self->si[i].verify_page;
+
+		ret = run_luo_selftest_cmd(self->fd, LUO_CMD_SUBSYSTEM_GETDATA,
+					   &subsys_arg);
+
+		ASSERT_EQ(0, ret);
+		ASSERT_EQ(0, memcmp(self->si[i].data_page,
+				    self->si[i].verify_page,
+				    page_size));
+	}
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_EVENT_CANCEL, NULL);
+	ASSERT_EQ(0, ret);
+	self->state = LIVEUPDATE_STATE_NORMAL;
+
+	ret = ioctl(self->fd, LIVEUPDATE_IOCTL_GET_STATE, &state);
+	ASSERT_EQ(0, ret);
+	ASSERT_EQ(state, LIVEUPDATE_STATE_NORMAL);
+
+	for (i = 0; i < LUO_MAX_SUBSYSTEMS; i++) {
+		ret = __unregister_subsystem(self->fd, self->si[i].test_name);
+		ASSERT_EQ(0, ret);
+		self->si[i].registered = false;
+	}
+}
+
+TEST_HARNESS_MAIN
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 15/16] docs: add luo documentation
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (13 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 14/16] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-26  9:00   ` Mike Rapoport
  2025-05-15 18:23 ` [RFC v2 16/16] MAINTAINERS: add liveupdate entry Pasha Tatashin
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Add the main documentation file for the Live Update Orchestrator
subsystem at Documentation/admin-guide/liveupdate.rst.

The new file is included in the main
Documentation/admin-guide/index.rst table of contents.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 Documentation/admin-guide/index.rst      |  1 +
 Documentation/admin-guide/liveupdate.rst | 62 ++++++++++++++++++++++++
 2 files changed, 63 insertions(+)
 create mode 100644 Documentation/admin-guide/liveupdate.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 259d79fbeb94..3f59ccf32760 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -95,6 +95,7 @@ likely to be of interest on almost any system.
    cgroup-v2
    cgroup-v1/index
    cpu-load
+   liveupdate
    mm/index
    module-signing
    namespaces/index
diff --git a/Documentation/admin-guide/liveupdate.rst b/Documentation/admin-guide/liveupdate.rst
new file mode 100644
index 000000000000..bff9475d2518
--- /dev/null
+++ b/Documentation/admin-guide/liveupdate.rst
@@ -0,0 +1,62 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Live Update Orchestrator (LUO)
+==============================
+:Author: Pasha Tatashin <pasha.tatashin@soleen.com>
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_core.c
+   :doc: Live Update Orchestrator (LUO)
+
+LUO Subsystems Participation
+============================
+.. kernel-doc:: drivers/misc/liveupdate/luo_subsystems.c
+   :doc: LUO Subsystems support
+
+LUO Preserving File Descriptors
+===============================
+.. kernel-doc:: drivers/misc/liveupdate/luo_files.c
+   :doc: LUO file descriptors
+
+LUO ioctl interface
+===================
+.. kernel-doc:: drivers/misc/liveupdate/luo_ioctl.c
+   :doc: LUO ioctl Interface
+
+LUO sysfs interface
+===================
+.. kernel-doc:: drivers/misc/liveupdate/luo_sysfs.c
+   :doc: LUO sysfs interface
+
+LUO selftests ioctl
+===================
+.. kernel-doc:: drivers/misc/liveupdate/luo_selftests.c
+   :doc: LUO Selftests
+
+ioctl uAPI
+===========
+.. kernel-doc:: include/uapi/linux/liveupdate.h
+
+Public API
+==========
+.. kernel-doc:: include/linux/liveupdate.h
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_core.c
+   :export:
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_subsystems.c
+   :export:
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_files.c
+   :export:
+
+Internal API
+============
+.. kernel-doc:: drivers/misc/liveupdate/luo_core.c
+   :internal:
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_subsystems.c
+   :internal:
+
+.. kernel-doc:: drivers/misc/liveupdate/luo_files.c
+   :internal:
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* [RFC v2 16/16] MAINTAINERS: add liveupdate entry
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (14 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 15/16] docs: add luo documentation Pasha Tatashin
@ 2025-05-15 18:23 ` Pasha Tatashin
  2025-05-20  7:25 ` [RFC v2 00/16] Live Update Orchestrator Mike Rapoport
  2025-05-26  6:32 ` Mike Rapoport
  17 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-15 18:23 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav

Add a MAINTAINERS file entry for the new Live Update Orchestrator
introduced in previous patches.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 MAINTAINERS | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 4fc28b6674bd..327b2084ab79 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13806,6 +13806,17 @@ F:	kernel/module/livepatch.c
 F:	samples/livepatch/
 F:	tools/testing/selftests/livepatch/
 
+LIVE UPDATE
+M:	Pasha Tatashin <pasha.tatashin@soleen.com>
+L:	linux-kernel@vger.kernel.org
+S:	Maintained
+F:	Documentation/ABI/testing/sysfs-kernel-liveupdate
+F:	Documentation/admin-guide/liveupdate.rst
+F:	drivers/misc/liveupdate/
+F:	include/linux/liveupdate.h
+F:	include/uapi/linux/liveupdate.h
+F:	tools/testing/selftests/liveupdate/
+
 LLC (802.2)
 L:	netdev@vger.kernel.org
 S:	Odd fixes
-- 
2.49.0.1101.gccaa498523-goog


^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
@ 2025-05-15 23:15   ` James Houghton
  2025-05-23 18:09     ` Pasha Tatashin
  2025-05-26  7:55   ` Mike Rapoport
  2025-06-05 15:56   ` Pratyush Yadav
  2 siblings, 1 reply; 102+ messages in thread
From: James Houghton @ 2025-05-15 23:15 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 11:23 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> +/**
> + * luo_retrieve_file - Find a registered file instance by its token.
> + * @token: The unique token of the file instance to retrieve.
> + * @file: Output parameter. On success (return value 0), this will point
> + * to the retrieved "struct file".
> + *
> + * Searches the global list for a &struct luo_file matching the @token. Uses a
> + * read lock, allowing concurrent retrievals.
> + *
> + * Return: 0 on success. Negative errno on failure.
> + */
> +int luo_retrieve_file(u64 token, struct file **file)
> +{
> +       struct luo_file *luo_file;
> +       int ret = 0;
> +
> +       luo_files_recreate_luo_files_xa_in();
> +       luo_state_read_enter();
> +       if (!liveupdate_state_updated()) {
> +               pr_warn("File can be retrieved only in updated state\n");
> +               luo_state_read_exit();
> +               return -EBUSY;
> +       }
> +
> +       luo_file = xa_load(&luo_files_xa_in, token);
> +       if (luo_file && !luo_file->reclaimed) {
> +               luo_file->reclaimed = true;

I haven't been able to pay too much attention to the series yet, and I
know this was posted as an RFC, so pardon my nit-picking.

I think you need to have xchg here for this not to be racy, so something like:

`if (luo_file && !xchg(&luo_file->reclaimed, true))`

Or maybe you meant to avoid this race some other way; IIUC,
luo_state_read_enter() is not sufficient.

Thanks!

> +               ret = luo_file->fs->retrieve(luo_file->fs->arg,
> +                                            luo_file->private_data,
> +                                            file);
> +               if (!ret)
> +                       luo_file->file = *file;
> +       } else if (luo_file && luo_file->reclaimed) {
> +               pr_err("The file descriptor for token %lld has already been retrieved\n",
> +                      token);
> +               ret = -EINVAL;
> +       } else {
> +               ret = -ENOENT;
> +       }
> +
> +       luo_state_read_exit();
> +
> +       return ret;
> +}

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 00/16] Live Update Orchestrator
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (15 preceding siblings ...)
  2025-05-15 18:23 ` [RFC v2 16/16] MAINTAINERS: add liveupdate entry Pasha Tatashin
@ 2025-05-20  7:25 ` Mike Rapoport
  2025-05-23 18:07   ` Pasha Tatashin
  2025-05-26  6:32 ` Mike Rapoport
  17 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-20  7:25 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

Hi Pasha,

On Thu, May 15, 2025 at 06:23:04PM +0000, Pasha Tatashin wrote:
> This v2 series introduces the LUO, a kernel subsystem designed to
> facilitate live kernel updates with minimal downtime,
> particularly in cloud delplyoments aiming to update without fully
> disrupting running virtual machines.
> 
> This series builds upon KHO framework [1] by adding programmatic
> control over KHO's lifecycle and leveraging KHO for persisting LUO's
> own metadata across the kexec boundary. The git branch for this series
> can be found at:
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2
> 
> What is Live Update?
> Live Update is a specialized reboot process where selected kernel
> resources (memory, file descriptors, and eventually devices) are kept
> operational or their state preserved across a kernel transition (e.g.,
> via kexec). For certain resources, DMA and interrupt activity might
> continue with minimal interruption during the kernel reboot.
> 
> LUO v2 Overview:
> LUO v2 provides a framework for coordinating live updates. It features:
> State Machine: Manages the live update process through states:
> NORMAL, PREPARED, FROZEN, UPDATED.
> 
> KHO Integration:
> 
> LUO programmatically drives KHO's finalization and abort sequences.
> KHO's debugfs interface is now optional configured via
> CONFIG_KEXEC_HANDOVER_DEBUG.
> 
> LUO preserves its own metadata via KHO's kho_add_subtree and
> kho_preserve_phys() mechanisms.

I've only had time to skip through the patches, one thing that came to mind
was that since LUO is quite tightly coupled with KHO maybe we'll put them
together in, say, kernel/liveupdate?

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 00/16] Live Update Orchestrator
  2025-05-20  7:25 ` [RFC v2 00/16] Live Update Orchestrator Mike Rapoport
@ 2025-05-23 18:07   ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-23 18:07 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Tue, May 20, 2025 at 3:25 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> Hi Pasha,
>
> On Thu, May 15, 2025 at 06:23:04PM +0000, Pasha Tatashin wrote:
> > This v2 series introduces the LUO, a kernel subsystem designed to
> > facilitate live kernel updates with minimal downtime,
> > particularly in cloud delplyoments aiming to update without fully
> > disrupting running virtual machines.
> >
> > This series builds upon KHO framework [1] by adding programmatic
> > control over KHO's lifecycle and leveraging KHO for persisting LUO's
> > own metadata across the kexec boundary. The git branch for this series
> > can be found at:
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2
> >
> > What is Live Update?
> > Live Update is a specialized reboot process where selected kernel
> > resources (memory, file descriptors, and eventually devices) are kept
> > operational or their state preserved across a kernel transition (e.g.,
> > via kexec). For certain resources, DMA and interrupt activity might
> > continue with minimal interruption during the kernel reboot.
> >
> > LUO v2 Overview:
> > LUO v2 provides a framework for coordinating live updates. It features:
> > State Machine: Manages the live update process through states:
> > NORMAL, PREPARED, FROZEN, UPDATED.
> >
> > KHO Integration:
> >
> > LUO programmatically drives KHO's finalization and abort sequences.
> > KHO's debugfs interface is now optional configured via
> > CONFIG_KEXEC_HANDOVER_DEBUG.
> >
> > LUO preserves its own metadata via KHO's kho_add_subtree and
> > kho_preserve_phys() mechanisms.
>
> I've only had time to skip through the patches, one thing that came to mind
> was that since LUO is quite tightly coupled with KHO maybe we'll put them
> together in, say, kernel/liveupdate?

Thank you Mike, yes, a good idea, I also thought that it would make
sense for them to be in the same place, but initially I thought
perhaps KHO should be moved to misc/liveupdate/, but since it is
already landing in kernel/kexec_*, and it works with a bunch of core
kernel subsystems it makes sense to move LUO and KHO together under
kernel/liveupdate/

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-15 23:15   ` James Houghton
@ 2025-05-23 18:09     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-23 18:09 UTC (permalink / raw)
  To: James Houghton
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 7:16 PM James Houghton <jthoughton@google.com> wrote:
>
> On Thu, May 15, 2025 at 11:23 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> > +/**
> > + * luo_retrieve_file - Find a registered file instance by its token.
> > + * @token: The unique token of the file instance to retrieve.
> > + * @file: Output parameter. On success (return value 0), this will point
> > + * to the retrieved "struct file".
> > + *
> > + * Searches the global list for a &struct luo_file matching the @token. Uses a
> > + * read lock, allowing concurrent retrievals.
> > + *
> > + * Return: 0 on success. Negative errno on failure.
> > + */
> > +int luo_retrieve_file(u64 token, struct file **file)
> > +{
> > +       struct luo_file *luo_file;
> > +       int ret = 0;
> > +
> > +       luo_files_recreate_luo_files_xa_in();
> > +       luo_state_read_enter();
> > +       if (!liveupdate_state_updated()) {
> > +               pr_warn("File can be retrieved only in updated state\n");
> > +               luo_state_read_exit();
> > +               return -EBUSY;
> > +       }
> > +
> > +       luo_file = xa_load(&luo_files_xa_in, token);
> > +       if (luo_file && !luo_file->reclaimed) {
> > +               luo_file->reclaimed = true;
>
> I haven't been able to pay too much attention to the series yet, and I
> know this was posted as an RFC, so pardon my nit-picking.
>
> I think you need to have xchg here for this not to be racy, so something like:
>
> `if (luo_file && !xchg(&luo_file->reclaimed, true))`
>
> Or maybe you meant to avoid this race some other way; IIUC,
> luo_state_read_enter() is not sufficient.

Thank you for catching this. This is a bug, I actually added a per fd
mutex lock to struct luo_file that is supposed to be used here. I am
going to address this in the next version.

Thanks,
Pasha

>
> Thanks!
>
> > +               ret = luo_file->fs->retrieve(luo_file->fs->arg,
> > +                                            luo_file->private_data,
> > +                                            file);
> > +               if (!ret)
> > +                       luo_file->file = *file;
> > +       } else if (luo_file && luo_file->reclaimed) {
> > +               pr_err("The file descriptor for token %lld has already been retrieved\n",
> > +                      token);
> > +               ret = -EINVAL;
> > +       } else {
> > +               ret = -ENOENT;
> > +       }
> > +
> > +       luo_state_read_exit();
> > +
> > +       return ret;
> > +}

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 04/16] luo: luo_core: Live Update Orchestrator
  2025-05-15 18:23 ` [RFC v2 04/16] luo: luo_core: Live Update Orchestrator Pasha Tatashin
@ 2025-05-26  6:31   ` Mike Rapoport
  2025-05-30  5:00     ` Pasha Tatashin
  2025-06-04 15:17   ` Pratyush Yadav
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  6:31 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:08PM +0000, Pasha Tatashin wrote:
> Introduce LUO, a mechanism intended to facilitate kernel updates while
> keeping designated devices operational across the transition (e.g., via
> kexec). The primary use case is updating hypervisors with minimal
> disruption to running virtual machines. For userspace side of hypervisor
> update we have copyless migration. LUO is for updating the kernel.
> 
> This initial patch lays the groundwork for the LUO subsystem.
> 
> Further functionality, including the implementation of state transition
> logic, integration with KHO, and hooks for subsystems and file
> descriptors, will be added in subsequent patches.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/Kconfig                   |   1 +
>  drivers/misc/Makefile                  |   1 +
>  drivers/misc/liveupdate/Kconfig        |  27 +++
>  drivers/misc/liveupdate/Makefile       |   2 +
>  drivers/misc/liveupdate/luo_core.c     | 296 +++++++++++++++++++++++++
>  drivers/misc/liveupdate/luo_internal.h |  26 +++
>  include/linux/liveupdate.h             | 131 +++++++++++
>  7 files changed, 484 insertions(+)
>  create mode 100644 drivers/misc/liveupdate/Kconfig
>  create mode 100644 drivers/misc/liveupdate/Makefile
>  create mode 100644 drivers/misc/liveupdate/luo_core.c
>  create mode 100644 drivers/misc/liveupdate/luo_internal.h
>  create mode 100644 include/linux/liveupdate.h
> 
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index 6b37d61150ee..851fd9c33b36 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -636,6 +636,7 @@ source "drivers/misc/c2port/Kconfig"
>  source "drivers/misc/eeprom/Kconfig"
>  source "drivers/misc/cb710/Kconfig"
>  source "drivers/misc/lis3lv02d/Kconfig"
> +source "drivers/misc/liveupdate/Kconfig"
>  source "drivers/misc/altera-stapl/Kconfig"
>  source "drivers/misc/mei/Kconfig"
>  source "drivers/misc/vmw_vmci/Kconfig"
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index d6c917229c45..ed5b5bc71b85 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -41,6 +41,7 @@ obj-y				+= eeprom/
>  obj-y				+= cb710/
>  obj-$(CONFIG_VMWARE_BALLOON)	+= vmw_balloon.o
>  obj-$(CONFIG_PCH_PHUB)		+= pch_phub.o
> +obj-$(CONFIG_LIVEUPDATE)	+= liveupdate/
>  obj-y				+= lis3lv02d/
>  obj-$(CONFIG_ALTERA_STAPL)	+=altera-stapl/
>  obj-$(CONFIG_INTEL_MEI)		+= mei/
> diff --git a/drivers/misc/liveupdate/Kconfig b/drivers/misc/liveupdate/Kconfig
> new file mode 100644
> index 000000000000..a7424ceeba0b
> --- /dev/null
> +++ b/drivers/misc/liveupdate/Kconfig
> @@ -0,0 +1,27 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +#
> +# Copyright (c) 2025, Google LLC.
> +# Pasha Tatashin <pasha.tatashin@soleen.com>
> +#
> +# Live Update Orchestrator
> +#
> +
> +config LIVEUPDATE
> +	bool "Live Update Orchestrator"
> +	depends on KEXEC_HANDOVER
> +	help
> +	  Enable the Live Update Orchestrator. Live Update is a mechanism,
> +	  typically based on kexec, that allows the kernel to be updated
> +	  while keeping selected devices operational across the transition.
> +	  These devices are intended to be reclaimed by the new kernel and
> +	  re-attached to their original workload without requiring a device
> +	  reset.
> +
> +	  This functionality depends on specific support within device drivers
> +	  and related kernel subsystems.

This is not clear if the ability to reattach a device to the new kernel or
the entire live update functionality depends on specific support with
drivers.

Probably better phrase it as

	  Ability to handover a device from old to new kernel depends ...

> +
> +	  This feature is primarily used in cloud environments to quickly
> +	  update the kernel hypervisor with minimal disruption to the
> +	  running virtual machines.

I wouldn't put it into Kconfig. If anything I'd make it

	  This feature primarily targets virtual machine hosts to quickly ...

> +
> +	  If unsure, say N.
> diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> new file mode 100644
> index 000000000000..3bfb4b9fed11
> --- /dev/null
> +++ b/drivers/misc/liveupdate/Makefile
> @@ -0,0 +1,2 @@
> +# SPDX-License-Identifier: GPL-2.0
> +obj-y					+= luo_core.o
> diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> new file mode 100644
> index 000000000000..919c37b0b4d1
> --- /dev/null
> +++ b/drivers/misc/liveupdate/luo_core.c
> @@ -0,0 +1,296 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +
> +/**
> + * DOC: Live Update Orchestrator (LUO)
> + *
> + * Live Update is a specialized reboot process where selected devices are
> + * kept operational across a kernel transition. For these devices, DMA activity
> + * may continue during the kernel reboot.
> + *
> + * The primary use case is in cloud environments, allowing hypervisor updates
> + * without disrupting running virtual machines. During a live update, VMs can be
> + * suspended (with their state preserved in memory), while the hypervisor kernel
> + * reboots. Devices attached to these VMs (e.g., NICs, block devices) are kept
> + * operational by the LUO during the hypervisor reboot, allowing the VMs to be
> + * quickly resumed on the new kernel.
> + *
> + * The core of LUO is a state machine that tracks the progress of a live update,
> + * along with a callback API that allows other kernel subsystems to participate
> + * in the process. Example subsystems that can hook into LUO include: kvm,
> + * iommu, interrupts, vfio, participating filesystems, and mm.

Please spell out memory management.

> + * LUO uses KHO to transfer memory state from the current Kernel to the next

A link to KHO docs would have been nice, but I'm not sure kernel-doc can do
that nicely.

> + * Kernel.

Why capital 'K'? :)

> + * The LUO state machine ensures that operations are performed in the correct
> + * sequence and provides a mechanism to track and recover from potential
> + * failures, and select devices and subsystems that should participate in
> + * live update sequence.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/err.h>
> +#include <linux/kobject.h>
> +#include <linux/liveupdate.h>
> +#include <linux/rwsem.h>
> +#include <linux/string.h>
> +#include "luo_internal.h"
> +
> +static DECLARE_RWSEM(luo_state_rwsem);
> +
> +enum liveupdate_state luo_state;

static?

Hmm, luo_state is initialized to 0 (NORMAL) which means we always start
from NORMAL, although the second kernel is not in the normal state until
the handover is complete. Maybe we need an initial "unknown" state until
some of luo code starts running and would set an actual known state?

> +
> +const char *const luo_state_str[] = {
> +	[LIVEUPDATE_STATE_NORMAL]	= "normal",
> +	[LIVEUPDATE_STATE_PREPARED]	= "prepared",
> +	[LIVEUPDATE_STATE_FROZEN]	= "frozen",
> +	[LIVEUPDATE_STATE_UPDATED]	= "updated",
> +};
> +
> +bool luo_enabled;

static?

> +static int __init early_liveupdate_param(char *buf)
> +{
> +	return kstrtobool(buf, &luo_enabled);
> +}
> +early_param("liveupdate", early_liveupdate_param);
> +
> +/* Return true if the current state is equal to the provided state */
> +static inline bool is_current_luo_state(enum liveupdate_state expected_state)
> +{
> +	return READ_ONCE(luo_state) == expected_state;
> +}
> +
> +static void __luo_set_state(enum liveupdate_state state)
> +{
> +	WRITE_ONCE(luo_state, state);
> +}
> +
> +static inline void luo_set_state(enum liveupdate_state state)
> +{
> +	pr_info("Switched from [%s] to [%s] state\n",
> +		LUO_STATE_STR, luo_state_str[state]);

Maybe LUO_CURRENT_STATE_STR?

> +	__luo_set_state(state);
> +}
> +
> +static int luo_do_freeze_calls(void)
> +{
> +	return 0;
> +}
> +
> +static void luo_do_finish_calls(void)
> +{
> +}
> +
> +int luo_prepare(void)
> +{
> +	return 0;
> +}
> +
> +/**
> + * luo_freeze() - Initiate the final freeze notification phase for live update.
> + *
> + * Attempts to transition the live update orchestrator state from
> + * %LIVEUPDATE_STATE_PREPARED to %LIVEUPDATE_STATE_FROZEN. This function is
> + * typically called just before the actual reboot system call (e.g., kexec)
> + * is invoked, either directly by the orchestration tool or potentially from
> + * within the reboot syscall path itself.
> + *
> + * Based on the outcome of the notification process:
> + * - If luo_do_freeze_calls() returns 0 (all callbacks succeeded), the state
> + * is set to %LIVEUPDATE_STATE_FROZEN using luo_set_state(), indicating
> + * readiness for the imminent kexec.
> + * - If luo_do_freeze_calls() returns a negative error code (a callback
> + * failed), the state is reverted to %LIVEUPDATE_STATE_NORMAL using
> + * luo_set_state() to cancel the live update attempt.

The kernel-doc comments are mostly for users of a function and describe how
it should be used rather how it is implemented.

I don't think it's important to mention return values of
luo_do_freeze_calls() here. The important things are whether registered
subsystems succeeded to freeze or not and the state changes.
I'd also mention that if a subsystem fails to freeze, everything is
canceled.

> + *
> + * @return  0: Success. Negative error otherwise. State is reverted to
> + * %LIVEUPDATE_STATE_NORMAL in case of an error during callbacks.
> + */
> +int luo_freeze(void)
> +{
> +	int ret;
> +
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[freeze] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_FROZEN],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	ret = luo_do_freeze_calls();
> +	if (!ret)
> +		luo_set_state(LIVEUPDATE_STATE_FROZEN);
> +	else
> +		luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	up_write(&luo_state_rwsem);
> +
> +	return ret;
> +}
> +
> +/**
> + * luo_finish - Finalize the live update process in the new kernel.
> + *
> + * This function is called  after a successful live update reboot into a new
> + * kernel, once the new kernel is ready to transition to the normal operational
> + * state. It signals the completion of the live update sequence to subsystems.
> + *
> + * It first attempts to acquire the write lock for the orchestrator state.
> + *
> + * Then, it checks if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state.
> + * If not, it logs a warning and returns ``-EINVAL``.
> + *
> + * If the state is correct, it triggers the ``LIVEUPDATE_FINISH`` notifier

Here too, you describe what the function does rather how it should be used

> + * chain. Note that the return value of the notifier is intentionally ignored as
> + * finish callbacks must not fail. Finally, the orchestrator state is

And what should happen if there was an error in a finish callback?

> + * transitioned back to ``LIVEUPDATE_STATE_NORMAL``, indicating the end of the
> + * live update process.
> + *
> + * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
> + * user while waiting for the lock, or ``-EINVAL`` if the orchestrator is not in
> + * the updated state.
> + */
> +int luo_finish(void)
> +{
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[finish] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_UPDATED)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_NORMAL],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	luo_do_finish_calls();
> +	luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	up_write(&luo_state_rwsem);
> +
> +	return 0;
> +}
> +
> +int luo_cancel(void)
> +{
> +	return 0;
> +}
> +
> +void luo_state_read_enter(void)
> +{
> +	down_read(&luo_state_rwsem);
> +}
> +
> +void luo_state_read_exit(void)
> +{
> +	up_read(&luo_state_rwsem);
> +}
> +
> +static int __init luo_startup(void)
> +{
> +	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	return 0;
> +}
> +early_initcall(luo_startup);

This means that the second kernel starts with luo_state == NORMAL, then
at early_initcall transitions to NORMAL again and later is set to UPDATED,
doesn't it?

> +
> +/* Public Functions */
> +
> +/**
> + * liveupdate_reboot() - Kernel reboot notifier for live update final
> + * serialization.
> + *
> + * This function is invoked directly from the reboot() syscall pathway if a
> + * reboot is initiated while the live update state is %LIVEUPDATE_STATE_PREPARED
> + * (i.e., if the user did not explicitly trigger the frozen state). It handles
> + * the implicit transition into the final frozen state.
> + *
> + * It triggers the %LIVEUPDATE_REBOOT event callbacks for participating
> + * subsystems. These callbacks must perform final state saving very quickly as
> + * they execute during the blackout period just before kexec.
> + *
> + * If any %LIVEUPDATE_FREEZE callback fails, this function triggers the
> + * %LIVEUPDATE_CANCEL event for all participants to revert their state, aborts
> + * the live update, and returns an error.
> + */
> +int liveupdate_reboot(void)
> +{
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED))
> +		return 0;
> +
> +	return luo_freeze();
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_reboot);
> +
> +/**
> + * liveupdate_state_updated - Check if the system is in the live update
> + * 'updated' state.
> + *
> + * This function checks if the live update orchestrator is in the
> + * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
> + * successfully rebooted into a new kernel as part of a live update, and the
> + * preserved devices are expected to be in the process of being reclaimed.
> + *
> + * This is typically used by subsystems during early boot of the new kernel
> + * to determine if they need to attempt to restore state from a previous
> + * live update.
> + *
> + * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
> + * false otherwise.
> + */
> +bool liveupdate_state_updated(void)
> +{
> +	return is_current_luo_state(LIVEUPDATE_STATE_UPDATED);
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_state_updated);
> +
> +/**
> + * liveupdate_state_normal - Check if the system is in the live update 'normal'
> + * state.
> + *
> + * This function checks if the live update orchestrator is in the
> + * ``LIVEUPDATE_STATE_NORMAL`` state. This state indicates that no live update
> + * is in progress. It represents the default operational state of the system.
> + *
> + * This can be used to gate actions that should only be performed when no
> + * live update activity is occurring.
> + *
> + * @return true if the system is in the ``LIVEUPDATE_STATE_NORMAL`` state,
> + * false otherwise.
> + */
> +bool liveupdate_state_normal(void)
> +{
> +	return is_current_luo_state(LIVEUPDATE_STATE_NORMAL);
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_state_normal);

Won't liveupdate_get_state() do?

> +
> +/**
> + * liveupdate_enabled - Check if the live update feature is enabled.
> + *
> + * This function returns the state of the live update feature flag, which
> + * can be controlled via the ``liveupdate`` kernel command-line parameter.
> + *
> + * @return true if live update is enabled, false otherwise.
> + */
> +bool liveupdate_enabled(void)
> +{
> +	return luo_enabled;
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_enabled);
> diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
> new file mode 100644
> index 000000000000..34e73fb0318c
> --- /dev/null
> +++ b/drivers/misc/liveupdate/luo_internal.h
> @@ -0,0 +1,26 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +
> +#ifndef _LINUX_LUO_INTERNAL_H
> +#define _LINUX_LUO_INTERNAL_H
> +
> +int luo_cancel(void);
> +int luo_prepare(void);
> +int luo_freeze(void);
> +int luo_finish(void);
> +
> +void luo_state_read_enter(void);
> +void luo_state_read_exit(void);
> +
> +extern const char *const luo_state_str[];
> +
> +/* Get the current state as a string */
> +#define LUO_STATE_STR luo_state_str[READ_ONCE(luo_state)]

IIUC you need the macro to have LUO_STATE_STR available in all files in
liveupdate/ but without exposing luo_state.

I think that we can do a function call to get that string, will make things
nicer IMHO.

> +
> +extern enum liveupdate_state luo_state;
> +
> +#endif /* _LINUX_LUO_INTERNAL_H */
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> new file mode 100644
> index 000000000000..c2740da70958
> --- /dev/null
> +++ b/include/linux/liveupdate.h
> @@ -0,0 +1,131 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +#ifndef _LINUX_LIVEUPDATE_H
> +#define _LINUX_LIVEUPDATE_H
> +
> +#include <linux/bug.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +
> +/**
> + * enum liveupdate_event - Events that trigger live update callbacks.
> + * @LIVEUPDATE_PREPARE: PREPARE should happens *before* the blackout window.

should happen or happens ;-)

> + *                      Subsystems should prepare for an upcoming reboot by
> + *                      serializing their states. However, it must be considered

It's not only about state serialization, it's also about adjusting
operational mode so that state that was serialized won't be changed or at
least the changes from PREPARE to FREEZE would be accounted somehow.

> + *                      that user applications, e.g. virtual machines are still
> + *                      running during this phase.
> + * @LIVEUPDATE_FREEZE:  FREEZE sent from the reboot() syscall, when the current
> + *                      kernel is on its way out. This is the final opportunity
> + *                      for subsystems to save any state that must persist
> + *                      across the reboot. Callbacks for this event should be as
> + *                      fast as possible since they are on the critical path of
> + *                      rebooting into the next kernel.
> + * @LIVEUPDATE_FINISH:  FINISH is sent in the newly booted kernel after a
> + *                      successful live update and normally *after* the blackout
> + *                      window. Subsystems should perform any final cleanup
> + *                      during this phase. This phase also provides an
> + *                      opportunity to clean up devices that were preserved but
> + *                      never explicitly reclaimed during the live update
> + *                      process. State restoration should have already occurred
> + *                      before this event. Callbacks for this event must not
> + *                      fail. The completion of this call transitions the
> + *                      machine from ``updated`` to ``normal`` state.
> + * @LIVEUPDATE_CANCEL:  CANCEL the live update and go back to normal state. This
> + *                      event is user initiated, or is done automatically when
> + *                      LIVEUPDATE_PREPARE or LIVEUPDATE_FREEZE stage fails.
> + *                      Subsystems should revert any actions taken during the
> + *                      corresponding prepare event. Callbacks for this event
> + *                      must not fail.
> + *
> + * These events represent the different stages and actions within the live
> + * update process that subsystems (like device drivers and bus drivers)
> + * need to be aware of to correctly serialize and restore their state.
> + *
> + */
> +enum liveupdate_event {
> +	LIVEUPDATE_PREPARE,
> +	LIVEUPDATE_FREEZE,
> +	LIVEUPDATE_FINISH,
> +	LIVEUPDATE_CANCEL,
> +};
> +
> +/**
> + * enum liveupdate_state - Defines the possible states of the live update
> + * orchestrator.
> + * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
> + * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
> + *                                   LIVEUPDATE_PREPARE callbacks have completed
> + *                                   successfully.
> + *                                   Devices might operate in a limited state
> + *                                   for example the participating devices might
> + *                                   not be allowed to unbind, and also the
> + *                                   setting up of new DMA mappings might be
> + *                                   disabled in this state.
> + * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
> + *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
> + *                                   system is performing its final state saving
> + *                                   within the "blackout window". User
> + *                                   workloads must be suspended. The actual
> + *                                   reboot (kexec) into the next kernel is
> + *                                   imminent.
> + * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
> + *                                   kernel via live update the system is now
> + *                                   running the next kernel, awaiting the
> + *                                   finish event.
> + *
> + * These states track the progress and outcome of a live update operation.
> + */
> +enum liveupdate_state  {
> +	LIVEUPDATE_STATE_NORMAL = 0,
> +	LIVEUPDATE_STATE_PREPARED = 1,
> +	LIVEUPDATE_STATE_FROZEN = 2,
> +	LIVEUPDATE_STATE_UPDATED = 3,
> +};
> +
> +#ifdef CONFIG_LIVEUPDATE
> +
> +/* Return true if live update orchestrator is enabled */
> +bool liveupdate_enabled(void);
> +
> +/* Called during reboot to tell participants to complete serialization */
> +int liveupdate_reboot(void);
> +
> +/*
> + * Return true if machine is in updated state (i.e. live update boot in
> + * progress)
> + */
> +bool liveupdate_state_updated(void);
> +
> +/*
> + * Return true if machine is in normal state (i.e. no live update in progress).
> + */
> +bool liveupdate_state_normal(void);
> +
> +#else /* CONFIG_LIVEUPDATE */
> +
> +static inline int liveupdate_reboot(void)
> +{
> +	return 0;
> +}
> +
> +static inline bool liveupdate_enabled(void)
> +{
> +	return false;
> +}
> +
> +static inline bool liveupdate_state_updated(void)
> +{
> +	return false;
> +}
> +
> +static inline bool liveupdate_state_normal(void)
> +{
> +	return true;
> +}
> +
> +#endif /* CONFIG_LIVEUPDATE */
> +#endif /* _LINUX_LIVEUPDATE_H */
> -- 
> 2.49.0.1101.gccaa498523-goog
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 00/16] Live Update Orchestrator
  2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
                   ` (16 preceding siblings ...)
  2025-05-20  7:25 ` [RFC v2 00/16] Live Update Orchestrator Mike Rapoport
@ 2025-05-26  6:32 ` Mike Rapoport
  17 siblings, 0 replies; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  6:32 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	linux-api

(cc'ing linux-api)

On Thu, May 15, 2025 at 06:23:04PM +0000, Pasha Tatashin wrote:
> This v2 series introduces the LUO, a kernel subsystem designed to
> facilitate live kernel updates with minimal downtime,
> particularly in cloud delplyoments aiming to update without fully
> disrupting running virtual machines.
> 
> This series builds upon KHO framework [1] by adding programmatic
> control over KHO's lifecycle and leveraging KHO for persisting LUO's
> own metadata across the kexec boundary. The git branch for this series
> can be found at:
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/rfc-v2
> 
> Changelog from v1:
> - Control Interface: Shifted from sysfs-based control
>   (/sys/kernel/liveupdate/{prepare,finish}) to an ioctl interface
>   (/dev/liveupdate). Sysfs is now primarily for monitoring the state.
> - Event/State Renaming: LIVEUPDATE_REBOOT event/phase is now
>   LIVEUPDATE_FREEZE.
> - FD Preservation: A new component for preserving file descriptors.
>   Subsystem Registration: A formal mechanism for kernel subsystems
>   to participate.
> - Device Layer: removed device list handling from this series, it is
>   going to be added separately.
> - Selftests: Kernel-side selftest hooks and userspace selftests are
>   now included.
> KHO Enhancements:
> - KHO debugfs became optional, and kernel APIs for finalize/abort
>   were added (driven by LUO's needs).
> - KHO unpreserve functions were also added.
> 
> What is Live Update?
> Live Update is a specialized reboot process where selected kernel
> resources (memory, file descriptors, and eventually devices) are kept
> operational or their state preserved across a kernel transition (e.g.,
> via kexec). For certain resources, DMA and interrupt activity might
> continue with minimal interruption during the kernel reboot.
> 
> LUO v2 Overview:
> LUO v2 provides a framework for coordinating live updates. It features:
> State Machine: Manages the live update process through states:
> NORMAL, PREPARED, FROZEN, UPDATED.
> 
> KHO Integration:
> 
> LUO programmatically drives KHO's finalization and abort sequences.
> KHO's debugfs interface is now optional configured via
> CONFIG_KEXEC_HANDOVER_DEBUG.
> 
> LUO preserves its own metadata via KHO's kho_add_subtree and
> kho_preserve_phys() mechanisms.
> 
> Subsystem Participation: A callback API liveupdate_register_subsystem()
> allows kernel subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register
> handlers for LUO events (PREPARE, FREEZE, FINISH, CANCEL) and persist a
> u64 payload via the LUO FDT.
> 
> File Descriptor Preservation: Infrastructure
> liveupdate_register_filesystem, luo_register_file, luo_retrieve_file to
> allow specific types of file descriptors (e.g., memfd, vfio) to be
> preserved and restored.
> 
> Handlers for specific file types can be registered to manage their
> preservation and restoration, storing a u64 payload in the LUO FDT.
> 
> Example WIP for memfd preservation can be found here [2].
> 
> User-space Interface:
> 
> ioctl (/dev/liveupdate): The primary control interface for
> triggering LUO state transitions (prepare, freeze, finish, cancel)
> and managing the preservation/restoration of file descriptors.
> Access requires CAP_SYS_ADMIN.
> 
> sysfs (/sys/kernel/liveupdate/state): A read-only interface for
> monitoring the current LUO state. This allows userspace services to
> track progress and coordinate actions.
> 
> Selftests: Includes kernel-side hooks and userspace selftests to
> verify core LUO functionality, particularly subsystem registration and
> basic state transitions.
> 
> LUO State Machine and Events:
> 
> NORMAL:   Default operational state.
> PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
>           event. Subsystems have saved initial state.
> FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
>           event, just before kexec. Workloads must be suspended.
> UPDATED:  Next kernel has booted via live update. Awaiting restoration
>           and LIVEUPDATE_FINISH.
> 
> Events:
> LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
> LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
> LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
> LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.
> 
> [1] https://lore.kernel.org/all/20250509074635.3187114-1-changyuanl@google.com
>     https://github.com/googleprodkernel/linux-liveupdate/tree/luo/kho-v8
> [2] https://github.com/googleprodkernel/linux-liveupdate/tree/luo/memfd-v0.1
> 
> RFC v1: https://lore.kernel.org/all/20250320024011.2995837-1-pasha.tatashin@soleen.com
> 
> Changyuan Lyu (1):
>   kho: add kho_unpreserve_folio/phys
> 
> Pasha Tatashin (15):
>   kho: make debugfs interface optional
>   kho: allow to drive kho from within kernel
>   luo: luo_core: Live Update Orchestrator
>   luo: luo_core: integrate with KHO
>   luo: luo_subsystems: add subsystem registration
>   luo: luo_subsystems: implement subsystem callbacks
>   luo: luo_files: add infrastructure for FDs
>   luo: luo_files: implement file systems callbacks
>   luo: luo_ioctl: add ioctl interface
>   luo: luo_sysfs: add sysfs state monitoring
>   reboot: call liveupdate_reboot() before kexec
>   luo: add selftests for subsystems un/registration
>   selftests/liveupdate: add subsystem/state tests
>   docs: add luo documentation
>   MAINTAINERS: add liveupdate entry
> 
>  .../ABI/testing/sysfs-kernel-liveupdate       |  51 ++
>  Documentation/admin-guide/index.rst           |   1 +
>  Documentation/admin-guide/liveupdate.rst      |  62 ++
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  MAINTAINERS                                   |  14 +-
>  drivers/misc/Kconfig                          |   1 +
>  drivers/misc/Makefile                         |   1 +
>  drivers/misc/liveupdate/Kconfig               |  60 ++
>  drivers/misc/liveupdate/Makefile              |   7 +
>  drivers/misc/liveupdate/luo_core.c            | 547 +++++++++++++++
>  drivers/misc/liveupdate/luo_files.c           | 664 ++++++++++++++++++
>  drivers/misc/liveupdate/luo_internal.h        |  59 ++
>  drivers/misc/liveupdate/luo_ioctl.c           | 203 ++++++
>  drivers/misc/liveupdate/luo_selftests.c       | 283 ++++++++
>  drivers/misc/liveupdate/luo_selftests.h       |  23 +
>  drivers/misc/liveupdate/luo_subsystems.c      | 413 +++++++++++
>  drivers/misc/liveupdate/luo_sysfs.c           |  92 +++
>  include/linux/kexec_handover.h                |  27 +
>  include/linux/liveupdate.h                    | 214 ++++++
>  include/uapi/linux/liveupdate.h               | 324 +++++++++
>  kernel/Kconfig.kexec                          |  10 +
>  kernel/Makefile                               |   1 +
>  kernel/kexec_handover.c                       | 343 +++------
>  kernel/kexec_handover_debug.c                 | 237 +++++++
>  kernel/kexec_handover_internal.h              |  74 ++
>  kernel/reboot.c                               |   4 +
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/liveupdate/.gitignore |   1 +
>  tools/testing/selftests/liveupdate/Makefile   |   7 +
>  tools/testing/selftests/liveupdate/config     |   6 +
>  .../testing/selftests/liveupdate/liveupdate.c | 440 ++++++++++++
>  31 files changed, 3933 insertions(+), 238 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-liveupdate
>  create mode 100644 Documentation/admin-guide/liveupdate.rst
>  create mode 100644 drivers/misc/liveupdate/Kconfig
>  create mode 100644 drivers/misc/liveupdate/Makefile
>  create mode 100644 drivers/misc/liveupdate/luo_core.c
>  create mode 100644 drivers/misc/liveupdate/luo_files.c
>  create mode 100644 drivers/misc/liveupdate/luo_internal.h
>  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
>  create mode 100644 drivers/misc/liveupdate/luo_selftests.c
>  create mode 100644 drivers/misc/liveupdate/luo_selftests.h
>  create mode 100644 drivers/misc/liveupdate/luo_subsystems.c
>  create mode 100644 drivers/misc/liveupdate/luo_sysfs.c
>  create mode 100644 include/linux/liveupdate.h
>  create mode 100644 include/uapi/linux/liveupdate.h
>  create mode 100644 kernel/kexec_handover_debug.c
>  create mode 100644 kernel/kexec_handover_internal.h
>  create mode 100644 tools/testing/selftests/liveupdate/.gitignore
>  create mode 100644 tools/testing/selftests/liveupdate/Makefile
>  create mode 100644 tools/testing/selftests/liveupdate/config
>  create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c
> 
> -- 
> 2.49.0.1101.gccaa498523-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-05-15 18:23 ` [RFC v2 05/16] luo: luo_core: integrate with KHO Pasha Tatashin
@ 2025-05-26  7:18   ` Mike Rapoport
  2025-06-07 17:50     ` Pasha Tatashin
  2025-06-04 16:00   ` Pratyush Yadav
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  7:18 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:09PM +0000, Pasha Tatashin wrote:
> Integrate the LUO with the KHO framework to enable passing LUO state
> across a kexec reboot.
> 
> This patch introduces the following changes:
> - During the KHO finalization phase allocate FDT blob.
> - Populate this FDT with a LUO compatibility string ("luo-v1") and the
>   current LUO state (`luo_state`).
> - Implement a KHO notifier

Would be nice to have more details about how LUO interacts with KHO, like
how LUO states correspond to the state of KHO, what may trigger
corresponding state transitions etc.
 
> LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
> logic (`luo_do_*_calls`) remains unimplemented in this patch.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/liveupdate/luo_core.c | 222 ++++++++++++++++++++++++++++-
>  1 file changed, 219 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> index 919c37b0b4d1..a76e886bc3b1 100644
> --- a/drivers/misc/liveupdate/luo_core.c
> +++ b/drivers/misc/liveupdate/luo_core.c
> @@ -36,9 +36,12 @@
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
>  #include <linux/err.h>
> +#include <linux/kexec_handover.h>
>  #include <linux/kobject.h>
> +#include <linux/libfdt.h>
>  #include <linux/liveupdate.h>
>  #include <linux/rwsem.h>
> +#include <linux/sizes.h>
>  #include <linux/string.h>
>  #include "luo_internal.h"
>  
> @@ -55,6 +58,12 @@ const char *const luo_state_str[] = {
>  
>  bool luo_enabled;
>  
> +static void *luo_fdt_out;
> +static void *luo_fdt_in;
> +#define LUO_FDT_SIZE		SZ_1M

Does LUO really need that much?

> +#define LUO_KHO_ENTRY_NAME	"LUO"
> +#define LUO_COMPATIBLE		"luo-v1"
> +
>  static int __init early_liveupdate_param(char *buf)
>  {
>  	return kstrtobool(buf, &luo_enabled);
> @@ -79,6 +88,60 @@ static inline void luo_set_state(enum liveupdate_state state)
>  	__luo_set_state(state);
>  }
>  
> +/* Called during the prepare phase, to create LUO fdt tree */
> +static int luo_fdt_setup(struct kho_serialization *ser)
> +{
> +	void *fdt_out;
> +	int ret;
> +
> +	fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> +					   get_order(LUO_FDT_SIZE));
> +	if (!fdt_out) {
> +		pr_err("failed to allocate FDT memory\n");
> +		return -ENOMEM;
> +	}
> +
> +	ret = fdt_create_empty_tree(fdt_out, LUO_FDT_SIZE);
> +	if (ret)
> +		goto exit_free;
> +
> +	ret = fdt_setprop(fdt_out, 0, "compatible", LUO_COMPATIBLE,
> +			  strlen(LUO_COMPATIBLE) + 1);
> +	if (ret)
> +		goto exit_free;
> +
> +	ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> +	if (ret)
> +		goto exit_free;
> +
> +	ret = kho_add_subtree(ser, LUO_KHO_ENTRY_NAME, fdt_out);
> +	if (ret)
> +		goto exit_unpreserve;
> +	luo_fdt_out = fdt_out;
> +
> +	return 0;
> +
> +exit_unpreserve:
> +	kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> +exit_free:
> +	free_pages((unsigned long)fdt_out, get_order(LUO_FDT_SIZE));
> +	pr_err("failed to prepare LUO FDT: %d\n", ret);
> +
> +	return ret;
> +}
> +
> +static void luo_fdt_destroy(void)
> +{
> +	kho_unpreserve_phys(__pa(luo_fdt_out), LUO_FDT_SIZE);
> +	free_pages((unsigned long)luo_fdt_out, get_order(LUO_FDT_SIZE));
> +	luo_fdt_out = NULL;
> +}
> +
> +static int luo_do_prepare_calls(void)
> +{
> +	return 0;
> +}
> +
>  static int luo_do_freeze_calls(void)
>  {
>  	return 0;
> @@ -88,11 +151,111 @@ static void luo_do_finish_calls(void)
>  {
>  }
>  
> -int luo_prepare(void)
> +static void luo_do_cancel_calls(void)
> +{
> +}
> +
> +static int __luo_prepare(struct kho_serialization *ser)
>  {
> +	int ret;
> +
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[prepare] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_NORMAL)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_PREPARED],
> +			LUO_STATE_STR);
> +		ret = -EINVAL;
> +		goto exit_unlock;
> +	}
> +
> +	ret = luo_fdt_setup(ser);
> +	if (ret)
> +		goto exit_unlock;

At this point LUO should know how many subsystems are participating in live
update, I believe it can properly size the fdt. 

> +
> +	ret = luo_do_prepare_calls();
> +	if (ret)
> +		goto exit_unlock;
> +
> +	luo_set_state(LIVEUPDATE_STATE_PREPARED);
> +
> +exit_unlock:
> +	up_write(&luo_state_rwsem);
> +
> +	return ret;
> +}
> +
> +static int __luo_cancel(void)
> +{
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[cancel] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED) &&
> +	    !is_current_luo_state(LIVEUPDATE_STATE_FROZEN)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_NORMAL],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	luo_do_cancel_calls();
> +	luo_fdt_destroy();
> +	luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	up_write(&luo_state_rwsem);
> +
>  	return 0;
>  }
>  
> +static int luo_kho_notifier(struct notifier_block *self,
> +			    unsigned long cmd, void *v)
> +{
> +	int ret;
> +
> +	switch (cmd) {
> +	case KEXEC_KHO_FINALIZE:
> +		ret = __luo_prepare((struct kho_serialization *)v);
> +		break;
> +	case KEXEC_KHO_ABORT:
> +		ret = __luo_cancel();
> +		break;
> +	default:
> +		return NOTIFY_BAD;
> +	}
> +
> +	return notifier_from_errno(ret);
> +}
> +
> +static struct notifier_block luo_kho_notifier_nb = {
> +	.notifier_call = luo_kho_notifier,
> +};
> +
> +/**
> + * luo_prepare - Initiate the live update preparation phase.
> + *
> + * This function is called to begin the live update process. It attempts to
> + * transition the luo to the ``LIVEUPDATE_STATE_PREPARED`` state.
> + *
> + * If the calls complete successfully, the orchestrator state is set
> + * to ``LIVEUPDATE_STATE_PREPARED``. If any  call fails a
> + * ``LIVEUPDATE_CANCEL`` is sent to roll back any actions.
> + *
> + * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
> + * user while waiting for the lock, ``-EINVAL`` if the orchestrator is not in
> + * the normal state, or a negative error code returned by the calls.
> + */
> +int luo_prepare(void)
> +{
> +	return kho_finalize();
> +}
> +
>  /**
>   * luo_freeze() - Initiate the final freeze notification phase for live update.
>   *
> @@ -188,9 +351,23 @@ int luo_finish(void)
>  	return 0;
>  }
>  
> +/**
> + * luo_cancel - Cancel the ongoing live update from prepared or frozen states.
> + *
> + * This function is called to abort a live update that is currently in the
> + * ``LIVEUPDATE_STATE_PREPARED`` state.
> + *
> + * If the state is correct, it triggers the ``LIVEUPDATE_CANCEL`` notifier chain
> + * to allow subsystems to undo any actions performed during the prepare or
> + * freeze events. Finally, the orchestrator state is transitioned back to
> + * ``LIVEUPDATE_STATE_NORMAL``.
> + *
> + * @return 0 on success, or ``-EAGAIN`` if the state change was cancelled by the
> + * user while waiting for the lock.
> + */
>  int luo_cancel(void)
>  {
> -	return 0;
> +	return kho_abort();
>  }
>  
>  void luo_state_read_enter(void)
> @@ -205,7 +382,46 @@ void luo_state_read_exit(void)
>  
>  static int __init luo_startup(void)
>  {
> -	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +	phys_addr_t fdt_phys;
> +	int ret;
> +
> +	if (!kho_is_enabled()) {
> +		if (luo_enabled)
> +			pr_warn("Disabling liveupdate because KHO is disabled\n");
> +		luo_enabled = false;
> +		return 0;
> +	}
> +
> +	ret = register_kho_notifier(&luo_kho_notifier_nb);
> +	if (ret) {
> +		luo_enabled = false;
> +		pr_warn("Failed to register with KHO [%d]\n", ret);
> +	}
> +
> +	/*
> +	 * Retrieve LUO subtree, and verify its format.  Panic in case of
> +	 * exceptions, since machine devices and memory is in unpredictable
> +	 * state.
> +	 */
> +	ret = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &fdt_phys);
> +	if (ret) {
> +		if (ret != -ENOENT) {
> +			panic("failed to retrieve FDT '%s' from KHO: %d\n",
> +			      LUO_KHO_ENTRY_NAME, ret);
> +		}
> +		__luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +		return 0;
> +	}
> +
> +	luo_fdt_in = __va(fdt_phys);
> +	ret = fdt_node_check_compatible(luo_fdt_in, 0, LUO_COMPATIBLE);
> +	if (ret) {
> +		panic("FDT '%s' is incompatible with '%s' [%d]\n",
> +		      LUO_KHO_ENTRY_NAME, LUO_COMPATIBLE, ret);
> +	}
> +
> +	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
>  
>  	return 0;
>  }
> -- 
> 2.49.0.1101.gccaa498523-goog
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
@ 2025-05-26  7:31   ` Mike Rapoport
  2025-06-07 23:42     ` Pasha Tatashin
  2025-05-28 19:12   ` David Matlack
  2025-06-04 16:30   ` Pratyush Yadav
  2 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  7:31 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:10PM +0000, Pasha Tatashin wrote:
> Introduce the framework for kernel subsystems (e.g., KVM, IOMMU, device
> drivers) to register with LUO and participate in the live update process
> via callbacks.

...

> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index c2740da70958..7a130680b5f2 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -86,6 +86,39 @@ enum liveupdate_state  {
>  	LIVEUPDATE_STATE_UPDATED = 3,
>  };
>  
> +/**
> + * struct liveupdate_subsystem - Represents a subsystem participating in LUO
> + * @prepare:      Optional. Called during LUO prepare phase. Should perform
> + *                preparatory actions and can store a u64 handle/state
> + *                via the 'data' pointer for use in later callbacks.
> + *                Return 0 on success, negative error code on failure.
> + * @freeze:       Optional. Called during LUO freeze event (before actual jump
> + *                to new kernel). Should perform final state saving actions and
> + *                can update the u64 handle/state via the 'data' pointer. Retur:
> + *                0 on success, negative error code on failure.
> + * @cancel:       Optional. Called if the live update process is canceled after
> + *                prepare (or freeze) was called. Receives the u64 data
> + *                set by prepare/freeze. Used for cleanup.
> + * @finish:       Optional. Called after the live update is finished in the new
> + *                kernel.
> + *                Receives the u64 data set by prepare/freeze. Used for cleanup.
> + * @name:         Mandatory. Unique name identifying the subsystem.
> + * @arg:          Add this argument to callback functions.
> + * @list:         List head used internally by LUO. Should not be modified by
> + *                caller after registration.
> + * @private_data: For LUO internal use, cached value of data field.
> + */
> +struct liveupdate_subsystem {
> +	int (*prepare)(void *arg, u64 *data);
> +	int (*freeze)(void *arg, u64 *data);
> +	void (*cancel)(void *arg, u64 data);
> +	void (*finish)(void *arg, u64 data);

What is the intended use of arg in all these?

> +	const char *name;
> +	void *arg;
> +	struct list_head list;
> +	u64 private_data;
> +};

I suggest to split callbacks into, say, liveupdate_ops so we could constify
them.
And then it seems that the data in liveupdate_subsystem can be private to
LUO.

> +
>  #ifdef CONFIG_LIVEUPDATE
>  
>  /* Return true if live update orchestrator is enabled */
> @@ -105,6 +138,10 @@ bool liveupdate_state_updated(void);
>   */
>  bool liveupdate_state_normal(void);
>  
> +int liveupdate_register_subsystem(struct liveupdate_subsystem *h);

int liveupdate_register_subsystem(name, ops, data) ?

> +int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
> +int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
> +
>  #else /* CONFIG_LIVEUPDATE */

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
  2025-05-15 23:15   ` James Houghton
@ 2025-05-26  7:55   ` Mike Rapoport
  2025-06-05 11:56     ` Pratyush Yadav
  2025-06-08 13:13     ` Pasha Tatashin
  2025-06-05 15:56   ` Pratyush Yadav
  2 siblings, 2 replies; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  7:55 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:12PM +0000, Pasha Tatashin wrote:
> Introduce the framework within LUO to support preserving specific types
> of file descriptors across a live update transition. This allows
> stateful FDs (like memfds or vfio FDs used by VMs) to be recreated in
> the new kernel.
> 
> Note: The core logic for iterating through the luo_files_list and
> invoking the handler callbacks (prepare, freeze, cancel, finish)
> within luo_do_files_*_calls, as well as managing the u64 data
> persistence via the FDT for individual files, is currently implemented
> as stubs in this patch. This patch sets up the registration, FDT layout,
> and retrieval framework.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/liveupdate/Makefile       |   1 +
>  drivers/misc/liveupdate/luo_core.c     |  19 +
>  drivers/misc/liveupdate/luo_files.c    | 563 +++++++++++++++++++++++++
>  drivers/misc/liveupdate/luo_internal.h |  11 +
>  include/linux/liveupdate.h             |  62 +++
>  5 files changed, 656 insertions(+)
>  create mode 100644 drivers/misc/liveupdate/luo_files.c
> 
> diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> index df1c9709ba4f..b4cdd162574f 100644
> --- a/drivers/misc/liveupdate/Makefile
> +++ b/drivers/misc/liveupdate/Makefile
> @@ -1,3 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y					+= luo_core.o
> +obj-y					+= luo_files.o
>  obj-y					+= luo_subsystems.o
> diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> index 417e7f6bf36c..ab1d76221fe2 100644
> --- a/drivers/misc/liveupdate/luo_core.c
> +++ b/drivers/misc/liveupdate/luo_core.c
> @@ -110,6 +110,10 @@ static int luo_fdt_setup(struct kho_serialization *ser)
>  	if (ret)
>  		goto exit_free;
>  
> +	ret = luo_files_fdt_setup(fdt_out);
> +	if (ret)
> +		goto exit_free;
> +
>  	ret = luo_subsystems_fdt_setup(fdt_out);
>  	if (ret)
>  		goto exit_free;

The duplication of files and subsystems does not look nice here and below.
Can't we make files to be a subsystem?

> @@ -145,7 +149,13 @@ static int luo_do_prepare_calls(void)
>  {
>  	int ret;
>  
> +	ret = luo_do_files_prepare_calls();
> +	if (ret)
> +		return ret;
> +
>  	ret = luo_do_subsystems_prepare_calls();
> +	if (ret)
> +		luo_do_files_cancel_calls();
>  
>  	return ret;
>  }
> @@ -154,18 +164,26 @@ static int luo_do_freeze_calls(void)
>  {
>  	int ret;
>  
> +	ret = luo_do_files_freeze_calls();
> +	if (ret)
> +		return ret;
> +
>  	ret = luo_do_subsystems_freeze_calls();
> +	if (ret)
> +		luo_do_files_cancel_calls();
>  
>  	return ret;
>  }
>  
>  static void luo_do_finish_calls(void)
>  {
> +	luo_do_files_finish_calls();
>  	luo_do_subsystems_finish_calls();
>  }
>  
>  static void luo_do_cancel_calls(void)
>  {
> +	luo_do_files_cancel_calls();
>  	luo_do_subsystems_cancel_calls();
>  }
>  
> @@ -436,6 +454,7 @@ static int __init luo_startup(void)
>  	}
>  
>  	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
> +	luo_files_startup(luo_fdt_in);
>  	luo_subsystems_startup(luo_fdt_in);
>  
>  	return 0;
> diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
> new file mode 100644
> index 000000000000..953fc40db3d7
> --- /dev/null
> +++ b/drivers/misc/liveupdate/luo_files.c
> @@ -0,0 +1,563 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +
> +/**
> + * DOC: LUO file descriptors
> + *
> + * LUO provides the infrastructure necessary to preserve
> + * specific types of stateful file descriptors across a kernel live
> + * update transition. The primary goal is to allow workloads, such as virtual
> + * machines using vfio, memfd, or iommufd to retain access to their essential
> + * resources without interruption after the underlying kernel is  updated.
> + *
> + * The framework operates based on handler registration and instance tracking:
> + *
> + * 1. Handler Registration: Kernel modules responsible for specific file
> + * types (e.g., memfd, vfio) register a &struct liveupdate_filesystem
> + * handler. This handler contains callbacks (&liveupdate_filesystem.prepare,
> + * &liveupdate_filesystem.freeze, &liveupdate_filesystem.finish, etc.)
> + * and a unique 'compatible' string identifying the file type.
> + * Registration occurs via liveupdate_register_filesystem().

I wouldn't use filesystem here, as the obvious users are not really
filesystems. Maybe liveupdate_register_file_ops?

> + *
> + * 2. File Instance Tracking: When a potentially preservable file needs to be
> + * managed for live update, the core LUO logic (luo_register_file()) finds a
> + * compatible registered handler using its &liveupdate_filesystem.can_preserve
> + * callback. If found,  an internal &struct luo_file instance is created,
> + * assigned a unique u64 'token', and added to a list.
> + *
> + * 3. State Persistence (FDT): During the LUO prepare/freeze phases, the
> + * registered handler callbacks are invoked for each tracked file instance.
> + * These callbacks can generate a u64 data payload representing the minimal
> + * state needed for restoration. This payload, along with the handler's
> + * compatible string and the unique token, is stored in a dedicated
> + * '/file-descriptors' node within the main LUO FDT blob passed via
> + * Kexec Handover (KHO).
> + *
> + * 4. Restoration: In the new kernel, the LUO framework parses the incoming
> + * FDT to reconstruct the list of &struct luo_file instances. When the
> + * original owner requests the file, luo_retrieve_file() uses the corresponding
> + * handler's &liveupdate_filesystem.retrieve callback, passing the persisted
> + * u64 data, to recreate or find the appropriate &struct file object.
> + */

The DOC is mostly about what luo_files does, we'd also need a description
of it's intended use, both internally in the kernel and by the userspace.

> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +

...

> +/**
> + * luo_register_file - Register a file descriptor for live update management.
> + * @tokenp: Return argument for the token value.
> + * @file: Pointer to the struct file to be preserved.
> + *
> + * Context: Must be called when LUO is in 'normal' state.
> + *
> + * Return: 0 on success. Negative errno on failure.
> + */
> +int luo_register_file(u64 *tokenp, struct file *file)
> +{
> +	struct liveupdate_filesystem *fs;
> +	bool found = false;
> +	int ret = -ENOENT;
> +	u64 token;
> +
> +	luo_state_read_enter();
> +	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
> +		pr_warn("File can be registered only in normal or prepared state\n");
> +		luo_state_read_exit();
> +		return -EBUSY;
> +	}
> +
> +	down_read(&luo_filesystems_list_rwsem);
> +	list_for_each_entry(fs, &luo_filesystems_list, list) {
> +		if (fs->can_preserve(file, fs->arg)) {
> +			found = true;
> +			break;
> +		}
> +	}
> +
> +	if (found) {

	if (!found)
		goto exit_unlock;

> +		struct luo_file *luo_file = kmalloc(sizeof(*luo_file),
> +						    GFP_KERNEL);
> +
> +		if (!luo_file) {
> +			ret = -ENOMEM;
> +			goto exit_unlock;
> +		}
> +
> +		token = luo_next_file_token;
> +		luo_next_file_token++;
> +
> +		luo_file->private_data = 0;
> +		luo_file->reclaimed = false;
> +
> +		luo_file->file = file;
> +		luo_file->fs = fs;
> +		mutex_init(&luo_file->mutex);
> +		luo_file->state = LIVEUPDATE_STATE_NORMAL;
> +		ret = xa_err(xa_store(&luo_files_xa_out, token, luo_file,
> +				      GFP_KERNEL));
> +		if (ret < 0) {
> +			pr_warn("Failed to store file for token %llu in XArray: %d\n",
> +				token, ret);
> +			kfree(luo_file);
> +			goto exit_unlock;
> +		}
> +		*tokenp = token;
> +	}
> +
> +exit_unlock:
> +	up_read(&luo_filesystems_list_rwsem);
> +	luo_state_read_exit();
> +
> +	return ret;
> +}
> +
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index 7a130680b5f2..7afe0aac5ce4 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -86,6 +86,55 @@ enum liveupdate_state  {
>  	LIVEUPDATE_STATE_UPDATED = 3,
>  };
>  
> +/* Forward declaration needed if definition isn't included */
> +struct file;
> +
> +/**
> + * struct liveupdate_filesystem - Represents a handler for a live-updatable
> + * filesystem/file type.
> + * @prepare:       Optional. Saves state for a specific file instance (@file,
> + *                 @arg) before update, potentially returning value via @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @freeze:        Optional. Performs final actions just before kernel
> + *                 transition, potentially reading/updating the handle via
> + *                 @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @cancel:        Optional. Cleans up state/resources if update is aborted
> + *                 after prepare/freeze succeeded, using the @data handle (by
> + *                 value) from the successful prepare. Returns void.
> + * @finish:        Optional. Performs final cleanup in the new kernel using the
> + *                 preserved @data handle (by value). Returns void.
> + * @retrieve:      Retrieve the preserved file. Must be called before finish.
> + * @can_preserve:  callback to determine if @file with associated context (@arg)
> + *                 can be preserved by this handler.
> + *                 Return bool (true if preservable, false otherwise).
> + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> + *                 that uniquely identifies the filesystem or file type this
> + *                 handler supports. This is matched against the compatible
> + *                 string associated with individual &struct liveupdate_file
> + *                 instances.
> + * @arg:           An opaque pointer to implementation-specific context data
> + *                 associated with this filesystem handler registration.
> + * @list:          used for linking this handler instance into a global list of
> + *                 registered filesystem handlers.
> + *
> + * Modules that want to support live update for specific file types should
> + * register an instance of this structure. LUO uses this registration to
> + * determine if a given file can be preserved and to find the appropriate
> + * operations to manage its state across the update.
> + */
> +struct liveupdate_filesystem {
> +	int (*prepare)(struct file *file, void *arg, u64 *data);
> +	int (*freeze)(struct file *file, void *arg, u64 *data);
> +	void (*cancel)(struct file *file, void *arg, u64 data);
> +	void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> +	int (*retrieve)(void *arg, u64 data, struct file **file);
> +	bool (*can_preserve)(struct file *file, void *arg);
> +	const char *compatible;
> +	void *arg;
> +	struct list_head list;
> +};
> +

Like with subsystems, I'd split ops and make the data part private to
luo_files.c

>  /**
>   * struct liveupdate_subsystem - Represents a subsystem participating in LUO
>   * @prepare:      Optional. Called during LUO prepare phase. Should perform
> @@ -142,6 +191,9 @@ int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
>  int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
>  int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
>  
> +int liveupdate_register_filesystem(struct liveupdate_filesystem *h);
> +int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h);

int liveupdate_register_file_ops(name, ops, data, ret_token) ?

> +
>  #else /* CONFIG_LIVEUPDATE */
>  
>  static inline int liveupdate_reboot(void)
> @@ -180,5 +232,15 @@ static inline int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h,
>  	return -ENODATA;
>  }
>  
> +static inline int liveupdate_register_filesystem(struct liveupdate_filesystem *h)
> +{
> +	return 0;
> +}
> +
> +static inline int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h)
> +{
> +	return 0;
> +}
> +
>  #endif /* CONFIG_LIVEUPDATE */
>  #endif /* _LINUX_LIVEUPDATE_H */
> -- 
> 2.49.0.1101.gccaa498523-goog
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
@ 2025-05-26  8:42   ` Mike Rapoport
  2025-06-08 15:08     ` Pasha Tatashin
  2025-05-28 20:29   ` David Matlack
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  8:42 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> Introduce the user-space interface for the Live Update Orchestrator
> via ioctl commands, enabling external control over the live update
> process and management of preserved resources.
> 
> Create a misc character device at /dev/liveupdate. Access
> to this device requires the CAP_SYS_ADMIN capability.
> 
> A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> structures. The magic number is registered in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
 
...

> -/**
> - * enum liveupdate_state - Defines the possible states of the live update
> - * orchestrator.
> - * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
> - * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
> - *                                   LIVEUPDATE_PREPARE callbacks have completed
> - *                                   successfully.
> - *                                   Devices might operate in a limited state
> - *                                   for example the participating devices might
> - *                                   not be allowed to unbind, and also the
> - *                                   setting up of new DMA mappings might be
> - *                                   disabled in this state.
> - * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
> - *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
> - *                                   system is performing its final state saving
> - *                                   within the "blackout window". User
> - *                                   workloads must be suspended. The actual
> - *                                   reboot (kexec) into the next kernel is
> - *                                   imminent.
> - * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
> - *                                   kernel via live update the system is now
> - *                                   running the next kernel, awaiting the
> - *                                   finish event.
> - *
> - * These states track the progress and outcome of a live update operation.
> - */
> -enum liveupdate_state  {
> -	LIVEUPDATE_STATE_NORMAL = 0,
> -	LIVEUPDATE_STATE_PREPARED = 1,
> -	LIVEUPDATE_STATE_FROZEN = 2,
> -	LIVEUPDATE_STATE_UPDATED = 3,
> -};
> -

Nit: this seems an unnecessary churn, these definitions can go to
include/uapi from the start.

> diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
> +/**
> + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> + * descriptors across live update.
> + * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> + *         descriptor to be preserved.
> + *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> + *         representing the fully restored kernel resource.
> + * @flags: Unused, reserved for future expansion, must be set to 0.
> + * @token: Output for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> + *         generated by the kernel representing the successfully preserved
> + *         resource state.
> + *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> + *         returned by the preserve ioctl for the resource to be restored.
> + *
> + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> + * underlying kernel state preserved across a live update cycle.
> + *
> + * To preserve an FD, user space passes this struct to
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> + * kernel populates the @token field.
> + *
> + * After the live update transition, user space passes the struct populated with
> + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> + * to find the preserved state and, on success, populates the @fd field with a
> + * new file descriptor referring to the fully restored resource.
> + */
> +struct liveupdate_fd {
> +	int		fd;
> +	__u32		flags;
> +	__u64		token;
> +};

Consider using __aligned_u64 here for size-based versioning.

> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define LIVEUPDATE_IOCTL_TYPE		0xBA

...

> +/**
> + * LIVEUPDATE_IOCTL_EVENT_PREPARE - Initiate preparation phase and trigger state
> + * saving.

This (and others below) is more a command than an event IMHO. Maybe just
LIVEUPDATE_IOCTL_PREPARE?

> + * Argument: None.
> + *
> + * Initiates the live update preparation phase. This action corresponds to
> + * the internal %LIVEUPDATE_PREPARE kernel event and can also be triggered

This action is a reason for LIVEUPDATE_PREPARE event, isn't it?
The same applies to other IOCTL_EVENTS

> + * by writing '1' to ``/sys/kernel/liveupdate/prepare``. This typically
> + * triggers the main state saving process for items marked via the PRESERVE
> + * ioctls. This occurs *before* the main "blackout window", while user
> + * applications (e.g., VMs) may still be running. Kernel subsystems
> + * receiving the %LIVEUPDATE_PREPARE event should serialize necessary state.
> + * This command does not transfer data.

I'm not sure I follow what this sentence means.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 13/16] luo: add selftests for subsystems un/registration
  2025-05-15 18:23 ` [RFC v2 13/16] luo: add selftests for subsystems un/registration Pasha Tatashin
@ 2025-05-26  8:52   ` Mike Rapoport
  2025-06-08 16:47     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  8:52 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:17PM +0000, Pasha Tatashin wrote:
> Introduce a self-test mechanism for the LUO to allow verification of
> core subsystem management functionality. This is primarily intended
> for developers and system integrators validating the live update
> feature.
> 
> The tests are enabled via the new Kconfig option
> CONFIG_LIVEUPDATE_SELFTESTS (default 'n') and are triggered through
> a new ioctl command, LIVEUPDATE_IOCTL_SELFTESTS, added to the
> /dev/liveupdate device node.
> 
> This ioctl accepts commands defined in luo_selftests.h to:
> - LUO_CMD_SUBSYSTEM_REGISTER: Creates and registers a dummy LUO
>   subsystem using the liveupdate_register_subsystem() function. It
>   allocates a data page and copies initial data from userspace.
> - LUO_CMD_SUBSYSTEM_UNREGISTER: Unregisters the specified dummy
>   subsystem using the liveupdate_unregister_subsystem() function and
>   cleans up associated test resources.
> - LUO_CMD_SUBSYSTEM_GETDATA: Copies the data page associated with a
>   registered test subsystem back to userspace, allowing verification of
>   data potentially modified or preserved by test callbacks.
> This provides a way to test the fundamental registration and
> unregistration flows within the LUO framework from userspace without
> requiring a full live update sequence.

I don't think ioctl for selftest is a good idea.
Can't we test register/unregister and state machine transitions with kunit?

And have a separate test module that registers as a subsystem, preserves
it's state and then verifies the state after the reboot. This will require
running qemu and qemu usage in tools/testing is a mess right now, but
still.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 15/16] docs: add luo documentation
  2025-05-15 18:23 ` [RFC v2 15/16] docs: add luo documentation Pasha Tatashin
@ 2025-05-26  9:00   ` Mike Rapoport
  0 siblings, 0 replies; 102+ messages in thread
From: Mike Rapoport @ 2025-05-26  9:00 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:19PM +0000, Pasha Tatashin wrote:
> Add the main documentation file for the Live Update Orchestrator
> subsystem at Documentation/admin-guide/liveupdate.rst.
> 
> The new file is included in the main
> Documentation/admin-guide/index.rst table of contents.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  Documentation/admin-guide/index.rst      |  1 +
>  Documentation/admin-guide/liveupdate.rst | 62 ++++++++++++++++++++++++
>  2 files changed, 63 insertions(+)
>  create mode 100644 Documentation/admin-guide/liveupdate.rst
> 
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 259d79fbeb94..3f59ccf32760 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -95,6 +95,7 @@ likely to be of interest on almost any system.
>     cgroup-v2
>     cgroup-v1/index
>     cpu-load
> +   liveupdate

I afraid it's not the right place for everything :)
LUO has admin-guide parts, userspace-api parts and subsystems-api parts at least.

>     mm/index
>     module-signing
>     namespaces/index

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
  2025-05-26  7:31   ` Mike Rapoport
@ 2025-05-28 19:12   ` David Matlack
  2025-06-07 23:58     ` Pasha Tatashin
  2025-06-04 16:30   ` Pratyush Yadav
  2 siblings, 1 reply; 102+ messages in thread
From: David Matlack @ 2025-05-28 19:12 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 11:23 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> +int liveupdate_register_subsystem(struct liveupdate_subsystem *h)
> +{
> +       struct liveupdate_subsystem *iter;
> +       int ret = 0;
> +
> +       luo_state_read_enter();
> +       if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
> +               luo_state_read_exit();
> +               return -EBUSY;
> +       }
> +
> +       mutex_lock(&luo_subsystem_list_mutex);
> +       list_for_each_entry(iter, &luo_subsystems_list, list) {
> +               if (iter == h) {
> +                       pr_warn("Subsystem '%s' (%p) already registered.\n",
> +                               h->name, h);
> +                       ret = -EEXIST;
> +                       goto out_unlock;
> +               }
> +
> +               if (!strcmp(iter->name, h->name)) {
> +                       pr_err("Subsystem with name '%s' already registered.\n",
> +                              h->name);
> +                       ret = -EEXIST;
> +                       goto out_unlock;
> +               }
> +       }
> +
> +       INIT_LIST_HEAD(&h->list);
> +       list_add_tail(&h->list, &luo_subsystems_list);
> +
> +out_unlock:
> +       mutex_unlock(&luo_subsystem_list_mutex);
> +       luo_state_read_exit();
> +
> +       return ret;
> +}

Suggest using guard()() and scoped_guard() throughout this series
instead of manual lock/unlock and up/down. That will simplify the code
and reduce the chance of silly bugs where a code path misses an
unlock/down.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
  2025-05-26  8:42   ` Mike Rapoport
@ 2025-05-28 20:29   ` David Matlack
  2025-06-08 16:32     ` Pasha Tatashin
  2025-06-05 16:15   ` Pratyush Yadav
  2025-06-24  9:50   ` Christian Brauner
  3 siblings, 1 reply; 102+ messages in thread
From: David Matlack @ 2025-05-28 20:29 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 11:23 AM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
> +static int luo_open(struct inode *inodep, struct file *filep)
> +{
> +       if (!capable(CAP_SYS_ADMIN))
> +               return -EACCES;

It makes sense that LIVEUPDATE_IOCTL_EVENT* would require
CAP_SYS_ADMIN. But I think requiring it for LIVEUPDATE_IOCTL_FD* will
add a lot of complexity.

It would essentially require a central userspace process to mediate
all preserving/restoring of file descriptors across Live Update to
enforce security. If we need a central authority to enforce security,
I don't see why that authority can't just be the kernel or what the
industry gains by punting the problem to userspace. It seems like all
users of LUO are going to want the same security guarantees when it
comes to FDs: a FD preserved inside a given "security domain" should
not be accessible outside that domain.

One way to do this in the kernel would be to have the kernel hand out
Live Update security tokens (say, some large random number). Then
require userspace to pass in a security token when preserving an FD.
Userspace can then only restore or unpreserve an FD if it passes back
in the security token associated with the FD. Then it's just up to
each userspace process to remember their token across kexec, keep it
secret from other untrusted processes, and pass it back in when
recovering FDs.

All the kernel has to do is generate secure tokens, which I imagine
can't be that hard.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 04/16] luo: luo_core: Live Update Orchestrator
  2025-05-26  6:31   ` Mike Rapoport
@ 2025-05-30  5:00     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-05-30  5:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

> > +config LIVEUPDATE
> > +     bool "Live Update Orchestrator"
> > +     depends on KEXEC_HANDOVER
> > +     help
> > +       Enable the Live Update Orchestrator. Live Update is a mechanism,
> > +       typically based on kexec, that allows the kernel to be updated
> > +       while keeping selected devices operational across the transition.
> > +       These devices are intended to be reclaimed by the new kernel and
> > +       re-attached to their original workload without requiring a device
> > +       reset.
> > +
> > +       This functionality depends on specific support within device drivers
> > +       and related kernel subsystems.
>
> This is not clear if the ability to reattach a device to the new kernel or
> the entire live update functionality depends on specific support with
> drivers.
>
> Probably better phrase it as
>
>           Ability to handover a device from old to new kernel depends ...

Updated

>
> > +
> > +       This feature is primarily used in cloud environments to quickly
> > +       update the kernel hypervisor with minimal disruption to the
> > +       running virtual machines.
>
> I wouldn't put it into Kconfig. If anything I'd make it
>
>           This feature primarily targets virtual machine hosts to quickly ...

Ok

> > + * The core of LUO is a state machine that tracks the progress of a live update,
> > + * along with a callback API that allows other kernel subsystems to participate
> > + * in the process. Example subsystems that can hook into LUO include: kvm,
> > + * iommu, interrupts, vfio, participating filesystems, and mm.
>
> Please spell out memory management.

Done.

>
> > + * LUO uses KHO to transfer memory state from the current Kernel to the next
>
> A link to KHO docs would have been nice, but I'm not sure kernel-doc can do
> that nicely.

Added a link, a simple path to rst, is apparently correctly converted
to a link by sphinx.

>
> > + * Kernel.
>
> Why capital 'K'? :)

Fixed.

>
> > + * The LUO state machine ensures that operations are performed in the correct
> > + * sequence and provides a mechanism to track and recover from potential
> > + * failures, and select devices and subsystems that should participate in
> > + * live update sequence.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/err.h>
> > +#include <linux/kobject.h>
> > +#include <linux/liveupdate.h>
> > +#include <linux/rwsem.h>
> > +#include <linux/string.h>
> > +#include "luo_internal.h"
> > +
> > +static DECLARE_RWSEM(luo_state_rwsem);
> > +
> > +enum liveupdate_state luo_state;
>
> static?

Fixed

> Hmm, luo_state is initialized to 0 (NORMAL) which means we always start
> from NORMAL, although the second kernel is not in the normal state until
> the handover is complete. Maybe we need an initial "unknown" state until
> some of luo code starts running and would set an actual known state?

Added: LIVEUPDATE_STATE_UNDEFINED that exists only before LUO is
initialized during boot.

> > +const char *const luo_state_str[] = {
> > +     [LIVEUPDATE_STATE_NORMAL]       = "normal",
> > +     [LIVEUPDATE_STATE_PREPARED]     = "prepared",
> > +     [LIVEUPDATE_STATE_FROZEN]       = "frozen",
> > +     [LIVEUPDATE_STATE_UPDATED]      = "updated",
> > +};
> > +
> > +bool luo_enabled;
>
> static?

Fixed.

>
> > +static int __init early_liveupdate_param(char *buf)
> > +{
> > +     return kstrtobool(buf, &luo_enabled);
> > +}
> > +early_param("liveupdate", early_liveupdate_param);
> > +
> > +/* Return true if the current state is equal to the provided state */
> > +static inline bool is_current_luo_state(enum liveupdate_state expected_state)
> > +{
> > +     return READ_ONCE(luo_state) == expected_state;
> > +}
> > +
> > +static void __luo_set_state(enum liveupdate_state state)
> > +{
> > +     WRITE_ONCE(luo_state, state);
> > +}
> > +
> > +static inline void luo_set_state(enum liveupdate_state state)
> > +{
> > +     pr_info("Switched from [%s] to [%s] state\n",
> > +             LUO_STATE_STR, luo_state_str[state]);
>
> Maybe LUO_CURRENT_STATE_STR?

Done

> > +     __luo_set_state(state);
> > +}
> > +
> > +static int luo_do_freeze_calls(void)
> > +{
> > +     return 0;
> > +}
> > +
> > +static void luo_do_finish_calls(void)
> > +{
> > +}
> > +
> > +int luo_prepare(void)
> > +{
> > +     return 0;
> > +}
> > +
> > +/**
> > + * luo_freeze() - Initiate the final freeze notification phase for live update.
> > + *
> > + * Attempts to transition the live update orchestrator state from
> > + * %LIVEUPDATE_STATE_PREPARED to %LIVEUPDATE_STATE_FROZEN. This function is
> > + * typically called just before the actual reboot system call (e.g., kexec)
> > + * is invoked, either directly by the orchestration tool or potentially from
> > + * within the reboot syscall path itself.
> > + *
> > + * Based on the outcome of the notification process:
> > + * - If luo_do_freeze_calls() returns 0 (all callbacks succeeded), the state
> > + * is set to %LIVEUPDATE_STATE_FROZEN using luo_set_state(), indicating
> > + * readiness for the imminent kexec.
> > + * - If luo_do_freeze_calls() returns a negative error code (a callback
> > + * failed), the state is reverted to %LIVEUPDATE_STATE_NORMAL using
> > + * luo_set_state() to cancel the live update attempt.
>
> The kernel-doc comments are mostly for users of a function and describe how
> it should be used rather how it is implemented.

SGTM, cleaned-up.

> I don't think it's important to mention return values of
> luo_do_freeze_calls() here. The important things are whether registered
> subsystems succeeded to freeze or not and the state changes.
> I'd also mention that if a subsystem fails to freeze, everything is
> canceled.

Added

> > +/**
> > + * luo_finish - Finalize the live update process in the new kernel.
> > + *
> > + * This function is called  after a successful live update reboot into a new
> > + * kernel, once the new kernel is ready to transition to the normal operational
> > + * state. It signals the completion of the live update sequence to subsystems.
> > + *
> > + * It first attempts to acquire the write lock for the orchestrator state.
> > + *
> > + * Then, it checks if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state.
> > + * If not, it logs a warning and returns ``-EINVAL``.
> > + *
> > + * If the state is correct, it triggers the ``LIVEUPDATE_FINISH`` notifier
>
> Here too, you describe what the function does rather how it should be used

Fixed

>
> > + * chain. Note that the return value of the notifier is intentionally ignored as
> > + * finish callbacks must not fail. Finally, the orchestrator state is
>
> And what should happen if there was an error in a finish callback?

Scream, warn, panic, we cannot allow running a system past liveupdate,
if some state was not properly passed from the previous kernel to the
current kernel. This may result in catastrophic memory leaks.

> > +static int __init luo_startup(void)
> > +{
> > +     __luo_set_state(LIVEUPDATE_STATE_NORMAL);
> > +
> > +     return 0;
> > +}
> > +early_initcall(luo_startup);
>
> This means that the second kernel starts with luo_state == NORMAL, then
> at early_initcall transitions to NORMAL again and later is set to UPDATED,
> doesn't it?

In the next patch, in this function we transition to UPDATED. So,
technically, we go from NORMAL to UPDATED. However, I added UNDEFINED
state so, in this function we either go from UNDEFINED to UPDATED or
UNDEFINED to NORMAL.


> > + * @return true if the system is in the ``LIVEUPDATE_STATE_NORMAL`` state,
> > + * false otherwise.
> > + */
> > +bool liveupdate_state_normal(void)
> > +{
> > +     return is_current_luo_state(LIVEUPDATE_STATE_NORMAL);
> > +}
> > +EXPORT_SYMBOL_GPL(liveupdate_state_normal);
>
> Won't liveupdate_get_state() do?

Yeah, we can simply return state, and let caller to compare. However,
I think, caller is only interested if this is normal state or if live
update is in progress. I will keep them, and also added
liveupdate_get_state().

> > +
> > +/**
> > + * liveupdate_enabled - Check if the live update feature is enabled.
> > + *
> > + * This function returns the state of the live update feature flag, which
> > + * can be controlled via the ``liveupdate`` kernel command-line parameter.
> > + *
> > + * @return true if live update is enabled, false otherwise.
> > + */
> > +bool liveupdate_enabled(void)
> > +{
> > +     return luo_enabled;
> > +}
> > +EXPORT_SYMBOL_GPL(liveupdate_enabled);
> > diff --git a/drivers/misc/liveupdate/luo_internal.h b/drivers/misc/liveupdate/luo_internal.h
> > new file mode 100644
> > index 000000000000..34e73fb0318c
> > --- /dev/null
> > +++ b/drivers/misc/liveupdate/luo_internal.h
> > @@ -0,0 +1,26 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@soleen.com>
> > + */
> > +
> > +#ifndef _LINUX_LUO_INTERNAL_H
> > +#define _LINUX_LUO_INTERNAL_H
> > +
> > +int luo_cancel(void);
> > +int luo_prepare(void);
> > +int luo_freeze(void);
> > +int luo_finish(void);
> > +
> > +void luo_state_read_enter(void);
> > +void luo_state_read_exit(void);
> > +
> > +extern const char *const luo_state_str[];
> > +
> > +/* Get the current state as a string */
> > +#define LUO_STATE_STR luo_state_str[READ_ONCE(luo_state)]
>
> IIUC you need the macro to have LUO_STATE_STR available in all files in
> liveupdate/ but without exposing luo_state.
>
> I think that we can do a function call to get that string, will make things
> nicer IMHO.

Done.

>
> > +
> > +extern enum liveupdate_state luo_state;
> > +
> > +#endif /* _LINUX_LUO_INTERNAL_H */
> > diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> > new file mode 100644
> > index 000000000000..c2740da70958
> > --- /dev/null
> > +++ b/include/linux/liveupdate.h
> > @@ -0,0 +1,131 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@soleen.com>
> > + */
> > +#ifndef _LINUX_LIVEUPDATE_H
> > +#define _LINUX_LIVEUPDATE_H
> > +
> > +#include <linux/bug.h>
> > +#include <linux/types.h>
> > +#include <linux/list.h>
> > +
> > +/**
> > + * enum liveupdate_event - Events that trigger live update callbacks.
> > + * @LIVEUPDATE_PREPARE: PREPARE should happens *before* the blackout window.
>
> should happen or happens ;-)

Done

>
> > + *                      Subsystems should prepare for an upcoming reboot by
> > + *                      serializing their states. However, it must be considered
>
> It's not only about state serialization, it's also about adjusting
> operational mode so that state that was serialized won't be changed or at
> least the changes from PREPARE to FREEZE would be accounted somehow.

By serialization, I mean is to save their state, but I agree, the
devices and resources are also should be in a limited state where the
serialized data should not be altered between prepare and freeze (i.e.
no memfd resizing, no new DMA mappings, etc).

Thank you for your comments.
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 03/16] kho: add kho_unpreserve_folio/phys
  2025-05-15 18:23 ` [RFC v2 03/16] kho: add kho_unpreserve_folio/phys Pasha Tatashin
@ 2025-06-04 15:00   ` Pratyush Yadav
  2025-06-06 16:22     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-04 15:00 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> From: Changyuan Lyu <changyuanl@google.com>
>
> Allow users of KHO to cancel the previous preservation by adding the
> necessary interfaces to unpreserve folio.
>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/kexec_handover.h | 12 +++++
>  kernel/kexec_handover.c        | 84 ++++++++++++++++++++++++++++------
>  2 files changed, 83 insertions(+), 13 deletions(-)
>
[...]
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index 8ff561e36a87..eb305e7e6129 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -101,26 +101,33 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
>  	return elm;
>  }
>  
> -static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
> -			     unsigned long end_pfn)
> +static void __kho_unpreserve_order(struct kho_mem_track *track, unsigned long pfn,
> +				   unsigned int order)
>  {
>  	struct kho_mem_phys_bits *bits;
>  	struct kho_mem_phys *physxa;
> +	const unsigned long pfn_high = pfn >> order;
>  
> -	while (pfn < end_pfn) {
> -		const unsigned int order =
> -			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> -		const unsigned long pfn_high = pfn >> order;
> +	physxa = xa_load(&track->orders, order);
> +	if (!physxa)
> +		return;
>  
> -		physxa = xa_load(&track->orders, order);
> -		if (!physxa)
> -			continue;
> +	bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
> +	if (!bits)
> +		return;
>  
> -		bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
> -		if (!bits)
> -			continue;
> +	clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
> +}
>  
> -		clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
> +static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
> +			     unsigned long end_pfn)
> +{
> +	unsigned int order;
> +
> +	while (pfn < end_pfn) {
> +		order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));

This is fragile. If the preserve call spans say 4 PFNs, then it gets
preserved as a order 2 allocation, but if the PFNs are unpreserved
one-by-one, __kho_unpreserve_order() will unpreserve from the order 0
xarray, which will end up doing nothing, leaking those pages.

It should either look through all orders to find the PFN, or at least
have a requirement in the API that the same phys and size combination as
the preserve call must be given to unpreserve.

> +
> +		__kho_unpreserve_order(track, pfn, order);
>  
>  		pfn += 1 << order;
>  	}
> @@ -607,6 +614,29 @@ int kho_preserve_folio(struct folio *folio)
>  }
>  EXPORT_SYMBOL_GPL(kho_preserve_folio);
>  
> +/**
> + * kho_unpreserve_folio - unpreserve a folio.
> + * @folio: folio to unpreserve.
> + *
> + * Instructs KHO to unpreserve a folio that was preserved by
> + * kho_preserve_folio() before.
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int kho_unpreserve_folio(struct folio *folio)
> +{
> +	const unsigned long pfn = folio_pfn(folio);
> +	const unsigned int order = folio_order(folio);
> +	struct kho_mem_track *track = &kho_out.ser.track;
> +
> +	if (kho_out.finalized)
> +		return -EBUSY;
> +
> +	__kho_unpreserve_order(track, pfn, order);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
> +
>  /**
>   * kho_preserve_phys - preserve a physically contiguous range across kexec.
>   * @phys: physical address of the range.
> @@ -652,6 +682,34 @@ int kho_preserve_phys(phys_addr_t phys, size_t size)
>  }
>  EXPORT_SYMBOL_GPL(kho_preserve_phys);
>  
> +/**
> + * kho_unpreserve_phys - unpreserve a physically contiguous range across kexec.
> + * @phys: physical address of the range.
> + * @size: size of the range.
> + *
> + * Instructs KHO to unpreserve the memory range from @phys to @phys + @size
> + * across kexec.
> + *
> + * Return: 0 on success, error code on failure
> + */
> +int kho_unpreserve_phys(phys_addr_t phys, size_t size)
> +{
> +	struct kho_mem_track *track = &kho_out.ser.track;
> +	unsigned long pfn = PHYS_PFN(phys);
> +	unsigned long end_pfn = PHYS_PFN(phys + size);
> +
> +	if (kho_out.finalized)
> +		return -EBUSY;
> +
> +	if (!PAGE_ALIGNED(phys) || !PAGE_ALIGNED(size))
> +		return -EINVAL;
> +
> +	__kho_unpreserve(track, pfn, end_pfn);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(kho_unpreserve_phys);
> +
>  int __kho_abort(void)
>  {
>  	int err;

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 04/16] luo: luo_core: Live Update Orchestrator
  2025-05-15 18:23 ` [RFC v2 04/16] luo: luo_core: Live Update Orchestrator Pasha Tatashin
  2025-05-26  6:31   ` Mike Rapoport
@ 2025-06-04 15:17   ` Pratyush Yadav
  2025-06-07 17:11     ` Pasha Tatashin
  1 sibling, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-04 15:17 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Introduce LUO, a mechanism intended to facilitate kernel updates while
> keeping designated devices operational across the transition (e.g., via
> kexec). The primary use case is updating hypervisors with minimal
> disruption to running virtual machines. For userspace side of hypervisor
> update we have copyless migration. LUO is for updating the kernel.
>
> This initial patch lays the groundwork for the LUO subsystem.
>
> Further functionality, including the implementation of state transition
> logic, integration with KHO, and hooks for subsystems and file
> descriptors, will be added in subsequent patches.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
[...]
> +/**
> + * luo_freeze() - Initiate the final freeze notification phase for live update.
> + *
> + * Attempts to transition the live update orchestrator state from
> + * %LIVEUPDATE_STATE_PREPARED to %LIVEUPDATE_STATE_FROZEN. This function is
> + * typically called just before the actual reboot system call (e.g., kexec)
> + * is invoked, either directly by the orchestration tool or potentially from
> + * within the reboot syscall path itself.
> + *
> + * Based on the outcome of the notification process:
> + * - If luo_do_freeze_calls() returns 0 (all callbacks succeeded), the state
> + * is set to %LIVEUPDATE_STATE_FROZEN using luo_set_state(), indicating
> + * readiness for the imminent kexec.
> + * - If luo_do_freeze_calls() returns a negative error code (a callback
> + * failed), the state is reverted to %LIVEUPDATE_STATE_NORMAL using
> + * luo_set_state() to cancel the live update attempt.

Would we end up with a more robust serialization in subsystems or
filesystems if we do not allow freeze to fail? Then they would be forced
to ensure they have everything in order by the time the system goes into
prepared state, and only need to make small adjustments in the freeze
callback.

> + *
> + * @return  0: Success. Negative error otherwise. State is reverted to
> + * %LIVEUPDATE_STATE_NORMAL in case of an error during callbacks.
> + */
> +int luo_freeze(void)
> +{
> +	int ret;
> +
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[freeze] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_FROZEN],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	ret = luo_do_freeze_calls();
> +	if (!ret)
> +		luo_set_state(LIVEUPDATE_STATE_FROZEN);
> +	else
> +		luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	up_write(&luo_state_rwsem);
> +
> +	return ret;
> +}
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-05-15 18:23 ` [RFC v2 05/16] luo: luo_core: integrate with KHO Pasha Tatashin
  2025-05-26  7:18   ` Mike Rapoport
@ 2025-06-04 16:00   ` Pratyush Yadav
  2025-06-07 23:30     ` Pasha Tatashin
  1 sibling, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-04 16:00 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Integrate the LUO with the KHO framework to enable passing LUO state
> across a kexec reboot.
>
> This patch introduces the following changes:
> - During the KHO finalization phase allocate FDT blob.
> - Populate this FDT with a LUO compatibility string ("luo-v1") and the
>   current LUO state (`luo_state`).
> - Implement a KHO notifier
>
> LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
> logic (`luo_do_*_calls`) remains unimplemented in this patch.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/liveupdate/luo_core.c | 222 ++++++++++++++++++++++++++++-
>  1 file changed, 219 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> index 919c37b0b4d1..a76e886bc3b1 100644
> --- a/drivers/misc/liveupdate/luo_core.c
> +++ b/drivers/misc/liveupdate/luo_core.c
> @@ -36,9 +36,12 @@
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
>  #include <linux/err.h>
> +#include <linux/kexec_handover.h>
>  #include <linux/kobject.h>
> +#include <linux/libfdt.h>
>  #include <linux/liveupdate.h>
>  #include <linux/rwsem.h>
> +#include <linux/sizes.h>
>  #include <linux/string.h>
>  #include "luo_internal.h"
>  
> @@ -55,6 +58,12 @@ const char *const luo_state_str[] = {
>  
>  bool luo_enabled;
>  
> +static void *luo_fdt_out;
> +static void *luo_fdt_in;
> +#define LUO_FDT_SIZE		SZ_1M
> +#define LUO_KHO_ENTRY_NAME	"LUO"
> +#define LUO_COMPATIBLE		"luo-v1"
> +
>  static int __init early_liveupdate_param(char *buf)
>  {
>  	return kstrtobool(buf, &luo_enabled);
> @@ -79,6 +88,60 @@ static inline void luo_set_state(enum liveupdate_state state)
>  	__luo_set_state(state);
>  }
>  
> +/* Called during the prepare phase, to create LUO fdt tree */
> +static int luo_fdt_setup(struct kho_serialization *ser)
> +{
> +	void *fdt_out;
> +	int ret;
> +
> +	fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> +					   get_order(LUO_FDT_SIZE));

Why not alloc_folio()? KHO already deals with folios so it seems
simpler. The kho_{un,}preserve_folio() functions exist exactly for these
kinds of allocations, so LUO also ends up being a first user. You also
won't end up needing kho_unpreserve_phys() and all the __pa() calls.

> +	if (!fdt_out) {
> +		pr_err("failed to allocate FDT memory\n");
> +		return -ENOMEM;
> +	}
> +
> +	ret = fdt_create_empty_tree(fdt_out, LUO_FDT_SIZE);

You are using FDT read/write functions throughout the series to create
new FDTs. The sequential write functions are generally more efficient
since they are meant for creating new FDT blobs. The read/write
functions are better for modifying an existing FDT blob.

Is there a particular reason you do this?

When using FDT SW functions, the creation of the tree would be something
like:

        fdt_create()
        fdt_finish_reservemap()
        fdt_begin_node()

        // Add stuff to FDT

        fdt_end_node()
        fdt_finish()

In this patch, the FDT does not change much after creation so it doesn't
look like it matters much, but in later patches, the FDT is passed to
luo_files_fdt_setup() and luo_subsystems_fdt_setup() which probably
modify the FDT a fair bit.

> +	if (ret)
> +		goto exit_free;
> +
> +	ret = fdt_setprop(fdt_out, 0, "compatible", LUO_COMPATIBLE,
> +			  strlen(LUO_COMPATIBLE) + 1);

fdt_setprop_string() instead? Or if you change to FDT SW,
fdt_property_string().

> +	if (ret)
> +		goto exit_free;
> +
> +	ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> +	if (ret)
> +		goto exit_free;
> +
> +	ret = kho_add_subtree(ser, LUO_KHO_ENTRY_NAME, fdt_out);
> +	if (ret)
> +		goto exit_unpreserve;
> +	luo_fdt_out = fdt_out;
> +
> +	return 0;
> +
> +exit_unpreserve:
> +	kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> +exit_free:
> +	free_pages((unsigned long)fdt_out, get_order(LUO_FDT_SIZE));
> +	pr_err("failed to prepare LUO FDT: %d\n", ret);
> +
> +	return ret;
> +}
> +
> +static void luo_fdt_destroy(void)
> +{
> +	kho_unpreserve_phys(__pa(luo_fdt_out), LUO_FDT_SIZE);
> +	free_pages((unsigned long)luo_fdt_out, get_order(LUO_FDT_SIZE));
> +	luo_fdt_out = NULL;
> +}
> +
> +static int luo_do_prepare_calls(void)
> +{
> +	return 0;
> +}
> +
>  static int luo_do_freeze_calls(void)
>  {
>  	return 0;
> @@ -88,11 +151,111 @@ static void luo_do_finish_calls(void)
>  {
>  }
>  
> -int luo_prepare(void)
> +static void luo_do_cancel_calls(void)
> +{
> +}
> +
> +static int __luo_prepare(struct kho_serialization *ser)
>  {
> +	int ret;
> +
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[prepare] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_NORMAL)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_PREPARED],
> +			LUO_STATE_STR);
> +		ret = -EINVAL;
> +		goto exit_unlock;
> +	}
> +
> +	ret = luo_fdt_setup(ser);
> +	if (ret)
> +		goto exit_unlock;
> +
> +	ret = luo_do_prepare_calls();
> +	if (ret)
> +		goto exit_unlock;

With subsystems/filesystems support in place, this can fail. But since
luo_fdt_setup() called kho_add_subtree(), the debugfs file stays around,
and later calls to __luo_prepare() fail because the next
kho_add_subtree() tries to create a debugfs file that already exists. So
you would see an error like below:

    [  767.339920] debugfs: File 'LUO' in directory 'sub_fdts' already present!
    [  767.340613] luo_core: failed to prepare LUO FDT: -17
    [  767.341071] KHO: Failed to convert KHO state tree: -17
    [  767.341593] luo_core: Can't switch to [normal] from [normal] state
    [  767.342175] KHO: Failed to abort KHO finalization: -22

You probably need a kho_remove_subtree() that can be called from the
error paths.

Note that __luo_cancel() is called because failure in a KHO finalize
notifier calls the abort notifiers.

This is also something to fix, since if prepare fails, all other KHO
users who are already serialized won't even get to abort.

This weirdness happens because luo_prepare() and luo_cancel() control
the KHO state machine, but then also get controlled by it via the
notifier callbacks. So the relationship between then is not clear.
__luo_prepare() at least needs access to struct kho_serialization, so it
needs to come from the callback. So I don't have a clear way to clean
this all up off the top of my head.

I suppose one way to fix this would be to move the state check to
luo_cancel() instead. It would probably fix this problem but won't
actually do anything about the murky hierarchy between KHO and LUO.

> +
> +	luo_set_state(LIVEUPDATE_STATE_PREPARED);
> +
> +exit_unlock:
> +	up_write(&luo_state_rwsem);
> +
> +	return ret;
> +}
> +
> +static int __luo_cancel(void)
> +{
> +	if (down_write_killable(&luo_state_rwsem)) {
> +		pr_warn("[cancel] event canceled by user\n");
> +		return -EAGAIN;
> +	}
> +
> +	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED) &&
> +	    !is_current_luo_state(LIVEUPDATE_STATE_FROZEN)) {
> +		pr_warn("Can't switch to [%s] from [%s] state\n",
> +			luo_state_str[LIVEUPDATE_STATE_NORMAL],
> +			LUO_STATE_STR);
> +		up_write(&luo_state_rwsem);
> +
> +		return -EINVAL;
> +	}
> +
> +	luo_do_cancel_calls();
> +	luo_fdt_destroy();
> +	luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +	up_write(&luo_state_rwsem);
> +
>  	return 0;
>  }
>  
> +static int luo_kho_notifier(struct notifier_block *self,
> +			    unsigned long cmd, void *v)
> +{
> +	int ret;
> +
> +	switch (cmd) {
> +	case KEXEC_KHO_FINALIZE:
> +		ret = __luo_prepare((struct kho_serialization *)v);
> +		break;
> +	case KEXEC_KHO_ABORT:
> +		ret = __luo_cancel();
> +		break;
> +	default:
> +		return NOTIFY_BAD;
> +	}
> +
> +	return notifier_from_errno(ret);
> +}
> +
> +static struct notifier_block luo_kho_notifier_nb = {
> +	.notifier_call = luo_kho_notifier,
> +};
> +
> +/**
> + * luo_prepare - Initiate the live update preparation phase.
> + *
> + * This function is called to begin the live update process. It attempts to
> + * transition the luo to the ``LIVEUPDATE_STATE_PREPARED`` state.
> + *
> + * If the calls complete successfully, the orchestrator state is set
> + * to ``LIVEUPDATE_STATE_PREPARED``. If any  call fails a
> + * ``LIVEUPDATE_CANCEL`` is sent to roll back any actions.
> + *
> + * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
> + * user while waiting for the lock, ``-EINVAL`` if the orchestrator is not in
> + * the normal state, or a negative error code returned by the calls.
> + */
> +int luo_prepare(void)
> +{
> +	return kho_finalize();
> +}
> +
>  /**
>   * luo_freeze() - Initiate the final freeze notification phase for live update.
>   *
> @@ -188,9 +351,23 @@ int luo_finish(void)
>  	return 0;
>  }
>  
> +/**
> + * luo_cancel - Cancel the ongoing live update from prepared or frozen states.
> + *
> + * This function is called to abort a live update that is currently in the
> + * ``LIVEUPDATE_STATE_PREPARED`` state.
> + *
> + * If the state is correct, it triggers the ``LIVEUPDATE_CANCEL`` notifier chain
> + * to allow subsystems to undo any actions performed during the prepare or
> + * freeze events. Finally, the orchestrator state is transitioned back to
> + * ``LIVEUPDATE_STATE_NORMAL``.
> + *
> + * @return 0 on success, or ``-EAGAIN`` if the state change was cancelled by the
> + * user while waiting for the lock.
> + */
>  int luo_cancel(void)
>  {
> -	return 0;
> +	return kho_abort();
>  }
>  
>  void luo_state_read_enter(void)
> @@ -205,7 +382,46 @@ void luo_state_read_exit(void)
>  
>  static int __init luo_startup(void)
>  {
> -	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +	phys_addr_t fdt_phys;
> +	int ret;
> +
> +	if (!kho_is_enabled()) {
> +		if (luo_enabled)
> +			pr_warn("Disabling liveupdate because KHO is disabled\n");
> +		luo_enabled = false;
> +		return 0;
> +	}
> +
> +	ret = register_kho_notifier(&luo_kho_notifier_nb);
> +	if (ret) {
> +		luo_enabled = false;

You set luo_enabled to false here, but none of the LUO entry points like
luo_prepare() or luo_freeze() actually check it. So LUO will appear work
just fine even when it hasn't initialized properly.

> +		pr_warn("Failed to register with KHO [%d]\n", ret);

I guess you don't return here so a previous liveupdate can still be
recovered, even though we won't be able to make the next one. If so, a
comment would be nice to point this out.

> +	}
> +
> +	/*
> +	 * Retrieve LUO subtree, and verify its format.  Panic in case of
> +	 * exceptions, since machine devices and memory is in unpredictable
> +	 * state.
> +	 */
> +	ret = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &fdt_phys);
> +	if (ret) {
> +		if (ret != -ENOENT) {
> +			panic("failed to retrieve FDT '%s' from KHO: %d\n",
> +			      LUO_KHO_ENTRY_NAME, ret);
> +		}
> +		__luo_set_state(LIVEUPDATE_STATE_NORMAL);
> +
> +		return 0;
> +	}
> +
> +	luo_fdt_in = __va(fdt_phys);
> +	ret = fdt_node_check_compatible(luo_fdt_in, 0, LUO_COMPATIBLE);
> +	if (ret) {
> +		panic("FDT '%s' is incompatible with '%s' [%d]\n",
> +		      LUO_KHO_ENTRY_NAME, LUO_COMPATIBLE, ret);
> +	}
> +
> +	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
>  
>  	return 0;
>  }

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 01/16] kho: make debugfs interface optional
  2025-05-15 18:23 ` [RFC v2 01/16] kho: make debugfs interface optional Pasha Tatashin
@ 2025-06-04 16:03   ` Pratyush Yadav
  2025-06-06 16:12     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-04 16:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Currently, KHO is controlled via debugfs interface, but once LUO is
> introduced, it can control KHO, and the debug interface becomes
> optional.
>
> Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> the debugfs interface, and allows to inspect the tree.
>
> Move all debufs related code to a new file to keep the .c files

Nit: s/debufs/debugfs/

I don't have any other feedback for this patch so a lot of bits wasted
for one typo fix ;-)

> clear of ifdefs.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
  2025-05-26  7:31   ` Mike Rapoport
  2025-05-28 19:12   ` David Matlack
@ 2025-06-04 16:30   ` Pratyush Yadav
  2025-06-08  0:04     ` Pasha Tatashin
  2 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-04 16:30 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Introduce the framework for kernel subsystems (e.g., KVM, IOMMU, device
> drivers) to register with LUO and participate in the live update process
> via callbacks.
>
> Subsystem Registration:
> - Defines struct liveupdate_subsystem in linux/liveupdate.h,
>   which subsystems use to provide their name and optional callbacks
>   (prepare, freeze, cancel, finish). The callbacks accept
>   a u64 *data intended for passing state/handles.
> - Exports liveupdate_register_subsystem() and
>   liveupdate_unregister_subsystem() API functions.
> - Adds drivers/misc/liveupdate/luo_subsystems.c to manage a list
>   of registered subsystems.
>   Registration/unregistration is restricted to
>   specific LUO states (NORMAL/UPDATED).
>
> Callback Framework:
> - The main luo_core.c state transition functions
>   now delegate to new luo_do_subsystems_*_calls() functions
>   defined in luo_subsystems.c.
> - These new functions are intended to iterate through the registered
>   subsystems and invoke their corresponding callbacks.
>
> FDT Integration:
> - Adds a /subsystems subnode within the main LUO FDT created in
>   luo_core.c. This node has its own compatibility string
>   (subsystems-v1).
> - luo_subsystems_fdt_setup() populates this node by adding a
>   property for each registered subsystem, using the subsystem's
>   name.
>   Currently, these properties are initialized with a placeholder
>   u64 value (0).
> - luo_subsystems_startup() is called from luo_core.c on boot to
>   find and validate the /subsystems node in the FDT received via
>   KHO. It panics if the node is missing or incompatible.
> - Adds a stub API function liveupdate_get_subsystem_data() intended
>   for subsystems to retrieve their persisted u64 data from the FDT
>       in the new kernel.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]
> +/**
> + * liveupdate_unregister_subsystem - Unregister a kernel subsystem handler from
> + * LUO
> + * @h: Pointer to the same liveupdate_subsystem structure that was used during
> + * registration.
> + *
> + * Unregisters a previously registered subsystem handler. Typically called
> + * during module exit or subsystem teardown. LUO removes the structure from its
> + * internal list; the caller is responsible for any necessary memory cleanup
> + * of the structure itself.
> + *
> + * Return: 0 on success, negative error code otherwise.
> + * -EINVAL if h is NULL.
> + * -ENOENT if the specified handler @h is not found in the registration list.
> + * -EBUSY if LUO is not in the NORMAL state.
> + */
> +int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
> +{
> +	struct liveupdate_subsystem *iter;
> +	bool found = false;
> +	int ret = 0;
> +
> +	luo_state_read_enter();
> +	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
> +		luo_state_read_exit();
> +		return -EBUSY;
> +	}
> +
> +	mutex_lock(&luo_subsystem_list_mutex);
> +	list_for_each_entry(iter, &luo_subsystems_list, list) {
> +		if (iter == h) {
> +			found = true;

Nit: you don't actually need the found variable. You can do the same
check that list_for_each_entry() uses, which is to call
list_entry_is_head().

> +			break;
> +		}
> +	}
> +
> +	if (found) {
> +		list_del_init(&h->list);
> +	} else {
> +		pr_warn("Subsystem handler '%s' not found for unregistration.\n",
> +			h->name);
> +		ret = -ENOENT;
> +	}
> +
> +	mutex_unlock(&luo_subsystem_list_mutex);
> +	luo_state_read_exit();
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_unregister_subsystem);
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-26  7:55   ` Mike Rapoport
@ 2025-06-05 11:56     ` Pratyush Yadav
  2025-06-08 13:13     ` Pasha Tatashin
  1 sibling, 0 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-05 11:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Mon, May 26 2025, Mike Rapoport wrote:

> On Thu, May 15, 2025 at 06:23:12PM +0000, Pasha Tatashin wrote:
>> Introduce the framework within LUO to support preserving specific types
>> of file descriptors across a live update transition. This allows
>> stateful FDs (like memfds or vfio FDs used by VMs) to be recreated in
>> the new kernel.
>> 
>> Note: The core logic for iterating through the luo_files_list and
>> invoking the handler callbacks (prepare, freeze, cancel, finish)
>> within luo_do_files_*_calls, as well as managing the u64 data
>> persistence via the FDT for individual files, is currently implemented
>> as stubs in this patch. This patch sets up the registration, FDT layout,
>> and retrieval framework.
>> 
>> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> ---
[...]
>> diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
>> index 417e7f6bf36c..ab1d76221fe2 100644
>> --- a/drivers/misc/liveupdate/luo_core.c
>> +++ b/drivers/misc/liveupdate/luo_core.c
>> @@ -110,6 +110,10 @@ static int luo_fdt_setup(struct kho_serialization *ser)
>>  	if (ret)
>>  		goto exit_free;
>>  
>> +	ret = luo_files_fdt_setup(fdt_out);
>> +	if (ret)
>> +		goto exit_free;
>> +
>>  	ret = luo_subsystems_fdt_setup(fdt_out);
>>  	if (ret)
>>  		goto exit_free;
>
> The duplication of files and subsystems does not look nice here and below.
> Can't we make files to be a subsystem?

+1

It would also give subsystems a user.

[...]
>> diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
>> new file mode 100644
>> index 000000000000..953fc40db3d7
>> --- /dev/null
>> +++ b/drivers/misc/liveupdate/luo_files.c
>> @@ -0,0 +1,563 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +/*
>> + * Copyright (c) 2025, Google LLC.
>> + * Pasha Tatashin <pasha.tatashin@soleen.com>
>> + */
>> +
>> +/**
>> + * DOC: LUO file descriptors
>> + *
>> + * LUO provides the infrastructure necessary to preserve
>> + * specific types of stateful file descriptors across a kernel live
>> + * update transition. The primary goal is to allow workloads, such as virtual
>> + * machines using vfio, memfd, or iommufd to retain access to their essential
>> + * resources without interruption after the underlying kernel is  updated.
>> + *
>> + * The framework operates based on handler registration and instance tracking:
>> + *
>> + * 1. Handler Registration: Kernel modules responsible for specific file
>> + * types (e.g., memfd, vfio) register a &struct liveupdate_filesystem
>> + * handler. This handler contains callbacks (&liveupdate_filesystem.prepare,
>> + * &liveupdate_filesystem.freeze, &liveupdate_filesystem.finish, etc.)
>> + * and a unique 'compatible' string identifying the file type.
>> + * Registration occurs via liveupdate_register_filesystem().
>
> I wouldn't use filesystem here, as the obvious users are not really
> filesystems. Maybe liveupdate_register_file_ops?

+1

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
  2025-05-15 23:15   ` James Houghton
  2025-05-26  7:55   ` Mike Rapoport
@ 2025-06-05 15:56   ` Pratyush Yadav
  2025-06-08 13:37     ` Pasha Tatashin
  2 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-05 15:56 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Introduce the framework within LUO to support preserving specific types
> of file descriptors across a live update transition. This allows
> stateful FDs (like memfds or vfio FDs used by VMs) to be recreated in
> the new kernel.
>
> Note: The core logic for iterating through the luo_files_list and
> invoking the handler callbacks (prepare, freeze, cancel, finish)
> within luo_do_files_*_calls, as well as managing the u64 data
> persistence via the FDT for individual files, is currently implemented
> as stubs in this patch. This patch sets up the registration, FDT layout,
> and retrieval framework.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/liveupdate/Makefile       |   1 +
>  drivers/misc/liveupdate/luo_core.c     |  19 +
>  drivers/misc/liveupdate/luo_files.c    | 563 +++++++++++++++++++++++++
>  drivers/misc/liveupdate/luo_internal.h |  11 +
>  include/linux/liveupdate.h             |  62 +++
>  5 files changed, 656 insertions(+)
>  create mode 100644 drivers/misc/liveupdate/luo_files.c
>
> diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> index df1c9709ba4f..b4cdd162574f 100644
> --- a/drivers/misc/liveupdate/Makefile
> +++ b/drivers/misc/liveupdate/Makefile
> @@ -1,3 +1,4 @@
>  # SPDX-License-Identifier: GPL-2.0
>  obj-y					+= luo_core.o
> +obj-y					+= luo_files.o
>  obj-y					+= luo_subsystems.o
[...]
> diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
> new file mode 100644
> index 000000000000..953fc40db3d7
> --- /dev/null
> +++ b/drivers/misc/liveupdate/luo_files.c
> @@ -0,0 +1,563 @@
[...]
> +struct luo_file {
> +	struct liveupdate_filesystem *fs;
> +	struct file *file;
> +	u64 private_data;
> +	bool reclaimed;
> +	enum liveupdate_state state;
> +	struct mutex mutex;
> +};
> +
> +/**
> + * luo_files_startup - Validates the LUO file-descriptors FDT node at startup.
> + * @fdt: Pointer to the LUO FDT blob passed from the previous kernel.
> + *
> + * This __init function checks the existence and validity of the
> + * '/file-descriptors' node in the FDT. This node is considered mandatory. It

Why is it mandatory? Can't a user just preserve some subsystems, and no
FDs?

> + * calls panic() if the node is missing, inaccessible, or invalid (e.g., missing
> + * compatible, wrong compatible string), indicating a critical configuration
> + * error for LUO.
> + */
> +void __init luo_files_startup(void *fdt)
> +{
> +	int ret, node_offset;
> +
> +	node_offset = fdt_subnode_offset(fdt, 0, LUO_FILES_NODE_NAME);
> +	if (node_offset < 0)
> +		panic("Failed to find /file-descriptors node\n");
> +
> +	ret = fdt_node_check_compatible(fdt, node_offset,
> +					LUO_FILES_COMPATIBLE);
> +	if (ret) {
> +		panic("FDT '%s' is incompatible with '%s' [%d]\n",
> +		      LUO_FILES_NODE_NAME, LUO_FILES_COMPATIBLE, ret);
> +	}
> +	luo_fdt_in = fdt;
> +}
> +
> +static void luo_files_recreate_luo_files_xa_in(void)
> +{
> +	int parent_node_offset, file_node_offset;
> +	const char *node_name, *fdt_compat_str;
> +	struct liveupdate_filesystem *fs;
> +	struct luo_file *luo_file;
> +	const void *data_ptr;
> +	int ret = 0;
> +
> +	if (luo_files_xa_in_recreated || !luo_fdt_in)
> +		return;
> +
> +	/* Take write in order to gurantee that we re-create list once */

Typo: s/gurantee/guarantee

> +	down_write(&luo_filesystems_list_rwsem);
> +	if (luo_files_xa_in_recreated)
> +		goto exit_unlock;
> +
> +	parent_node_offset = fdt_subnode_offset(luo_fdt_in, 0,
> +						LUO_FILES_NODE_NAME);
> +
> +	fdt_for_each_subnode(file_node_offset, luo_fdt_in, parent_node_offset) {
> +		bool handler_found = false;
> +		u64 token;
> +
> +		node_name = fdt_get_name(luo_fdt_in, file_node_offset, NULL);
> +		if (!node_name) {
> +			panic("Skipping FDT subnode at offset %d: Cannot get name\n",
> +			      file_node_offset);

Should failure to parse a specific FD really be a panic? Wouldn't it be
better to continue and let userspace decide if it can live with the FD
missing?

> +		}
> +
> +		ret = kstrtou64(node_name, 0, &token);
> +		if (ret < 0) {
> +			panic("Skipping FDT node '%s': Failed to parse token\n",
> +			      node_name);
> +		}
> +
> +		fdt_compat_str = fdt_getprop(luo_fdt_in, file_node_offset,
> +					     "compatible", NULL);
> +		if (!fdt_compat_str) {
> +			panic("Skipping FDT node '%s': Missing 'compatible' property\n",
> +			      node_name);
> +		}
> +
> +		data_ptr = fdt_getprop(luo_fdt_in, file_node_offset, "data",
> +				       NULL);
> +		if (!data_ptr) {
> +			panic("Can't recover property 'data' for FDT node '%s'\n",
> +			      node_name);
> +		}
> +
> +		list_for_each_entry(fs, &luo_filesystems_list, list) {
> +			if (!strcmp(fs->compatible, fdt_compat_str)) {
> +				handler_found = true;
> +				break;
> +			}
> +		}
> +
> +		if (!handler_found) {
> +			panic("Skipping FDT node '%s': No registered handler for compatible '%s'\n",
> +			      node_name, fdt_compat_str);

Thinking out loud here: this means that by the time of first retrieval,
all file systems must be registered. Since this is called from
luo_do_files_finish_calls() or luo_retrieve_file(), it will come from
userspace, so all built in modules would be initialized by then. But
some loadable module might not be. I don't see much of a use case for
loadable modules to participate in LUO, so I don't think it should be a
problem.

> +		}
> +
> +		luo_file = kmalloc(sizeof(*luo_file),
> +				   GFP_KERNEL | __GFP_NOFAIL);
> +		luo_file->fs = fs;
> +		luo_file->file = NULL;
> +		memcpy(&luo_file->private_data, data_ptr, sizeof(u64));

Why not make sure data_ptr is exactly sizeof(u64) when we parse it, and
then simply do luo_file->private_data = (u64)*data_ptr ?

Because if the previous kernel wrote more than a u64 in data, then
something is broken and we should catch that error anyway.

> +		luo_file->reclaimed = false;
> +		mutex_init(&luo_file->mutex);
> +		luo_file->state = LIVEUPDATE_STATE_UPDATED;
> +		ret = xa_err(xa_store(&luo_files_xa_in, token, luo_file,
> +				      GFP_KERNEL | __GFP_NOFAIL));

Should you also check if something is already at token's slot, in case
previous kernel generated wrong tokens or FDT is broken?

> +		if (ret < 0) {
> +			panic("Failed to store luo_file for token %llu in XArray: %d\n",
> +			      token, ret);
> +		}
> +	}
> +	luo_files_xa_in_recreated = true;
> +
> +exit_unlock:
> +	up_write(&luo_filesystems_list_rwsem);
> +}
> +
[...]
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index 7a130680b5f2..7afe0aac5ce4 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -86,6 +86,55 @@ enum liveupdate_state  {
>  	LIVEUPDATE_STATE_UPDATED = 3,
>  };
>  
> +/* Forward declaration needed if definition isn't included */
> +struct file;
> +
> +/**
> + * struct liveupdate_filesystem - Represents a handler for a live-updatable
> + * filesystem/file type.
> + * @prepare:       Optional. Saves state for a specific file instance (@file,
> + *                 @arg) before update, potentially returning value via @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @freeze:        Optional. Performs final actions just before kernel
> + *                 transition, potentially reading/updating the handle via
> + *                 @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @cancel:        Optional. Cleans up state/resources if update is aborted
> + *                 after prepare/freeze succeeded, using the @data handle (by
> + *                 value) from the successful prepare. Returns void.
> + * @finish:        Optional. Performs final cleanup in the new kernel using the
> + *                 preserved @data handle (by value). Returns void.
> + * @retrieve:      Retrieve the preserved file. Must be called before finish.
> + * @can_preserve:  callback to determine if @file with associated context (@arg)
> + *                 can be preserved by this handler.
> + *                 Return bool (true if preservable, false otherwise).
> + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> + *                 that uniquely identifies the filesystem or file type this
> + *                 handler supports. This is matched against the compatible
> + *                 string associated with individual &struct liveupdate_file
> + *                 instances.
> + * @arg:           An opaque pointer to implementation-specific context data
> + *                 associated with this filesystem handler registration.
> + * @list:          used for linking this handler instance into a global list of
> + *                 registered filesystem handlers.
> + *
> + * Modules that want to support live update for specific file types should
> + * register an instance of this structure. LUO uses this registration to
> + * determine if a given file can be preserved and to find the appropriate
> + * operations to manage its state across the update.
> + */
> +struct liveupdate_filesystem {
> +	int (*prepare)(struct file *file, void *arg, u64 *data);
> +	int (*freeze)(struct file *file, void *arg, u64 *data);
> +	void (*cancel)(struct file *file, void *arg, u64 data);
> +	void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> +	int (*retrieve)(void *arg, u64 data, struct file **file);
> +	bool (*can_preserve)(struct file *file, void *arg);
> +	const char *compatible;
> +	void *arg;

What is the use for this arg? I would expect one file type/system to
register one set of handlers. So they can keep their arg in a global in
their code. I don't see why a per-filesystem arg is needed.

What I do think is needed is a per-file arg. Each callback gets 'data',
which is the serialized data, but there is no place to store runtime
state, like some flags or serialization metadata. Sure, you could make
place for it somewhere in the inode, but I think it would be a lot
cleaner to be able to store it in struct luo_file.

So perhaps rename private_data in struct luo_file to say
serialized_data, and have a field called "private" that filesystems can
use for their runtime state?

Same suggestion for subsystems as well.

> +	struct list_head list;
> +};
> +
>  /**
>   * struct liveupdate_subsystem - Represents a subsystem participating in LUO
>   * @prepare:      Optional. Called during LUO prepare phase. Should perform
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-05-15 18:23 ` [RFC v2 09/16] luo: luo_files: implement file systems callbacks Pasha Tatashin
@ 2025-06-05 16:03   ` Pratyush Yadav
  2025-06-08 13:49     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-05 16:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Implements the core logic within luo_files.c to invoke the prepare,
> reboot, finish, and cancel callbacks for preserved file instances,
> replacing the previous stub implementations. It also handles
> the persistence and retrieval of the u64 data payload associated with
> each file via the LUO FDT.
>
> This completes the core mechanism enabling registered filesystem
> handlers to actively manage file state across the live update
> transition using the LUO framework.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
>  1 file changed, 103 insertions(+), 2 deletions(-)
>
[...]
> @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
>   */
>  int luo_do_files_freeze_calls(void)
>  {
> -	return 0;
> +	unsigned long token;
> +	struct luo_file *h;
> +	int ret;
> +
> +	xa_for_each(&luo_files_xa_out, token, h) {

Should we also ensure at this point that there are no open handles to
this file? How else would a file system ensure the file is in quiescent
state to do its final serialization?

This conflicts with my suggestion to have freeze callbacks never fail,
but now that I think of it, this is also important, so maybe we have to
live with freeze that can fail.

> +		if (h->fs->freeze) {
> +			ret = h->fs->freeze(h->file, h->fs->arg,
> +					    &h->private_data);
> +			if (ret < 0) {
> +				pr_err("Freeze callback failed for file token %#0llx handler '%s' [%d]\n",
> +				       (u64)token, h->fs->compatible, ret);
> +				__luo_do_files_cancel_calls(h);
> +
> +				return ret;
> +			}
> +		}
> +	}
> +
> +	ret = luo_files_commit_data_to_fdt();
> +	if (ret)
> +		__luo_do_files_cancel_calls(NULL);
> +
> +	return ret;
>  }
>  
>  /**
> @@ -316,7 +402,20 @@ int luo_do_files_freeze_calls(void)
>   */
>  void luo_do_files_finish_calls(void)
>  {
> +	unsigned long token;
> +	struct luo_file *h;
> +
>  	luo_files_recreate_luo_files_xa_in();
> +	xa_for_each(&luo_files_xa_in, token, h) {
> +		mutex_lock(&h->mutex);
> +		if (h->state == LIVEUPDATE_STATE_UPDATED && h->fs->finish) {
> +			h->fs->finish(h->file, h->fs->arg,
> +				      h->private_data,
> +				      h->reclaimed);
> +			h->state = LIVEUPDATE_STATE_NORMAL;
> +		}
> +		mutex_unlock(&h->mutex);
> +	}

We can also clean up luo_files_xa_in at this point, right?

>  }
>  
>  /**
> @@ -330,6 +429,8 @@ void luo_do_files_finish_calls(void)
>   */
>  void luo_do_files_cancel_calls(void)
>  {
> +	__luo_do_files_cancel_calls(NULL);
> +	luo_files_commit_data_to_fdt();
>  }
>  
>  /**

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
  2025-05-26  8:42   ` Mike Rapoport
  2025-05-28 20:29   ` David Matlack
@ 2025-06-05 16:15   ` Pratyush Yadav
  2025-06-08 16:35     ` Pasha Tatashin
  2025-06-24  9:50   ` Christian Brauner
  3 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-05 16:15 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Introduce the user-space interface for the Live Update Orchestrator
> via ioctl commands, enabling external control over the live update
> process and management of preserved resources.
>
> Create a misc character device at /dev/liveupdate. Access
> to this device requires the CAP_SYS_ADMIN capability.
>
> A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> structures. The magic number is registered in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
[...]
> +static int luo_ioctl_fd_preserve(struct liveupdate_fd *luo_fd)
> +{
> +	struct file *file;
> +	int ret;
> +
> +	file = fget(luo_fd->fd);
> +	if (!file) {
> +		pr_err("Bad file descriptor\n");
> +		return -EBADF;
> +	}
> +
> +	ret = luo_register_file(&luo_fd->token, file);
> +	if (ret)
> +		fput(file);
> +
> +	return ret;
> +}
> +
> +static int luo_ioctl_fd_unpreserve(u64 token)
> +{

This leaks the refcount on the file that preserve took. Perhaps
luo_unregister_file() should return the file it unregistered, so this
can do fput(file)?

> +	return luo_unregister_file(token);
> +}
> +
> +static int luo_ioctl_fd_restore(struct liveupdate_fd *luo_fd)
> +{
> +	struct file *file;
> +	int ret;
> +	int fd;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		pr_err("Failed to allocate new fd: %d\n", fd);
> +		return fd;
> +	}
> +
> +	ret = luo_retrieve_file(luo_fd->token, &file);
> +	if (ret < 0) {
> +		put_unused_fd(fd);
> +
> +		return ret;
> +	}
> +
> +	fd_install(fd, file);
> +	luo_fd->fd = fd;
> +
> +	return 0;
> +}
> +
> +static int luo_open(struct inode *inodep, struct file *filep)
> +{
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (filep->f_flags & O_EXCL)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct liveupdate_fd luo_fd;
> +	enum liveupdate_state state;
> +	int ret = 0;
> +	u64 token;
> +
> +	if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> +		return -ENOTTY;
> +
> +	switch (cmd) {
> +	case LIVEUPDATE_IOCTL_GET_STATE:
> +		state = READ_ONCE(luo_state);
> +		if (copy_to_user(argp, &state, sizeof(luo_state)))
> +			ret = -EFAULT;
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_PREPARE:
> +		ret = luo_prepare();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_FREEZE:
> +		ret = luo_freeze();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_FINISH:
> +		ret = luo_finish();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_CANCEL:
> +		ret = luo_cancel();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_PRESERVE:
> +		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_preserve(&luo_fd);
> +		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> +			ret = -EFAULT;

luo_unregister_file() is needed here on error.

> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_UNPRESERVE:
> +		if (copy_from_user(&token, argp, sizeof(u64))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_unpreserve(token);
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_RESTORE:
> +		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_restore(&luo_fd);
> +		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> +			ret = -EFAULT;
> +		break;
> +
> +	default:
> +		pr_warn("ioctl: unknown command nr: 0x%x\n", _IOC_NR(cmd));
> +		ret = -ENOTTY;
> +		break;
> +	}
> +
> +	return ret;
> +}
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring
  2025-05-15 18:23 ` [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring Pasha Tatashin
@ 2025-06-05 16:20   ` Pratyush Yadav
  2025-06-08 16:36     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-05 16:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Thu, May 15 2025, Pasha Tatashin wrote:

> Introduce a sysfs interface for the Live Update Orchestrator
> under /sys/kernel/liveupdate/. This interface provides a way for
> userspace tools and scripts to monitor the current state of the LUO
> state machine.

I am not sure if adding and maintaining a new UAPI that does the same
thing is worth it. Can't we just have commandline utilities that can do
the ioctls and fetch the LUO state, and those can be called from tools
and scripts?

>
> The main feature is a read-only file, state, which displays the
> current LUO state as a string ("normal", "prepared", "frozen",
> "updated"). The interface uses sysfs_notify to allow userspace
> listeners (e.g., via poll) to be efficiently notified of state changes.
>
> ABI documentation for this new sysfs interface is added in
> Documentation/ABI/testing/sysfs-kernel-liveupdate.
>
> This read-only sysfs interface complements the main ioctl interface
> provided by /dev/liveupdate, which handles LUO control operations and
> resource management.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 01/16] kho: make debugfs interface optional
  2025-06-04 16:03   ` Pratyush Yadav
@ 2025-06-06 16:12     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-06 16:12 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Wed, Jun 4, 2025 at 12:03 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > Currently, KHO is controlled via debugfs interface, but once LUO is
> > introduced, it can control KHO, and the debug interface becomes
> > optional.
> >
> > Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
> > the debugfs interface, and allows to inspect the tree.
> >
> > Move all debufs related code to a new file to keep the .c files
>
> Nit: s/debufs/debugfs/

Done.

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 03/16] kho: add kho_unpreserve_folio/phys
  2025-06-04 15:00   ` Pratyush Yadav
@ 2025-06-06 16:22     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-06 16:22 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Wed, Jun 4, 2025 at 11:00 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > From: Changyuan Lyu <changyuanl@google.com>
> >
> > Allow users of KHO to cancel the previous preservation by adding the
> > necessary interfaces to unpreserve folio.
> >
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  include/linux/kexec_handover.h | 12 +++++
> >  kernel/kexec_handover.c        | 84 ++++++++++++++++++++++++++++------
> >  2 files changed, 83 insertions(+), 13 deletions(-)
> >
> [...]
> > diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> > index 8ff561e36a87..eb305e7e6129 100644
> > --- a/kernel/kexec_handover.c
> > +++ b/kernel/kexec_handover.c
> > @@ -101,26 +101,33 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
> >       return elm;
> >  }
> >
> > -static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
> > -                          unsigned long end_pfn)
> > +static void __kho_unpreserve_order(struct kho_mem_track *track, unsigned long pfn,
> > +                                unsigned int order)
> >  {
> >       struct kho_mem_phys_bits *bits;
> >       struct kho_mem_phys *physxa;
> > +     const unsigned long pfn_high = pfn >> order;
> >
> > -     while (pfn < end_pfn) {
> > -             const unsigned int order =
> > -                     min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
> > -             const unsigned long pfn_high = pfn >> order;
> > +     physxa = xa_load(&track->orders, order);
> > +     if (!physxa)
> > +             return;
> >
> > -             physxa = xa_load(&track->orders, order);
> > -             if (!physxa)
> > -                     continue;
> > +     bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
> > +     if (!bits)
> > +             return;
> >
> > -             bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
> > -             if (!bits)
> > -                     continue;
> > +     clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
> > +}
> >
> > -             clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
> > +static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
> > +                          unsigned long end_pfn)
> > +{
> > +     unsigned int order;
> > +
> > +     while (pfn < end_pfn) {
> > +             order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
>
> This is fragile. If the preserve call spans say 4 PFNs, then it gets
> preserved as a order 2 allocation, but if the PFNs are unpreserved
> one-by-one, __kho_unpreserve_order() will unpreserve from the order 0
> xarray, which will end up doing nothing, leaking those pages.
>
> It should either look through all orders to find the PFN, or at least
> have a requirement in the API that the same phys and size combination as
> the preserve call must be given to unpreserve.

Thank you Pratyush, this is an excellent point. I will add to the
comments of these functions, that it is a requirement to unpreserve
exactly the memory that was preserved, and subsections are not
allowed. I do not think this is needed, but in the future, if a use
case arises, we can relax this requirement.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 04/16] luo: luo_core: Live Update Orchestrator
  2025-06-04 15:17   ` Pratyush Yadav
@ 2025-06-07 17:11     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-07 17:11 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

> > + * Based on the outcome of the notification process:
> > + * - If luo_do_freeze_calls() returns 0 (all callbacks succeeded), the state
> > + * is set to %LIVEUPDATE_STATE_FROZEN using luo_set_state(), indicating
> > + * readiness for the imminent kexec.
> > + * - If luo_do_freeze_calls() returns a negative error code (a callback
> > + * failed), the state is reverted to %LIVEUPDATE_STATE_NORMAL using
> > + * luo_set_state() to cancel the live update attempt.
>
> Would we end up with a more robust serialization in subsystems or
> filesystems if we do not allow freeze to fail? Then they would be forced
> to ensure they have everything in order by the time the system goes into
> prepared state, and only need to make small adjustments in the freeze
> callback.
>

The reboot syscall is allowed to fail. Since freeze happens once we
leave userspace, it is the only chance left to conduct proper
verification that serialization assumptions have been maintained. For
example, if, after the prepare phase, some mutations are not allowed
for preserved resources (such as DMA re-mappings, etc.), the freeze
phase is the only place where we can perform this verification and
return an error to the user. So, while I agree it could simplify the
state machine by allowing cancellation only from the prepared state, I
think it is important to leave this ability for the freeze phase as
well.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-05-26  7:18   ` Mike Rapoport
@ 2025-06-07 17:50     ` Pasha Tatashin
  2025-06-09  2:14       ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-07 17:50 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Mon, May 26, 2025 at 3:19 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 06:23:09PM +0000, Pasha Tatashin wrote:
> > Integrate the LUO with the KHO framework to enable passing LUO state
> > across a kexec reboot.
> >
> > This patch introduces the following changes:
> > - During the KHO finalization phase allocate FDT blob.
> > - Populate this FDT with a LUO compatibility string ("luo-v1") and the
> >   current LUO state (`luo_state`).
> > - Implement a KHO notifier
>
> Would be nice to have more details about how LUO interacts with KHO, like
> how LUO states correspond to the state of KHO, what may trigger
> corresponding state transitions etc.

Updated.

>
> > LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
> > logic (`luo_do_*_calls`) remains unimplemented in this patch.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  drivers/misc/liveupdate/luo_core.c | 222 ++++++++++++++++++++++++++++-
> >  1 file changed, 219 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> > index 919c37b0b4d1..a76e886bc3b1 100644
> > --- a/drivers/misc/liveupdate/luo_core.c
> > +++ b/drivers/misc/liveupdate/luo_core.c
> > @@ -36,9 +36,12 @@
> >  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> >
> >  #include <linux/err.h>
> > +#include <linux/kexec_handover.h>
> >  #include <linux/kobject.h>
> > +#include <linux/libfdt.h>
> >  #include <linux/liveupdate.h>
> >  #include <linux/rwsem.h>
> > +#include <linux/sizes.h>
> >  #include <linux/string.h>
> >  #include "luo_internal.h"
> >
> > @@ -55,6 +58,12 @@ const char *const luo_state_str[] = {
> >
> >  bool luo_enabled;
> >
> > +static void *luo_fdt_out;
> > +static void *luo_fdt_in;
> > +#define LUO_FDT_SIZE         SZ_1M
>
> Does LUO really need that much?

Not, really, but I am keeping it simple in this patch. I added the
following comment:

/*
 * The LUO FDT size depends on the number of participating subsystems,
 * preserved file descriptors, and devices. While the total size could be
 * calculated precisely during the "prepare" phase, it would require
 * iterating through all participants twice: once to calculate the required
 * size, and a second time to actually preserve the data and populate the FDT.
 *
 * Given that each participant stores only a small amount of metadata
 * (e.g., an 8-byte payload or pointer) directly in the LUO FDT, and that
 * this FDT is used only during the relatively short kexec transition
 * period (including the blackout window and early boot of the next kernel),
 * a fixed size is used for simplicity.
 *
 * The current fixed size (1M) is large enough to handle reasonable number of
 * preserved entities. If this size ever becomes insufficient, it can either be
 * increased, or a dynamic size calculation mechanism could be implemented in
 * the future.
 */

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-04 16:00   ` Pratyush Yadav
@ 2025-06-07 23:30     ` Pasha Tatashin
  2025-06-13 14:58       ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-07 23:30 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

> > +     fdt_out = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> > +                                        get_order(LUO_FDT_SIZE));
>
> Why not alloc_folio()? KHO already deals with folios so it seems
> simpler. The kho_{un,}preserve_folio() functions exist exactly for these
> kinds of allocations, so LUO also ends up being a first user. You also
> won't end up needing kho_unpreserve_phys() and all the __pa() calls.

I prefer phys here, because this way, we are not bound for size and
alignment to be of a specific order, it can be n-pages instead.

> > +     if (!fdt_out) {
> > +             pr_err("failed to allocate FDT memory\n");
> > +             return -ENOMEM;
> > +     }
> > +
> > +     ret = fdt_create_empty_tree(fdt_out, LUO_FDT_SIZE);
>
> You are using FDT read/write functions throughout the series to create
> new FDTs. The sequential write functions are generally more efficient
> since they are meant for creating new FDT blobs. The read/write
> functions are better for modifying an existing FDT blob.
>
> Is there a particular reason you do this?
>
> When using FDT SW functions, the creation of the tree would be something
> like:
>
>         fdt_create()
>         fdt_finish_reservemap()
>         fdt_begin_node()
>
>         // Add stuff to FDT
>
>         fdt_end_node()
>         fdt_finish()
>
> In this patch, the FDT does not change much after creation so it doesn't
> look like it matters much, but in later patches, the FDT is passed to
> luo_files_fdt_setup() and luo_subsystems_fdt_setup() which probably
> modify the FDT a fair bit.

The number of changes to empty tree FDT is small, and this is done
only once, so I do think the extra cost is substantial.  This could be
a future optimization. Also, we could use a hybird approach where
luo_files/luo_subsystems do the SW updates, while here we do
Read/Write updates.

> > +     if (ret)
> > +             goto exit_free;
> > +
> > +     ret = fdt_setprop(fdt_out, 0, "compatible", LUO_COMPATIBLE,
> > +                       strlen(LUO_COMPATIBLE) + 1);
>
> fdt_setprop_string() instead? Or if you change to FDT SW,

Updated, thanks!

> fdt_property_string().
>
> > +     if (ret)
> > +             goto exit_free;
> > +
> > +     ret = kho_preserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> > +     if (ret)
> > +             goto exit_free;
> > +
> > +     ret = kho_add_subtree(ser, LUO_KHO_ENTRY_NAME, fdt_out);
> > +     if (ret)
> > +             goto exit_unpreserve;
> > +     luo_fdt_out = fdt_out;
> > +
> > +     return 0;
> > +
> > +exit_unpreserve:
> > +     kho_unpreserve_phys(__pa(fdt_out), LUO_FDT_SIZE);
> > +exit_free:
> > +     free_pages((unsigned long)fdt_out, get_order(LUO_FDT_SIZE));
> > +     pr_err("failed to prepare LUO FDT: %d\n", ret);
> > +
> > +     return ret;
> > +}
> > +
> > +static void luo_fdt_destroy(void)
> > +{
> > +     kho_unpreserve_phys(__pa(luo_fdt_out), LUO_FDT_SIZE);
> > +     free_pages((unsigned long)luo_fdt_out, get_order(LUO_FDT_SIZE));
> > +     luo_fdt_out = NULL;
> > +}
> > +
> > +static int luo_do_prepare_calls(void)
> > +{
> > +     return 0;
> > +}
> > +
> >  static int luo_do_freeze_calls(void)
> >  {
> >       return 0;
> > @@ -88,11 +151,111 @@ static void luo_do_finish_calls(void)
> >  {
> >  }
> >
> > -int luo_prepare(void)
> > +static void luo_do_cancel_calls(void)
> > +{
> > +}
> > +
> > +static int __luo_prepare(struct kho_serialization *ser)
> >  {
> > +     int ret;
> > +
> > +     if (down_write_killable(&luo_state_rwsem)) {
> > +             pr_warn("[prepare] event canceled by user\n");
> > +             return -EAGAIN;
> > +     }
> > +
> > +     if (!is_current_luo_state(LIVEUPDATE_STATE_NORMAL)) {
> > +             pr_warn("Can't switch to [%s] from [%s] state\n",
> > +                     luo_state_str[LIVEUPDATE_STATE_PREPARED],
> > +                     LUO_STATE_STR);
> > +             ret = -EINVAL;
> > +             goto exit_unlock;
> > +     }
> > +
> > +     ret = luo_fdt_setup(ser);
> > +     if (ret)
> > +             goto exit_unlock;
> > +
> > +     ret = luo_do_prepare_calls();
> > +     if (ret)
> > +             goto exit_unlock;
>
> With subsystems/filesystems support in place, this can fail. But since
> luo_fdt_setup() called kho_add_subtree(), the debugfs file stays around,
> and later calls to __luo_prepare() fail because the next
> kho_add_subtree() tries to create a debugfs file that already exists. So
> you would see an error like below:
>
>     [  767.339920] debugfs: File 'LUO' in directory 'sub_fdts' already present!
>     [  767.340613] luo_core: failed to prepare LUO FDT: -17
>     [  767.341071] KHO: Failed to convert KHO state tree: -17
>     [  767.341593] luo_core: Can't switch to [normal] from [normal] state
>     [  767.342175] KHO: Failed to abort KHO finalization: -22
>
> You probably need a kho_remove_subtree() that can be called from the
> error paths.
> Note that __luo_cancel() is called because failure in a KHO finalize
> notifier calls the abort notifiers.
>
> This is also something to fix, since if prepare fails, all other KHO
> users who are already serialized won't even get to abort.

Thank you for reporting this. This should not be happening, because if
__luo_prepare() fails, the kho_abort should follow, however, KHO does
not do kho_out_update_debugfs_fdt() when kho_finalize() fails, so I
added this callback and it fixes this problem. I also added a selftest
case for this.

>
> This weirdness happens because luo_prepare() and luo_cancel() control
> the KHO state machine, but then also get controlled by it via the
> notifier callbacks. So the relationship between then is not clear.
> __luo_prepare() at least needs access to struct kho_serialization, so it
> needs to come from the callback. So I don't have a clear way to clean
> this all up off the top of my head.

On production machine, without KHO_DEBUGFS, only LUO can control KHO
state, but if debugfs is enabled, KHO can be finalized manually, and
in this case LUO transitions to prepared state. In both cases, the
path is identical. The KHO debugfs path is only for
developers/debugging purposes.

> >  static int __init luo_startup(void)
> >  {
> > -     __luo_set_state(LIVEUPDATE_STATE_NORMAL);
> > +     phys_addr_t fdt_phys;
> > +     int ret;
> > +
> > +     if (!kho_is_enabled()) {
> > +             if (luo_enabled)
> > +                     pr_warn("Disabling liveupdate because KHO is disabled\n");
> > +             luo_enabled = false;
> > +             return 0;
> > +     }
> > +
> > +     ret = register_kho_notifier(&luo_kho_notifier_nb);
> > +     if (ret) {
> > +             luo_enabled = false;
>
> You set luo_enabled to false here, but none of the LUO entry points like
> luo_prepare() or luo_freeze() actually check it. So LUO will appear work
> just fine even when it hasn't initialized properly.

luo_enabled check was missing from luo_ioctl.c, as we should not
create a device if LUO is not enabled. This is fixed.

>
> > +             pr_warn("Failed to register with KHO [%d]\n", ret);
>
> I guess you don't return here so a previous liveupdate can still be
> recovered, even though we won't be able to make the next one. If so, a
> comment would be nice to point this out.

This is correct, but this is not going to work. Because, with the
current change I am disabling "/dev/liveupdate" iff luo_enable ==
false. Let's just return here, failing to register with KHO should not
really happen, it usually means that there is another notifier with
the same name has already registered.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-26  7:31   ` Mike Rapoport
@ 2025-06-07 23:42     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-07 23:42 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

> > +struct liveupdate_subsystem {
> > +     int (*prepare)(void *arg, u64 *data);
> > +     int (*freeze)(void *arg, u64 *data);
> > +     void (*cancel)(void *arg, u64 data);
> > +     void (*finish)(void *arg, u64 data);
>
> What is the intended use of arg in all these?

It can be used when multiple instances of the same subsystem want to
register. For example, if there is a host device driver registered
with LUO directly (i.e. devices that are not referenced through FDs),
it might use argument to distinguish between multiple instances of the
devices.

> > +     const char *name;
> > +     void *arg;
> > +     struct list_head list;
> > +     u64 private_data;
> > +};
>
> I suggest to split callbacks into, say, liveupdate_ops so we could constify
> them.
> And then it seems that the data in liveupdate_subsystem can be private to
> LUO.

Let's keep it as is. I do not really see a big advantage, subsystems
can still globally declare and set static callbacks in struct
liveupdate_subsystem { }

>
> > +
> >  #ifdef CONFIG_LIVEUPDATE
> >
> >  /* Return true if live update orchestrator is enabled */
> > @@ -105,6 +138,10 @@ bool liveupdate_state_updated(void);
> >   */
> >  bool liveupdate_state_normal(void);
> >
> > +int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
>
> int liveupdate_register_subsystem(name, ops, data) ?
>
> > +int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
> > +int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
> > +
> >  #else /* CONFIG_LIVEUPDATE */
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-05-28 19:12   ` David Matlack
@ 2025-06-07 23:58     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-07 23:58 UTC (permalink / raw)
  To: David Matlack
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

> Suggest using guard()() and scoped_guard() throughout this series
> instead of manual lock/unlock and up/down. That will simplify the code
> and reduce the chance of silly bugs where a code path misses an
> unlock/down.

This is an interesting suggestion. I have not really considered using
guard()/scoped_guard(). I personally prefer regular lock/unlock/goto
error, IMO the code is more readable this way, but I may revisit this
in the future. Also, at least in mm guards are not used, i.e.  `git
grep scoped_guard mm` returns no results.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 06/16] luo: luo_subsystems: add subsystem registration
  2025-06-04 16:30   ` Pratyush Yadav
@ 2025-06-08  0:04     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08  0:04 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Wed, Jun 4, 2025 at 12:30 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > Introduce the framework for kernel subsystems (e.g., KVM, IOMMU, device
> > drivers) to register with LUO and participate in the live update process
> > via callbacks.
> >
> > Subsystem Registration:
> > - Defines struct liveupdate_subsystem in linux/liveupdate.h,
> >   which subsystems use to provide their name and optional callbacks
> >   (prepare, freeze, cancel, finish). The callbacks accept
> >   a u64 *data intended for passing state/handles.
> > - Exports liveupdate_register_subsystem() and
> >   liveupdate_unregister_subsystem() API functions.
> > - Adds drivers/misc/liveupdate/luo_subsystems.c to manage a list
> >   of registered subsystems.
> >   Registration/unregistration is restricted to
> >   specific LUO states (NORMAL/UPDATED).
> >
> > Callback Framework:
> > - The main luo_core.c state transition functions
> >   now delegate to new luo_do_subsystems_*_calls() functions
> >   defined in luo_subsystems.c.
> > - These new functions are intended to iterate through the registered
> >   subsystems and invoke their corresponding callbacks.
> >
> > FDT Integration:
> > - Adds a /subsystems subnode within the main LUO FDT created in
> >   luo_core.c. This node has its own compatibility string
> >   (subsystems-v1).
> > - luo_subsystems_fdt_setup() populates this node by adding a
> >   property for each registered subsystem, using the subsystem's
> >   name.
> >   Currently, these properties are initialized with a placeholder
> >   u64 value (0).
> > - luo_subsystems_startup() is called from luo_core.c on boot to
> >   find and validate the /subsystems node in the FDT received via
> >   KHO. It panics if the node is missing or incompatible.
> > - Adds a stub API function liveupdate_get_subsystem_data() intended
> >   for subsystems to retrieve their persisted u64 data from the FDT
> >       in the new kernel.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
> > +/**
> > + * liveupdate_unregister_subsystem - Unregister a kernel subsystem handler from
> > + * LUO
> > + * @h: Pointer to the same liveupdate_subsystem structure that was used during
> > + * registration.
> > + *
> > + * Unregisters a previously registered subsystem handler. Typically called
> > + * during module exit or subsystem teardown. LUO removes the structure from its
> > + * internal list; the caller is responsible for any necessary memory cleanup
> > + * of the structure itself.
> > + *
> > + * Return: 0 on success, negative error code otherwise.
> > + * -EINVAL if h is NULL.
> > + * -ENOENT if the specified handler @h is not found in the registration list.
> > + * -EBUSY if LUO is not in the NORMAL state.
> > + */
> > +int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
> > +{
> > +     struct liveupdate_subsystem *iter;
> > +     bool found = false;
> > +     int ret = 0;
> > +
> > +     luo_state_read_enter();
> > +     if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
> > +             luo_state_read_exit();
> > +             return -EBUSY;
> > +     }
> > +
> > +     mutex_lock(&luo_subsystem_list_mutex);
> > +     list_for_each_entry(iter, &luo_subsystems_list, list) {
> > +             if (iter == h) {
> > +                     found = true;
>
> Nit: you don't actually need the found variable. You can do the same
> check that list_for_each_entry() uses, which is to call
> list_entry_is_head().

True, but for readability, 'found' makes more sense here. I do not
like using iterator outside of the loop, and also if
(list_entry_is_head(iter, &luo_subsystems_list, list) {} harder to
understand, and would require a  comment, instead of simple:  if
(found) {}

>
> > +                     break;
> > +             }
> > +     }
> > +
> > +     if (found) {
> > +             list_del_init(&h->list);
> > +     } else {
> > +             pr_warn("Subsystem handler '%s' not found for unregistration.\n",
> > +                     h->name);
> > +             ret = -ENOENT;
> > +     }
> > +
> > +     mutex_unlock(&luo_subsystem_list_mutex);
> > +     luo_state_read_exit();
> > +
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(liveupdate_unregister_subsystem);
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-05-26  7:55   ` Mike Rapoport
  2025-06-05 11:56     ` Pratyush Yadav
@ 2025-06-08 13:13     ` Pasha Tatashin
  1 sibling, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 13:13 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Mon, May 26, 2025 at 3:55 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 06:23:12PM +0000, Pasha Tatashin wrote:
> > Introduce the framework within LUO to support preserving specific types
> > of file descriptors across a live update transition. This allows
> > stateful FDs (like memfds or vfio FDs used by VMs) to be recreated in
> > the new kernel.
> >
> > Note: The core logic for iterating through the luo_files_list and
> > invoking the handler callbacks (prepare, freeze, cancel, finish)
> > within luo_do_files_*_calls, as well as managing the u64 data
> > persistence via the FDT for individual files, is currently implemented
> > as stubs in this patch. This patch sets up the registration, FDT layout,
> > and retrieval framework.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  drivers/misc/liveupdate/Makefile       |   1 +
> >  drivers/misc/liveupdate/luo_core.c     |  19 +
> >  drivers/misc/liveupdate/luo_files.c    | 563 +++++++++++++++++++++++++
> >  drivers/misc/liveupdate/luo_internal.h |  11 +
> >  include/linux/liveupdate.h             |  62 +++
> >  5 files changed, 656 insertions(+)
> >  create mode 100644 drivers/misc/liveupdate/luo_files.c
> >
> > diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> > index df1c9709ba4f..b4cdd162574f 100644
> > --- a/drivers/misc/liveupdate/Makefile
> > +++ b/drivers/misc/liveupdate/Makefile
> > @@ -1,3 +1,4 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  obj-y                                        += luo_core.o
> > +obj-y                                        += luo_files.o
> >  obj-y                                        += luo_subsystems.o
> > diff --git a/drivers/misc/liveupdate/luo_core.c b/drivers/misc/liveupdate/luo_core.c
> > index 417e7f6bf36c..ab1d76221fe2 100644
> > --- a/drivers/misc/liveupdate/luo_core.c
> > +++ b/drivers/misc/liveupdate/luo_core.c
> > @@ -110,6 +110,10 @@ static int luo_fdt_setup(struct kho_serialization *ser)
> >       if (ret)
> >               goto exit_free;
> >
> > +     ret = luo_files_fdt_setup(fdt_out);
> > +     if (ret)
> > +             goto exit_free;
> > +
> >       ret = luo_subsystems_fdt_setup(fdt_out);
> >       if (ret)
> >               goto exit_free;
>
> The duplication of files and subsystems does not look nice here and below.
> Can't we make files to be a subsystem?

Good idea, let me work on this.

>
> > @@ -145,7 +149,13 @@ static int luo_do_prepare_calls(void)
> >  {
> >       int ret;
> >
> > +     ret = luo_do_files_prepare_calls();
> > +     if (ret)
> > +             return ret;
> > +
> >       ret = luo_do_subsystems_prepare_calls();
> > +     if (ret)
> > +             luo_do_files_cancel_calls();
> >
> >       return ret;
> >  }
> > @@ -154,18 +164,26 @@ static int luo_do_freeze_calls(void)
> >  {
> >       int ret;
> >
> > +     ret = luo_do_files_freeze_calls();
> > +     if (ret)
> > +             return ret;
> > +
> >       ret = luo_do_subsystems_freeze_calls();
> > +     if (ret)
> > +             luo_do_files_cancel_calls();
> >
> >       return ret;
> >  }
> >
> >  static void luo_do_finish_calls(void)
> >  {
> > +     luo_do_files_finish_calls();
> >       luo_do_subsystems_finish_calls();
> >  }
> >
> >  static void luo_do_cancel_calls(void)
> >  {
> > +     luo_do_files_cancel_calls();
> >       luo_do_subsystems_cancel_calls();
> >  }
> >
> > @@ -436,6 +454,7 @@ static int __init luo_startup(void)
> >       }
> >
> >       __luo_set_state(LIVEUPDATE_STATE_UPDATED);
> > +     luo_files_startup(luo_fdt_in);
> >       luo_subsystems_startup(luo_fdt_in);
> >
> >       return 0;
> > diff --git a/drivers/misc/liveupdate/luo_files.c b/drivers/misc/liveupdate/luo_files.c
> > new file mode 100644
> > index 000000000000..953fc40db3d7
> > --- /dev/null
> > +++ b/drivers/misc/liveupdate/luo_files.c
> > @@ -0,0 +1,563 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@soleen.com>
> > + */
> > +
> > +/**
> > + * DOC: LUO file descriptors
> > + *
> > + * LUO provides the infrastructure necessary to preserve
> > + * specific types of stateful file descriptors across a kernel live
> > + * update transition. The primary goal is to allow workloads, such as virtual
> > + * machines using vfio, memfd, or iommufd to retain access to their essential
> > + * resources without interruption after the underlying kernel is  updated.
> > + *
> > + * The framework operates based on handler registration and instance tracking:
> > + *
> > + * 1. Handler Registration: Kernel modules responsible for specific file
> > + * types (e.g., memfd, vfio) register a &struct liveupdate_filesystem
> > + * handler. This handler contains callbacks (&liveupdate_filesystem.prepare,
> > + * &liveupdate_filesystem.freeze, &liveupdate_filesystem.finish, etc.)
> > + * and a unique 'compatible' string identifying the file type.
> > + * Registration occurs via liveupdate_register_filesystem().
>
> I wouldn't use filesystem here, as the obvious users are not really
> filesystems. Maybe liveupdate_register_file_ops?

This corresponds to the way these structs are called in linux, so I
think the name is OK.

>
> > + *
> > + * 2. File Instance Tracking: When a potentially preservable file needs to be
> > + * managed for live update, the core LUO logic (luo_register_file()) finds a
> > + * compatible registered handler using its &liveupdate_filesystem.can_preserve
> > + * callback. If found,  an internal &struct luo_file instance is created,
> > + * assigned a unique u64 'token', and added to a list.
> > + *
> > + * 3. State Persistence (FDT): During the LUO prepare/freeze phases, the
> > + * registered handler callbacks are invoked for each tracked file instance.
> > + * These callbacks can generate a u64 data payload representing the minimal
> > + * state needed for restoration. This payload, along with the handler's
> > + * compatible string and the unique token, is stored in a dedicated
> > + * '/file-descriptors' node within the main LUO FDT blob passed via
> > + * Kexec Handover (KHO).
> > + *
> > + * 4. Restoration: In the new kernel, the LUO framework parses the incoming
> > + * FDT to reconstruct the list of &struct luo_file instances. When the
> > + * original owner requests the file, luo_retrieve_file() uses the corresponding
> > + * handler's &liveupdate_filesystem.retrieve callback, passing the persisted
> > + * u64 data, to recreate or find the appropriate &struct file object.
> > + */
>
> The DOC is mostly about what luo_files does, we'd also need a description
> of it's intended use, both internally in the kernel and by the userspace.
>
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
>
> ...
>
> > +/**
> > + * luo_register_file - Register a file descriptor for live update management.
> > + * @tokenp: Return argument for the token value.
> > + * @file: Pointer to the struct file to be preserved.
> > + *
> > + * Context: Must be called when LUO is in 'normal' state.
> > + *
> > + * Return: 0 on success. Negative errno on failure.
> > + */
> > +int luo_register_file(u64 *tokenp, struct file *file)
> > +{
> > +     struct liveupdate_filesystem *fs;
> > +     bool found = false;
> > +     int ret = -ENOENT;
> > +     u64 token;
> > +
> > +     luo_state_read_enter();
> > +     if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
> > +             pr_warn("File can be registered only in normal or prepared state\n");
> > +             luo_state_read_exit();
> > +             return -EBUSY;
> > +     }
> > +
> > +     down_read(&luo_filesystems_list_rwsem);
> > +     list_for_each_entry(fs, &luo_filesystems_list, list) {
> > +             if (fs->can_preserve(file, fs->arg)) {
> > +                     found = true;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     if (found) {
>
>         if (!found)
>                 goto exit_unlock;

Done, thank you.


> > + * struct liveupdate_filesystem - Represents a handler for a live-updatable
> > + * filesystem/file type.
> > + * @prepare:       Optional. Saves state for a specific file instance (@file,
> > + *                 @arg) before update, potentially returning value via @data.
> > + *                 Returns 0 on success, negative errno on failure.
> > + * @freeze:        Optional. Performs final actions just before kernel
> > + *                 transition, potentially reading/updating the handle via
> > + *                 @data.
> > + *                 Returns 0 on success, negative errno on failure.
> > + * @cancel:        Optional. Cleans up state/resources if update is aborted
> > + *                 after prepare/freeze succeeded, using the @data handle (by
> > + *                 value) from the successful prepare. Returns void.
> > + * @finish:        Optional. Performs final cleanup in the new kernel using the
> > + *                 preserved @data handle (by value). Returns void.
> > + * @retrieve:      Retrieve the preserved file. Must be called before finish.
> > + * @can_preserve:  callback to determine if @file with associated context (@arg)
> > + *                 can be preserved by this handler.
> > + *                 Return bool (true if preservable, false otherwise).
> > + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> > + *                 that uniquely identifies the filesystem or file type this
> > + *                 handler supports. This is matched against the compatible
> > + *                 string associated with individual &struct liveupdate_file
> > + *                 instances.
> > + * @arg:           An opaque pointer to implementation-specific context data
> > + *                 associated with this filesystem handler registration.
> > + * @list:          used for linking this handler instance into a global list of
> > + *                 registered filesystem handlers.
> > + *
> > + * Modules that want to support live update for specific file types should
> > + * register an instance of this structure. LUO uses this registration to
> > + * determine if a given file can be preserved and to find the appropriate
> > + * operations to manage its state across the update.
> > + */
> > +struct liveupdate_filesystem {
> > +     int (*prepare)(struct file *file, void *arg, u64 *data);
> > +     int (*freeze)(struct file *file, void *arg, u64 *data);
> > +     void (*cancel)(struct file *file, void *arg, u64 data);
> > +     void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> > +     int (*retrieve)(void *arg, u64 data, struct file **file);
> > +     bool (*can_preserve)(struct file *file, void *arg);
> > +     const char *compatible;
> > +     void *arg;
> > +     struct list_head list;
> > +};
> > +
>
> Like with subsystems, I'd split ops and make the data part private to
> luo_files.c

For simplicity, I would like to keep them together, the same as in subsystems.


>
> >  /**
> >   * struct liveupdate_subsystem - Represents a subsystem participating in LUO
> >   * @prepare:      Optional. Called during LUO prepare phase. Should perform
> > @@ -142,6 +191,9 @@ int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
> >  int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
> >  int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
> >
> > +int liveupdate_register_filesystem(struct liveupdate_filesystem *h);
> > +int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h);
>
> int liveupdate_register_file_ops(name, ops, data, ret_token) ?
>
> > +
> >  #else /* CONFIG_LIVEUPDATE */
> >
> >  static inline int liveupdate_reboot(void)
> > @@ -180,5 +232,15 @@ static inline int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h,
> >       return -ENODATA;
> >  }
> >
> > +static inline int liveupdate_register_filesystem(struct liveupdate_filesystem *h)
> > +{
> > +     return 0;
> > +}
> > +
> > +static inline int liveupdate_unregister_filesystem(struct liveupdate_filesystem *h)
> > +{
> > +     return 0;
> > +}
> > +
> >  #endif /* CONFIG_LIVEUPDATE */
> >  #endif /* _LINUX_LIVEUPDATE_H */
> > --
> > 2.49.0.1101.gccaa498523-goog
> >
> >
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-06-05 15:56   ` Pratyush Yadav
@ 2025-06-08 13:37     ` Pasha Tatashin
  2025-06-13 15:27       ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 13:37 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

> > +
> > +/**
> > + * luo_files_startup - Validates the LUO file-descriptors FDT node at startup.
> > + * @fdt: Pointer to the LUO FDT blob passed from the previous kernel.
> > + *
> > + * This __init function checks the existence and validity of the
> > + * '/file-descriptors' node in the FDT. This node is considered mandatory. It
>
> Why is it mandatory? Can't a user just preserve some subsystems, and no
> FDs?

Yes, that is legal, in that case this node is going to be empty.

>
> > + * calls panic() if the node is missing, inaccessible, or invalid (e.g., missing
> > + * compatible, wrong compatible string), indicating a critical configuration
> > + * error for LUO.
> > + */
> > +void __init luo_files_startup(void *fdt)
> > +{
> > +     int ret, node_offset;
> > +
> > +     node_offset = fdt_subnode_offset(fdt, 0, LUO_FILES_NODE_NAME);
> > +     if (node_offset < 0)
> > +             panic("Failed to find /file-descriptors node\n");
> > +
> > +     ret = fdt_node_check_compatible(fdt, node_offset,
> > +                                     LUO_FILES_COMPATIBLE);
> > +     if (ret) {
> > +             panic("FDT '%s' is incompatible with '%s' [%d]\n",
> > +                   LUO_FILES_NODE_NAME, LUO_FILES_COMPATIBLE, ret);
> > +     }
> > +     luo_fdt_in = fdt;
> > +}
> > +
> > +static void luo_files_recreate_luo_files_xa_in(void)
> > +{
> > +     int parent_node_offset, file_node_offset;
> > +     const char *node_name, *fdt_compat_str;
> > +     struct liveupdate_filesystem *fs;
> > +     struct luo_file *luo_file;
> > +     const void *data_ptr;
> > +     int ret = 0;
> > +
> > +     if (luo_files_xa_in_recreated || !luo_fdt_in)
> > +             return;
> > +
> > +     /* Take write in order to gurantee that we re-create list once */
>
> Typo: s/gurantee/guarantee

Done, thanks.

>
> > +     down_write(&luo_filesystems_list_rwsem);
> > +     if (luo_files_xa_in_recreated)
> > +             goto exit_unlock;
> > +
> > +     parent_node_offset = fdt_subnode_offset(luo_fdt_in, 0,
> > +                                             LUO_FILES_NODE_NAME);
> > +
> > +     fdt_for_each_subnode(file_node_offset, luo_fdt_in, parent_node_offset) {
> > +             bool handler_found = false;
> > +             u64 token;
> > +
> > +             node_name = fdt_get_name(luo_fdt_in, file_node_offset, NULL);
> > +             if (!node_name) {
> > +                     panic("Skipping FDT subnode at offset %d: Cannot get name\n",
> > +                           file_node_offset);
>
> Should failure to parse a specific FD really be a panic? Wouldn't it be
> better to continue and let userspace decide if it can live with the FD
> missing?

This is not safe, the memory might be DMA or owned by a sensetive
process, and if we proceed liveupdate reboot without properly handling
memory, we can get corruptions, and memory leaks. Therefore, during
liveupdate boot if there are exceptions, we should panic.

> > +             }
> > +
> > +             ret = kstrtou64(node_name, 0, &token);
> > +             if (ret < 0) {
> > +                     panic("Skipping FDT node '%s': Failed to parse token\n",
> > +                           node_name);
> > +             }
> > +
> > +             fdt_compat_str = fdt_getprop(luo_fdt_in, file_node_offset,
> > +                                          "compatible", NULL);
> > +             if (!fdt_compat_str) {
> > +                     panic("Skipping FDT node '%s': Missing 'compatible' property\n",
> > +                           node_name);
> > +             }
> > +
> > +             data_ptr = fdt_getprop(luo_fdt_in, file_node_offset, "data",
> > +                                    NULL);
> > +             if (!data_ptr) {
> > +                     panic("Can't recover property 'data' for FDT node '%s'\n",
> > +                           node_name);
> > +             }
> > +
> > +             list_for_each_entry(fs, &luo_filesystems_list, list) {
> > +                     if (!strcmp(fs->compatible, fdt_compat_str)) {
> > +                             handler_found = true;
> > +                             break;
> > +                     }
> > +             }
> > +
> > +             if (!handler_found) {
> > +                     panic("Skipping FDT node '%s': No registered handler for compatible '%s'\n",
> > +                           node_name, fdt_compat_str);
>
> Thinking out loud here: this means that by the time of first retrieval,
> all file systems must be registered. Since this is called from
> luo_do_files_finish_calls() or luo_retrieve_file(), it will come from
> userspace, so all built in modules would be initialized by then. But
> some loadable module might not be. I don't see much of a use case for
> loadable modules to participate in LUO, so I don't think it should be a
> problem.

Yes, in practice I am against supporting liveupdate for loadable
modules for FDs and devices; however, if userspace decides to use
them, they have to be very careful in terms when data is retrieved,
and when they are loaded.

> > +             }
> > +
> > +             luo_file = kmalloc(sizeof(*luo_file),
> > +                                GFP_KERNEL | __GFP_NOFAIL);
> > +             luo_file->fs = fs;
> > +             luo_file->file = NULL;
> > +             memcpy(&luo_file->private_data, data_ptr, sizeof(u64));
>
> Why not make sure data_ptr is exactly sizeof(u64) when we parse it, and
> then simply do luo_file->private_data = (u64)*data_ptr ?

Because FDT alignment is 4 bytes, we can't simply assign it.

> Because if the previous kernel wrote more than a u64 in data, then
> something is broken and we should catch that error anyway.
>
> > +             luo_file->reclaimed = false;
> > +             mutex_init(&luo_file->mutex);
> > +             luo_file->state = LIVEUPDATE_STATE_UPDATED;
> > +             ret = xa_err(xa_store(&luo_files_xa_in, token, luo_file,
> > +                                   GFP_KERNEL | __GFP_NOFAIL));
>
> Should you also check if something is already at token's slot, in case
> previous kernel generated wrong tokens or FDT is broken?

Good idea, added.

>
> > +             if (ret < 0) {
> > +                     panic("Failed to store luo_file for token %llu in XArray: %d\n",
> > +                           token, ret);
> > +             }
> > +     }
> > +     luo_files_xa_in_recreated = true;
> > +
> > +exit_unlock:
> > +     up_write(&luo_filesystems_list_rwsem);
> > +}
> > +
> [...]
> > diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> > index 7a130680b5f2..7afe0aac5ce4 100644
> > --- a/include/linux/liveupdate.h
> > +++ b/include/linux/liveupdate.h
> > @@ -86,6 +86,55 @@ enum liveupdate_state  {
> >       LIVEUPDATE_STATE_UPDATED = 3,
> >  };
> >
> > +/* Forward declaration needed if definition isn't included */
> > +struct file;
> > +
> > +/**
> > + * struct liveupdate_filesystem - Represents a handler for a live-updatable
> > + * filesystem/file type.
> > + * @prepare:       Optional. Saves state for a specific file instance (@file,
> > + *                 @arg) before update, potentially returning value via @data.
> > + *                 Returns 0 on success, negative errno on failure.
> > + * @freeze:        Optional. Performs final actions just before kernel
> > + *                 transition, potentially reading/updating the handle via
> > + *                 @data.
> > + *                 Returns 0 on success, negative errno on failure.
> > + * @cancel:        Optional. Cleans up state/resources if update is aborted
> > + *                 after prepare/freeze succeeded, using the @data handle (by
> > + *                 value) from the successful prepare. Returns void.
> > + * @finish:        Optional. Performs final cleanup in the new kernel using the
> > + *                 preserved @data handle (by value). Returns void.
> > + * @retrieve:      Retrieve the preserved file. Must be called before finish.
> > + * @can_preserve:  callback to determine if @file with associated context (@arg)
> > + *                 can be preserved by this handler.
> > + *                 Return bool (true if preservable, false otherwise).
> > + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> > + *                 that uniquely identifies the filesystem or file type this
> > + *                 handler supports. This is matched against the compatible
> > + *                 string associated with individual &struct liveupdate_file
> > + *                 instances.
> > + * @arg:           An opaque pointer to implementation-specific context data
> > + *                 associated with this filesystem handler registration.
> > + * @list:          used for linking this handler instance into a global list of
> > + *                 registered filesystem handlers.
> > + *
> > + * Modules that want to support live update for specific file types should
> > + * register an instance of this structure. LUO uses this registration to
> > + * determine if a given file can be preserved and to find the appropriate
> > + * operations to manage its state across the update.
> > + */
> > +struct liveupdate_filesystem {
> > +     int (*prepare)(struct file *file, void *arg, u64 *data);
> > +     int (*freeze)(struct file *file, void *arg, u64 *data);
> > +     void (*cancel)(struct file *file, void *arg, u64 data);
> > +     void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> > +     int (*retrieve)(void *arg, u64 data, struct file **file);
> > +     bool (*can_preserve)(struct file *file, void *arg);
> > +     const char *compatible;
> > +     void *arg;
>
> What is the use for this arg? I would expect one file type/system to
> register one set of handlers. So they can keep their arg in a global in
> their code. I don't see why a per-filesystem arg is needed.

I think, arg is useful in case we support a subsystem is registered
multiple times with some differences: i.e. based on mount point, or
file types handling. Let's keep it for now, but if needed, we can
remove that in future revisions.

> What I do think is needed is a per-file arg. Each callback gets 'data',
> which is the serialized data, but there is no place to store runtime
> state, like some flags or serialization metadata. Sure, you could make
> place for it somewhere in the inode, but I think it would be a lot
> cleaner to be able to store it in struct luo_file.
>
> So perhaps rename private_data in struct luo_file to say
> serialized_data, and have a field called "private" that filesystems can
> use for their runtime state?

I am not against this, but let's make this change when it is actually
needed by a registered filesystem.

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-05 16:03   ` Pratyush Yadav
@ 2025-06-08 13:49     ` Pasha Tatashin
  2025-06-13 15:18       ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 13:49 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > Implements the core logic within luo_files.c to invoke the prepare,
> > reboot, finish, and cancel callbacks for preserved file instances,
> > replacing the previous stub implementations. It also handles
> > the persistence and retrieval of the u64 data payload associated with
> > each file via the LUO FDT.
> >
> > This completes the core mechanism enabling registered filesystem
> > handlers to actively manage file state across the live update
> > transition using the LUO framework.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
> >  1 file changed, 103 insertions(+), 2 deletions(-)
> >
> [...]
> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
> >   */
> >  int luo_do_files_freeze_calls(void)
> >  {
> > -     return 0;
> > +     unsigned long token;
> > +     struct luo_file *h;
> > +     int ret;
> > +
> > +     xa_for_each(&luo_files_xa_out, token, h) {
>
> Should we also ensure at this point that there are no open handles to
> this file? How else would a file system ensure the file is in quiescent
> state to do its final serialization?

Do you mean check refcnt here? If so, this is a good idea, but first
we need to implement the lifecycle of liveupdate agent correctectly,
where owner of FD must survive through entering into reboot() with
/dev/liveupdate still open.

> This conflicts with my suggestion to have freeze callbacks never fail,
> but now that I think of it, this is also important, so maybe we have to
> live with freeze that can fail.
>
> > +             if (h->fs->freeze) {
> > +                     ret = h->fs->freeze(h->file, h->fs->arg,
> > +                                         &h->private_data);
> > +                     if (ret < 0) {
> > +                             pr_err("Freeze callback failed for file token %#0llx handler '%s' [%d]\n",
> > +                                    (u64)token, h->fs->compatible, ret);
> > +                             __luo_do_files_cancel_calls(h);
> > +
> > +                             return ret;
> > +                     }
> > +             }
> > +     }
> > +
> > +     ret = luo_files_commit_data_to_fdt();
> > +     if (ret)
> > +             __luo_do_files_cancel_calls(NULL);
> > +
> > +     return ret;
> >  }
> >
> >  /**
> > @@ -316,7 +402,20 @@ int luo_do_files_freeze_calls(void)
> >   */
> >  void luo_do_files_finish_calls(void)
> >  {
> > +     unsigned long token;
> > +     struct luo_file *h;
> > +
> >       luo_files_recreate_luo_files_xa_in();
> > +     xa_for_each(&luo_files_xa_in, token, h) {
> > +             mutex_lock(&h->mutex);
> > +             if (h->state == LIVEUPDATE_STATE_UPDATED && h->fs->finish) {
> > +                     h->fs->finish(h->file, h->fs->arg,
> > +                                   h->private_data,
> > +                                   h->reclaimed);
> > +                     h->state = LIVEUPDATE_STATE_NORMAL;
> > +             }
> > +             mutex_unlock(&h->mutex);
> > +     }
>
> We can also clean up luo_files_xa_in at this point, right?

Yes, we can.

Thank you,
Pasha

>
> >  }
> >
> >  /**
> > @@ -330,6 +429,8 @@ void luo_do_files_finish_calls(void)
> >   */
> >  void luo_do_files_cancel_calls(void)
> >  {
> > +     __luo_do_files_cancel_calls(NULL);
> > +     luo_files_commit_data_to_fdt();
> >  }
> >
> >  /**
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-26  8:42   ` Mike Rapoport
@ 2025-06-08 15:08     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 15:08 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Mon, May 26, 2025 at 4:43 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> > Introduce the user-space interface for the Live Update Orchestrator
> > via ioctl commands, enabling external control over the live update
> > process and management of preserved resources.
> >
> > Create a misc character device at /dev/liveupdate. Access
> > to this device requires the CAP_SYS_ADMIN capability.
> >
> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> > structures. The magic number is registered in
> > Documentation/userspace-api/ioctl/ioctl-number.rst.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
> ...
>
> > -/**
> > - * enum liveupdate_state - Defines the possible states of the live update
> > - * orchestrator.
> > - * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
> > - * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
> > - *                                   LIVEUPDATE_PREPARE callbacks have completed
> > - *                                   successfully.
> > - *                                   Devices might operate in a limited state
> > - *                                   for example the participating devices might
> > - *                                   not be allowed to unbind, and also the
> > - *                                   setting up of new DMA mappings might be
> > - *                                   disabled in this state.
> > - * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
> > - *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
> > - *                                   system is performing its final state saving
> > - *                                   within the "blackout window". User
> > - *                                   workloads must be suspended. The actual
> > - *                                   reboot (kexec) into the next kernel is
> > - *                                   imminent.
> > - * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
> > - *                                   kernel via live update the system is now
> > - *                                   running the next kernel, awaiting the
> > - *                                   finish event.
> > - *
> > - * These states track the progress and outcome of a live update operation.
> > - */
> > -enum liveupdate_state  {
> > -     LIVEUPDATE_STATE_NORMAL = 0,
> > -     LIVEUPDATE_STATE_PREPARED = 1,
> > -     LIVEUPDATE_STATE_FROZEN = 2,
> > -     LIVEUPDATE_STATE_UPDATED = 3,
> > -};
> > -
>
> Nit: this seems an unnecessary churn, these definitions can go to
> include/uapi from the start.

True, but we do not have a user api at that moment yet :-)

>
> > diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
> > +/**
> > + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> > + * descriptors across live update.
> > + * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> > + *         descriptor to be preserved.
> > + *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> > + *         representing the fully restored kernel resource.
> > + * @flags: Unused, reserved for future expansion, must be set to 0.
> > + * @token: Output for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> > + *         generated by the kernel representing the successfully preserved
> > + *         resource state.
> > + *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> > + *         returned by the preserve ioctl for the resource to be restored.
> > + *
> > + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> > + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> > + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> > + * underlying kernel state preserved across a live update cycle.
> > + *
> > + * To preserve an FD, user space passes this struct to
> > + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> > + * kernel populates the @token field.
> > + *
> > + * After the live update transition, user space passes the struct populated with
> > + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> > + * to find the preserved state and, on success, populates the @fd field with a
> > + * new file descriptor referring to the fully restored resource.
> > + */
> > +struct liveupdate_fd {
> > +     int             fd;
> > +     __u32           flags;
> > +     __u64           token;
> > +};
>
> Consider using __aligned_u64 here for size-based versioning.

Good suggestion, added.

>
> > +
> > +/* The ioctl type, documented in ioctl-number.rst */
> > +#define LIVEUPDATE_IOCTL_TYPE                0xBA
>
> ...
>
> > +/**
> > + * LIVEUPDATE_IOCTL_EVENT_PREPARE - Initiate preparation phase and trigger state
> > + * saving.
>
> This (and others below) is more a command than an event IMHO. Maybe just
> LIVEUPDATE_IOCTL_PREPARE?

Renamed.

>
> > + * Argument: None.
> > + *
> > + * Initiates the live update preparation phase. This action corresponds to
> > + * the internal %LIVEUPDATE_PREPARE kernel event and can also be triggered
>
> This action is a reason for LIVEUPDATE_PREPARE event, isn't it?
> The same applies to other IOCTL_EVENTS

It is.

>
> > + * by writing '1' to ``/sys/kernel/liveupdate/prepare``. This typically

Oops, this is a leftover from LUO RFCv1, fixed.

> > + * triggers the main state saving process for items marked via the PRESERVE
> > + * ioctls. This occurs *before* the main "blackout window", while user
> > + * applications (e.g., VMs) may still be running. Kernel subsystems
> > + * receiving the %LIVEUPDATE_PREPARE event should serialize necessary state.
> > + * This command does not transfer data.
>
> I'm not sure I follow what this sentence means.

Fixed

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-28 20:29   ` David Matlack
@ 2025-06-08 16:32     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 16:32 UTC (permalink / raw)
  To: David Matlack
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Wed, May 28, 2025 at 4:29 PM David Matlack <dmatlack@google.com> wrote:
>
> On Thu, May 15, 2025 at 11:23 AM Pasha Tatashin
> <pasha.tatashin@soleen.com> wrote:
> > +static int luo_open(struct inode *inodep, struct file *filep)
> > +{
> > +       if (!capable(CAP_SYS_ADMIN))
> > +               return -EACCES;
>
> It makes sense that LIVEUPDATE_IOCTL_EVENT* would require
> CAP_SYS_ADMIN. But I think requiring it for LIVEUPDATE_IOCTL_FD* will
> add a lot of complexity.
> It would essentially require a central userspace process to mediate
> all preserving/restoring of file descriptors across Live Update to
> enforce security. If we need a central authority to enforce security,
> I don't see why that authority can't just be the kernel or what the
> industry gains by punting the problem to userspace. It seems like all
> users of LUO are going to want the same security guarantees when it
> comes to FDs: a FD preserved inside a given "security domain" should
> not be accessible outside that domain.
>
> One way to do this in the kernel would be to have the kernel hand out
> Live Update security tokens (say, some large random number). Then
> require userspace to pass in a security token when preserving an FD.
> Userspace can then only restore or unpreserve an FD if it passes back
> in the security token associated with the FD. Then it's just up to
> each userspace process to remember their token across kexec, keep it
> secret from other untrusted processes, and pass it back in when
> recovering FDs.
>
> All the kernel has to do is generate secure tokens, which I imagine
> can't be that hard.

Based on current discussions at the bi-weekly hypervisor live update
sync [1], one proposed idea is for LIVEUPDATE_IOCTL_FD_* operations to
be managed by a dedicated userspace agent. This agent would be
responsible for preserving and restoring file descriptors,
subsequently passing them to their respective owners (e.g., VMMs).
While the complexity of implementing such a userspace architecture in
a cloud environment is unclear to me, introducing kernel-enforced
security boundaries around /dev/liveupdate tokens themselves (instead
of CAP_SYS_ADMIN for the device node) seems too complex and
potentially risky to incorporate at this stage of LUO's development.
If finer-grained, token-based security is necessary, it could perhaps
be an optional extension to LUO in the future managed by a dedicated
CONFIG_*.

[1] https://lore.kernel.org/all/958b2ec3-f5f1-b714-1256-1b06dcf7470f@google.com/

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-05 16:15   ` Pratyush Yadav
@ 2025-06-08 16:35     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 16:35 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Thu, Jun 5, 2025 at 12:16 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > Introduce the user-space interface for the Live Update Orchestrator
> > via ioctl commands, enabling external control over the live update
> > process and management of preserved resources.
> >
> > Create a misc character device at /dev/liveupdate. Access
> > to this device requires the CAP_SYS_ADMIN capability.
> >
> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> > structures. The magic number is registered in
> > Documentation/userspace-api/ioctl/ioctl-number.rst.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> [...]
> > +static int luo_ioctl_fd_preserve(struct liveupdate_fd *luo_fd)
> > +{
> > +     struct file *file;
> > +     int ret;
> > +
> > +     file = fget(luo_fd->fd);
> > +     if (!file) {
> > +             pr_err("Bad file descriptor\n");
> > +             return -EBADF;
> > +     }
> > +
> > +     ret = luo_register_file(&luo_fd->token, file);
> > +     if (ret)
> > +             fput(file);
> > +
> > +     return ret;
> > +}
> > +
> > +static int luo_ioctl_fd_unpreserve(u64 token)
> > +{
>
> This leaks the refcount on the file that preserve took. Perhaps
> luo_unregister_file() should return the file it unregistered, so this
> can do fput(file)?

Thank you, David Matlack also noticed this leak, I fixed it.

>
> > +     return luo_unregister_file(token);
> > +}
> > +
> > +static int luo_ioctl_fd_restore(struct liveupdate_fd *luo_fd)
> > +{
> > +     struct file *file;
> > +     int ret;
> > +     int fd;
> > +
> > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > +     if (fd < 0) {
> > +             pr_err("Failed to allocate new fd: %d\n", fd);
> > +             return fd;
> > +     }
> > +
> > +     ret = luo_retrieve_file(luo_fd->token, &file);
> > +     if (ret < 0) {
> > +             put_unused_fd(fd);
> > +
> > +             return ret;
> > +     }
> > +
> > +     fd_install(fd, file);
> > +     luo_fd->fd = fd;
> > +
> > +     return 0;
> > +}
> > +
> > +static int luo_open(struct inode *inodep, struct file *filep)
> > +{
> > +     if (!capable(CAP_SYS_ADMIN))
> > +             return -EACCES;
> > +
> > +     if (filep->f_flags & O_EXCL)
> > +             return -EINVAL;
> > +
> > +     return 0;
> > +}
> > +
> > +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> > +{
> > +     void __user *argp = (void __user *)arg;
> > +     struct liveupdate_fd luo_fd;
> > +     enum liveupdate_state state;
> > +     int ret = 0;
> > +     u64 token;
> > +
> > +     if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> > +             return -ENOTTY;
> > +
> > +     switch (cmd) {
> > +     case LIVEUPDATE_IOCTL_GET_STATE:
> > +             state = READ_ONCE(luo_state);
> > +             if (copy_to_user(argp, &state, sizeof(luo_state)))
> > +                     ret = -EFAULT;
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_PREPARE:
> > +             ret = luo_prepare();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_FREEZE:
> > +             ret = luo_freeze();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_FINISH:
> > +             ret = luo_finish();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_CANCEL:
> > +             ret = luo_cancel();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_FD_PRESERVE:
> > +             if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +
> > +             ret = luo_ioctl_fd_preserve(&luo_fd);
> > +             if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> > +                     ret = -EFAULT;
>
> luo_unregister_file() is needed here on error.
>

Done, thank you.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring
  2025-06-05 16:20   ` Pratyush Yadav
@ 2025-06-08 16:36     ` Pasha Tatashin
  2025-06-13 15:13       ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 16:36 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Thu, Jun 5, 2025 at 12:20 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, May 15 2025, Pasha Tatashin wrote:
>
> > Introduce a sysfs interface for the Live Update Orchestrator
> > under /sys/kernel/liveupdate/. This interface provides a way for
> > userspace tools and scripts to monitor the current state of the LUO
> > state machine.
>
> I am not sure if adding and maintaining a new UAPI that does the same
> thing is worth it. Can't we just have commandline utilities that can do
> the ioctls and fetch the LUO state, and those can be called from tools
> and scripts?
>

This is based on discussion from SystemD people. It is much simpler
for units to check the current 'state' via sysfs, and act accordingly.

> >
> > The main feature is a read-only file, state, which displays the
> > current LUO state as a string ("normal", "prepared", "frozen",
> > "updated"). The interface uses sysfs_notify to allow userspace
> > listeners (e.g., via poll) to be efficiently notified of state changes.
> >
> > ABI documentation for this new sysfs interface is added in
> > Documentation/ABI/testing/sysfs-kernel-liveupdate.
> >
> > This read-only sysfs interface complements the main ioctl interface
> > provided by /dev/liveupdate, which handles LUO control operations and
> > resource management.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 13/16] luo: add selftests for subsystems un/registration
  2025-05-26  8:52   ` Mike Rapoport
@ 2025-06-08 16:47     ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-08 16:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

On Mon, May 26, 2025 at 4:52 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 06:23:17PM +0000, Pasha Tatashin wrote:
> > Introduce a self-test mechanism for the LUO to allow verification of
> > core subsystem management functionality. This is primarily intended
> > for developers and system integrators validating the live update
> > feature.
> >
> > The tests are enabled via the new Kconfig option
> > CONFIG_LIVEUPDATE_SELFTESTS (default 'n') and are triggered through
> > a new ioctl command, LIVEUPDATE_IOCTL_SELFTESTS, added to the
> > /dev/liveupdate device node.
> >
> > This ioctl accepts commands defined in luo_selftests.h to:
> > - LUO_CMD_SUBSYSTEM_REGISTER: Creates and registers a dummy LUO
> >   subsystem using the liveupdate_register_subsystem() function. It
> >   allocates a data page and copies initial data from userspace.
> > - LUO_CMD_SUBSYSTEM_UNREGISTER: Unregisters the specified dummy
> >   subsystem using the liveupdate_unregister_subsystem() function and
> >   cleans up associated test resources.
> > - LUO_CMD_SUBSYSTEM_GETDATA: Copies the data page associated with a
> >   registered test subsystem back to userspace, allowing verification of
> >   data potentially modified or preserved by test callbacks.
> > This provides a way to test the fundamental registration and
> > unregistration flows within the LUO framework from userspace without
> > requiring a full live update sequence.
>
> I don't think ioctl for selftest is a good idea.
> Can't we test register/unregister and state machine transitions with kunit?
>
> And have a separate test module that registers as a subsystem, preserves
> it's state and then verifies the state after the reboot. This will require
> running qemu and qemu usage in tools/testing is a mess right now, but
> still.

Normally, I would agree with you, but LUO is special as it has two
parts: user states and kernel states, and it is already driven through
ioctl() interface to do state transitions, and preservation
management. So, in this particular case having an extended IOCTLs to
configure a specific kernel state, and then use normal IOCTLs to drive
tests is very useful. In the future, I plan to add support to QEMU,
but we need more work for that to happen.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-07 17:50     ` Pasha Tatashin
@ 2025-06-09  2:14       ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-09  2:14 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav

> > > +static void *luo_fdt_out;
> > > +static void *luo_fdt_in;
> > > +#define LUO_FDT_SIZE         SZ_1M
> >
> > Does LUO really need that much?
>
> Not, really, but I am keeping it simple in this patch. I added the
> following comment:

Actually, given that we are moving files to be another subsystem, this
can be reduced to only one page (i.e. unlikely more than one page of
subsystems ever register), and for files we can dynamically calculate
the required size. So, I am going to fix this.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-07 23:30     ` Pasha Tatashin
@ 2025-06-13 14:58       ` Pratyush Yadav
  2025-06-17 15:23         ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-13 14:58 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Sat, Jun 07 2025, Pasha Tatashin wrote:
[...]
>>
>> This weirdness happens because luo_prepare() and luo_cancel() control
>> the KHO state machine, but then also get controlled by it via the
>> notifier callbacks. So the relationship between then is not clear.
>> __luo_prepare() at least needs access to struct kho_serialization, so it
>> needs to come from the callback. So I don't have a clear way to clean
>> this all up off the top of my head.
>
> On production machine, without KHO_DEBUGFS, only LUO can control KHO
> state, but if debugfs is enabled, KHO can be finalized manually, and
> in this case LUO transitions to prepared state. In both cases, the
> path is identical. The KHO debugfs path is only for
> developers/debugging purposes.

What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
KHO calls into LUO from the notifier, which makes the control flow
somewhat convoluted. If LUO is supposed to be the only thing that
interacts directly with KHO, maybe we should get rid of the notifier and
only let LUO drive things.

This can be done later though; it doesn't have to be in the initial
revision.

>
>> >  static int __init luo_startup(void)
>> >  {
>> > -     __luo_set_state(LIVEUPDATE_STATE_NORMAL);
>> > +     phys_addr_t fdt_phys;
>> > +     int ret;
>> > +
>> > +     if (!kho_is_enabled()) {
>> > +             if (luo_enabled)
>> > +                     pr_warn("Disabling liveupdate because KHO is disabled\n");
>> > +             luo_enabled = false;
>> > +             return 0;
>> > +     }
>> > +
>> > +     ret = register_kho_notifier(&luo_kho_notifier_nb);
>> > +     if (ret) {
>> > +             luo_enabled = false;
>>
>> You set luo_enabled to false here, but none of the LUO entry points like
>> luo_prepare() or luo_freeze() actually check it. So LUO will appear work
>> just fine even when it hasn't initialized properly.
>
> luo_enabled check was missing from luo_ioctl.c, as we should not
> create a device if LUO is not enabled. This is fixed.
>
>>
>> > +             pr_warn("Failed to register with KHO [%d]\n", ret);
>>
>> I guess you don't return here so a previous liveupdate can still be
>> recovered, even though we won't be able to make the next one. If so, a
>> comment would be nice to point this out.
>
> This is correct, but this is not going to work. Because, with the
> current change I am disabling "/dev/liveupdate" iff luo_enable ==
> false. Let's just return here, failing to register with KHO should not
> really happen, it usually means that there is another notifier with
> the same name has already registered.

Okay, fair enough.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring
  2025-06-08 16:36     ` Pasha Tatashin
@ 2025-06-13 15:13       ` Pratyush Yadav
  0 siblings, 0 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-13 15:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Sun, Jun 08 2025, Pasha Tatashin wrote:

> On Thu, Jun 5, 2025 at 12:20 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Thu, May 15 2025, Pasha Tatashin wrote:
>>
>> > Introduce a sysfs interface for the Live Update Orchestrator
>> > under /sys/kernel/liveupdate/. This interface provides a way for
>> > userspace tools and scripts to monitor the current state of the LUO
>> > state machine.
>>
>> I am not sure if adding and maintaining a new UAPI that does the same
>> thing is worth it. Can't we just have commandline utilities that can do
>> the ioctls and fetch the LUO state, and those can be called from tools
>> and scripts?
>>
>
> This is based on discussion from SystemD people. It is much simpler
> for units to check the current 'state' via sysfs, and act accordingly.

Ok, fair enough.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-08 13:49     ` Pasha Tatashin
@ 2025-06-13 15:18       ` Pratyush Yadav
  2025-06-13 20:26         ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-13 15:18 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Sun, Jun 08 2025, Pasha Tatashin wrote:

> On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Thu, May 15 2025, Pasha Tatashin wrote:
>>
>> > Implements the core logic within luo_files.c to invoke the prepare,
>> > reboot, finish, and cancel callbacks for preserved file instances,
>> > replacing the previous stub implementations. It also handles
>> > the persistence and retrieval of the u64 data payload associated with
>> > each file via the LUO FDT.
>> >
>> > This completes the core mechanism enabling registered filesystem
>> > handlers to actively manage file state across the live update
>> > transition using the LUO framework.
>> >
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > ---
>> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
>> >  1 file changed, 103 insertions(+), 2 deletions(-)
>> >
>> [...]
>> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
>> >   */
>> >  int luo_do_files_freeze_calls(void)
>> >  {
>> > -     return 0;
>> > +     unsigned long token;
>> > +     struct luo_file *h;
>> > +     int ret;
>> > +
>> > +     xa_for_each(&luo_files_xa_out, token, h) {
>>
>> Should we also ensure at this point that there are no open handles to
>> this file? How else would a file system ensure the file is in quiescent
>> state to do its final serialization?
>
> Do you mean check refcnt here? If so, this is a good idea, but first
> we need to implement the lifecycle of liveupdate agent correctectly,
> where owner of FD must survive through entering into reboot() with
> /dev/liveupdate still open.

Yes, by this point we should ensure refcnt == 1. IIUC you plan to
implement the lifecycle change in the next revision, so this can be
added there as well I suppose.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-06-08 13:37     ` Pasha Tatashin
@ 2025-06-13 15:27       ` Pratyush Yadav
  2025-06-15 18:02         ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-13 15:27 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Sun, Jun 08 2025, Pasha Tatashin wrote:

[...]
>> > +     down_write(&luo_filesystems_list_rwsem);
>> > +     if (luo_files_xa_in_recreated)
>> > +             goto exit_unlock;
>> > +
>> > +     parent_node_offset = fdt_subnode_offset(luo_fdt_in, 0,
>> > +                                             LUO_FILES_NODE_NAME);
>> > +
>> > +     fdt_for_each_subnode(file_node_offset, luo_fdt_in, parent_node_offset) {
>> > +             bool handler_found = false;
>> > +             u64 token;
>> > +
>> > +             node_name = fdt_get_name(luo_fdt_in, file_node_offset, NULL);
>> > +             if (!node_name) {
>> > +                     panic("Skipping FDT subnode at offset %d: Cannot get name\n",
>> > +                           file_node_offset);
>>
>> Should failure to parse a specific FD really be a panic? Wouldn't it be
>> better to continue and let userspace decide if it can live with the FD
>> missing?
>
> This is not safe, the memory might be DMA or owned by a sensetive
> process, and if we proceed liveupdate reboot without properly handling
> memory, we can get corruptions, and memory leaks. Therefore, during
> liveupdate boot if there are exceptions, we should panic.

I don't get how it would result in memory leaks or corruptions, since
KHO would have marked that memory as preserved, and the new kernel won't
touch it until someone restores it.

So it can at most lead to loss of data, and in that case, userspace can
very well decide if it can live with that loss or not.

Or are you assuming here that even data in KHO is broken? In that case,
it would probably be a good idea to panic early.

[...]
>> > +             }
>> > +
>> > +             luo_file = kmalloc(sizeof(*luo_file),
>> > +                                GFP_KERNEL | __GFP_NOFAIL);
>> > +             luo_file->fs = fs;
>> > +             luo_file->file = NULL;
>> > +             memcpy(&luo_file->private_data, data_ptr, sizeof(u64));
>>
>> Why not make sure data_ptr is exactly sizeof(u64) when we parse it, and
>> then simply do luo_file->private_data = (u64)*data_ptr ?
>
> Because FDT alignment is 4 bytes, we can't simply assign it.

Hmm, good catch. Didn't think of that.

>
>> Because if the previous kernel wrote more than a u64 in data, then
>> something is broken and we should catch that error anyway.
>>
>> > +             luo_file->reclaimed = false;
>> > +             mutex_init(&luo_file->mutex);
>> > +             luo_file->state = LIVEUPDATE_STATE_UPDATED;
>> > +             ret = xa_err(xa_store(&luo_files_xa_in, token, luo_file,
>> > +                                   GFP_KERNEL | __GFP_NOFAIL));
>>
[...]
>> > +struct liveupdate_filesystem {
>> > +     int (*prepare)(struct file *file, void *arg, u64 *data);
>> > +     int (*freeze)(struct file *file, void *arg, u64 *data);
>> > +     void (*cancel)(struct file *file, void *arg, u64 data);
>> > +     void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
>> > +     int (*retrieve)(void *arg, u64 data, struct file **file);
>> > +     bool (*can_preserve)(struct file *file, void *arg);
>> > +     const char *compatible;
>> > +     void *arg;
>>
>> What is the use for this arg? I would expect one file type/system to
>> register one set of handlers. So they can keep their arg in a global in
>> their code. I don't see why a per-filesystem arg is needed.
>
> I think, arg is useful in case we support a subsystem is registered
> multiple times with some differences: i.e. based on mount point, or
> file types handling. Let's keep it for now, but if needed, we can
> remove that in future revisions.
>
>> What I do think is needed is a per-file arg. Each callback gets 'data',
>> which is the serialized data, but there is no place to store runtime
>> state, like some flags or serialization metadata. Sure, you could make
>> place for it somewhere in the inode, but I think it would be a lot
>> cleaner to be able to store it in struct luo_file.
>>
>> So perhaps rename private_data in struct luo_file to say
>> serialized_data, and have a field called "private" that filesystems can
>> use for their runtime state?
>
> I am not against this, but let's make this change when it is actually
> needed by a registered filesystem.

Okay, fair enough.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-13 15:18       ` Pratyush Yadav
@ 2025-06-13 20:26         ` Pasha Tatashin
  2025-06-16 10:43           ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-13 20:26 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Fri, Jun 13, 2025 at 11:18 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Sun, Jun 08 2025, Pasha Tatashin wrote:
>
> > On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >>
> >> On Thu, May 15 2025, Pasha Tatashin wrote:
> >>
> >> > Implements the core logic within luo_files.c to invoke the prepare,
> >> > reboot, finish, and cancel callbacks for preserved file instances,
> >> > replacing the previous stub implementations. It also handles
> >> > the persistence and retrieval of the u64 data payload associated with
> >> > each file via the LUO FDT.
> >> >
> >> > This completes the core mechanism enabling registered filesystem
> >> > handlers to actively manage file state across the live update
> >> > transition using the LUO framework.
> >> >
> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >> > ---
> >> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
> >> >  1 file changed, 103 insertions(+), 2 deletions(-)
> >> >
> >> [...]
> >> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
> >> >   */
> >> >  int luo_do_files_freeze_calls(void)
> >> >  {
> >> > -     return 0;
> >> > +     unsigned long token;
> >> > +     struct luo_file *h;
> >> > +     int ret;
> >> > +
> >> > +     xa_for_each(&luo_files_xa_out, token, h) {
> >>
> >> Should we also ensure at this point that there are no open handles to
> >> this file? How else would a file system ensure the file is in quiescent
> >> state to do its final serialization?
> >
> > Do you mean check refcnt here? If so, this is a good idea, but first
> > we need to implement the lifecycle of liveupdate agent correctectly,
> > where owner of FD must survive through entering into reboot() with
> > /dev/liveupdate still open.
>
> Yes, by this point we should ensure refcnt == 1. IIUC you plan to
> implement the lifecycle change in the next revision, so this can be
> added there as well I suppose.

Yes, I am working on that. Current, WIP patch looks like this:
https://github.com/soleen/linux/commit/fecf912d8b70acd23d24185a8c0504764e43a279

However, I am not sure about refcnt == 1 at freeze() time. We can have
programs, that never terminated while we were still in userspace (i.e.
kexec -e -> reboot() -> freeze()), in that case refcnt can be anything
at the time of freeze, no?

Pasha

>
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 08/16] luo: luo_files: add infrastructure for FDs
  2025-06-13 15:27       ` Pratyush Yadav
@ 2025-06-15 18:02         ` Pasha Tatashin
  0 siblings, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-15 18:02 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

> > This is not safe, the memory might be DMA or owned by a sensetive
> > process, and if we proceed liveupdate reboot without properly handling
> > memory, we can get corruptions, and memory leaks. Therefore, during
> > liveupdate boot if there are exceptions, we should panic.
>
> I don't get how it would result in memory leaks or corruptions, since
> KHO would have marked that memory as preserved, and the new kernel won't
> touch it until someone restores it.
>
> So it can at most lead to loss of data, and in that case, userspace can
> very well decide if it can live with that loss or not.
>
> Or are you assuming here that even data in KHO is broken? In that case,
> it would probably be a good idea to panic early.

A broken LUO format is a catastrophic failure. It's unclear at this
point in boot whether the problem lies with KHO, LUO itself, or
mismatched interface assumptions between kernel versions. Regardless,
falling back to a cold reboot is the safest course of action, rather
than attempting to boot into a potentially broken environment. Since
VMs or any preserved userspace won't survive, the additional delay of
a full reboot should not significantly worsen the impact.

>
> [...]
> >> > +             }
> >> > +
> >> > +             luo_file = kmalloc(sizeof(*luo_file),
> >> > +                                GFP_KERNEL | __GFP_NOFAIL);
> >> > +             luo_file->fs = fs;
> >> > +             luo_file->file = NULL;
> >> > +             memcpy(&luo_file->private_data, data_ptr, sizeof(u64));
> >>
> >> Why not make sure data_ptr is exactly sizeof(u64) when we parse it, and
> >> then simply do luo_file->private_data = (u64)*data_ptr ?
> >
> > Because FDT alignment is 4 bytes, we can't simply assign it.
>
> Hmm, good catch. Didn't think of that.
>
> >
> >> Because if the previous kernel wrote more than a u64 in data, then
> >> something is broken and we should catch that error anyway.
> >>
> >> > +             luo_file->reclaimed = false;
> >> > +             mutex_init(&luo_file->mutex);
> >> > +             luo_file->state = LIVEUPDATE_STATE_UPDATED;
> >> > +             ret = xa_err(xa_store(&luo_files_xa_in, token, luo_file,
> >> > +                                   GFP_KERNEL | __GFP_NOFAIL));
> >>
> [...]
> >> > +struct liveupdate_filesystem {
> >> > +     int (*prepare)(struct file *file, void *arg, u64 *data);
> >> > +     int (*freeze)(struct file *file, void *arg, u64 *data);
> >> > +     void (*cancel)(struct file *file, void *arg, u64 data);
> >> > +     void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> >> > +     int (*retrieve)(void *arg, u64 data, struct file **file);
> >> > +     bool (*can_preserve)(struct file *file, void *arg);
> >> > +     const char *compatible;
> >> > +     void *arg;
> >>
> >> What is the use for this arg? I would expect one file type/system to
> >> register one set of handlers. So they can keep their arg in a global in
> >> their code. I don't see why a per-filesystem arg is needed.
> >
> > I think, arg is useful in case we support a subsystem is registered
> > multiple times with some differences: i.e. based on mount point, or
> > file types handling. Let's keep it for now, but if needed, we can
> > remove that in future revisions.
> >
> >> What I do think is needed is a per-file arg. Each callback gets 'data',
> >> which is the serialized data, but there is no place to store runtime
> >> state, like some flags or serialization metadata. Sure, you could make
> >> place for it somewhere in the inode, but I think it would be a lot
> >> cleaner to be able to store it in struct luo_file.
> >>
> >> So perhaps rename private_data in struct luo_file to say
> >> serialized_data, and have a field called "private" that filesystems can
> >> use for their runtime state?
> >
> > I am not against this, but let's make this change when it is actually
> > needed by a registered filesystem.
>
> Okay, fair enough.
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-13 20:26         ` Pasha Tatashin
@ 2025-06-16 10:43           ` Pratyush Yadav
  2025-06-16 14:57             ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-16 10:43 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Fri, Jun 13 2025, Pasha Tatashin wrote:

> On Fri, Jun 13, 2025 at 11:18 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Sun, Jun 08 2025, Pasha Tatashin wrote:
>>
>> > On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >>
>> >> On Thu, May 15 2025, Pasha Tatashin wrote:
>> >>
>> >> > Implements the core logic within luo_files.c to invoke the prepare,
>> >> > reboot, finish, and cancel callbacks for preserved file instances,
>> >> > replacing the previous stub implementations. It also handles
>> >> > the persistence and retrieval of the u64 data payload associated with
>> >> > each file via the LUO FDT.
>> >> >
>> >> > This completes the core mechanism enabling registered filesystem
>> >> > handlers to actively manage file state across the live update
>> >> > transition using the LUO framework.
>> >> >
>> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> >> > ---
>> >> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
>> >> >  1 file changed, 103 insertions(+), 2 deletions(-)
>> >> >
>> >> [...]
>> >> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
>> >> >   */
>> >> >  int luo_do_files_freeze_calls(void)
>> >> >  {
>> >> > -     return 0;
>> >> > +     unsigned long token;
>> >> > +     struct luo_file *h;
>> >> > +     int ret;
>> >> > +
>> >> > +     xa_for_each(&luo_files_xa_out, token, h) {
>> >>
>> >> Should we also ensure at this point that there are no open handles to
>> >> this file? How else would a file system ensure the file is in quiescent
>> >> state to do its final serialization?
>> >
>> > Do you mean check refcnt here? If so, this is a good idea, but first
>> > we need to implement the lifecycle of liveupdate agent correctectly,
>> > where owner of FD must survive through entering into reboot() with
>> > /dev/liveupdate still open.
>>
>> Yes, by this point we should ensure refcnt == 1. IIUC you plan to
>> implement the lifecycle change in the next revision, so this can be
>> added there as well I suppose.
>
> Yes, I am working on that. Current, WIP patch looks like this:
> https://github.com/soleen/linux/commit/fecf912d8b70acd23d24185a8c0504764e43a279
>
> However, I am not sure about refcnt == 1 at freeze() time. We can have
> programs, that never terminated while we were still in userspace (i.e.
> kexec -e -> reboot() -> freeze()), in that case refcnt can be anything
> at the time of freeze, no?

Do you mean the agent that controls the liveupdate session? Then in that
case the agent can keep running with the /dev/liveupdate FD open, but it
must close all of the FDs preserved via LUO before doing kexec -e.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-16 10:43           ` Pratyush Yadav
@ 2025-06-16 14:57             ` Pasha Tatashin
  2025-06-18 13:16               ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-16 14:57 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Mon, Jun 16, 2025 at 6:43 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Fri, Jun 13 2025, Pasha Tatashin wrote:
>
> > On Fri, Jun 13, 2025 at 11:18 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >>
> >> On Sun, Jun 08 2025, Pasha Tatashin wrote:
> >>
> >> > On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
> >> >>
> >> >> On Thu, May 15 2025, Pasha Tatashin wrote:
> >> >>
> >> >> > Implements the core logic within luo_files.c to invoke the prepare,
> >> >> > reboot, finish, and cancel callbacks for preserved file instances,
> >> >> > replacing the previous stub implementations. It also handles
> >> >> > the persistence and retrieval of the u64 data payload associated with
> >> >> > each file via the LUO FDT.
> >> >> >
> >> >> > This completes the core mechanism enabling registered filesystem
> >> >> > handlers to actively manage file state across the live update
> >> >> > transition using the LUO framework.
> >> >> >
> >> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >> >> > ---
> >> >> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
> >> >> >  1 file changed, 103 insertions(+), 2 deletions(-)
> >> >> >
> >> >> [...]
> >> >> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
> >> >> >   */
> >> >> >  int luo_do_files_freeze_calls(void)
> >> >> >  {
> >> >> > -     return 0;
> >> >> > +     unsigned long token;
> >> >> > +     struct luo_file *h;
> >> >> > +     int ret;
> >> >> > +
> >> >> > +     xa_for_each(&luo_files_xa_out, token, h) {
> >> >>
> >> >> Should we also ensure at this point that there are no open handles to
> >> >> this file? How else would a file system ensure the file is in quiescent
> >> >> state to do its final serialization?
> >> >
> >> > Do you mean check refcnt here? If so, this is a good idea, but first
> >> > we need to implement the lifecycle of liveupdate agent correctectly,
> >> > where owner of FD must survive through entering into reboot() with
> >> > /dev/liveupdate still open.
> >>
> >> Yes, by this point we should ensure refcnt == 1. IIUC you plan to
> >> implement the lifecycle change in the next revision, so this can be
> >> added there as well I suppose.
> >
> > Yes, I am working on that. Current, WIP patch looks like this:
> > https://github.com/soleen/linux/commit/fecf912d8b70acd23d24185a8c0504764e43a279
> >
> > However, I am not sure about refcnt == 1 at freeze() time. We can have
> > programs, that never terminated while we were still in userspace (i.e.
> > kexec -e -> reboot() -> freeze()), in that case refcnt can be anything
> > at the time of freeze, no?
>
> Do you mean the agent that controls the liveupdate session? Then in that
Yes
> case the agent can keep running with the /dev/liveupdate FD open, but it
> must close all of the FDs preserved via LUO before doing kexec -e.

Right, but in this case the agent would have to basically kill all the
processes the regestred FDs through it prior to 'kexec -e', I am not
sure it is its job. However, we can add some pr_warn_once() when rfcnt
!= 1, I think this is a minor change. Lets do that once we have a more
developed userspace setup. We need to start working on liveupdated
that would through some sort of RPCs calls store and restore FDs.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-13 14:58       ` Pratyush Yadav
@ 2025-06-17 15:23         ` Jason Gunthorpe
  2025-06-17 19:32           ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 15:23 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> On Sat, Jun 07 2025, Pasha Tatashin wrote:
> [...]
> >>
> >> This weirdness happens because luo_prepare() and luo_cancel() control
> >> the KHO state machine, but then also get controlled by it via the
> >> notifier callbacks. So the relationship between then is not clear.
> >> __luo_prepare() at least needs access to struct kho_serialization, so it
> >> needs to come from the callback. So I don't have a clear way to clean
> >> this all up off the top of my head.
> >
> > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> > state, but if debugfs is enabled, KHO can be finalized manually, and
> > in this case LUO transitions to prepared state. In both cases, the
> > path is identical. The KHO debugfs path is only for
> > developers/debugging purposes.
> 
> What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> KHO calls into LUO from the notifier, which makes the control flow
> somewhat convoluted. If LUO is supposed to be the only thing that
> interacts directly with KHO, maybe we should get rid of the notifier and
> only let LUO drive things.

Yes, we should. I think we should consider the KHO notifiers and self
orchestration as obsoleted by LUO. That's why it was in debugfs
because we were not ready to commit to it.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-17 15:23         ` Jason Gunthorpe
@ 2025-06-17 19:32           ` Pasha Tatashin
  2025-06-18 13:11             ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-17 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
> > [...]
> > >>
> > >> This weirdness happens because luo_prepare() and luo_cancel() control
> > >> the KHO state machine, but then also get controlled by it via the
> > >> notifier callbacks. So the relationship between then is not clear.
> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
> > >> needs to come from the callback. So I don't have a clear way to clean
> > >> this all up off the top of my head.
> > >
> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> > > state, but if debugfs is enabled, KHO can be finalized manually, and
> > > in this case LUO transitions to prepared state. In both cases, the
> > > path is identical. The KHO debugfs path is only for
> > > developers/debugging purposes.
> >
> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> > KHO calls into LUO from the notifier, which makes the control flow
> > somewhat convoluted. If LUO is supposed to be the only thing that
> > interacts directly with KHO, maybe we should get rid of the notifier and
> > only let LUO drive things.
>
> Yes, we should. I think we should consider the KHO notifiers and self
> orchestration as obsoleted by LUO. That's why it was in debugfs
> because we were not ready to commit to it.

We could do that, however, there is one example KHO user
`reserve_mem`, that is also not liveupdate related. So, it should
either be removed or modified to be handled by LUO.

Mike, what do you think?

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-17 19:32           ` Pasha Tatashin
@ 2025-06-18 13:11             ` Pratyush Yadav
  2025-06-18 14:48               ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-18 13:11 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Tue, Jun 17 2025, Pasha Tatashin wrote:

> On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
>> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
>> > [...]
>> > >>
>> > >> This weirdness happens because luo_prepare() and luo_cancel() control
>> > >> the KHO state machine, but then also get controlled by it via the
>> > >> notifier callbacks. So the relationship between then is not clear.
>> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
>> > >> needs to come from the callback. So I don't have a clear way to clean
>> > >> this all up off the top of my head.
>> > >
>> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
>> > > state, but if debugfs is enabled, KHO can be finalized manually, and
>> > > in this case LUO transitions to prepared state. In both cases, the
>> > > path is identical. The KHO debugfs path is only for
>> > > developers/debugging purposes.
>> >
>> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
>> > KHO calls into LUO from the notifier, which makes the control flow
>> > somewhat convoluted. If LUO is supposed to be the only thing that
>> > interacts directly with KHO, maybe we should get rid of the notifier and
>> > only let LUO drive things.
>>
>> Yes, we should. I think we should consider the KHO notifiers and self
>> orchestration as obsoleted by LUO. That's why it was in debugfs
>> because we were not ready to commit to it.
>
> We could do that, however, there is one example KHO user
> `reserve_mem`, that is also not liveupdate related. So, it should
> either be removed or modified to be handled by LUO.

It still depends on kho_finalize() being called, so it still needs
something to trigger its serialization. It is not automatic. And with
your proposed patch to make debugfs interface optional, it can't even be
used with the config disabled.

So if it must be explicitly triggered to be preserved, why not let the
trigger point be LUO instead of KHO? You can make reservemem a LUO
subsystem instead.

Although to be honest, things like reservemem (or IMA perhaps?) don't
really fit well with the explicit trigger mechanism. They can be carried
across kexec without needing userspace explicitly driving it. Maybe we
allow LUO subsystems to mark themselves as auto-preservable and LUO will
preserve them regardless of state being prepared? Something to think
about later down the line I suppose.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 09/16] luo: luo_files: implement file systems callbacks
  2025-06-16 14:57             ` Pasha Tatashin
@ 2025-06-18 13:16               ` Pratyush Yadav
  0 siblings, 0 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-18 13:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Mon, Jun 16 2025, Pasha Tatashin wrote:

> On Mon, Jun 16, 2025 at 6:43 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Fri, Jun 13 2025, Pasha Tatashin wrote:
>>
>> > On Fri, Jun 13, 2025 at 11:18 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >>
>> >> On Sun, Jun 08 2025, Pasha Tatashin wrote:
>> >>
>> >> > On Thu, Jun 5, 2025 at 12:04 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >> >>
>> >> >> On Thu, May 15 2025, Pasha Tatashin wrote:
>> >> >>
>> >> >> > Implements the core logic within luo_files.c to invoke the prepare,
>> >> >> > reboot, finish, and cancel callbacks for preserved file instances,
>> >> >> > replacing the previous stub implementations. It also handles
>> >> >> > the persistence and retrieval of the u64 data payload associated with
>> >> >> > each file via the LUO FDT.
>> >> >> >
>> >> >> > This completes the core mechanism enabling registered filesystem
>> >> >> > handlers to actively manage file state across the live update
>> >> >> > transition using the LUO framework.
>> >> >> >
>> >> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> >> >> > ---
>> >> >> >  drivers/misc/liveupdate/luo_files.c | 105 +++++++++++++++++++++++++++-
>> >> >> >  1 file changed, 103 insertions(+), 2 deletions(-)
>> >> >> >
>> >> >> [...]
>> >> >> > @@ -305,7 +369,29 @@ int luo_do_files_prepare_calls(void)
>> >> >> >   */
>> >> >> >  int luo_do_files_freeze_calls(void)
>> >> >> >  {
>> >> >> > -     return 0;
>> >> >> > +     unsigned long token;
>> >> >> > +     struct luo_file *h;
>> >> >> > +     int ret;
>> >> >> > +
>> >> >> > +     xa_for_each(&luo_files_xa_out, token, h) {
>> >> >>
>> >> >> Should we also ensure at this point that there are no open handles to
>> >> >> this file? How else would a file system ensure the file is in quiescent
>> >> >> state to do its final serialization?
>> >> >
>> >> > Do you mean check refcnt here? If so, this is a good idea, but first
>> >> > we need to implement the lifecycle of liveupdate agent correctectly,
>> >> > where owner of FD must survive through entering into reboot() with
>> >> > /dev/liveupdate still open.
>> >>
>> >> Yes, by this point we should ensure refcnt == 1. IIUC you plan to
>> >> implement the lifecycle change in the next revision, so this can be
>> >> added there as well I suppose.
>> >
>> > Yes, I am working on that. Current, WIP patch looks like this:
>> > https://github.com/soleen/linux/commit/fecf912d8b70acd23d24185a8c0504764e43a279
>> >
>> > However, I am not sure about refcnt == 1 at freeze() time. We can have
>> > programs, that never terminated while we were still in userspace (i.e.
>> > kexec -e -> reboot() -> freeze()), in that case refcnt can be anything
>> > at the time of freeze, no?
>>
>> Do you mean the agent that controls the liveupdate session? Then in that
> Yes
>> case the agent can keep running with the /dev/liveupdate FD open, but it
>> must close all of the FDs preserved via LUO before doing kexec -e.
>
> Right, but in this case the agent would have to basically kill all the

Or the participating processes can be cooperative and simply exit
cleanly, or at least close the FDs before triggering the kexec. The
whole live update process needs a lot of parts to cooperate anyway.

> processes the regestred FDs through it prior to 'kexec -e', I am not
> sure it is its job. However, we can add some pr_warn_once() when rfcnt
> != 1, I think this is a minor change. Lets do that once we have a more
> developed userspace setup. We need to start working on liveupdated

Sure, makes sense.

> that would through some sort of RPCs calls store and restore FDs.

I have been playing around with some ideas on how to do this. Will try
some things out and see if I can come up with a PoC soon.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 13:11             ` Pratyush Yadav
@ 2025-06-18 14:48               ` Pasha Tatashin
  2025-06-18 16:40                 ` Mike Rapoport
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-18 14:48 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Jason Gunthorpe, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes

On Wed, Jun 18, 2025 at 9:12 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Tue, Jun 17 2025, Pasha Tatashin wrote:
>
> > On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>
> >> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> >> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
> >> > [...]
> >> > >>
> >> > >> This weirdness happens because luo_prepare() and luo_cancel() control
> >> > >> the KHO state machine, but then also get controlled by it via the
> >> > >> notifier callbacks. So the relationship between then is not clear.
> >> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
> >> > >> needs to come from the callback. So I don't have a clear way to clean
> >> > >> this all up off the top of my head.
> >> > >
> >> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> >> > > state, but if debugfs is enabled, KHO can be finalized manually, and
> >> > > in this case LUO transitions to prepared state. In both cases, the
> >> > > path is identical. The KHO debugfs path is only for
> >> > > developers/debugging purposes.
> >> >
> >> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> >> > KHO calls into LUO from the notifier, which makes the control flow
> >> > somewhat convoluted. If LUO is supposed to be the only thing that
> >> > interacts directly with KHO, maybe we should get rid of the notifier and
> >> > only let LUO drive things.
> >>
> >> Yes, we should. I think we should consider the KHO notifiers and self
> >> orchestration as obsoleted by LUO. That's why it was in debugfs
> >> because we were not ready to commit to it.
> >
> > We could do that, however, there is one example KHO user
> > `reserve_mem`, that is also not liveupdate related. So, it should
> > either be removed or modified to be handled by LUO.
>
> It still depends on kho_finalize() being called, so it still needs
> something to trigger its serialization. It is not automatic. And with
> your proposed patch to make debugfs interface optional, it can't even be
> used with the config disabled.

At least for now, it can still be used via LUO going into prepare
state, since LUO changes KHO into finalized state and reserve_mem is
registered to be called back from KHO.

> So if it must be explicitly triggered to be preserved, why not let the
> trigger point be LUO instead of KHO? You can make reservemem a LUO
> subsystem instead.

Yes, LUO can do that, the only concern I raised is that  `reserve_mem`
is not really live update related.

> Although to be honest, things like reservemem (or IMA perhaps?) don't
> really fit well with the explicit trigger mechanism. They can be carried

Agreed. Another example I was thinking about is "kexec telemetry":
precise time information about kexec, including shutdown, purgatory,
boot. We are planning to propose kexec telemetry, and it could be LUO
subsystem. On the other hand, it could be useful even without live
update, just to measure precise kexec reboot time.

> across kexec without needing userspace explicitly driving it. Maybe we
> allow LUO subsystems to mark themselves as auto-preservable and LUO will
> preserve them regardless of state being prepared? Something to think
> about later down the line I suppose.

We can start with adding `reserve_mem` as regular subsystem, and make
this auto-preserve option a future expansion, when if needed.
Presumably, `luoctl prepare` would work for whoever plans to use just
`reserve_mem`.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 14:48               ` Pasha Tatashin
@ 2025-06-18 16:40                 ` Mike Rapoport
  2025-06-18 17:00                   ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-06-18 16:40 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 18, 2025 at 10:48:09AM -0400, Pasha Tatashin wrote:
> On Wed, Jun 18, 2025 at 9:12 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >
> > On Tue, Jun 17 2025, Pasha Tatashin wrote:
> >
> > > On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >>
> > >> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> > >> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
> > >> > [...]
> > >> > >>
> > >> > >> This weirdness happens because luo_prepare() and luo_cancel() control
> > >> > >> the KHO state machine, but then also get controlled by it via the
> > >> > >> notifier callbacks. So the relationship between then is not clear.
> > >> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
> > >> > >> needs to come from the callback. So I don't have a clear way to clean
> > >> > >> this all up off the top of my head.
> > >> > >
> > >> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> > >> > > state, but if debugfs is enabled, KHO can be finalized manually, and
> > >> > > in this case LUO transitions to prepared state. In both cases, the
> > >> > > path is identical. The KHO debugfs path is only for
> > >> > > developers/debugging purposes.
> > >> >
> > >> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> > >> > KHO calls into LUO from the notifier, which makes the control flow
> > >> > somewhat convoluted. If LUO is supposed to be the only thing that
> > >> > interacts directly with KHO, maybe we should get rid of the notifier and
> > >> > only let LUO drive things.
> > >>
> > >> Yes, we should. I think we should consider the KHO notifiers and self
> > >> orchestration as obsoleted by LUO. That's why it was in debugfs
> > >> because we were not ready to commit to it.
> > >
> > > We could do that, however, there is one example KHO user
> > > `reserve_mem`, that is also not liveupdate related. So, it should
> > > either be removed or modified to be handled by LUO.
> >
> > It still depends on kho_finalize() being called, so it still needs
> > something to trigger its serialization. It is not automatic. And with
> > your proposed patch to make debugfs interface optional, it can't even be
> > used with the config disabled.
> 
> At least for now, it can still be used via LUO going into prepare
> state, since LUO changes KHO into finalized state and reserve_mem is
> registered to be called back from KHO.
> 
> > So if it must be explicitly triggered to be preserved, why not let the
> > trigger point be LUO instead of KHO? You can make reservemem a LUO
> > subsystem instead.
> 
> Yes, LUO can do that, the only concern I raised is that  `reserve_mem`
> is not really live update related.

I only now realized what bothered me about "liveupdate". It's the name of
the driving usecase rather then the name of the technology it implements.
In the end what LUO does is a (more) sophisticated control for KHO.

But essentially it's not that it actually implements live update, it
provides kexec handover control plane that enables live update.

And since the same machinery can be used regardless of live update, and I'm
sure other usecases will appear as soon as the technology will become more
mature, it makes me think that we probably should just
s/liveupdate_/kho_control/g or something along those lines.
 
> > Although to be honest, things like reservemem (or IMA perhaps?) don't
> > really fit well with the explicit trigger mechanism. They can be carried
> 
> Agreed. Another example I was thinking about is "kexec telemetry":
> precise time information about kexec, including shutdown, purgatory,
> boot. We are planning to propose kexec telemetry, and it could be LUO
> subsystem. On the other hand, it could be useful even without live
> update, just to measure precise kexec reboot time.
> 
> > across kexec without needing userspace explicitly driving it. Maybe we
> > allow LUO subsystems to mark themselves as auto-preservable and LUO will
> > preserve them regardless of state being prepared? Something to think
> > about later down the line I suppose.
> 
> We can start with adding `reserve_mem` as regular subsystem, and make
> this auto-preserve option a future expansion, when if needed.
> Presumably, `luoctl prepare` would work for whoever plans to use just
> `reserve_mem`.

I think it would be nice to support auto-preserve sooner than later. 
reserve_mem can already be useful for ftrace and pstore folks and if it
would survive a kexec without any userspace intervention it would be great.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 16:40                 ` Mike Rapoport
@ 2025-06-18 17:00                   ` Pasha Tatashin
  2025-06-18 17:43                     ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-18 17:00 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 18, 2025 at 12:40 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Jun 18, 2025 at 10:48:09AM -0400, Pasha Tatashin wrote:
> > On Wed, Jun 18, 2025 at 9:12 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> > >
> > > On Tue, Jun 17 2025, Pasha Tatashin wrote:
> > >
> > > > On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >>
> > > >> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> > > >> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
> > > >> > [...]
> > > >> > >>
> > > >> > >> This weirdness happens because luo_prepare() and luo_cancel() control
> > > >> > >> the KHO state machine, but then also get controlled by it via the
> > > >> > >> notifier callbacks. So the relationship between then is not clear.
> > > >> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
> > > >> > >> needs to come from the callback. So I don't have a clear way to clean
> > > >> > >> this all up off the top of my head.
> > > >> > >
> > > >> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> > > >> > > state, but if debugfs is enabled, KHO can be finalized manually, and
> > > >> > > in this case LUO transitions to prepared state. In both cases, the
> > > >> > > path is identical. The KHO debugfs path is only for
> > > >> > > developers/debugging purposes.
> > > >> >
> > > >> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> > > >> > KHO calls into LUO from the notifier, which makes the control flow
> > > >> > somewhat convoluted. If LUO is supposed to be the only thing that
> > > >> > interacts directly with KHO, maybe we should get rid of the notifier and
> > > >> > only let LUO drive things.
> > > >>
> > > >> Yes, we should. I think we should consider the KHO notifiers and self
> > > >> orchestration as obsoleted by LUO. That's why it was in debugfs
> > > >> because we were not ready to commit to it.
> > > >
> > > > We could do that, however, there is one example KHO user
> > > > `reserve_mem`, that is also not liveupdate related. So, it should
> > > > either be removed or modified to be handled by LUO.
> > >
> > > It still depends on kho_finalize() being called, so it still needs
> > > something to trigger its serialization. It is not automatic. And with
> > > your proposed patch to make debugfs interface optional, it can't even be
> > > used with the config disabled.
> >
> > At least for now, it can still be used via LUO going into prepare
> > state, since LUO changes KHO into finalized state and reserve_mem is
> > registered to be called back from KHO.
> >
> > > So if it must be explicitly triggered to be preserved, why not let the
> > > trigger point be LUO instead of KHO? You can make reservemem a LUO
> > > subsystem instead.
> >
> > Yes, LUO can do that, the only concern I raised is that  `reserve_mem`
> > is not really live update related.
>
> I only now realized what bothered me about "liveupdate". It's the name of
> the driving usecase rather then the name of the technology it implements.
> In the end what LUO does is a (more) sophisticated control for KHO.
>
> But essentially it's not that it actually implements live update, it
> provides kexec handover control plane that enables live update.
>
> And since the same machinery can be used regardless of live update, and I'm
> sure other usecases will appear as soon as the technology will become more
> mature, it makes me think that we probably should just
> s/liveupdate_/kho_control/g or something along those lines.

I disagree, LUO is for liveupdate flows, and is designed specifically
around the live update flows: brownout/blackout/post-liveupdate, it
should not be generalized to anticipate some other random states, and
it should only support participants that are related to live update:
iommufd/vfiofd/kvmfd/memfd/eventfd and controled via "liveupdated" the
userspace agent.

KHO is for preserving memory, LUO uses KHO as a backbone for Live Update.

> > > Although to be honest, things like reservemem (or IMA perhaps?) don't
> > > really fit well with the explicit trigger mechanism. They can be carried
> >
> > Agreed. Another example I was thinking about is "kexec telemetry":
> > precise time information about kexec, including shutdown, purgatory,
> > boot. We are planning to propose kexec telemetry, and it could be LUO
> > subsystem. On the other hand, it could be useful even without live
> > update, just to measure precise kexec reboot time.
> >
> > > across kexec without needing userspace explicitly driving it. Maybe we
> > > allow LUO subsystems to mark themselves as auto-preservable and LUO will
> > > preserve them regardless of state being prepared? Something to think
> > > about later down the line I suppose.
> >
> > We can start with adding `reserve_mem` as regular subsystem, and make
> > this auto-preserve option a future expansion, when if needed.
> > Presumably, `luoctl prepare` would work for whoever plans to use just
> > `reserve_mem`.
>
> I think it would be nice to support auto-preserve sooner than later.

Makes sense.

> reserve_mem can already be useful for ftrace and pstore folks and if it
> would survive a kexec without any userspace intervention it would be great.

The pstore use case is only potential, correct? Or can it already use
reserve_mem?

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 17:00                   ` Pasha Tatashin
@ 2025-06-18 17:43                     ` Pasha Tatashin
  2025-06-19 12:00                       ` Mike Rapoport
  2025-06-23  7:32                       ` Mike Rapoport
  0 siblings, 2 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-18 17:43 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 18, 2025 at 1:00 PM Pasha Tatashin
<pasha.tatashin@soleen.com> wrote:
>
> On Wed, Jun 18, 2025 at 12:40 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Wed, Jun 18, 2025 at 10:48:09AM -0400, Pasha Tatashin wrote:
> > > On Wed, Jun 18, 2025 at 9:12 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> > > >
> > > > On Tue, Jun 17 2025, Pasha Tatashin wrote:
> > > >
> > > > > On Tue, Jun 17, 2025 at 11:24 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > >>
> > > > >> On Fri, Jun 13, 2025 at 04:58:27PM +0200, Pratyush Yadav wrote:
> > > > >> > On Sat, Jun 07 2025, Pasha Tatashin wrote:
> > > > >> > [...]
> > > > >> > >>
> > > > >> > >> This weirdness happens because luo_prepare() and luo_cancel() control
> > > > >> > >> the KHO state machine, but then also get controlled by it via the
> > > > >> > >> notifier callbacks. So the relationship between then is not clear.
> > > > >> > >> __luo_prepare() at least needs access to struct kho_serialization, so it
> > > > >> > >> needs to come from the callback. So I don't have a clear way to clean
> > > > >> > >> this all up off the top of my head.
> > > > >> > >
> > > > >> > > On production machine, without KHO_DEBUGFS, only LUO can control KHO
> > > > >> > > state, but if debugfs is enabled, KHO can be finalized manually, and
> > > > >> > > in this case LUO transitions to prepared state. In both cases, the
> > > > >> > > path is identical. The KHO debugfs path is only for
> > > > >> > > developers/debugging purposes.
> > > > >> >
> > > > >> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> > > > >> > KHO calls into LUO from the notifier, which makes the control flow
> > > > >> > somewhat convoluted. If LUO is supposed to be the only thing that
> > > > >> > interacts directly with KHO, maybe we should get rid of the notifier and
> > > > >> > only let LUO drive things.
> > > > >>
> > > > >> Yes, we should. I think we should consider the KHO notifiers and self
> > > > >> orchestration as obsoleted by LUO. That's why it was in debugfs
> > > > >> because we were not ready to commit to it.
> > > > >
> > > > > We could do that, however, there is one example KHO user
> > > > > `reserve_mem`, that is also not liveupdate related. So, it should
> > > > > either be removed or modified to be handled by LUO.
> > > >
> > > > It still depends on kho_finalize() being called, so it still needs
> > > > something to trigger its serialization. It is not automatic. And with
> > > > your proposed patch to make debugfs interface optional, it can't even be
> > > > used with the config disabled.
> > >
> > > At least for now, it can still be used via LUO going into prepare
> > > state, since LUO changes KHO into finalized state and reserve_mem is
> > > registered to be called back from KHO.
> > >
> > > > So if it must be explicitly triggered to be preserved, why not let the
> > > > trigger point be LUO instead of KHO? You can make reservemem a LUO
> > > > subsystem instead.
> > >
> > > Yes, LUO can do that, the only concern I raised is that  `reserve_mem`
> > > is not really live update related.
> >
> > I only now realized what bothered me about "liveupdate". It's the name of
> > the driving usecase rather then the name of the technology it implements.
> > In the end what LUO does is a (more) sophisticated control for KHO.
> >
> > But essentially it's not that it actually implements live update, it
> > provides kexec handover control plane that enables live update.
> >
> > And since the same machinery can be used regardless of live update, and I'm
> > sure other usecases will appear as soon as the technology will become more
> > mature, it makes me think that we probably should just
> > s/liveupdate_/kho_control/g or something along those lines.
>
> I disagree, LUO is for liveupdate flows, and is designed specifically
> around the live update flows: brownout/blackout/post-liveupdate, it
> should not be generalized to anticipate some other random states, and
> it should only support participants that are related to live update:
> iommufd/vfiofd/kvmfd/memfd/eventfd and controled via "liveupdated" the
> userspace agent.
>
> KHO is for preserving memory, LUO uses KHO as a backbone for Live Update.
>
> > > > Although to be honest, things like reservemem (or IMA perhaps?) don't
> > > > really fit well with the explicit trigger mechanism. They can be carried
> > >
> > > Agreed. Another example I was thinking about is "kexec telemetry":
> > > precise time information about kexec, including shutdown, purgatory,
> > > boot. We are planning to propose kexec telemetry, and it could be LUO
> > > subsystem. On the other hand, it could be useful even without live
> > > update, just to measure precise kexec reboot time.
> > >
> > > > across kexec without needing userspace explicitly driving it. Maybe we
> > > > allow LUO subsystems to mark themselves as auto-preservable and LUO will
> > > > preserve them regardless of state being prepared? Something to think
> > > > about later down the line I suppose.
> > >
> > > We can start with adding `reserve_mem` as regular subsystem, and make
> > > this auto-preserve option a future expansion, when if needed.
> > > Presumably, `luoctl prepare` would work for whoever plans to use just
> > > `reserve_mem`.
> >
> > I think it would be nice to support auto-preserve sooner than later.
>
> Makes sense.
>
> > reserve_mem can already be useful for ftrace and pstore folks and if it
> > would survive a kexec without any userspace intervention it would be great.
>
> The pstore use case is only potential, correct? Or can it already use
> reserve_mem?

So currently, KHO provides the following two types of  internal API:

Preserve memory and metadata
=========================
kho_preserve_folio() / kho_preserve_phys()
kho_unpreserve_folio() / kho_unpreserve_phys()
kho_restore_folio()

kho_add_subtree() kho_retrieve_subtree()

State machine
===========
register_kho_notifier() / unregister_kho_notifier()

kho_finalize() / kho_abort()

We should remove the "State machine", and only keep the "Preserve
Memory" API functions. At the time these functions are called, KHO
should do the magic of making sure that the memory gets preserved
across the reboot.

This way, reserve_mem_init() would call: kho_preserve_folio() and
kho_add_subtree() during boot, and be done with it.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 17:43                     ` Pasha Tatashin
@ 2025-06-19 12:00                       ` Mike Rapoport
  2025-06-19 14:22                         ` Pasha Tatashin
  2025-06-23  7:32                       ` Mike Rapoport
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-06-19 12:00 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 18, 2025 at 01:43:18PM -0400, Pasha Tatashin wrote:
> > > > > >> >
> > > > > >> > What I meant is that even without KHO_DEBUGFS, LUO drives KHO, but then
> > > > > >> > KHO calls into LUO from the notifier, which makes the control flow
> > > > > >> > somewhat convoluted. If LUO is supposed to be the only thing that
> > > > > >> > interacts directly with KHO, maybe we should get rid of the notifier and
> > > > > >> > only let LUO drive things.
> > > > > >>
> > > > > >> Yes, we should. I think we should consider the KHO notifiers and self
> > > > > >> orchestration as obsoleted by LUO. That's why it was in debugfs
> > > > > >> because we were not ready to commit to it.
> > > > > >
> > > > > > We could do that, however, there is one example KHO user
> > > > > > `reserve_mem`, that is also not liveupdate related. So, it should
> > > > > > either be removed or modified to be handled by LUO.
> > > > >
> > > > > It still depends on kho_finalize() being called, so it still needs
> > > > > something to trigger its serialization. It is not automatic. And with
> > > > > your proposed patch to make debugfs interface optional, it can't even be
> > > > > used with the config disabled.
> > > >
> > > > At least for now, it can still be used via LUO going into prepare
> > > > state, since LUO changes KHO into finalized state and reserve_mem is
> > > > registered to be called back from KHO.
> > > >
> > > > > So if it must be explicitly triggered to be preserved, why not let the
> > > > > trigger point be LUO instead of KHO? You can make reservemem a LUO
> > > > > subsystem instead.
> > > >
> > > > Yes, LUO can do that, the only concern I raised is that  `reserve_mem`
> > > > is not really live update related.
> > >
> > > I only now realized what bothered me about "liveupdate". It's the name of
> > > the driving usecase rather then the name of the technology it implements.
> > > In the end what LUO does is a (more) sophisticated control for KHO.
> > >
> > > But essentially it's not that it actually implements live update, it
> > > provides kexec handover control plane that enables live update.
> > >
> > > And since the same machinery can be used regardless of live update, and I'm
> > > sure other usecases will appear as soon as the technology will become more
> > > mature, it makes me think that we probably should just
> > > s/liveupdate_/kho_control/g or something along those lines.
> >
> > I disagree, LUO is for liveupdate flows, and is designed specifically
> > around the live update flows: brownout/blackout/post-liveupdate, it
> > should not be generalized to anticipate some other random states, and
> > it should only support participants that are related to live update:
> > iommufd/vfiofd/kvmfd/memfd/eventfd and controled via "liveupdated" the
> > userspace agent.

But it's not how the things work. Once there's an API anyone can use it,
right?

How do you intend to restrict this API usage to subsystems that are related
to the live update flow? Or userspace driving ioctls outside "liveupdated"
user agent?

There are a lot of examples of kernel subsystems that were designed for a
particular thing and later were extended to support additional use cases.

I'm not saying LUO should "anticipate some other random states", what I'm
saying is that usecases other than liveupdate may appear and use the APIs
LUO provides for something else.

> > KHO is for preserving memory, LUO uses KHO as a backbone for Live Update.

If we make LUO the only uABI to drive KHO it becomes misnamed from the
start.
As you mentioned yourself, reserve_mem and potentially IMA and kexec
telemetry are not necessarily related to LUO, but it still would be useful
to support them without LUO.

While it's easy to make memblock a LUO subsystem to me it seems
semantically wrong naming.

> > > > > Although to be honest, things like reservemem (or IMA perhaps?) don't
> > > > > really fit well with the explicit trigger mechanism. They can be carried
> > > >
> > > > Agreed. Another example I was thinking about is "kexec telemetry":
> > > > precise time information about kexec, including shutdown, purgatory,
> > > > boot. We are planning to propose kexec telemetry, and it could be LUO
> > > > subsystem. On the other hand, it could be useful even without live
> > > > update, just to measure precise kexec reboot time.
> > > >
> > > > > across kexec without needing userspace explicitly driving it. Maybe we
> > > > > allow LUO subsystems to mark themselves as auto-preservable and LUO will
> > > > > preserve them regardless of state being prepared? Something to think
> > > > > about later down the line I suppose.
> > > >
> > > > We can start with adding `reserve_mem` as regular subsystem, and make
> > > > this auto-preserve option a future expansion, when if needed.
> > > > Presumably, `luoctl prepare` would work for whoever plans to use just
> > > > `reserve_mem`.
> > >
> > > I think it would be nice to support auto-preserve sooner than later.
> >
> > Makes sense.
> >
> > > reserve_mem can already be useful for ftrace and pstore folks and if it
> > > would survive a kexec without any userspace intervention it would be great.
> >
> > The pstore use case is only potential, correct? Or can it already use
> > reserve_mem?

pstore can use reserve_mem already.
 
> So currently, KHO provides the following two types of  internal API:
> 
> Preserve memory and metadata
> =========================
> kho_preserve_folio() / kho_preserve_phys()
> kho_unpreserve_folio() / kho_unpreserve_phys()
> kho_restore_folio()
> 
> kho_add_subtree() kho_retrieve_subtree()
> 
> State machine
> ===========
> register_kho_notifier() / unregister_kho_notifier()
> 
> kho_finalize() / kho_abort()
> 
> We should remove the "State machine", and only keep the "Preserve
> Memory" API functions. At the time these functions are called, KHO
> should do the magic of making sure that the memory gets preserved
> across the reboot.
> 
> This way, reserve_mem_init() would call: kho_preserve_folio() and
> kho_add_subtree() during boot, and be done with it.

Right, but we still need something to drive kho_mem_serialize().
And it has to be done before kexec load, at least until we resolve this.

Currently this is triggered either by KHO debugfs or by LUO ioctls. If we
completely drop KHO debugfs and notifiers, we still need something that
would trigger the magic.

I'm not saying we should keep KHO debugfs and notifiers, I'm saying that if
we make LUO the only thing driving KHO, liveupdate is not an appropriate
name.

> Pasha
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-19 12:00                       ` Mike Rapoport
@ 2025-06-19 14:22                         ` Pasha Tatashin
  2025-06-20 15:28                           ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-19 14:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

> > > I disagree, LUO is for liveupdate flows, and is designed specifically
> > > around the live update flows: brownout/blackout/post-liveupdate, it
> > > should not be generalized to anticipate some other random states, and
> > > it should only support participants that are related to live update:
> > > iommufd/vfiofd/kvmfd/memfd/eventfd and controled via "liveupdated" the
> > > userspace agent.
>
> But it's not how the things work. Once there's an API anyone can use it,
> right?
>
> How do you intend to restrict this API usage to subsystems that are related
> to the live update flow? Or userspace driving ioctls outside "liveupdated"
> user agent?

Hi Mike,

LUO provides both kernel and user APIs specifically for live update
scenarios.  Live Update is an ability to reboot kernel while keeping
some devices operations and FDs intact. That is the only uAPI that LUO
provides, It enables users to preserve resources via FDs for memfd,
vfiofd, guestmemfd, kvmfd, eventfd, and any other supported FD. It
also provides a well defined state machine for user to add an retrieve
the resources, and for kernel to do proper serialization of these
resources. Since this is the only uAPI that LUO provides, I do not see
how it can be used for other scenarios.

> There are a lot of examples of kernel subsystems that were designed for a
> particular thing and later were extended to support additional use cases.

If that ever becomes necessary, either the core part would need to be
moved out to be a separate thing, or a separate state machine on top
of KHO targeting that use case would need to be developed.

Currently, I don't see an immediate need for this, especially if KHO
itself is updated so the state machine is removed, and therefore
finalization is not required.

> I'm not saying LUO should "anticipate some other random states", what I'm
> saying is that usecases other than liveupdate may appear and use the APIs
> LUO provides for something else.
>
> > > KHO is for preserving memory, LUO uses KHO as a backbone for Live Update.
>
> If we make LUO the only uABI to drive KHO it becomes misnamed from the
> start.
> As you mentioned yourself, reserve_mem and potentially IMA and kexec

Kernel-internal components like pstore/reserve_mem or IMA do not
require a uAPI to drive their KHO interactions. They can, and should,
directly use KHO's kernel-level APIs kho_preserve_folio() and
kho_restore_folio().

KHO itself must offer these preservation primitives, rather than
embedding a state machine that dictates a single "finalize" point for
all users.

> pstore can use reserve_mem already.

That's good to know; I'll investigate how pstore currently utilizes
reserve_mem. My current approach involves reserving the memmap for
pstore via kernel parameters.

> > So currently, KHO provides the following two types of  internal API:
> >
> > Preserve memory and metadata
> > =========================
> > kho_preserve_folio() / kho_preserve_phys()
> > kho_unpreserve_folio() / kho_unpreserve_phys()
> > kho_restore_folio()
> >
> > kho_add_subtree() kho_retrieve_subtree()
> >
> > State machine
> > ===========
> > register_kho_notifier() / unregister_kho_notifier()
> >
> > kho_finalize() / kho_abort()
> >
> > We should remove the "State machine", and only keep the "Preserve
> > Memory" API functions. At the time these functions are called, KHO
> > should do the magic of making sure that the memory gets preserved
> > across the reboot.
> >
> > This way, reserve_mem_init() would call: kho_preserve_folio() and
> > kho_add_subtree() during boot, and be done with it.
>
> Right, but we still need something to drive kho_mem_serialize().

My view is that an explicit, global kho_mem_serialize() call driven
externally (like by LUO or debugfs) is not necessary for KHO
operations.

When kho_preserve_folio() or kho_add_subtree() is called, KHO itself
should perform the immediate actions required to ensure that specific
folio or subtree metadata is staged for preservation across a kexec.
Similarly, kho_unpreserve_folio() or kho_remove_subtree() (which is
currently missing from the KHO API) should immediately update KHO's
state to reflect that the item is no longer preserved.

> And it has to be done before kexec load, at least until we resolve this.

The before kexec load constrained has been fixed. The only
"finalization" constraint we have is it should be before
reboot(LINUX_REBOOT_CMD_KEXEC) and only because memory allocations
during kernel shutdown are undesirable. Once KHO moves away from a
monolithic state machine this constraint disappears. Kernel components
could preserve their resources at appropriate times, not necessarily
tied to a shutdown-time. For live update scenarios, LUO already
orchestrates this timing.

> Currently this is triggered either by KHO debugfs or by LUO ioctls. If we
> completely drop KHO debugfs and notifiers, we still need something that
> would trigger the magic.

An external "magic trigger" for KHO (like the current finalize
notifier or debugfs command) is necessary for scenarios like live
update, where userspace resources are being preserved in a coordinated
fashion just before kexec.

For kernel-internal resources that are unrelated to such a
userspace-driven live update flow, the respective kernel components
should directly use KHO's primitive preservation APIs
(kho_preserve_folio, etc.) when they need to mark their resources for
handover. No separate, state machine or external trigger should be
required for these individual, self-contained preservation acts.

> I'm not saying we should keep KHO debugfs and notifiers, I'm saying that if
> we make LUO the only thing driving KHO, liveupdate is not an appropriate
> name.

LUO drives KHO specifically for the purpose of live updates. If a
different userspace use-case emerges that needs another distinct
purpose (e.g., not to preserve a FD a or a device across kernel reboot
(i.e. something for which LUO does not provide uAPI)), then that would
probably need a separate from LUO uAPI instead of extending the LUO
uAPI.

Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-19 14:22                         ` Pasha Tatashin
@ 2025-06-20 15:28                           ` Pratyush Yadav
  2025-06-20 16:03                             ` Pasha Tatashin
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-20 15:28 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

Hi Pasha,

On Thu, Jun 19 2025, Pasha Tatashin wrote:

[...]
>> And it has to be done before kexec load, at least until we resolve this.
>
> The before kexec load constrained has been fixed. The only
> "finalization" constraint we have is it should be before
> reboot(LINUX_REBOOT_CMD_KEXEC) and only because memory allocations
> during kernel shutdown are undesirable. Once KHO moves away from a
> monolithic state machine this constraint disappears. Kernel components
> could preserve their resources at appropriate times, not necessarily
> tied to a shutdown-time. For live update scenarios, LUO already
> orchestrates this timing.
>
>> Currently this is triggered either by KHO debugfs or by LUO ioctls. If we
>> completely drop KHO debugfs and notifiers, we still need something that
>> would trigger the magic.
>
> An external "magic trigger" for KHO (like the current finalize
> notifier or debugfs command) is necessary for scenarios like live
> update, where userspace resources are being preserved in a coordinated
> fashion just before kexec.
>
> For kernel-internal resources that are unrelated to such a
> userspace-driven live update flow, the respective kernel components
> should directly use KHO's primitive preservation APIs
> (kho_preserve_folio, etc.) when they need to mark their resources for
> handover. No separate, state machine or external trigger should be
> required for these individual, self-contained preservation acts.

For kernel-internal components, I think this makes a lot of sense,
especially now that we don't need to get everything done by kexec load
time. I suppose the liveupdate_reboot() call at reboot time to prepare
final things can be useful, but subsystems can just as well register
reboot notifiers to get the same notification.

>
>> I'm not saying we should keep KHO debugfs and notifiers, I'm saying that if
>> we make LUO the only thing driving KHO, liveupdate is not an appropriate
>> name.
>
> LUO drives KHO specifically for the purpose of live updates. If a
> different userspace use-case emerges that needs another distinct
> purpose (e.g., not to preserve a FD a or a device across kernel reboot
> (i.e. something for which LUO does not provide uAPI)), then that would
> probably need a separate from LUO uAPI instead of extending the LUO
> uAPI.

Outside of hypervisor live update, I have a very clear use case in mind:
userspace memory handover (on guest side). Say a guest running an
in-memory cache like memcached with many gigabytes of cache wants to
reboot. It can just shove the cache into a memfd, give it to LUO, and
restore it after reboot. Some services that suffer from long reboots are
looking into using this to reduce downtime. Since it pretty much
overlaps with the hypervisor work for now, I haven't been talking about
it as much.

Would you also call this use case "live update"? Does it also fit with
your vision of where LUO should go?

If not, why do you think we should have a parallel set of uAPIs that do
similar work? Why can't we accommodate other use cases under one API,
especially as long as they don't have conflicting goals? In practice,
outside of s/luo/khoctl/g, I don't think much would change as of now.
The state machine and APIs will stay the same.

When those use cases start to diverge from the liveupdate, or conflict
with it, we can then decide to have a separate interface for them, but
when going the other way round, we won't end up with a somewhat
confusing name for a more widely applicable technology.

I've been thinking about the naming since the start, but I didn't want
to bikeshed on it too much. But if we are also talking about the scope
of LUO, then I think this is a conversation worth having.

PS: I don't have real data, but I have a feeling that after luo/khoctl
    mature, more use cases will come out of the woodwork to optimize
    reboots.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-20 15:28                           ` Pratyush Yadav
@ 2025-06-20 16:03                             ` Pasha Tatashin
  2025-06-24 16:12                               ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-20 16:03 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Fri, Jun 20, 2025 at 11:28 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Thu, Jun 19 2025, Pasha Tatashin wrote:
>
> [...]
> >> And it has to be done before kexec load, at least until we resolve this.
> >
> > The before kexec load constrained has been fixed. The only
> > "finalization" constraint we have is it should be before
> > reboot(LINUX_REBOOT_CMD_KEXEC) and only because memory allocations
> > during kernel shutdown are undesirable. Once KHO moves away from a
> > monolithic state machine this constraint disappears. Kernel components
> > could preserve their resources at appropriate times, not necessarily
> > tied to a shutdown-time. For live update scenarios, LUO already
> > orchestrates this timing.
> >
> >> Currently this is triggered either by KHO debugfs or by LUO ioctls. If we
> >> completely drop KHO debugfs and notifiers, we still need something that
> >> would trigger the magic.
> >
> > An external "magic trigger" for KHO (like the current finalize
> > notifier or debugfs command) is necessary for scenarios like live
> > update, where userspace resources are being preserved in a coordinated
> > fashion just before kexec.
> >
> > For kernel-internal resources that are unrelated to such a
> > userspace-driven live update flow, the respective kernel components
> > should directly use KHO's primitive preservation APIs
> > (kho_preserve_folio, etc.) when they need to mark their resources for
> > handover. No separate, state machine or external trigger should be
> > required for these individual, self-contained preservation acts.
>

Hi Pratyush,

> For kernel-internal components, I think this makes a lot of sense,
> especially now that we don't need to get everything done by kexec load
> time. I suppose the liveupdate_reboot() call at reboot time to prepare
> final things can be useful, but subsystems can just as well register
> reboot notifiers to get the same notification.

Correct. If subsystems unrelated to the userspace live update flow,
such as pstore, tracing, telemetry, debugging, or IMA, need to be
notified about a reboot, they can simply register their own reboot
notifier.

> >> I'm not saying we should keep KHO debugfs and notifiers, I'm saying that if
> >> we make LUO the only thing driving KHO, liveupdate is not an appropriate
> >> name.
> >
> > LUO drives KHO specifically for the purpose of live updates. If a
> > different userspace use-case emerges that needs another distinct
> > purpose (e.g., not to preserve a FD a or a device across kernel reboot
> > (i.e. something for which LUO does not provide uAPI)), then that would
> > probably need a separate from LUO uAPI instead of extending the LUO
> > uAPI.
>
> Outside of hypervisor live update, I have a very clear use case in mind:
> userspace memory handover (on guest side). Say a guest running an
> in-memory cache like memcached with many gigabytes of cache wants to
> reboot. It can just shove the cache into a memfd, give it to LUO, and
> restore it after reboot. Some services that suffer from long reboots are
> looking into using this to reduce downtime. Since it pretty much
> overlaps with the hypervisor work for now, I haven't been talking about
> it as much.
>
> Would you also call this use case "live update"? Does it also fit with
> your vision of where LUO should go?

Yes, absolutely. The use case you described (preserving a memcached
instance via memfd) is a perfect fit for LUO's vision.

While the primary use case driving this work is supporting the
preservation of virtual machines on a hypervisor, the framework itself
is not restricted to that scenario. We define "live update" as the
process of updating the kernel from one version to another while
preserving FD-based resources and keeping selected devices
operational. The machine itself can be running storage, database,
networking, containers, or anything else.

A good parallel is Kernel Live Patching: we don't distinguish what
workload is running on a machine when applying a security patch; we
simply patch the running kernel. In the same way, Live Update is
designed to be workload-agnostic. Whether the system is running an
in-memory database, containers, or VMs, its primary goal is to enable
a full kernel update while preserving the userspace-requested state.

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-18 17:43                     ` Pasha Tatashin
  2025-06-19 12:00                       ` Mike Rapoport
@ 2025-06-23  7:32                       ` Mike Rapoport
  2025-06-23 11:29                         ` Pasha Tatashin
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-06-23  7:32 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 18, 2025 at 01:43:18PM -0400, Pasha Tatashin wrote:
> On Wed, Jun 18, 2025 at 1:00 PM Pasha Tatashin
>
> So currently, KHO provides the following two types of  internal API:
> 
> Preserve memory and metadata
> =========================
> kho_preserve_folio() / kho_preserve_phys()
> kho_unpreserve_folio() / kho_unpreserve_phys()
> kho_restore_folio()
> 
> kho_add_subtree() kho_retrieve_subtree()
> 
> State machine
> ===========
> register_kho_notifier() / unregister_kho_notifier()
> 
> kho_finalize() / kho_abort()
> 
> We should remove the "State machine", and only keep the "Preserve
> Memory" API functions. At the time these functions are called, KHO
> should do the magic of making sure that the memory gets preserved
> across the reboot.
> 
> This way, reserve_mem_init() would call: kho_preserve_folio() and
> kho_add_subtree() during boot, and be done with it.

I agree that there's no need in notifiers.

I even have a half cooked patch for this on top of "kho: allow to drive kho
from within kernel"

From 02716e4731480bde997a9c1676b7246aa8e358de Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Sun, 22 Jun 2025 14:37:17 +0300
Subject: [PATCH] kho: drop notifiers

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 include/linux/kexec_handover.h   |  27 +-------
 kernel/kexec_handover.c          | 114 ++++++++++++++-----------------
 kernel/kexec_handover_debug.c    |   3 +-
 kernel/kexec_handover_internal.h |   3 +-
 mm/memblock.c                    |  56 +++------------
 5 files changed, 65 insertions(+), 138 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index f98565def593..ac9cb6eae71f 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -10,14 +10,7 @@ struct kho_scratch {
 	phys_addr_t size;
 };
 
-/* KHO Notifier index */
-enum kho_event {
-	KEXEC_KHO_FINALIZE = 0,
-	KEXEC_KHO_ABORT = 1,
-};
-
 struct folio;
-struct notifier_block;
 
 #define DECLARE_KHOSER_PTR(name, type) \
 	union {                        \
@@ -36,20 +29,15 @@ struct notifier_block;
 		(typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \
 	})
 
-struct kho_serialization;
-
 #ifdef CONFIG_KEXEC_HANDOVER
 bool kho_is_enabled(void);
 
 int kho_preserve_folio(struct folio *folio);
 int kho_preserve_phys(phys_addr_t phys, size_t size);
 struct folio *kho_restore_folio(phys_addr_t phys);
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt);
+int kho_add_subtree(const char *name, void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
 
-int register_kho_notifier(struct notifier_block *nb);
-int unregister_kho_notifier(struct notifier_block *nb);
-
 void kho_memory_init(void);
 
 void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
@@ -79,8 +67,7 @@ static inline struct folio *kho_restore_folio(phys_addr_t phys)
 	return NULL;
 }
 
-static inline int kho_add_subtree(struct kho_serialization *ser,
-				  const char *name, void *fdt)
+static inline int kho_add_subtree(const char *name, void *fdt)
 {
 	return -EOPNOTSUPP;
 }
@@ -90,16 +77,6 @@ static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 	return -EOPNOTSUPP;
 }
 
-static inline int register_kho_notifier(struct notifier_block *nb)
-{
-	return -EOPNOTSUPP;
-}
-
-static inline int unregister_kho_notifier(struct notifier_block *nb)
-{
-	return -EOPNOTSUPP;
-}
-
 static inline void kho_memory_init(void)
 {
 }
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 176eaf2c31ab..b609eaf92550 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -15,7 +15,6 @@
 #include <linux/libfdt.h>
 #include <linux/list.h>
 #include <linux/memblock.h>
-#include <linux/notifier.h>
 #include <linux/page-isolation.h>
 
 #include <asm/early_ioremap.h>
@@ -552,7 +551,6 @@ static void __init kho_reserve_scratch(void)
 
 /**
  * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
- * @ser: serialization control object passed by KHO notifiers.
  * @name: name of the sub tree.
  * @fdt: the sub tree blob.
  *
@@ -566,11 +564,12 @@ static void __init kho_reserve_scratch(void)
  *
  * Return: 0 on success, error code on failure
  */
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
+int kho_add_subtree(const char *name, void *fdt)
 {
+	struct kho_serialization *ser = &kho_out.ser;
 	int err = 0;
 	u64 phys = (u64)virt_to_phys(fdt);
-	void *root = page_to_virt(ser->fdt);
+	void *root = ser->fdt;
 
 	err |= fdt_begin_node(root, name);
 	err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys));
@@ -584,7 +583,6 @@ int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
 struct kho_out kho_out = {
-	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.lock = __MUTEX_INITIALIZER(kho_out.lock),
 	.ser = {
 		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
@@ -595,18 +593,6 @@ struct kho_out kho_out = {
 	.finalized = false,
 };
 
-int register_kho_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
-}
-EXPORT_SYMBOL_GPL(register_kho_notifier);
-
-int unregister_kho_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
-}
-EXPORT_SYMBOL_GPL(unregister_kho_notifier);
-
 /**
  * kho_preserve_folio - preserve a folio across kexec.
  * @folio: folio to preserve.
@@ -676,7 +662,6 @@ EXPORT_SYMBOL_GPL(kho_preserve_phys);
 
 int __kho_abort(void)
 {
-	int err;
 	unsigned long order;
 	struct kho_mem_phys *physxa;
 
@@ -697,44 +682,15 @@ int __kho_abort(void)
 		kho_out.ser.preserved_mem_map = NULL;
 	}
 
-	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT,
-					   NULL);
-	err = notifier_to_errno(err);
-
-	if (err)
-		pr_err("Failed to abort KHO finalization: %d\n", err);
-
-	return err;
+	return 0;
 }
 
 int __kho_finalize(void)
 {
 	int err = 0;
-	u64 *preserved_mem_map;
-	void *fdt = page_to_virt(kho_out.ser.fdt);
-
-	err |= fdt_create(fdt, PAGE_SIZE);
-	err |= fdt_finish_reservemap(fdt);
-	err |= fdt_begin_node(fdt, "");
-	err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE);
-	/**
-	 * Reserve the preserved-memory-map property in the root FDT, so
-	 * that all property definitions will precede subnodes created by
-	 * KHO callers.
-	 */
-	err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP,
-					sizeof(*preserved_mem_map),
-					(void **)&preserved_mem_map);
-	if (err)
-		goto abort;
+	void *fdt = kho_out.ser.fdt;
 
-	err = kho_preserve_folio(page_folio(kho_out.ser.fdt));
-	if (err)
-		goto abort;
-
-	err = blocking_notifier_call_chain(&kho_out.chain_head,
-					   KEXEC_KHO_FINALIZE, &kho_out.ser);
-	err = notifier_to_errno(err);
+	err = kho_preserve_folio(page_folio(virt_to_page(kho_out.ser.fdt)));
 	if (err)
 		goto abort;
 
@@ -742,7 +698,7 @@ int __kho_finalize(void)
 	if (err)
 		goto abort;
 
-	*preserved_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map);
+	*kho_out.ser.fdt_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map);
 
 	err |= fdt_end_node(fdt);
 	err |= fdt_finish(fdt);
@@ -863,19 +819,13 @@ static __init int kho_init(void)
 	if (!kho_enable)
 		return 0;
 
-	kho_out.ser.fdt = alloc_page(GFP_KERNEL);
-	if (!kho_out.ser.fdt) {
-		err = -ENOMEM;
-		goto err_free_scratch;
-	}
-
 	err = kho_debugfs_init();
 	if (err)
-		goto err_free_fdt;
+		goto err_free_scratch;
 
 	err = kho_out_debugfs_init();
 	if (err)
-		goto err_free_fdt;
+		goto err_free_scratch;
 
 	if (fdt) {
 		kho_in_debugfs_init(fdt);
@@ -894,9 +844,6 @@ static __init int kho_init(void)
 
 	return 0;
 
-err_free_fdt:
-	put_page(kho_out.ser.fdt);
-	kho_out.ser.fdt = NULL;
 err_free_scratch:
 	for (int i = 0; i < kho_scratch_cnt; i++) {
 		void *start = __va(kho_scratch[i].addr);
@@ -933,10 +880,50 @@ static void __init kho_release_scratch(void)
 	}
 }
 
+static int __init kho_out_fdt_init(void)
+{
+	void *fdt;
+	int err = 0;
+
+	fdt = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+	if (!fdt)
+		return -ENOMEM;
+
+	err |= fdt_create(fdt, PAGE_SIZE);
+	err |= fdt_finish_reservemap(fdt);
+	err |= fdt_begin_node(fdt, "");
+	err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE);
+	/**
+	 * Reserve the preserved-memory-map property in the root FDT, so
+	 * that all property definitions will precede subnodes created by
+	 * KHO callers.
+	 */
+	err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP,
+					sizeof(*kho_out.ser.fdt_mem_map),
+					(void **)&kho_out.ser.fdt_mem_map);
+	if (err)
+		goto err_free_fdt;
+
+	kho_out.ser.fdt = fdt;
+	return 0;
+
+err_free_fdt:
+	memblock_free(fdt, PAGE_SIZE);
+	return err;
+}
+
 void __init kho_memory_init(void)
 {
 	struct folio *folio;
 
+	int err = kho_out_fdt_init();
+
+	if (err) {
+		pr_err("failed to allocate root FDT, disabling KHO\n");
+		kho_enable = false;
+		return;
+	}
+
 	if (kho_in.scratch_phys) {
 		kho_scratch = phys_to_virt(kho_in.scratch_phys);
 		kho_release_scratch();
@@ -1008,6 +995,7 @@ void __init kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
 	}
 
 	memblock_reserve(scratch_phys, scratch_len);
+	memblock_reserve(fdt_phys, PAGE_SIZE);
 
 	/*
 	 * Now that we have a viable region of scratch memory, let's tell
@@ -1043,7 +1031,7 @@ int kho_fill_kimage(struct kimage *image)
 	if (!kho_enable)
 		return 0;
 
-	image->kho.fdt = page_to_phys(kho_out.ser.fdt);
+	image->kho.fdt = virt_to_phys(kho_out.ser.fdt);
 
 	scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt;
 	scratch = (struct kexec_buf){
diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c
index a15c238ec98e..a34997a1adae 100644
--- a/kernel/kexec_handover_debug.c
+++ b/kernel/kexec_handover_debug.c
@@ -62,8 +62,7 @@ int kho_out_update_debugfs_fdt(void)
 
 	if (kho_out.finalized) {
 		err = __kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir,
-					    "fdt",
-					    page_to_virt(kho_out.ser.fdt));
+					    "fdt", kho_out.ser.fdt);
 	} else {
 		list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) {
 			debugfs_remove(ff->file);
diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h
index 0b534758d39d..bf78ecb06996 100644
--- a/kernel/kexec_handover_internal.h
+++ b/kernel/kexec_handover_internal.h
@@ -16,7 +16,8 @@ struct kho_mem_track {
 };
 
 struct kho_serialization {
-	struct page *fdt;
+	void *fdt;
+	u64 *fdt_mem_map;
 	struct list_head fdt_list;
 	struct kho_mem_track track;
 	/* First chunk of serialized preserved memory map */
diff --git a/mm/memblock.c b/mm/memblock.c
index 154f1d73b61f..6af0b51b1bb7 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2501,51 +2501,18 @@ int reserve_mem_release_by_name(const char *name)
 #define MEMBLOCK_KHO_FDT "memblock"
 #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
 #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
-static struct page *kho_fdt;
-
-static int reserve_mem_kho_finalize(struct kho_serialization *ser)
-{
-	int err = 0, i;
-
-	for (i = 0; i < reserved_mem_count; i++) {
-		struct reserve_mem_table *map = &reserved_mem_table[i];
-
-		err |= kho_preserve_phys(map->start, map->size);
-	}
-
-	err |= kho_preserve_folio(page_folio(kho_fdt));
-	err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
-
-	return notifier_from_errno(err);
-}
-
-static int reserve_mem_kho_notifier(struct notifier_block *self,
-				    unsigned long cmd, void *v)
-{
-	switch (cmd) {
-	case KEXEC_KHO_FINALIZE:
-		return reserve_mem_kho_finalize((struct kho_serialization *)v);
-	case KEXEC_KHO_ABORT:
-		return NOTIFY_DONE;
-	default:
-		return NOTIFY_BAD;
-	}
-}
-
-static struct notifier_block reserve_mem_kho_nb = {
-	.notifier_call = reserve_mem_kho_notifier,
-};
 
 static int __init prepare_kho_fdt(void)
 {
 	int err = 0, i;
+	struct page *fdt_page;
 	void *fdt;
 
-	kho_fdt = alloc_page(GFP_KERNEL);
-	if (!kho_fdt)
+	fdt_page = alloc_page(GFP_KERNEL);
+	if (!fdt_page)
 		return -ENOMEM;
 
-	fdt = page_to_virt(kho_fdt);
+	fdt = page_to_virt(fdt_page);
 
 	err |= fdt_create(fdt, PAGE_SIZE);
 	err |= fdt_finish_reservemap(fdt);
@@ -2555,6 +2522,7 @@ static int __init prepare_kho_fdt(void)
 	for (i = 0; i < reserved_mem_count; i++) {
 		struct reserve_mem_table *map = &reserved_mem_table[i];
 
+		err |= kho_preserve_phys(map->start, map->size);
 		err |= fdt_begin_node(fdt, map->name);
 		err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
 		err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
@@ -2562,13 +2530,14 @@ static int __init prepare_kho_fdt(void)
 		err |= fdt_end_node(fdt);
 	}
 	err |= fdt_end_node(fdt);
-
 	err |= fdt_finish(fdt);
 
+	err |= kho_preserve_folio(page_folio(fdt_page));
+	err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
+
 	if (err) {
 		pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
-		put_page(kho_fdt);
-		kho_fdt = NULL;
+		put_page(fdt_page);
 	}
 
 	return err;
@@ -2584,13 +2553,6 @@ static int __init reserve_mem_init(void)
 	err = prepare_kho_fdt();
 	if (err)
 		return err;
-
-	err = register_kho_notifier(&reserve_mem_kho_nb);
-	if (err) {
-		put_page(kho_fdt);
-		kho_fdt = NULL;
-	}
-
 	return err;
 }
 late_initcall(reserve_mem_init);
-- 
2.47.2




> Pasha
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-23  7:32                       ` Mike Rapoport
@ 2025-06-23 11:29                         ` Pasha Tatashin
  2025-06-25 13:46                           ` Mike Rapoport
  0 siblings, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-23 11:29 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Mon, Jun 23, 2025 at 3:32 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Wed, Jun 18, 2025 at 01:43:18PM -0400, Pasha Tatashin wrote:
> > On Wed, Jun 18, 2025 at 1:00 PM Pasha Tatashin
> >
> > So currently, KHO provides the following two types of  internal API:
> >
> > Preserve memory and metadata
> > =========================
> > kho_preserve_folio() / kho_preserve_phys()
> > kho_unpreserve_folio() / kho_unpreserve_phys()
> > kho_restore_folio()
> >
> > kho_add_subtree() kho_retrieve_subtree()
> >
> > State machine
> > ===========
> > register_kho_notifier() / unregister_kho_notifier()
> >
> > kho_finalize() / kho_abort()
> >
> > We should remove the "State machine", and only keep the "Preserve
> > Memory" API functions. At the time these functions are called, KHO
> > should do the magic of making sure that the memory gets preserved
> > across the reboot.
> >
> > This way, reserve_mem_init() would call: kho_preserve_folio() and
> > kho_add_subtree() during boot, and be done with it.
>
> I agree that there's no need in notifiers.
>
> I even have a half cooked patch for this on top of "kho: allow to drive kho
> from within kernel"
>
> From 02716e4731480bde997a9c1676b7246aa8e358de Mon Sep 17 00:00:00 2001
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> Date: Sun, 22 Jun 2025 14:37:17 +0300
> Subject: [PATCH] kho: drop notifiers
>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  include/linux/kexec_handover.h   |  27 +-------
>  kernel/kexec_handover.c          | 114 ++++++++++++++-----------------
>  kernel/kexec_handover_debug.c    |   3 +-
>  kernel/kexec_handover_internal.h |   3 +-
>  mm/memblock.c                    |  56 +++------------
>  5 files changed, 65 insertions(+), 138 deletions(-)
>
> diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
> index f98565def593..ac9cb6eae71f 100644
> --- a/include/linux/kexec_handover.h
> +++ b/include/linux/kexec_handover.h
> @@ -10,14 +10,7 @@ struct kho_scratch {
>         phys_addr_t size;
>  };
>
> -/* KHO Notifier index */
> -enum kho_event {
> -       KEXEC_KHO_FINALIZE = 0,
> -       KEXEC_KHO_ABORT = 1,
> -};
> -
>  struct folio;
> -struct notifier_block;
>
>  #define DECLARE_KHOSER_PTR(name, type) \
>         union {                        \
> @@ -36,20 +29,15 @@ struct notifier_block;
>                 (typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \
>         })
>
> -struct kho_serialization;
> -
>  #ifdef CONFIG_KEXEC_HANDOVER
>  bool kho_is_enabled(void);
>
>  int kho_preserve_folio(struct folio *folio);
>  int kho_preserve_phys(phys_addr_t phys, size_t size);
>  struct folio *kho_restore_folio(phys_addr_t phys);
> -int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt);
> +int kho_add_subtree(const char *name, void *fdt);

For completeness, we also need `void kho_remove_substree(const char
*name);`, currently, all trees are removed during kho_abort(). Let's
rebase and include this patch on top of the next version of LUO, that
we are exchanging off list, and send it together later this week.

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
                     ` (2 preceding siblings ...)
  2025-06-05 16:15   ` Pratyush Yadav
@ 2025-06-24  9:50   ` Christian Brauner
  2025-06-24 14:27     ` Pasha Tatashin
  2025-07-06 14:24     ` Mike Rapoport
  3 siblings, 2 replies; 102+ messages in thread
From: Christian Brauner @ 2025-06-24  9:50 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> Introduce the user-space interface for the Live Update Orchestrator
> via ioctl commands, enabling external control over the live update
> process and management of preserved resources.
> 
> Create a misc character device at /dev/liveupdate. Access
> to this device requires the CAP_SYS_ADMIN capability.
> 
> A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> structures. The magic number is registered in
> Documentation/userspace-api/ioctl/ioctl-number.rst.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>  drivers/misc/liveupdate/Makefile              |   1 +
>  drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
>  include/linux/liveupdate.h                    |  34 +-
>  include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
>  5 files changed, 502 insertions(+), 33 deletions(-)
>  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
>  create mode 100644 include/uapi/linux/liveupdate.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 7a1409ecc238..279c124048f2 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -375,6 +375,7 @@ Code  Seq#    Include File                                           Comments
>  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
>  0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
>                                                                       <mailto:linux-hyperv@vger.kernel.org>
> +0xBA  all    uapi/linux/liveupdate.h                                 <mailto:Pasha Tatashin <pasha.tatashin@soleen.com>
>  0xC0  00-0F  linux/usb/iowarrior.h
>  0xCA  00-0F  uapi/misc/cxl.h                                         Dead since 6.15
>  0xCA  10-2F  uapi/misc/ocxl.h
> diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> index b4cdd162574f..7a0cd08919c9 100644
> --- a/drivers/misc/liveupdate/Makefile
> +++ b/drivers/misc/liveupdate/Makefile
> @@ -1,4 +1,5 @@
>  # SPDX-License-Identifier: GPL-2.0
> +obj-y					+= luo_ioctl.o
>  obj-y					+= luo_core.o
>  obj-y					+= luo_files.o
>  obj-y					+= luo_subsystems.o
> diff --git a/drivers/misc/liveupdate/luo_ioctl.c b/drivers/misc/liveupdate/luo_ioctl.c
> new file mode 100644
> index 000000000000..76c687ff650b
> --- /dev/null
> +++ b/drivers/misc/liveupdate/luo_ioctl.c
> @@ -0,0 +1,199 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +
> +/**
> + * DOC: LUO ioctl Interface
> + *
> + * The IOCTL user-space control interface for the LUO subsystem.
> + * It registers a misc character device, typically found at ``/dev/liveupdate``,
> + * which allows privileged userspace applications (requiring %CAP_SYS_ADMIN) to
> + * manage and monitor the LUO state machine and associated resources like
> + * preservable file descriptors.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/errno.h>
> +#include <linux/file.h>
> +#include <linux/fs.h>
> +#include <linux/init.h>
> +#include <linux/kernel.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/uaccess.h>
> +#include <uapi/linux/liveupdate.h>
> +#include "luo_internal.h"
> +
> +static int luo_ioctl_fd_preserve(struct liveupdate_fd *luo_fd)
> +{
> +	struct file *file;
> +	int ret;
> +
> +	file = fget(luo_fd->fd);
> +	if (!file) {
> +		pr_err("Bad file descriptor\n");
> +		return -EBADF;
> +	}
> +
> +	ret = luo_register_file(&luo_fd->token, file);
> +	if (ret)
> +		fput(file);
> +
> +	return ret;
> +}
> +
> +static int luo_ioctl_fd_unpreserve(u64 token)
> +{
> +	return luo_unregister_file(token);
> +}
> +
> +static int luo_ioctl_fd_restore(struct liveupdate_fd *luo_fd)
> +{
> +	struct file *file;
> +	int ret;
> +	int fd;
> +
> +	fd = get_unused_fd_flags(O_CLOEXEC);
> +	if (fd < 0) {
> +		pr_err("Failed to allocate new fd: %d\n", fd);
> +		return fd;
> +	}
> +
> +	ret = luo_retrieve_file(luo_fd->token, &file);
> +	if (ret < 0) {
> +		put_unused_fd(fd);
> +
> +		return ret;
> +	}
> +
> +	fd_install(fd, file);
> +	luo_fd->fd = fd;
> +
> +	return 0;
> +}
> +
> +static int luo_open(struct inode *inodep, struct file *filep)
> +{
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;
> +
> +	if (filep->f_flags & O_EXCL)
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct liveupdate_fd luo_fd;
> +	enum liveupdate_state state;
> +	int ret = 0;
> +	u64 token;
> +
> +	if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> +		return -ENOTTY;
> +
> +	switch (cmd) {
> +	case LIVEUPDATE_IOCTL_GET_STATE:
> +		state = READ_ONCE(luo_state);
> +		if (copy_to_user(argp, &state, sizeof(luo_state)))
> +			ret = -EFAULT;
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_PREPARE:
> +		ret = luo_prepare();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_FREEZE:
> +		ret = luo_freeze();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_FINISH:
> +		ret = luo_finish();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_EVENT_CANCEL:
> +		ret = luo_cancel();
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_PRESERVE:
> +		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_preserve(&luo_fd);
> +		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> +			ret = -EFAULT;
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_UNPRESERVE:
> +		if (copy_from_user(&token, argp, sizeof(u64))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_unpreserve(token);
> +		break;
> +
> +	case LIVEUPDATE_IOCTL_FD_RESTORE:
> +		if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> +			ret = -EFAULT;
> +			break;
> +		}
> +
> +		ret = luo_ioctl_fd_restore(&luo_fd);
> +		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> +			ret = -EFAULT;
> +		break;
> +
> +	default:
> +		pr_warn("ioctl: unknown command nr: 0x%x\n", _IOC_NR(cmd));
> +		ret = -ENOTTY;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct file_operations fops = {
> +	.owner          = THIS_MODULE,
> +	.open           = luo_open,
> +	.unlocked_ioctl = luo_ioctl,
> +};
> +
> +static struct miscdevice liveupdate_miscdev = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name  = "liveupdate",
> +	.fops  = &fops,
> +};

I'm not sure why people are so in love with character device based apis.
It's terrible. It glues everything to devtmpfs which isn't namespacable
in any way. It's terrible to delegate and extremely restrictive in terms
of extensiblity if you need additional device entries (aka the loop
driver folly).

One stupid question: I probably have asked this before and just swapped
out that I a) asked this already and b) received an explanation. But why
isn't this a singleton simple in-memory filesystem with a flat
hierarchy?

mount -t kexecfs kexecfs /kexecfs

So userspace mounts kexecfs (or the kernel does it automagically) and
then to add fds into that thing you do the following:

linkat(fd_my_anon_inode_memfd, "", -EBADF, "kexecfs/my_serialized_memfd", AT_EMPTY_PATH)

which will serialize the fd_my_anon_inode_memfd. You can also do this
with ioctls on the kexecfs filesystem of course.

The advantages are:

* implement your own lookup permission
* you're able to use ls -al /kexecfs to see all the things that you
  serialized in there trivially
* you're free to turn this into a multi-instance thing so that you can
  serialize things per-service for example
* it's naturally hierarchical integrating with systemd a lot more easily
* profit from fs apis like unlink() and so on to remove entries getting
  rid of at least complex ioctl()ing for the basic use-cases

> +
> +static int __init liveupdate_init(void)
> +{
> +	int err;
> +
> +	err = misc_register(&liveupdate_miscdev);
> +	if (err < 0) {
> +		pr_err("Failed to register misc device '%s': %d\n",
> +		       liveupdate_miscdev.name, err);
> +	}
> +
> +	return err;
> +}
> +module_init(liveupdate_init);
> +
> +static void __exit liveupdate_exit(void)
> +{
> +	misc_deregister(&liveupdate_miscdev);
> +}
> +module_exit(liveupdate_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Pasha Tatashin");
> +MODULE_DESCRIPTION("Live Update Orchestrator");
> +MODULE_VERSION("0.1");
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> index 7afe0aac5ce4..ff4f2ab5c673 100644
> --- a/include/linux/liveupdate.h
> +++ b/include/linux/liveupdate.h
> @@ -10,6 +10,7 @@
>  #include <linux/bug.h>
>  #include <linux/types.h>
>  #include <linux/list.h>
> +#include <uapi/linux/liveupdate.h>
>  
>  /**
>   * enum liveupdate_event - Events that trigger live update callbacks.
> @@ -53,39 +54,6 @@ enum liveupdate_event {
>  	LIVEUPDATE_CANCEL,
>  };
>  
> -/**
> - * enum liveupdate_state - Defines the possible states of the live update
> - * orchestrator.
> - * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
> - * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
> - *                                   LIVEUPDATE_PREPARE callbacks have completed
> - *                                   successfully.
> - *                                   Devices might operate in a limited state
> - *                                   for example the participating devices might
> - *                                   not be allowed to unbind, and also the
> - *                                   setting up of new DMA mappings might be
> - *                                   disabled in this state.
> - * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
> - *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
> - *                                   system is performing its final state saving
> - *                                   within the "blackout window". User
> - *                                   workloads must be suspended. The actual
> - *                                   reboot (kexec) into the next kernel is
> - *                                   imminent.
> - * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
> - *                                   kernel via live update the system is now
> - *                                   running the next kernel, awaiting the
> - *                                   finish event.
> - *
> - * These states track the progress and outcome of a live update operation.
> - */
> -enum liveupdate_state  {
> -	LIVEUPDATE_STATE_NORMAL = 0,
> -	LIVEUPDATE_STATE_PREPARED = 1,
> -	LIVEUPDATE_STATE_FROZEN = 2,
> -	LIVEUPDATE_STATE_UPDATED = 3,
> -};
> -
>  /* Forward declaration needed if definition isn't included */
>  struct file;
>  
> diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
> new file mode 100644
> index 000000000000..c673d08a29ea
> --- /dev/null
> +++ b/include/uapi/linux/liveupdate.h
> @@ -0,0 +1,300 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +
> +/*
> + * Userspace interface for /dev/liveupdate
> + * Live Update Orchestrator
> + *
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +
> +#ifndef _UAPI_LIVEUPDATE_H
> +#define _UAPI_LIVEUPDATE_H
> +
> +#include <linux/ioctl.h>
> +#include <linux/types.h>
> +
> +/**
> + * enum liveupdate_state - Defines the possible states of the live update
> + * orchestrator.
> + * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
> + * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
> + *                                   LIVEUPDATE_PREPARE callbacks have completed
> + *                                   successfully.
> + *                                   Devices might operate in a limited state
> + *                                   for example the participating devices might
> + *                                   not be allowed to unbind, and also the
> + *                                   setting up of new DMA mappings might be
> + *                                   disabled in this state.
> + * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
> + *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
> + *                                   system is performing its final state saving
> + *                                   within the "blackout window". User
> + *                                   workloads must be suspended. The actual
> + *                                   reboot (kexec) into the next kernel is
> + *                                   imminent.
> + * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
> + *                                   kernel via live update the system is now
> + *                                   running the next kernel, awaiting the
> + *                                   finish event.
> + *
> + * These states track the progress and outcome of a live update operation.
> + */
> +enum liveupdate_state  {
> +	LIVEUPDATE_STATE_NORMAL = 0,
> +	LIVEUPDATE_STATE_PREPARED = 1,
> +	LIVEUPDATE_STATE_FROZEN = 2,
> +	LIVEUPDATE_STATE_UPDATED = 3,
> +};
> +
> +/**
> + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> + * descriptors across live update.
> + * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> + *         descriptor to be preserved.
> + *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> + *         representing the fully restored kernel resource.
> + * @flags: Unused, reserved for future expansion, must be set to 0.
> + * @token: Output for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> + *         generated by the kernel representing the successfully preserved
> + *         resource state.
> + *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> + *         returned by the preserve ioctl for the resource to be restored.
> + *
> + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> + * underlying kernel state preserved across a live update cycle.
> + *
> + * To preserve an FD, user space passes this struct to
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> + * kernel populates the @token field.
> + *
> + * After the live update transition, user space passes the struct populated with
> + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> + * to find the preserved state and, on success, populates the @fd field with a
> + * new file descriptor referring to the fully restored resource.
> + */
> +struct liveupdate_fd {
> +	int		fd;
> +	__u32		flags;
> +	__u64		token;
> +};
> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define LIVEUPDATE_IOCTL_TYPE		0xBA
> +
> +/**
> + * LIVEUPDATE_IOCTL_FD_PRESERVE - Validate and initiate preservation for a file
> + * descriptor.
> + *
> + * Argument: Pointer to &struct liveupdate_fd.
> + *
> + * User sets the @fd field identifying the file descriptor to preserve
> + * (e.g., memfd, kvm, iommufd, VFIO). The kernel validates if this FD type
> + * and its dependencies are supported for preservation. If validation passes,
> + * the kernel marks the FD internally and *initiates the process* of preparing
> + * its state for saving. The actual snapshotting of the state typically occurs
> + * during the subsequent %LIVEUPDATE_IOCTL_EVENT_PREPARE execution phase, though
> + * some finalization might occur during %LIVEUPDATE_IOCTL_EVENT_FREEZE.
> + * On successful validation and initiation, the kernel populates the @token
> + * field with an opaque identifier representing the resource being preserved.
> + * This token confirms the FD is targeted for preservation and is required for
> + * the subsequent %LIVEUPDATE_IOCTL_FD_RESTORE call after the live update. This
> + * is an I/O read/write operation.
> + *
> + * Return: 0 on success (validation passed, preservation initiated), negative
> + * error code on failure (e.g., unsupported FD type, dependency issue,
> + * validation failed).
> + */
> +#define LIVEUPDATE_IOCTL_FD_PRESERVE					\
> +	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x00, struct liveupdate_fd)
> +
> +/**
> + * LIVEUPDATE_IOCTL_FD_UNPRESERVE - Remove a file descriptor from the
> + * preservation list.
> + *
> + * Argument: Pointer to __u64 token.
> + *
> + * Allows user space to explicitly remove a file descriptor from the set of
> + * items marked as potentially preservable. User space provides a pointer to the
> + * __u64 @token that was previously returned by a successful
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE call (potentially from a prior, possibly
> + * cancelled, live update attempt). The kernel reads the token value from the
> + * provided user-space address.
> + *
> + * On success, the kernel removes the corresponding entry (identified by the
> + * token value read from the user pointer) from its internal preservation list.
> + * The provided @token (representing the now-removed entry) becomes invalid
> + * after this call.
> + *
> + * This operation can only be called when the live update orchestrator is in the
> + *  %LIVEUPDATE_STATE_NORMAL state.**
> + *
> + * This is an I/O write operation (_IOW), signifying the kernel reads data (the
> + * token) from the user-provided pointer.
> + *
> + * Return: 0 on success, negative error code on failure (e.g., -EBUSY or -EINVAL
> + * if not in %LIVEUPDATE_STATE_NORMAL, bad address provided, invalid token value
> + * read, token not found).
> + */
> +#define LIVEUPDATE_IOCTL_FD_UNPRESERVE					\
> +	_IOW(LIVEUPDATE_IOCTL_TYPE, 0x01, __u64)
> +
> +/**
> + * LIVEUPDATE_IOCTL_FD_RESTORE - Restore a previously preserved file descriptor.
> + *
> + * Argument: Pointer to &struct liveupdate_fd.
> + *
> + * User sets the @token field to the value obtained from a successful
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE call before the live update. On success,
> + * the kernel restores the state (saved during the PREPARE/FREEZE phases)
> + * associated with the token and populates the @fd field with a new file
> + * descriptor referencing the restored resource in the current (new) kernel.
> + * This operation must be performed *before* signaling completion via
> + * %LIVEUPDATE_IOCTL_EVENT_FINISH. This is an I/O read/write operation.
> + *
> + * Return: 0 on success, negative error code on failure (e.g., invalid token).
> + */
> +#define LIVEUPDATE_IOCTL_FD_RESTORE					\
> +	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x02, struct liveupdate_fd)
> +
> +/**
> + * LIVEUPDATE_IOCTL_GET_STATE - Query the current state of the live update
> + * orchestrator.
> + *
> + * Argument: Pointer to &enum liveupdate_state.
> + *
> + * The kernel fills the enum value pointed to by the argument with the current
> + * state of the live update subsystem. Possible states are:
> + *
> + * - %LIVEUPDATE_STATE_NORMAL:   Default state; no live update operation is
> + *                               currently in progress.
> + * - %LIVEUPDATE_STATE_PREPARED: The preparation phase (triggered by
> + *                               %LIVEUPDATE_IOCTL_EVENT_PREPARE) has completed
> + *                               successfully. The system is ready for the
> + *                               reboot transition initiated by
> + *                               %LIVEUPDATE_IOCTL_EVENT_FREEZE. Note that some
> + *                               device operations (e.g., unbinding, new DMA
> + *                               mappings) might be restricted in this state.
> + * - %LIVEUPDATE_STATE_UPDATED:  The system has successfully rebooted into the
> + *                               new kernel via live update. It is now running
> + *                               the new kernel code and is awaiting the
> + *                               completion signal from user space via
> + *                               %LIVEUPDATE_IOCTL_EVENT_FINISH after
> + *                               restoration tasks are done.
> + *
> + * See the definition of &enum liveupdate_state for more details on each state.
> + * This is an I/O read operation (kernel writes to the user-provided pointer).
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +#define LIVEUPDATE_IOCTL_GET_STATE					\
> +	_IOR(LIVEUPDATE_IOCTL_TYPE, 0x03, enum liveupdate_state)
> +
> +/**
> + * LIVEUPDATE_IOCTL_EVENT_PREPARE - Initiate preparation phase and trigger state
> + * saving.
> + *
> + * Argument: None.
> + *
> + * Initiates the live update preparation phase. This action corresponds to
> + * the internal %LIVEUPDATE_PREPARE kernel event and can also be triggered
> + * by writing '1' to ``/sys/kernel/liveupdate/prepare``. This typically
> + * triggers the main state saving process for items marked via the PRESERVE
> + * ioctls. This occurs *before* the main "blackout window", while user
> + * applications (e.g., VMs) may still be running. Kernel subsystems
> + * receiving the %LIVEUPDATE_PREPARE event should serialize necessary state.
> + * This command does not transfer data.
> + *
> + * Return: 0 on success, negative error code on failure. Transitions state
> + * towards %LIVEUPDATE_STATE_PREPARED on success.
> + */
> +#define LIVEUPDATE_IOCTL_EVENT_PREPARE					\
> +	_IO(LIVEUPDATE_IOCTL_TYPE, 0x04)
> +
> +/**
> + * LIVEUPDATE_IOCTL_EVENT_FREEZE - Notify subsystems of imminent reboot
> + * transition.
> + *
> + * Argument: None.
> + *
> + * Notifies the live update subsystem and associated components that the kernel
> + * is about to execute the final reboot transition into the new kernel (e.g.,
> + * via kexec). This action triggers the internal %LIVEUPDATE_FREEZE kernel
> + * event. This event provides subsystems a final, brief opportunity (within the
> + * "blackout window") to save critical state or perform last-moment quiescing.
> + * Any remaining or deferred state saving for items marked via the PRESERVE
> + * ioctls typically occurs in response to the %LIVEUPDATE_FREEZE event.
> + *
> + * This ioctl should only be called when the system is in the
> + * %LIVEUPDATE_STATE_PREPARED state. This command does not transfer data.
> + *
> + * Return: 0 if the notification is successfully processed by the kernel (but
> + * reboot follows). Returns a negative error code if the notification fails
> + * or if the system is not in the %LIVEUPDATE_STATE_PREPARED state.
> + */
> +#define LIVEUPDATE_IOCTL_EVENT_FREEZE					\
> +	_IO(LIVEUPDATE_IOCTL_TYPE, 0x05)
> +
> +/**
> + * LIVEUPDATE_IOCTL_EVENT_CANCEL - Cancel the live update preparation phase.
> + *
> + * Argument: None.
> + *
> + * Notifies the live update subsystem to abort the preparation sequence
> + * potentially initiated by %LIVEUPDATE_IOCTL_EVENT_PREPARE. This action
> + * typically corresponds to the internal %LIVEUPDATE_CANCEL kernel event,
> + * which might also be triggered automatically if the PREPARE stage fails
> + * internally.
> + *
> + * When triggered, subsystems receiving the %LIVEUPDATE_CANCEL event should
> + * revert any state changes or actions taken specifically for the aborted
> + * prepare phase (e.g., discard partially serialized state). The kernel
> + * releases resources allocated specifically for this *aborted preparation
> + * attempt*.
> + *
> + * This operation cancels the current *attempt* to prepare for a live update
> + * but does **not** remove previously validated items from the internal list
> + * of potentially preservable resources. Consequently, preservation tokens
> + * previously generated by successful %LIVEUPDATE_IOCTL_FD_PRESERVE or calls
> + * generally **remain valid** as identifiers for those potentially preservable
> + * resources. However, since the system state returns towards
> + * %LIVEUPDATE_STATE_NORMAL, user space must initiate a new live update sequence
> + * (starting with %LIVEUPDATE_IOCTL_EVENT_PREPARE) to proceed with an update
> + * using these (or other) tokens.
> + *
> + * This command does not transfer data. Kernel callbacks for the
> + * %LIVEUPDATE_CANCEL event must not fail.
> + *
> + * Return: 0 on success, negative error code on failure. Transitions state back
> + * towards %LIVEUPDATE_STATE_NORMAL on success.
> + */
> +#define LIVEUPDATE_IOCTL_EVENT_CANCEL					\
> +	_IO(LIVEUPDATE_IOCTL_TYPE, 0x06)
> +
> +/**
> + * LIVEUPDATE_IOCTL_EVENT_FINISH - Signal restoration completion and trigger
> + * cleanup.
> + *
> + * Argument: None.
> + *
> + * Signals that user space has completed all necessary restoration actions in
> + * the new kernel (after a live update reboot). This action corresponds to the
> + * internal %LIVEUPDATE_FINISH kernel event and may also be triggerable via
> + * sysfs (e.g., writing '1' to ``/sys/kernel/liveupdate/finish``)
> + * Calling this ioctl triggers the cleanup phase: any resources that were
> + * successfully preserved but were *not* subsequently restored (reclaimed) via
> + * the RESTORE ioctls will have their preserved state discarded and associated
> + * kernel resources released. Involved devices may be reset. All desired
> + * restorations *must* be completed *before* this. Kernel callbacks for the
> + * %LIVEUPDATE_FINISH event must not fail. Successfully completing this phase
> + * transitions the system state from %LIVEUPDATE_STATE_UPDATED back to
> + * %LIVEUPDATE_STATE_NORMAL. This command does not transfer data.
> + *
> + * Return: 0 on success, negative error code on failure.
> + */
> +#define LIVEUPDATE_IOCTL_EVENT_FINISH					\
> +	_IO(LIVEUPDATE_IOCTL_TYPE, 0x07)
> +
> +#endif /* _UAPI_LIVEUPDATE_H */
> -- 
> 2.49.0.1101.gccaa498523-goog
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-24  9:50   ` Christian Brauner
@ 2025-06-24 14:27     ` Pasha Tatashin
  2025-06-25  9:36       ` Christian Brauner
  2025-07-06 14:24     ` Mike Rapoport
  1 sibling, 1 reply; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-24 14:27 UTC (permalink / raw)
  To: Christian Brauner
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Tue, Jun 24, 2025 at 5:51 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> > Introduce the user-space interface for the Live Update Orchestrator
> > via ioctl commands, enabling external control over the live update
> > process and management of preserved resources.
> >
> > Create a misc character device at /dev/liveupdate. Access
> > to this device requires the CAP_SYS_ADMIN capability.
> >
> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> > structures. The magic number is registered in
> > Documentation/userspace-api/ioctl/ioctl-number.rst.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
> >  drivers/misc/liveupdate/Makefile              |   1 +
> >  drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
> >  include/linux/liveupdate.h                    |  34 +-
> >  include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
> >  5 files changed, 502 insertions(+), 33 deletions(-)
> >  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
> >  create mode 100644 include/uapi/linux/liveupdate.h
> >
> > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > index 7a1409ecc238..279c124048f2 100644
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -375,6 +375,7 @@ Code  Seq#    Include File                                           Comments
> >  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                              Marvell CN10K DPI driver
> >  0xB8  all    uapi/linux/mshv.h                                       Microsoft Hyper-V /dev/mshv driver
> >                                                                       <mailto:linux-hyperv@vger.kernel.org>
> > +0xBA  all    uapi/linux/liveupdate.h                                 <mailto:Pasha Tatashin <pasha.tatashin@soleen.com>
> >  0xC0  00-0F  linux/usb/iowarrior.h
> >  0xCA  00-0F  uapi/misc/cxl.h                                         Dead since 6.15
> >  0xCA  10-2F  uapi/misc/ocxl.h
> > diff --git a/drivers/misc/liveupdate/Makefile b/drivers/misc/liveupdate/Makefile
> > index b4cdd162574f..7a0cd08919c9 100644
> > --- a/drivers/misc/liveupdate/Makefile
> > +++ b/drivers/misc/liveupdate/Makefile
> > @@ -1,4 +1,5 @@
> >  # SPDX-License-Identifier: GPL-2.0
> > +obj-y                                        += luo_ioctl.o
> >  obj-y                                        += luo_core.o
> >  obj-y                                        += luo_files.o
> >  obj-y                                        += luo_subsystems.o
> > diff --git a/drivers/misc/liveupdate/luo_ioctl.c b/drivers/misc/liveupdate/luo_ioctl.c
> > new file mode 100644
> > index 000000000000..76c687ff650b
> > --- /dev/null
> > +++ b/drivers/misc/liveupdate/luo_ioctl.c
> > @@ -0,0 +1,199 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * Copyright (c) 2025, Google LLC.
> > + * Pasha Tatashin <pasha.tatashin@soleen.com>
> > + */
> > +
> > +/**
> > + * DOC: LUO ioctl Interface
> > + *
> > + * The IOCTL user-space control interface for the LUO subsystem.
> > + * It registers a misc character device, typically found at ``/dev/liveupdate``,
> > + * which allows privileged userspace applications (requiring %CAP_SYS_ADMIN) to
> > + * manage and monitor the LUO state machine and associated resources like
> > + * preservable file descriptors.
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/errno.h>
> > +#include <linux/file.h>
> > +#include <linux/fs.h>
> > +#include <linux/init.h>
> > +#include <linux/kernel.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/module.h>
> > +#include <linux/uaccess.h>
> > +#include <uapi/linux/liveupdate.h>
> > +#include "luo_internal.h"
> > +
> > +static int luo_ioctl_fd_preserve(struct liveupdate_fd *luo_fd)
> > +{
> > +     struct file *file;
> > +     int ret;
> > +
> > +     file = fget(luo_fd->fd);
> > +     if (!file) {
> > +             pr_err("Bad file descriptor\n");
> > +             return -EBADF;
> > +     }
> > +
> > +     ret = luo_register_file(&luo_fd->token, file);
> > +     if (ret)
> > +             fput(file);
> > +
> > +     return ret;
> > +}
> > +
> > +static int luo_ioctl_fd_unpreserve(u64 token)
> > +{
> > +     return luo_unregister_file(token);
> > +}
> > +
> > +static int luo_ioctl_fd_restore(struct liveupdate_fd *luo_fd)
> > +{
> > +     struct file *file;
> > +     int ret;
> > +     int fd;
> > +
> > +     fd = get_unused_fd_flags(O_CLOEXEC);
> > +     if (fd < 0) {
> > +             pr_err("Failed to allocate new fd: %d\n", fd);
> > +             return fd;
> > +     }
> > +
> > +     ret = luo_retrieve_file(luo_fd->token, &file);
> > +     if (ret < 0) {
> > +             put_unused_fd(fd);
> > +
> > +             return ret;
> > +     }
> > +
> > +     fd_install(fd, file);
> > +     luo_fd->fd = fd;
> > +
> > +     return 0;
> > +}
> > +
> > +static int luo_open(struct inode *inodep, struct file *filep)
> > +{
> > +     if (!capable(CAP_SYS_ADMIN))
> > +             return -EACCES;
> > +
> > +     if (filep->f_flags & O_EXCL)
> > +             return -EINVAL;
> > +
> > +     return 0;
> > +}
> > +
> > +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> > +{
> > +     void __user *argp = (void __user *)arg;
> > +     struct liveupdate_fd luo_fd;
> > +     enum liveupdate_state state;
> > +     int ret = 0;
> > +     u64 token;
> > +
> > +     if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> > +             return -ENOTTY;
> > +
> > +     switch (cmd) {
> > +     case LIVEUPDATE_IOCTL_GET_STATE:
> > +             state = READ_ONCE(luo_state);
> > +             if (copy_to_user(argp, &state, sizeof(luo_state)))
> > +                     ret = -EFAULT;
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_PREPARE:
> > +             ret = luo_prepare();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_FREEZE:
> > +             ret = luo_freeze();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_FINISH:
> > +             ret = luo_finish();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_EVENT_CANCEL:
> > +             ret = luo_cancel();
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_FD_PRESERVE:
> > +             if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +
> > +             ret = luo_ioctl_fd_preserve(&luo_fd);
> > +             if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> > +                     ret = -EFAULT;
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_FD_UNPRESERVE:
> > +             if (copy_from_user(&token, argp, sizeof(u64))) {
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +
> > +             ret = luo_ioctl_fd_unpreserve(token);
> > +             break;
> > +
> > +     case LIVEUPDATE_IOCTL_FD_RESTORE:
> > +             if (copy_from_user(&luo_fd, argp, sizeof(luo_fd))) {
> > +                     ret = -EFAULT;
> > +                     break;
> > +             }
> > +
> > +             ret = luo_ioctl_fd_restore(&luo_fd);
> > +             if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd)))
> > +                     ret = -EFAULT;
> > +             break;
> > +
> > +     default:
> > +             pr_warn("ioctl: unknown command nr: 0x%x\n", _IOC_NR(cmd));
> > +             ret = -ENOTTY;
> > +             break;
> > +     }
> > +
> > +     return ret;
> > +}
> > +
> > +static const struct file_operations fops = {
> > +     .owner          = THIS_MODULE,
> > +     .open           = luo_open,
> > +     .unlocked_ioctl = luo_ioctl,
> > +};
> > +
> > +static struct miscdevice liveupdate_miscdev = {
> > +     .minor = MISC_DYNAMIC_MINOR,
> > +     .name  = "liveupdate",
> > +     .fops  = &fops,
> > +};
>
> I'm not sure why people are so in love with character device based apis.
> It's terrible. It glues everything to devtmpfs which isn't namespacable
> in any way. It's terrible to delegate and extremely restrictive in terms
> of extensiblity if you need additional device entries (aka the loop
> driver folly).
>
> One stupid question: I probably have asked this before and just swapped
> out that I a) asked this already and b) received an explanation. But why
> isn't this a singleton simple in-memory filesystem with a flat
> hierarchy?

Hi Christian,

Thank you for the detailed feedback and for raising this important
design question. I appreciate the points you've made about the
benefits of a filesystem-based API.

I have thought thoroughly about this and explored various alternatives
before settling on the ioctl-based interface. This design isn't a
sudden decision but is based on ongoing conversations that have been
happening for over two years at LPC, as well as incorporating direct
feedback I received on LUOv1 at LSF/MM.

The choice for an ioctl-based character device was ultimately driven
by the specific lifecycle and dependency management requirements of
the live update process. While a filesystem API offers great
advantages in visibility and hierarchy, filesystems are not typically
designed to be state machines with the complex lifecycle, dependency,
and ownership tracking that LUO needs to manage.

Let me elaborate on the key aspects that led to the current design:

1. session based lifecycle management: The preservation of an FD is
tied to the open instance of /dev/liveupdate. If a userspace agent
opens /dev/liveupdate, registers several FDs for preservation, and
then crashes or exits before the prepare phase is triggered, all FDs
it registered are automatically unregistered. This "session-scoped"
behavior is crucial to prevent leaking preserved resources into the
next kernel if the controlling process fails. This is naturally
handled by the open() and release() file operations on a character
device. It's not immediately obvious how a similar automatic,
session-based cleanup would be implemented with a singleton
filesystem.

2. state machine: LUO is fundamentally a state machine (NORMAL ->
PREPARED -> FROZEN -> UPDATED -> NORMAL). As part of this, it provides
a crucial guarantee: any resource that was successfully preserved but
not explicitly reclaimed by userspace in the new kernel by the time
the FINISH event is triggered will be automatically cleaned up and its
memory released. This prevents leaks of unreclaimed resources and is
managed by the orchestrator, which is a concept that doesn't map
cleanly onto standard VFS semantics.

3. dependency tracking: Unlike normal files, preserved resources for
live update have strong, often complex interdependencies. For example,
a kvmfd might depend on a guestmemfd; an iommufd can depend on vfiofd,
eventfd, memfd, and kvmfd. LUO's current design provides explicit
callback points (prepare, freeze) where these dependencies can be
validated and tracked by the participating subsystems. If a dependency
is not met when we are about to freeze, we can fail the entire
operation and return an error to userspace. The cancel callback
further allows this complex dependency graph to be unwound safely. A
filesystem interface based on linkat() or unlink() doesn't inherently
provide these critical, ordered points for dependency verification and
rollback.

While I agree that a filesystem offers superior introspection and
integration with standard tools, building this complex, stateful
orchestration logic on top of VFS seemed to be forcing a square peg
into a round hole. The ioctl interface, while more opaque, provides a
direct and explicit way to command the state machine and manage these
complex lifecycle and dependency rules.

Thanks,
Pasha

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-20 16:03                             ` Pasha Tatashin
@ 2025-06-24 16:12                               ` Pratyush Yadav
  2025-06-24 16:55                                 ` Pasha Tatashin
  2025-06-24 18:31                                 ` Jason Gunthorpe
  0 siblings, 2 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-24 16:12 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Mike Rapoport, Jason Gunthorpe, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Fri, Jun 20 2025, Pasha Tatashin wrote:

> On Fri, Jun 20, 2025 at 11:28 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>> On Thu, Jun 19 2025, Pasha Tatashin wrote:
[...]
>> Outside of hypervisor live update, I have a very clear use case in mind:
>> userspace memory handover (on guest side). Say a guest running an
>> in-memory cache like memcached with many gigabytes of cache wants to
>> reboot. It can just shove the cache into a memfd, give it to LUO, and
>> restore it after reboot. Some services that suffer from long reboots are
>> looking into using this to reduce downtime. Since it pretty much
>> overlaps with the hypervisor work for now, I haven't been talking about
>> it as much.
>>
>> Would you also call this use case "live update"? Does it also fit with
>> your vision of where LUO should go?
>
> Yes, absolutely. The use case you described (preserving a memcached
> instance via memfd) is a perfect fit for LUO's vision.
>
> While the primary use case driving this work is supporting the
> preservation of virtual machines on a hypervisor, the framework itself
> is not restricted to that scenario. We define "live update" as the
> process of updating the kernel from one version to another while
> preserving FD-based resources and keeping selected devices
> operational. The machine itself can be running storage, database,
> networking, containers, or anything else.
>
> A good parallel is Kernel Live Patching: we don't distinguish what
> workload is running on a machine when applying a security patch; we
> simply patch the running kernel. In the same way, Live Update is
> designed to be workload-agnostic. Whether the system is running an
> in-memory database, containers, or VMs, its primary goal is to enable
> a full kernel update while preserving the userspace-requested state.

Okay, then we are on the same page and I can live with whatever name we
go with :-)

BTW, I think it would be useful to make this clarification on the LUO
docs as well so the intended use case/audience of the API is clear.
Currently the doc string in luo_core.c only talks about hypervisors and
VMs.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-24 16:12                               ` Pratyush Yadav
@ 2025-06-24 16:55                                 ` Pasha Tatashin
  2025-06-24 18:31                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Pasha Tatashin @ 2025-06-24 16:55 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Tue, Jun 24, 2025 at 12:12 PM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Fri, Jun 20 2025, Pasha Tatashin wrote:
>
> > On Fri, Jun 20, 2025 at 11:28 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >> On Thu, Jun 19 2025, Pasha Tatashin wrote:
> [...]
> >> Outside of hypervisor live update, I have a very clear use case in mind:
> >> userspace memory handover (on guest side). Say a guest running an
> >> in-memory cache like memcached with many gigabytes of cache wants to
> >> reboot. It can just shove the cache into a memfd, give it to LUO, and
> >> restore it after reboot. Some services that suffer from long reboots are
> >> looking into using this to reduce downtime. Since it pretty much
> >> overlaps with the hypervisor work for now, I haven't been talking about
> >> it as much.
> >>
> >> Would you also call this use case "live update"? Does it also fit with
> >> your vision of where LUO should go?
> >
> > Yes, absolutely. The use case you described (preserving a memcached
> > instance via memfd) is a perfect fit for LUO's vision.
> >
> > While the primary use case driving this work is supporting the
> > preservation of virtual machines on a hypervisor, the framework itself
> > is not restricted to that scenario. We define "live update" as the
> > process of updating the kernel from one version to another while
> > preserving FD-based resources and keeping selected devices
> > operational. The machine itself can be running storage, database,
> > networking, containers, or anything else.
> >
> > A good parallel is Kernel Live Patching: we don't distinguish what
> > workload is running on a machine when applying a security patch; we
> > simply patch the running kernel. In the same way, Live Update is
> > designed to be workload-agnostic. Whether the system is running an
> > in-memory database, containers, or VMs, its primary goal is to enable
> > a full kernel update while preserving the userspace-requested state.
>
> Okay, then we are on the same page and I can live with whatever name we
> go with :-)
>
> BTW, I think it would be useful to make this clarification on the LUO
> docs as well so the intended use case/audience of the API is clear.
> Currently the doc string in luo_core.c only talks about hypervisors and
> VMs.

Thank you for the feedback, I will expand Documentation.

Pasha

>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-24 16:12                               ` Pratyush Yadav
  2025-06-24 16:55                                 ` Pasha Tatashin
@ 2025-06-24 18:31                                 ` Jason Gunthorpe
  1 sibling, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2025-06-24 18:31 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, Mike Rapoport, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Tue, Jun 24, 2025 at 06:12:14PM +0200, Pratyush Yadav wrote:
> On Fri, Jun 20 2025, Pasha Tatashin wrote:
> 
> > On Fri, Jun 20, 2025 at 11:28 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >> On Thu, Jun 19 2025, Pasha Tatashin wrote:
> [...]
> >> Outside of hypervisor live update, I have a very clear use case in mind:
> >> userspace memory handover (on guest side). Say a guest running an
> >> in-memory cache like memcached with many gigabytes of cache wants to
> >> reboot. It can just shove the cache into a memfd, give it to LUO, and
> >> restore it after reboot. Some services that suffer from long reboots are
> >> looking into using this to reduce downtime. Since it pretty much
> >> overlaps with the hypervisor work for now, I haven't been talking about
> >> it as much.
> >>
> >> Would you also call this use case "live update"? Does it also fit with
> >> your vision of where LUO should go?
> >
> > Yes, absolutely. The use case you described (preserving a memcached
> > instance via memfd) is a perfect fit for LUO's vision.
> >
> > While the primary use case driving this work is supporting the
> > preservation of virtual machines on a hypervisor, the framework itself
> > is not restricted to that scenario. We define "live update" as the
> > process of updating the kernel from one version to another while
> > preserving FD-based resources and keeping selected devices
> > operational. The machine itself can be running storage, database,
> > networking, containers, or anything else.
> >
> > A good parallel is Kernel Live Patching: we don't distinguish what
> > workload is running on a machine when applying a security patch; we
> > simply patch the running kernel. In the same way, Live Update is
> > designed to be workload-agnostic. Whether the system is running an
> > in-memory database, containers, or VMs, its primary goal is to enable
> > a full kernel update while preserving the userspace-requested state.
> 
> Okay, then we are on the same page and I can live with whatever name we
> go with :-)
> 
> BTW, I think it would be useful to make this clarification on the LUO
> docs as well so the intended use case/audience of the API is clear.
> Currently the doc string in luo_core.c only talks about hypervisors and
> VMs.

Just to be clear though - you used the word "reboot" and here we are
really only talking about kexec. kexec is not really a reboot, but it
is sort of close.

LUO is a way to pass lots of different things across a kexec, and if
you are happy to use kexec then you should be able to use it.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-24 14:27     ` Pasha Tatashin
@ 2025-06-25  9:36       ` Christian Brauner
  2025-06-25 16:12         ` David Matlack
  2025-06-25 16:58         ` pasha.tatashin
  0 siblings, 2 replies; 102+ messages in thread
From: Christian Brauner @ 2025-06-25  9:36 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

> > I'm not sure why people are so in love with character device based apis.
> > It's terrible. It glues everything to devtmpfs which isn't namespacable
> > in any way. It's terrible to delegate and extremely restrictive in terms
> > of extensiblity if you need additional device entries (aka the loop
> > driver folly).
> >
> > One stupid question: I probably have asked this before and just swapped
> > out that I a) asked this already and b) received an explanation. But why
> > isn't this a singleton simple in-memory filesystem with a flat
> > hierarchy?
> 
> Hi Christian,
> 
> Thank you for the detailed feedback and for raising this important

I don't know about detailed but no problem.

> design question. I appreciate the points you've made about the
> benefits of a filesystem-based API.
> 
> I have thought thoroughly about this and explored various alternatives
> before settling on the ioctl-based interface. This design isn't a
> sudden decision but is based on ongoing conversations that have been
> happening for over two years at LPC, as well as incorporating direct
> feedback I received on LUOv1 at LSF/MM.

Well, Mike mentioned that ultimately you want to interface this with
systemd? And we certainly have never been privy to any of these
uapi design conversations. Which is usually not a good sign...

> 
> The choice for an ioctl-based character device was ultimately driven
> by the specific lifecycle and dependency management requirements of
> the live update process. While a filesystem API offers great
> advantages in visibility and hierarchy, filesystems are not typically
> designed to be state machines with the complex lifecycle, dependency,
> and ownership tracking that LUO needs to manage.
> 
> Let me elaborate on the key aspects that led to the current design:
> 
> 1. session based lifecycle management: The preservation of an FD is
> tied to the open instance of /dev/liveupdate. If a userspace agent
> opens /dev/liveupdate, registers several FDs for preservation, and
> then crashes or exits before the prepare phase is triggered, all FDs
> it registered are automatically unregistered. This "session-scoped"
> behavior is crucial to prevent leaking preserved resources into the
> next kernel if the controlling process fails. This is naturally
> handled by the open() and release() file operations on a character
> device. It's not immediately obvious how a similar automatic,
> session-based cleanup would be implemented with a singleton
> filesystem.

fwiw

fd_context = fsopen("kexecfs")
fd_context = fsconfig(FSCONFIG_CMD_CREATE, ...)
fd_mnt = fsmount(fd_context, ...)

This gets you a private kexecfs instances that's never visible anywhere
in the filesystem hierarchy. When the fd is closed everything gets auto
cleaned up by the kernel. No need to umount or anything.

> 2. state machine: LUO is fundamentally a state machine (NORMAL ->
> PREPARED -> FROZEN -> UPDATED -> NORMAL). As part of this, it provides
> a crucial guarantee: any resource that was successfully preserved but
> not explicitly reclaimed by userspace in the new kernel by the time
> the FINISH event is triggered will be automatically cleaned up and its
> memory released. This prevents leaks of unreclaimed resources and is
> managed by the orchestrator, which is a concept that doesn't map
> cleanly onto standard VFS semantics.

I'm not following this. See above. And also any umount can trivially
just destroy whatever resource is still left in the filesystem.

> 
> 3. dependency tracking: Unlike normal files, preserved resources for
> live update have strong, often complex interdependencies. For example,
> a kvmfd might depend on a guestmemfd; an iommufd can depend on vfiofd,
> eventfd, memfd, and kvmfd. LUO's current design provides explicit
> callback points (prepare, freeze) where these dependencies can be
> validated and tracked by the participating subsystems. If a dependency
> is not met when we are about to freeze, we can fail the entire
> operation and return an error to userspace. The cancel callback
> further allows this complex dependency graph to be unwound safely. A
> filesystem interface based on linkat() or unlink() doesn't inherently
> provide these critical, ordered points for dependency verification and
> rollback.
> 
> While I agree that a filesystem offers superior introspection and
> integration with standard tools, building this complex, stateful
> orchestration logic on top of VFS seemed to be forcing a square peg
> into a round hole. The ioctl interface, while more opaque, provides a
> direct and explicit way to command the state machine and manage these
> complex lifecycle and dependency rules.

I'm not going to argue that you have to switch to this kexecfs idea
but...

You're using a character device that's tied to devmptfs. In other words,
you're already using a filesystem interface. Literally the whole code
here is built on top of filesystem APIs. So this argument is just very
wrong imho. If you can built it on top of a character device using VFS
interfaces you can do it as a minimal filesystem.

You're free to define the filesystem interface any way you like it. We
have a ton of examples there. All your ioctls would just be tied to the
fileystem instance instead of the /dev/somethingsomething character
device. The state machine could just be implemented the same way.

One of my points is that with an fs interface you can have easy state
seralization on a per-service level. IOW, you have a bunch of virtual
machines running as services or some networking services or whatever.
You could just bind-mount an instance of kexecfs into the service and
the service can persist state into the instance and easily recover it
after kexec.

But anyway, you seem to be set on the ioctl() interface, fine.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 05/16] luo: luo_core: integrate with KHO
  2025-06-23 11:29                         ` Pasha Tatashin
@ 2025-06-25 13:46                           ` Mike Rapoport
  0 siblings, 0 replies; 102+ messages in thread
From: Mike Rapoport @ 2025-06-25 13:46 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Mon, Jun 23, 2025 at 07:29:09AM -0400, Pasha Tatashin wrote:
> On Mon, Jun 23, 2025 at 3:32 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Wed, Jun 18, 2025 at 01:43:18PM -0400, Pasha Tatashin wrote:
> > > On Wed, Jun 18, 2025 at 1:00 PM Pasha Tatashin
> > >
> > > So currently, KHO provides the following two types of  internal API:
> > >
> > > Preserve memory and metadata
> > > =========================
> > > kho_preserve_folio() / kho_preserve_phys()
> > > kho_unpreserve_folio() / kho_unpreserve_phys()
> > > kho_restore_folio()
> > >
> > > kho_add_subtree() kho_retrieve_subtree()
> > >
> > > State machine
> > > ===========
> > > register_kho_notifier() / unregister_kho_notifier()
> > >
> > > kho_finalize() / kho_abort()
> > >
> > > We should remove the "State machine", and only keep the "Preserve
> > > Memory" API functions. At the time these functions are called, KHO
> > > should do the magic of making sure that the memory gets preserved
> > > across the reboot.
> > >
> > > This way, reserve_mem_init() would call: kho_preserve_folio() and
> > > kho_add_subtree() during boot, and be done with it.
> 
> For completeness, we also need `void kho_remove_substree(const char
> *name);`, currently, all trees are removed during kho_abort(). Let's
> rebase and include this patch on top of the next version of LUO, that
> we are exchanging off list, and send it together later this week.

Oh, here's what I've got:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/tree/?h=luo/v1.1

It complies, passes basic KHO test and LUO selftest.
 
> Thanks,
> Pasha

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-25  9:36       ` Christian Brauner
@ 2025-06-25 16:12         ` David Matlack
  2025-06-26 15:42           ` Pratyush Yadav
  2025-06-25 16:58         ` pasha.tatashin
  1 sibling, 1 reply; 102+ messages in thread
From: David Matlack @ 2025-06-25 16:12 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > While I agree that a filesystem offers superior introspection and
> > integration with standard tools, building this complex, stateful
> > orchestration logic on top of VFS seemed to be forcing a square peg
> > into a round hole. The ioctl interface, while more opaque, provides a
> > direct and explicit way to command the state machine and manage these
> > complex lifecycle and dependency rules.
>
> I'm not going to argue that you have to switch to this kexecfs idea
> but...
>
> You're using a character device that's tied to devmptfs. In other words,
> you're already using a filesystem interface. Literally the whole code
> here is built on top of filesystem APIs. So this argument is just very
> wrong imho. If you can built it on top of a character device using VFS
> interfaces you can do it as a minimal filesystem.
>
> You're free to define the filesystem interface any way you like it. We
> have a ton of examples there. All your ioctls would just be tied to the
> fileystem instance instead of the /dev/somethingsomething character
> device. The state machine could just be implemented the same way.
>
> One of my points is that with an fs interface you can have easy state
> seralization on a per-service level. IOW, you have a bunch of virtual
> machines running as services or some networking services or whatever.
> You could just bind-mount an instance of kexecfs into the service and
> the service can persist state into the instance and easily recover it
> after kexec.

This approach sounds worth exploring more. It would avoid the need for
a centralized daemon to mediate the preservation and restoration of
all file descriptors.

I'm not sure that we can get rid of the machine-wide state machine
though, as there is some kernel state that will necessarily cross
these kexecfs domains (e.g. IOMMU driver state). So we still might
need /dev/liveupdate for that.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-25  9:36       ` Christian Brauner
  2025-06-25 16:12         ` David Matlack
@ 2025-06-25 16:58         ` pasha.tatashin
  1 sibling, 0 replies; 102+ messages in thread
From: pasha.tatashin @ 2025-06-25 16:58 UTC (permalink / raw)
  To: Christian Brauner
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Wed, Jun 25, 2025 at 5:36 AM Christian Brauner <brauner@kernel.org> wrote:
>
> > > I'm not sure why people are so in love with character device based apis.
> > > It's terrible. It glues everything to devtmpfs which isn't namespacable
> > > in any way. It's terrible to delegate and extremely restrictive in terms
> > > of extensiblity if you need additional device entries (aka the loop
> > > driver folly).
> > >
> > > One stupid question: I probably have asked this before and just swapped
> > > out that I a) asked this already and b) received an explanation. But why
> > > isn't this a singleton simple in-memory filesystem with a flat
> > > hierarchy?
> >
> > Hi Christian,
> >
> > Thank you for the detailed feedback and for raising this important
>
> I don't know about detailed but no problem.
>
> > design question. I appreciate the points you've made about the
> > benefits of a filesystem-based API.
> >
> > I have thought thoroughly about this and explored various alternatives
> > before settling on the ioctl-based interface. This design isn't a
> > sudden decision but is based on ongoing conversations that have been
> > happening for over two years at LPC, as well as incorporating direct
> > feedback I received on LUOv1 at LSF/MM.
>
> Well, Mike mentioned that ultimately you want to interface this with
> systemd? And we certainly have never been privy to any of these
> uapi design conversations. Which is usually not a good sign...
>
> >
> > The choice for an ioctl-based character device was ultimately driven
> > by the specific lifecycle and dependency management requirements of
> > the live update process. While a filesystem API offers great
> > advantages in visibility and hierarchy, filesystems are not typically
> > designed to be state machines with the complex lifecycle, dependency,
> > and ownership tracking that LUO needs to manage.
> >
> > Let me elaborate on the key aspects that led to the current design:
> >
> > 1. session based lifecycle management: The preservation of an FD is
> > tied to the open instance of /dev/liveupdate. If a userspace agent
> > opens /dev/liveupdate, registers several FDs for preservation, and
> > then crashes or exits before the prepare phase is triggered, all FDs
> > it registered are automatically unregistered. This "session-scoped"
> > behavior is crucial to prevent leaking preserved resources into the
> > next kernel if the controlling process fails. This is naturally
> > handled by the open() and release() file operations on a character
> > device. It's not immediately obvious how a similar automatic,
> > session-based cleanup would be implemented with a singleton
> > filesystem.
>
> fwiw
>
> fd_context = fsopen("kexecfs")
> fd_context = fsconfig(FSCONFIG_CMD_CREATE, ...)
> fd_mnt = fsmount(fd_context, ...)

How is this kexecfs mount going to be restored into the container
view? Will we need to preserve fd_context in some global(?)
preservation way, i.e. in a root. Or is there a different way to
recreate fd_context upon reboot?

> This gets you a private kexecfs instances that's never visible anywhere
> in the filesystem hierarchy. When the fd is closed everything gets auto
> cleaned up by the kernel. No need to umount or anything.

Yes, this is a very good property of using a file system.

> > 2. state machine: LUO is fundamentally a state machine (NORMAL ->
> > PREPARED -> FROZEN -> UPDATED -> NORMAL). As part of this, it provides
> > a crucial guarantee: any resource that was successfully preserved but
> > not explicitly reclaimed by userspace in the new kernel by the time
> > the FINISH event is triggered will be automatically cleaned up and its
> > memory released. This prevents leaks of unreclaimed resources and is
> > managed by the orchestrator, which is a concept that doesn't map
> > cleanly onto standard VFS semantics.
>
> I'm not following this. See above. And also any umount can trivially
> just destroy whatever resource is still left in the filesystem.

LUO provides more than just resource preservation; it orchestrates the
serialization. While LUO can support various scenarios, let's use
virtual machines as an example.

The process involves distinct phases:

Before suspending a VM, the Virtual Machine Monitor may take actions
to quiesce the guest's activity. For example, it might temporarily
prevent guest reboots to avoid new DMA mappings or PCI device resets.
We refer to this preparatory, limited-functionality period as the
"brownout."

Following the brownout, LUO is transitioned into the PREPARED state.
This allows device states and other resources that require significant
time to serialize to be processed while the VMs are still running. For
most guests, this preparation period is unnoticeable.

Blackout: Once preparation is complete, the VMs are fully suspended in
memory, and the "blackout" period begins. The goal is to perform the
minimal required shutdown sequence and execute
reboot(LINUX_REBOOT_CMD_KEXEC) as quickly as possible. During this
shutdown, the VMM process itself might or might not be terminated.
With FS approach it will have to stay alive in order to be preserved,
with liveupdated it can be terminated and the session in liveupdated
would carry the state into the kernel shutdown.

Restoration and Finish: After the new kernel boots, a userspace agent
like liveupdated would manage the preserved resources. It restores and
returns these resources to their respective VMMs or containers upon
request. Once all workloads have resumed, LUO is notified via the
FINISH event. LUO then cleans up any post live update state and
transitions the system back to the NORMAL state.

> >
> > 3. dependency tracking: Unlike normal files, preserved resources for
> > live update have strong, often complex interdependencies. For example,
> > a kvmfd might depend on a guestmemfd; an iommufd can depend on vfiofd,
> > eventfd, memfd, and kvmfd. LUO's current design provides explicit
> > callback points (prepare, freeze) where these dependencies can be
> > validated and tracked by the participating subsystems. If a dependency
> > is not met when we are about to freeze, we can fail the entire
> > operation and return an error to userspace. The cancel callback
> > further allows this complex dependency graph to be unwound safely. A
> > filesystem interface based on linkat() or unlink() doesn't inherently
> > provide these critical, ordered points for dependency verification and
> > rollback.
> >
> > While I agree that a filesystem offers superior introspection and
> > integration with standard tools, building this complex, stateful
> > orchestration logic on top of VFS seemed to be forcing a square peg
> > into a round hole. The ioctl interface, while more opaque, provides a
> > direct and explicit way to command the state machine and manage these
> > complex lifecycle and dependency rules.
>
> I'm not going to argue that you have to switch to this kexecfs idea
> but...
>
> You're using a character device that's tied to devmptfs. In other words,
> you're already using a filesystem interface. Literally the whole code
> here is built on top of filesystem APIs. So this argument is just very
> wrong imho. If you can built it on top of a character device using VFS
> interfaces you can do it as a minimal filesystem.
>
> You're free to define the filesystem interface any way you like it. We
> have a ton of examples there. All your ioctls would just be tied to the
> fileystem instance instead of the /dev/somethingsomething character
> device. The state machine could just be implemented the same way.
>
> One of my points is that with an fs interface you can have easy state
> seralization on a per-service level. IOW, you have a bunch of virtual
> machines running as services or some networking services or whatever.
> You could just bind-mount an instance of kexecfs into the service and
> the service can persist state into the instance and easily recover it
> after kexec.
>
> But anyway, you seem to be set on the ioctl() interface, fine.

I am not against your proposal, it should be discussed, perhaps at the
hypervisor live update bi-weekly meeting.

[1] https://lore.kernel.org/all/ee353d62-2e4c-b69c-39e6-1d273bfb01a0@google.com/

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-25 16:12         ` David Matlack
@ 2025-06-26 15:42           ` Pratyush Yadav
  2025-06-26 16:24             ` David Matlack
  2025-07-06 14:33             ` Mike Rapoport
  0 siblings, 2 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-06-26 15:42 UTC (permalink / raw)
  To: David Matlack
  Cc: Christian Brauner, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie,
	ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jun 25 2025, David Matlack wrote:

> On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
>> >
>> > While I agree that a filesystem offers superior introspection and
>> > integration with standard tools, building this complex, stateful
>> > orchestration logic on top of VFS seemed to be forcing a square peg
>> > into a round hole. The ioctl interface, while more opaque, provides a
>> > direct and explicit way to command the state machine and manage these
>> > complex lifecycle and dependency rules.
>>
>> I'm not going to argue that you have to switch to this kexecfs idea
>> but...
>>
>> You're using a character device that's tied to devmptfs. In other words,
>> you're already using a filesystem interface. Literally the whole code
>> here is built on top of filesystem APIs. So this argument is just very
>> wrong imho. If you can built it on top of a character device using VFS
>> interfaces you can do it as a minimal filesystem.
>>
>> You're free to define the filesystem interface any way you like it. We
>> have a ton of examples there. All your ioctls would just be tied to the
>> fileystem instance instead of the /dev/somethingsomething character
>> device. The state machine could just be implemented the same way.
>>
>> One of my points is that with an fs interface you can have easy state
>> seralization on a per-service level. IOW, you have a bunch of virtual
>> machines running as services or some networking services or whatever.
>> You could just bind-mount an instance of kexecfs into the service and
>> the service can persist state into the instance and easily recover it
>> after kexec.
>
> This approach sounds worth exploring more. It would avoid the need for
> a centralized daemon to mediate the preservation and restoration of
> all file descriptors.

One of the jobs of the centralized daemon is to decide the _policy_ of
who gets to preserve things and more importantly, make sure the right
party unpreserves the right FDs after a kexec. I don't see how this
interface fixes this problem. You would still need a way to identify
which kexecfs instance belongs to who and enforce that. The kernel
probably shouldn't be the one doing this kind of policy so you still
need some userspace component to make those decisions.

>
> I'm not sure that we can get rid of the machine-wide state machine
> though, as there is some kernel state that will necessarily cross
> these kexecfs domains (e.g. IOMMU driver state). So we still might
> need /dev/liveupdate for that.

Generally speaking, I think both VFS-based and IOCTL-based interfaces
are more or less equally expressive/powerful. Most of the ioctl
operations can be translated to a VFS operation and vice versa.

For example, the fsopen() call is similar to open("/dev/liveupdate") --
both would create a live update session which auto closes when the FD is
closed or FS unmounted. Similarly, each ioctl can be replaced with a
file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
replaced with a fd_preserve file where you write() the FD number.
LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
replaced by a "state" file where you can read() or write() the state.

I think the main benefit of the VFS-based interface is ease of use.
There already exist a bunch of utilites and libraries that we can use to
interact with files. When we have ioctls, we would need to write
everything ourselves. For example, instead of
LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
easier to do.

As for downsides, I think we might end up with a bit more boilerplate
code, but beyond that I am not sure.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-26 15:42           ` Pratyush Yadav
@ 2025-06-26 16:24             ` David Matlack
  2025-07-14 14:56               ` Pratyush Yadav
  2025-07-06 14:33             ` Mike Rapoport
  1 sibling, 1 reply; 102+ messages in thread
From: David Matlack @ 2025-06-26 16:24 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Pasha Tatashin, jasonmiu, graf, changyuanl,
	rppt, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Wed, Jun 25 2025, David Matlack wrote:
>
> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
> >> >
> >> > While I agree that a filesystem offers superior introspection and
> >> > integration with standard tools, building this complex, stateful
> >> > orchestration logic on top of VFS seemed to be forcing a square peg
> >> > into a round hole. The ioctl interface, while more opaque, provides a
> >> > direct and explicit way to command the state machine and manage these
> >> > complex lifecycle and dependency rules.
> >>
> >> I'm not going to argue that you have to switch to this kexecfs idea
> >> but...
> >>
> >> You're using a character device that's tied to devmptfs. In other words,
> >> you're already using a filesystem interface. Literally the whole code
> >> here is built on top of filesystem APIs. So this argument is just very
> >> wrong imho. If you can built it on top of a character device using VFS
> >> interfaces you can do it as a minimal filesystem.
> >>
> >> You're free to define the filesystem interface any way you like it. We
> >> have a ton of examples there. All your ioctls would just be tied to the
> >> fileystem instance instead of the /dev/somethingsomething character
> >> device. The state machine could just be implemented the same way.
> >>
> >> One of my points is that with an fs interface you can have easy state
> >> seralization on a per-service level. IOW, you have a bunch of virtual
> >> machines running as services or some networking services or whatever.
> >> You could just bind-mount an instance of kexecfs into the service and
> >> the service can persist state into the instance and easily recover it
> >> after kexec.
> >
> > This approach sounds worth exploring more. It would avoid the need for
> > a centralized daemon to mediate the preservation and restoration of
> > all file descriptors.
>
> One of the jobs of the centralized daemon is to decide the _policy_ of
> who gets to preserve things and more importantly, make sure the right
> party unpreserves the right FDs after a kexec. I don't see how this
> interface fixes this problem. You would still need a way to identify
> which kexecfs instance belongs to who and enforce that. The kernel
> probably shouldn't be the one doing this kind of policy so you still
> need some userspace component to make those decisions.

The main benefits I see of kexecfs is that it avoids needing to send
all FDs over UDS to/from liveupdated and therefore the need for
dynamic cross-process communication (e.g. RPCs).

Instead, something just needs to set up a kexecfs for each VM when it
is created, and give the same kexecfs back to each VM after kexec.
Then VMs are free to save/restore any FDs in that kexecfs without
cross-process communication or transferring file descriptors.

Policy can be enforced by controlling access to kexecfs mounts. This
naturally fits into the standard architecture of running untrusted VMs
(e.g. using chroots and containers to enforce security and isolation).

>
> >
> > I'm not sure that we can get rid of the machine-wide state machine
> > though, as there is some kernel state that will necessarily cross
> > these kexecfs domains (e.g. IOMMU driver state). So we still might
> > need /dev/liveupdate for that.
>
> Generally speaking, I think both VFS-based and IOCTL-based interfaces
> are more or less equally expressive/powerful. Most of the ioctl
> operations can be translated to a VFS operation and vice versa.
>
> For example, the fsopen() call is similar to open("/dev/liveupdate") --
> both would create a live update session which auto closes when the FD is
> closed or FS unmounted. Similarly, each ioctl can be replaced with a
> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
> replaced with a fd_preserve file where you write() the FD number.
> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
> replaced by a "state" file where you can read() or write() the state.
>
> I think the main benefit of the VFS-based interface is ease of use.
> There already exist a bunch of utilites and libraries that we can use to
> interact with files. When we have ioctls, we would need to write
> everything ourselves. For example, instead of
> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
> easier to do.
>
> As for downsides, I think we might end up with a bit more boilerplate
> code, but beyond that I am not sure.

I agree we can more or less get to the same end state with either
approach. And also, I don't think we have to do one or the other. I
think kexecfs is something that we can build on top of this series.
For example, kexecfs would be a new kernel subsystem that registers
with LUO.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-24  9:50   ` Christian Brauner
  2025-06-24 14:27     ` Pasha Tatashin
@ 2025-07-06 14:24     ` Mike Rapoport
  2025-07-09 21:27       ` Pratyush Yadav
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-07-06 14:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav

On Tue, Jun 24, 2025 at 11:50:49AM +0200, Christian Brauner wrote:
> On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> > Introduce the user-space interface for the Live Update Orchestrator
> > via ioctl commands, enabling external control over the live update
> > process and management of preserved resources.
> > 
> > Create a misc character device at /dev/liveupdate. Access
> > to this device requires the CAP_SYS_ADMIN capability.
> > 
> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> > structures. The magic number is registered in
> > Documentation/userspace-api/ioctl/ioctl-number.rst.
> > 
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
> >  drivers/misc/liveupdate/Makefile              |   1 +
> >  drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
> >  include/linux/liveupdate.h                    |  34 +-
> >  include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
> >  5 files changed, 502 insertions(+), 33 deletions(-)
> >  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
> >  create mode 100644 include/uapi/linux/liveupdate.h

...

> > +static const struct file_operations fops = {
> > +	.owner          = THIS_MODULE,
> > +	.open           = luo_open,
> > +	.unlocked_ioctl = luo_ioctl,
> > +};
> > +
> > +static struct miscdevice liveupdate_miscdev = {
> > +	.minor = MISC_DYNAMIC_MINOR,
> > +	.name  = "liveupdate",
> > +	.fops  = &fops,
> > +};
> 
> I'm not sure why people are so in love with character device based apis.
> It's terrible. It glues everything to devtmpfs which isn't namespacable
> in any way. It's terrible to delegate and extremely restrictive in terms
> of extensiblity if you need additional device entries (aka the loop
> driver folly).
> 
> One stupid question: I probably have asked this before and just swapped
> out that I a) asked this already and b) received an explanation. But why
> isn't this a singleton simple in-memory filesystem with a flat
> hierarchy?
> 
> mount -t kexecfs kexecfs /kexecfs
> 
> So userspace mounts kexecfs (or the kernel does it automagically) and
> then to add fds into that thing you do the following:
> 
> linkat(fd_my_anon_inode_memfd, "", -EBADF, "kexecfs/my_serialized_memfd", AT_EMPTY_PATH)

Having an ability to link a file descriptor to kexecfs would have been
nice. We could even create a dependency hierarchy there, e.g.

mkdir -p kexecfs/vm1/kvm/{iommu,memfd}

linkat(kvmfd, "", -EBADF, "kexecfs/vm1/kvm/kvmfd", AT_EMPTY_PATH)
linkat(iommufd, "", -EBADF, "kexecfs/vm1/kvm/iommu/iommufd", AT_EMPTY_PATH)
linkat(memfd, "", -EBADF, "kexecfs/vm1/kvm/memfd/memfd", AT_EMPTY_PATH)

But unfortunately this won't work because VFS checks that new and old paths
are on the same mount. And even if cross-mount links were allowed, VFS does
not pass the file objects to link* APIs, so preserving a file backed by
anon_inode is another issue.

> which will serialize the fd_my_anon_inode_memfd. You can also do this
> with ioctls on the kexecfs filesystem of course.

ioctls seem to be the only option, but I agree they don't have to be bound
to a miscdev.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-26 15:42           ` Pratyush Yadav
  2025-06-26 16:24             ` David Matlack
@ 2025-07-06 14:33             ` Mike Rapoport
  2025-07-07 12:56               ` Jason Gunthorpe
  1 sibling, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-07-06 14:33 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: David Matlack, Christian Brauner, Pasha Tatashin, jasonmiu, graf,
	changyuanl, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie,
	ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Thu, Jun 26, 2025 at 05:42:28PM +0200, Pratyush Yadav wrote:
> On Wed, Jun 25 2025, David Matlack wrote:
> 
> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
> >> >
> >> > While I agree that a filesystem offers superior introspection and
> >> > integration with standard tools, building this complex, stateful
> >> > orchestration logic on top of VFS seemed to be forcing a square peg
> >> > into a round hole. The ioctl interface, while more opaque, provides a
> >> > direct and explicit way to command the state machine and manage these
> >> > complex lifecycle and dependency rules.
> >>
> >> I'm not going to argue that you have to switch to this kexecfs idea
> >> but...
> >>
> >> You're using a character device that's tied to devmptfs. In other words,
> >> you're already using a filesystem interface. Literally the whole code
> >> here is built on top of filesystem APIs. So this argument is just very
> >> wrong imho. If you can built it on top of a character device using VFS
> >> interfaces you can do it as a minimal filesystem.
> >>
> >> You're free to define the filesystem interface any way you like it. We
> >> have a ton of examples there. All your ioctls would just be tied to the
> >> fileystem instance instead of the /dev/somethingsomething character
> >> device. The state machine could just be implemented the same way.
> >>
> >> One of my points is that with an fs interface you can have easy state
> >> seralization on a per-service level. IOW, you have a bunch of virtual
> >> machines running as services or some networking services or whatever.
> >> You could just bind-mount an instance of kexecfs into the service and
> >> the service can persist state into the instance and easily recover it
> >> after kexec.
> >
> > This approach sounds worth exploring more. It would avoid the need for
> > a centralized daemon to mediate the preservation and restoration of
> > all file descriptors.
> 
> One of the jobs of the centralized daemon is to decide the _policy_ of
> who gets to preserve things and more importantly, make sure the right
> party unpreserves the right FDs after a kexec. I don't see how this
> interface fixes this problem. You would still need a way to identify
> which kexecfs instance belongs to who and enforce that. The kernel
> probably shouldn't be the one doing this kind of policy so you still
> need some userspace component to make those decisions.
> 
> >
> > I'm not sure that we can get rid of the machine-wide state machine
> > though, as there is some kernel state that will necessarily cross
> > these kexecfs domains (e.g. IOMMU driver state). So we still might
> > need /dev/liveupdate for that.
> 
> Generally speaking, I think both VFS-based and IOCTL-based interfaces
> are more or less equally expressive/powerful. Most of the ioctl
> operations can be translated to a VFS operation and vice versa.
> 
> For example, the fsopen() call is similar to open("/dev/liveupdate") --
> both would create a live update session which auto closes when the FD is
> closed or FS unmounted. Similarly, each ioctl can be replaced with a
> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
> replaced with a fd_preserve file where you write() the FD number.
> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
> replaced by a "state" file where you can read() or write() the state.
> 
> I think the main benefit of the VFS-based interface is ease of use.
> There already exist a bunch of utilites and libraries that we can use to
> interact with files. When we have ioctls, we would need to write
> everything ourselves. For example, instead of
> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
> easier to do.
>
> As for downsides, I think we might end up with a bit more boilerplate
> code, but beyond that I am not sure.

One of the points in Christian's suggestion was that ioctl doesn't have to
be bound to a misc device. Even if we don't use read()/write()/link() etc,
we can have a filesystem that exposes, say, "control" file and that file
has the same liveupdate_ioctl() in its fops as we have now in miscdev.

The cost is indeed a bit of boilerplate code to create the filesystem, but
it would be easier to extend for per-service and containers support.

And we won't need sysfs entry for status, as it can be also pre-populated
in kexecfs (or whatever it'll be called).
 
> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-06 14:33             ` Mike Rapoport
@ 2025-07-07 12:56               ` Jason Gunthorpe
  0 siblings, 0 replies; 102+ messages in thread
From: Jason Gunthorpe @ 2025-07-07 12:56 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, David Matlack, Christian Brauner, Pasha Tatashin,
	jasonmiu, graf, changyuanl, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Sun, Jul 06, 2025 at 05:33:04PM +0300, Mike Rapoport wrote:

> One of the points in Christian's suggestion was that ioctl doesn't have to
> be bound to a misc device. Even if we don't use read()/write()/link() etc,
> we can have a filesystem that exposes, say, "control" file and that file
> has the same liveupdate_ioctl() in its fops as we have now in miscdev.

IMHO for this application there is nothing wrong with a misc
device. The intention is for a single userspace process to use this as
some kind of request broker and provide the required policy layer.

Creating a VFS and then running ioctl inside the VFS just seems like
over-engineering to me. We can't really avoid the ioctls.

This is not really managing files in the sense of string named objects
with bytestreams associated with them.

I've also heard people saying things like configs were a mistake, so
I'm not so sure about this. IIRC VFS brings a bunch of standard use
models and their associated races that the kernel is forced to deal
with, while the simple ioctl here has none of that complexity.

> The cost is indeed a bit of boilerplate code to create the filesystem, but
> it would be easier to extend for per-service and containers support.

I don't think it really improves that. You still have a single policy
agent in userspace that has to control this thing. 

On the other side you'd have a much more complex serialization job
because you have to capture an open ended filesystem instead of the
much simpler u64 key/value scheme the ioctl is using.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-06 14:24     ` Mike Rapoport
@ 2025-07-09 21:27       ` Pratyush Yadav
  2025-07-10  7:26         ` Mike Rapoport
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-07-09 21:27 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Christian Brauner, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Sun, Jul 06 2025, Mike Rapoport wrote:

> On Tue, Jun 24, 2025 at 11:50:49AM +0200, Christian Brauner wrote:
>> On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
>> > Introduce the user-space interface for the Live Update Orchestrator
>> > via ioctl commands, enabling external control over the live update
>> > process and management of preserved resources.
>> > 
>> > Create a misc character device at /dev/liveupdate. Access
>> > to this device requires the CAP_SYS_ADMIN capability.
>> > 
>> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
>> > structures. The magic number is registered in
>> > Documentation/userspace-api/ioctl/ioctl-number.rst.
>> > 
>> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>> > ---
>> >  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
>> >  drivers/misc/liveupdate/Makefile              |   1 +
>> >  drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
>> >  include/linux/liveupdate.h                    |  34 +-
>> >  include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
>> >  5 files changed, 502 insertions(+), 33 deletions(-)
>> >  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
>> >  create mode 100644 include/uapi/linux/liveupdate.h
>
> ...
>
>> > +static const struct file_operations fops = {
>> > +	.owner          = THIS_MODULE,
>> > +	.open           = luo_open,
>> > +	.unlocked_ioctl = luo_ioctl,
>> > +};
>> > +
>> > +static struct miscdevice liveupdate_miscdev = {
>> > +	.minor = MISC_DYNAMIC_MINOR,
>> > +	.name  = "liveupdate",
>> > +	.fops  = &fops,
>> > +};
>> 
>> I'm not sure why people are so in love with character device based apis.
>> It's terrible. It glues everything to devtmpfs which isn't namespacable
>> in any way. It's terrible to delegate and extremely restrictive in terms
>> of extensiblity if you need additional device entries (aka the loop
>> driver folly).
>> 
>> One stupid question: I probably have asked this before and just swapped
>> out that I a) asked this already and b) received an explanation. But why
>> isn't this a singleton simple in-memory filesystem with a flat
>> hierarchy?
>> 
>> mount -t kexecfs kexecfs /kexecfs
>> 
>> So userspace mounts kexecfs (or the kernel does it automagically) and
>> then to add fds into that thing you do the following:
>> 
>> linkat(fd_my_anon_inode_memfd, "", -EBADF, "kexecfs/my_serialized_memfd", AT_EMPTY_PATH)
>
> Having an ability to link a file descriptor to kexecfs would have been
> nice. We could even create a dependency hierarchy there, e.g.
>
> mkdir -p kexecfs/vm1/kvm/{iommu,memfd}
>
> linkat(kvmfd, "", -EBADF, "kexecfs/vm1/kvm/kvmfd", AT_EMPTY_PATH)
> linkat(iommufd, "", -EBADF, "kexecfs/vm1/kvm/iommu/iommufd", AT_EMPTY_PATH)
> linkat(memfd, "", -EBADF, "kexecfs/vm1/kvm/memfd/memfd", AT_EMPTY_PATH)
>
> But unfortunately this won't work because VFS checks that new and old paths
> are on the same mount. And even if cross-mount links were allowed, VFS does
> not pass the file objects to link* APIs, so preserving a file backed by
> anon_inode is another issue.

Yep, I was poking around the VFS code last week and saw the same
problem.

>
>> which will serialize the fd_my_anon_inode_memfd. You can also do this
>> with ioctls on the kexecfs filesystem of course.
>
> ioctls seem to be the only option, but I agree they don't have to be bound
> to a miscdev.

I suppose you can have a special file, say "preserve_fd", where you can
write() the FD number.

This is in some ways similar to how you would write it to the ioctl()
via the arg buffer/struct. And I suppose you can have other special
files to do the things that other ioctls would do.

That is one way to do it, although I dunno if it classifies as a
"proper" use of the VFS APIs...

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-09 21:27       ` Pratyush Yadav
@ 2025-07-10  7:26         ` Mike Rapoport
  2025-07-14 14:34           ` Jason Gunthorpe
  0 siblings, 1 reply; 102+ messages in thread
From: Mike Rapoport @ 2025-07-10  7:26 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Pasha Tatashin, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Wed, Jul 09, 2025 at 11:27:08PM +0200, Pratyush Yadav wrote:
> On Sun, Jul 06 2025, Mike Rapoport wrote:
> 
> > On Tue, Jun 24, 2025 at 11:50:49AM +0200, Christian Brauner wrote:
> >> On Thu, May 15, 2025 at 06:23:14PM +0000, Pasha Tatashin wrote:
> >> > Introduce the user-space interface for the Live Update Orchestrator
> >> > via ioctl commands, enabling external control over the live update
> >> > process and management of preserved resources.
> >> > 
> >> > Create a misc character device at /dev/liveupdate. Access
> >> > to this device requires the CAP_SYS_ADMIN capability.
> >> > 
> >> > A new UAPI header, <uapi/linux/liveupdate.h>, defines the necessary
> >> > structures. The magic number is registered in
> >> > Documentation/userspace-api/ioctl/ioctl-number.rst.
> >> > 
> >> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> >> > ---
> >> >  .../userspace-api/ioctl/ioctl-number.rst      |   1 +
> >> >  drivers/misc/liveupdate/Makefile              |   1 +
> >> >  drivers/misc/liveupdate/luo_ioctl.c           | 199 ++++++++++++
> >> >  include/linux/liveupdate.h                    |  34 +-
> >> >  include/uapi/linux/liveupdate.h               | 300 ++++++++++++++++++
> >> >  5 files changed, 502 insertions(+), 33 deletions(-)
> >> >  create mode 100644 drivers/misc/liveupdate/luo_ioctl.c
> >> >  create mode 100644 include/uapi/linux/liveupdate.h
> >
> > ...
> >
> >> > +static const struct file_operations fops = {
> >> > +	.owner          = THIS_MODULE,
> >> > +	.open           = luo_open,
> >> > +	.unlocked_ioctl = luo_ioctl,
> >> > +};
> >> > +
> >> > +static struct miscdevice liveupdate_miscdev = {
> >> > +	.minor = MISC_DYNAMIC_MINOR,
> >> > +	.name  = "liveupdate",
> >> > +	.fops  = &fops,
> >> > +};
> >> 
> >> I'm not sure why people are so in love with character device based apis.
> >> It's terrible. It glues everything to devtmpfs which isn't namespacable
> >> in any way. It's terrible to delegate and extremely restrictive in terms
> >> of extensiblity if you need additional device entries (aka the loop
> >> driver folly).
> >> 
> >> One stupid question: I probably have asked this before and just swapped
> >> out that I a) asked this already and b) received an explanation. But why
> >> isn't this a singleton simple in-memory filesystem with a flat
> >> hierarchy?
> >> 
> >> mount -t kexecfs kexecfs /kexecfs
> >> 
> >> So userspace mounts kexecfs (or the kernel does it automagically) and
> >> then to add fds into that thing you do the following:
> >> 
> >> linkat(fd_my_anon_inode_memfd, "", -EBADF, "kexecfs/my_serialized_memfd", AT_EMPTY_PATH)
> >
> > Having an ability to link a file descriptor to kexecfs would have been
> > nice. We could even create a dependency hierarchy there, e.g.
> >
> > mkdir -p kexecfs/vm1/kvm/{iommu,memfd}
> >
> > linkat(kvmfd, "", -EBADF, "kexecfs/vm1/kvm/kvmfd", AT_EMPTY_PATH)
> > linkat(iommufd, "", -EBADF, "kexecfs/vm1/kvm/iommu/iommufd", AT_EMPTY_PATH)
> > linkat(memfd, "", -EBADF, "kexecfs/vm1/kvm/memfd/memfd", AT_EMPTY_PATH)
> >
> > But unfortunately this won't work because VFS checks that new and old paths
> > are on the same mount. And even if cross-mount links were allowed, VFS does
> > not pass the file objects to link* APIs, so preserving a file backed by
> > anon_inode is another issue.
> 
> Yep, I was poking around the VFS code last week and saw the same
> problem.
> 
> >
> >> which will serialize the fd_my_anon_inode_memfd. You can also do this
> >> with ioctls on the kexecfs filesystem of course.
> >
> > ioctls seem to be the only option, but I agree they don't have to be bound
> > to a miscdev.
> 
> I suppose you can have a special file, say "preserve_fd", where you can
> write() the FD number.
> 
> This is in some ways similar to how you would write it to the ioctl()
> via the arg buffer/struct. And I suppose you can have other special
> files to do the things that other ioctls would do.
> 
> That is one way to do it, although I dunno if it classifies as a
> "proper" use of the VFS APIs...

IIUC Christian's point was mostly not about using VFS APIs (i.e.
read/write) but about using a special pseudo fs rather than devtmpfs to
drive ioctls.
 
So instead of 

	fd = open("/dev/liveupdate", ...);
	ioctl(fd, ...);

we'd use

	fd = open("/sys/fs/kexec/control", ...);
	ioctl(fd, ...);

> -- 
> Regards,
> Pratyush Yadav

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-10  7:26         ` Mike Rapoport
@ 2025-07-14 14:34           ` Jason Gunthorpe
  2025-07-16  9:43             ` Greg KH
  0 siblings, 1 reply; 102+ messages in thread
From: Jason Gunthorpe @ 2025-07-14 14:34 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Pratyush Yadav, Christian Brauner, Pasha Tatashin, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Thu, Jul 10, 2025 at 10:26:45AM +0300, Mike Rapoport wrote:
> IIUC Christian's point was mostly not about using VFS APIs (i.e.
> read/write) but about using a special pseudo fs rather than devtmpfs to
> drive ioctls.
>  
> So instead of 
> 
> 	fd = open("/dev/liveupdate", ...);
> 	ioctl(fd, ...);
> 
> we'd use
> 
> 	fd = open("/sys/fs/kexec/control", ...);
> 	ioctl(fd, ...);

Please no, /sys/ is much worse.

/dev/ has lots of infrastructure to control permissions/etc that /sys/
does not.

If you want to do ioctls to something that you open() is a character
dev and you accept the limitations with namespaces, coarse permissions
and so on.

Jason

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-06-26 16:24             ` David Matlack
@ 2025-07-14 14:56               ` Pratyush Yadav
  2025-07-17 16:17                 ` David Matlack
  0 siblings, 1 reply; 102+ messages in thread
From: Pratyush Yadav @ 2025-07-14 14:56 UTC (permalink / raw)
  To: David Matlack
  Cc: Pratyush Yadav, Christian Brauner, Pasha Tatashin, jasonmiu, graf,
	changyuanl, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie,
	ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

Hi David,

On Thu, Jun 26 2025, David Matlack wrote:

> On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> On Wed, Jun 25 2025, David Matlack wrote:
>>
>> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
>> >> >
>> >> > While I agree that a filesystem offers superior introspection and
>> >> > integration with standard tools, building this complex, stateful
>> >> > orchestration logic on top of VFS seemed to be forcing a square peg
>> >> > into a round hole. The ioctl interface, while more opaque, provides a
>> >> > direct and explicit way to command the state machine and manage these
>> >> > complex lifecycle and dependency rules.
>> >>
>> >> I'm not going to argue that you have to switch to this kexecfs idea
>> >> but...
>> >>
>> >> You're using a character device that's tied to devmptfs. In other words,
>> >> you're already using a filesystem interface. Literally the whole code
>> >> here is built on top of filesystem APIs. So this argument is just very
>> >> wrong imho. If you can built it on top of a character device using VFS
>> >> interfaces you can do it as a minimal filesystem.
>> >>
>> >> You're free to define the filesystem interface any way you like it. We
>> >> have a ton of examples there. All your ioctls would just be tied to the
>> >> fileystem instance instead of the /dev/somethingsomething character
>> >> device. The state machine could just be implemented the same way.
>> >>
>> >> One of my points is that with an fs interface you can have easy state
>> >> seralization on a per-service level. IOW, you have a bunch of virtual
>> >> machines running as services or some networking services or whatever.
>> >> You could just bind-mount an instance of kexecfs into the service and
>> >> the service can persist state into the instance and easily recover it
>> >> after kexec.
>> >
>> > This approach sounds worth exploring more. It would avoid the need for
>> > a centralized daemon to mediate the preservation and restoration of
>> > all file descriptors.
>>
>> One of the jobs of the centralized daemon is to decide the _policy_ of
>> who gets to preserve things and more importantly, make sure the right
>> party unpreserves the right FDs after a kexec. I don't see how this
>> interface fixes this problem. You would still need a way to identify
>> which kexecfs instance belongs to who and enforce that. The kernel
>> probably shouldn't be the one doing this kind of policy so you still
>> need some userspace component to make those decisions.
>
> The main benefits I see of kexecfs is that it avoids needing to send
> all FDs over UDS to/from liveupdated and therefore the need for
> dynamic cross-process communication (e.g. RPCs).
>
> Instead, something just needs to set up a kexecfs for each VM when it
> is created, and give the same kexecfs back to each VM after kexec.
> Then VMs are free to save/restore any FDs in that kexecfs without
> cross-process communication or transferring file descriptors.

Isn't giving back the right kexecfs instance to the right VMM the main
problem? After a kexec, you need a way to make that policy decision. You
would need a userspace agent to do that.

I think what you are suggesting does make a lot of sense -- the agent
should be handing out sessions instead of FDs, which would make FD
save/restore simpler for applications. But that can be done using the
ioctl interface as well. Each time you open() the /dev/liveupdate, you
get a new session. Instead of file FDs like memfd or iommufs, we can
have the agent hand out these session FDs and anything that was saved
using this session would be ready for restoring.

My main point is that this can be done with the current interface as
well as kexecfs. I think there is very much a reason for considering
kexecfs (like not being dependent on devtmpfs), but I don't think this
is necessarily the main one.

>
> Policy can be enforced by controlling access to kexecfs mounts. This
> naturally fits into the standard architecture of running untrusted VMs
> (e.g. using chroots and containers to enforce security and isolation).

How? After a kexec, how do you tell which process can get which kexecfs
mount/instance? If any of them can get any, then we lose all sort of
policy enforcement.

>
>>
>> >
>> > I'm not sure that we can get rid of the machine-wide state machine
>> > though, as there is some kernel state that will necessarily cross
>> > these kexecfs domains (e.g. IOMMU driver state). So we still might
>> > need /dev/liveupdate for that.
>>
>> Generally speaking, I think both VFS-based and IOCTL-based interfaces
>> are more or less equally expressive/powerful. Most of the ioctl
>> operations can be translated to a VFS operation and vice versa.
>>
>> For example, the fsopen() call is similar to open("/dev/liveupdate") --
>> both would create a live update session which auto closes when the FD is
>> closed or FS unmounted. Similarly, each ioctl can be replaced with a
>> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
>> replaced with a fd_preserve file where you write() the FD number.
>> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
>> replaced by a "state" file where you can read() or write() the state.
>>
>> I think the main benefit of the VFS-based interface is ease of use.
>> There already exist a bunch of utilites and libraries that we can use to
>> interact with files. When we have ioctls, we would need to write
>> everything ourselves. For example, instead of
>> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
>> easier to do.
>>
>> As for downsides, I think we might end up with a bit more boilerplate
>> code, but beyond that I am not sure.
>
> I agree we can more or less get to the same end state with either
> approach. And also, I don't think we have to do one or the other. I
> think kexecfs is something that we can build on top of this series.
> For example, kexecfs would be a new kernel subsystem that registers
> with LUO.

Yeah, fair point. Though I'd rather we agree on one and go with that.
Having two interfaces for the same thing isn't the best.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-14 14:34           ` Jason Gunthorpe
@ 2025-07-16  9:43             ` Greg KH
  0 siblings, 0 replies; 102+ messages in thread
From: Greg KH @ 2025-07-16  9:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, Pratyush Yadav, Christian Brauner, Pasha Tatashin,
	jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes

On Mon, Jul 14, 2025 at 11:34:43AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 10, 2025 at 10:26:45AM +0300, Mike Rapoport wrote:
> > IIUC Christian's point was mostly not about using VFS APIs (i.e.
> > read/write) but about using a special pseudo fs rather than devtmpfs to
> > drive ioctls.
> >  
> > So instead of 
> > 
> > 	fd = open("/dev/liveupdate", ...);
> > 	ioctl(fd, ...);
> > 
> > we'd use
> > 
> > 	fd = open("/sys/fs/kexec/control", ...);
> > 	ioctl(fd, ...);
> 
> Please no, /sys/ is much worse.
> 
> /dev/ has lots of infrastructure to control permissions/etc that /sys/
> does not.
> 
> If you want to do ioctls to something that you open() is a character
> dev and you accept the limitations with namespaces, coarse permissions
> and so on.

Then use a special filesystem, and not sysfs.  It's easy to embed a
virtual filesystem in a driver, please do that instead.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-14 14:56               ` Pratyush Yadav
@ 2025-07-17 16:17                 ` David Matlack
  2025-07-23 14:51                   ` Pratyush Yadav
  0 siblings, 1 reply; 102+ messages in thread
From: David Matlack @ 2025-07-17 16:17 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Pasha Tatashin, jasonmiu, graf, changyuanl,
	rppt, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Mon, Jul 14, 2025 at 7:56 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> On Thu, Jun 26 2025, David Matlack wrote:
> > On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@kernel.org> wrote:
> >> On Wed, Jun 25 2025, David Matlack wrote:
> >> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
> >> >> >
> >> >> > While I agree that a filesystem offers superior introspection and
> >> >> > integration with standard tools, building this complex, stateful
> >> >> > orchestration logic on top of VFS seemed to be forcing a square peg
> >> >> > into a round hole. The ioctl interface, while more opaque, provides a
> >> >> > direct and explicit way to command the state machine and manage these
> >> >> > complex lifecycle and dependency rules.
> >> >>
> >> >> I'm not going to argue that you have to switch to this kexecfs idea
> >> >> but...
> >> >>
> >> >> You're using a character device that's tied to devmptfs. In other words,
> >> >> you're already using a filesystem interface. Literally the whole code
> >> >> here is built on top of filesystem APIs. So this argument is just very
> >> >> wrong imho. If you can built it on top of a character device using VFS
> >> >> interfaces you can do it as a minimal filesystem.
> >> >>
> >> >> You're free to define the filesystem interface any way you like it. We
> >> >> have a ton of examples there. All your ioctls would just be tied to the
> >> >> fileystem instance instead of the /dev/somethingsomething character
> >> >> device. The state machine could just be implemented the same way.
> >> >>
> >> >> One of my points is that with an fs interface you can have easy state
> >> >> seralization on a per-service level. IOW, you have a bunch of virtual
> >> >> machines running as services or some networking services or whatever.
> >> >> You could just bind-mount an instance of kexecfs into the service and
> >> >> the service can persist state into the instance and easily recover it
> >> >> after kexec.
> >> >
> >> > This approach sounds worth exploring more. It would avoid the need for
> >> > a centralized daemon to mediate the preservation and restoration of
> >> > all file descriptors.
> >>
> >> One of the jobs of the centralized daemon is to decide the _policy_ of
> >> who gets to preserve things and more importantly, make sure the right
> >> party unpreserves the right FDs after a kexec. I don't see how this
> >> interface fixes this problem. You would still need a way to identify
> >> which kexecfs instance belongs to who and enforce that. The kernel
> >> probably shouldn't be the one doing this kind of policy so you still
> >> need some userspace component to make those decisions.
> >
> > The main benefits I see of kexecfs is that it avoids needing to send
> > all FDs over UDS to/from liveupdated and therefore the need for
> > dynamic cross-process communication (e.g. RPCs).
> >
> > Instead, something just needs to set up a kexecfs for each VM when it
> > is created, and give the same kexecfs back to each VM after kexec.
> > Then VMs are free to save/restore any FDs in that kexecfs without
> > cross-process communication or transferring file descriptors.
>
> Isn't giving back the right kexecfs instance to the right VMM the main
> problem? After a kexec, you need a way to make that policy decision. You
> would need a userspace agent to do that.
>
> I think what you are suggesting does make a lot of sense -- the agent
> should be handing out sessions instead of FDs, which would make FD
> save/restore simpler for applications. But that can be done using the
> ioctl interface as well. Each time you open() the /dev/liveupdate, you
> get a new session. Instead of file FDs like memfd or iommufs, we can
> have the agent hand out these session FDs and anything that was saved
> using this session would be ready for restoring.
>
> My main point is that this can be done with the current interface as
> well as kexecfs. I think there is very much a reason for considering
> kexecfs (like not being dependent on devtmpfs), but I don't think this
> is necessarily the main one.

The main problem I'd like solved is requiring all FDs to preserved and
restored in the context of a central daemon, since I think this will
inevitably cause problems for KVM. I agree with you that this problem
can also be solved in other ways, such as session FDs (good idea!).

>
> >
> > Policy can be enforced by controlling access to kexecfs mounts. This
> > naturally fits into the standard architecture of running untrusted VMs
> > (e.g. using chroots and containers to enforce security and isolation).
>
> How? After a kexec, how do you tell which process can get which kexecfs
> mount/instance? If any of them can get any, then we lose all sort of
> policy enforcement.

I was imagining it's up to whatever process/daemon creates the kexecfs
instances before kexec is also responsible for reassociating them with
the right processes after kexec.

If you are asking how that association would be done mechanically, I
was imagining it would be through a combination of filesystem
permissions, mounts, and chroots. For example, the kexecfs instance
for VM A would be mounted in VM A's chroot. VM A would then only have
access to its own kexecfs instance.

> >> > I'm not sure that we can get rid of the machine-wide state machine
> >> > though, as there is some kernel state that will necessarily cross
> >> > these kexecfs domains (e.g. IOMMU driver state). So we still might
> >> > need /dev/liveupdate for that.
> >>
> >> Generally speaking, I think both VFS-based and IOCTL-based interfaces
> >> are more or less equally expressive/powerful. Most of the ioctl
> >> operations can be translated to a VFS operation and vice versa.
> >>
> >> For example, the fsopen() call is similar to open("/dev/liveupdate") --
> >> both would create a live update session which auto closes when the FD is
> >> closed or FS unmounted. Similarly, each ioctl can be replaced with a
> >> file in the FS. For example, LIVEUPDATE_IOCTL_FD_PRESERVE can be
> >> replaced with a fd_preserve file where you write() the FD number.
> >> LIVEUPDATE_IOCTL_GET_STATE or LIVEUPDATE_IOCTL_PREPARE, etc. can be
> >> replaced by a "state" file where you can read() or write() the state.
> >>
> >> I think the main benefit of the VFS-based interface is ease of use.
> >> There already exist a bunch of utilites and libraries that we can use to
> >> interact with files. When we have ioctls, we would need to write
> >> everything ourselves. For example, instead of
> >> LIVEUPDATE_IOCTL_GET_STATE, you can do "cat state", which is a bit
> >> easier to do.
> >>
> >> As for downsides, I think we might end up with a bit more boilerplate
> >> code, but beyond that I am not sure.
> >
> > I agree we can more or less get to the same end state with either
> > approach. And also, I don't think we have to do one or the other. I
> > think kexecfs is something that we can build on top of this series.
> > For example, kexecfs would be a new kernel subsystem that registers
> > with LUO.
>
> Yeah, fair point. Though I'd rather we agree on one and go with that.
> Having two interfaces for the same thing isn't the best.

Agreed, tt would be better to have a single way to preserve FDs rather
than 2 (LUO ioctl and kexecfs).

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [RFC v2 10/16] luo: luo_ioctl: add ioctl interface
  2025-07-17 16:17                 ` David Matlack
@ 2025-07-23 14:51                   ` Pratyush Yadav
  0 siblings, 0 replies; 102+ messages in thread
From: Pratyush Yadav @ 2025-07-23 14:51 UTC (permalink / raw)
  To: David Matlack
  Cc: Pratyush Yadav, Christian Brauner, Pasha Tatashin, jasonmiu, graf,
	changyuanl, rppt, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie,
	ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes

On Thu, Jul 17 2025, David Matlack wrote:

> On Mon, Jul 14, 2025 at 7:56 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>> On Thu, Jun 26 2025, David Matlack wrote:
>> > On Thu, Jun 26, 2025 at 8:42 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>> >> On Wed, Jun 25 2025, David Matlack wrote:
>> >> > On Wed, Jun 25, 2025 at 2:36 AM Christian Brauner <brauner@kernel.org> wrote:
[...]
>>
>> Isn't giving back the right kexecfs instance to the right VMM the main
>> problem? After a kexec, you need a way to make that policy decision. You
>> would need a userspace agent to do that.
>>
>> I think what you are suggesting does make a lot of sense -- the agent
>> should be handing out sessions instead of FDs, which would make FD
>> save/restore simpler for applications. But that can be done using the
>> ioctl interface as well. Each time you open() the /dev/liveupdate, you
>> get a new session. Instead of file FDs like memfd or iommufs, we can
>> have the agent hand out these session FDs and anything that was saved
>> using this session would be ready for restoring.
>>
>> My main point is that this can be done with the current interface as
>> well as kexecfs. I think there is very much a reason for considering
>> kexecfs (like not being dependent on devtmpfs), but I don't think this
>> is necessarily the main one.
>
> The main problem I'd like solved is requiring all FDs to preserved and
> restored in the context of a central daemon, since I think this will
> inevitably cause problems for KVM. I agree with you that this problem
> can also be solved in other ways, such as session FDs (good idea!).

Another benefit of session FDs: the central daemon can decide whether it
wants to check each FD it gives over to a process, or just give over a
session and let the process do whatever it wants. With the current
patches, only the former operation model can be implemented.

>> >
>> > Policy can be enforced by controlling access to kexecfs mounts. This
>> > naturally fits into the standard architecture of running untrusted VMs
>> > (e.g. using chroots and containers to enforce security and isolation).
>>
>> How? After a kexec, how do you tell which process can get which kexecfs
>> mount/instance? If any of them can get any, then we lose all sort of
>> policy enforcement.
>
> I was imagining it's up to whatever process/daemon creates the kexecfs
> instances before kexec is also responsible for reassociating them with
> the right processes after kexec.
>
> If you are asking how that association would be done mechanically, I
> was imagining it would be through a combination of filesystem
> permissions, mounts, and chroots. For example, the kexecfs instance
> for VM A would be mounted in VM A's chroot. VM A would then only have
> access to its own kexecfs instance.

Hmm, good point. This would be quite a clean way of doing it I think.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2025-07-23 14:51 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-15 18:23 [RFC v2 00/16] Live Update Orchestrator Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 01/16] kho: make debugfs interface optional Pasha Tatashin
2025-06-04 16:03   ` Pratyush Yadav
2025-06-06 16:12     ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 02/16] kho: allow to drive kho from within kernel Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 03/16] kho: add kho_unpreserve_folio/phys Pasha Tatashin
2025-06-04 15:00   ` Pratyush Yadav
2025-06-06 16:22     ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 04/16] luo: luo_core: Live Update Orchestrator Pasha Tatashin
2025-05-26  6:31   ` Mike Rapoport
2025-05-30  5:00     ` Pasha Tatashin
2025-06-04 15:17   ` Pratyush Yadav
2025-06-07 17:11     ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 05/16] luo: luo_core: integrate with KHO Pasha Tatashin
2025-05-26  7:18   ` Mike Rapoport
2025-06-07 17:50     ` Pasha Tatashin
2025-06-09  2:14       ` Pasha Tatashin
2025-06-04 16:00   ` Pratyush Yadav
2025-06-07 23:30     ` Pasha Tatashin
2025-06-13 14:58       ` Pratyush Yadav
2025-06-17 15:23         ` Jason Gunthorpe
2025-06-17 19:32           ` Pasha Tatashin
2025-06-18 13:11             ` Pratyush Yadav
2025-06-18 14:48               ` Pasha Tatashin
2025-06-18 16:40                 ` Mike Rapoport
2025-06-18 17:00                   ` Pasha Tatashin
2025-06-18 17:43                     ` Pasha Tatashin
2025-06-19 12:00                       ` Mike Rapoport
2025-06-19 14:22                         ` Pasha Tatashin
2025-06-20 15:28                           ` Pratyush Yadav
2025-06-20 16:03                             ` Pasha Tatashin
2025-06-24 16:12                               ` Pratyush Yadav
2025-06-24 16:55                                 ` Pasha Tatashin
2025-06-24 18:31                                 ` Jason Gunthorpe
2025-06-23  7:32                       ` Mike Rapoport
2025-06-23 11:29                         ` Pasha Tatashin
2025-06-25 13:46                           ` Mike Rapoport
2025-05-15 18:23 ` [RFC v2 06/16] luo: luo_subsystems: add subsystem registration Pasha Tatashin
2025-05-26  7:31   ` Mike Rapoport
2025-06-07 23:42     ` Pasha Tatashin
2025-05-28 19:12   ` David Matlack
2025-06-07 23:58     ` Pasha Tatashin
2025-06-04 16:30   ` Pratyush Yadav
2025-06-08  0:04     ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 07/16] luo: luo_subsystems: implement subsystem callbacks Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 08/16] luo: luo_files: add infrastructure for FDs Pasha Tatashin
2025-05-15 23:15   ` James Houghton
2025-05-23 18:09     ` Pasha Tatashin
2025-05-26  7:55   ` Mike Rapoport
2025-06-05 11:56     ` Pratyush Yadav
2025-06-08 13:13     ` Pasha Tatashin
2025-06-05 15:56   ` Pratyush Yadav
2025-06-08 13:37     ` Pasha Tatashin
2025-06-13 15:27       ` Pratyush Yadav
2025-06-15 18:02         ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 09/16] luo: luo_files: implement file systems callbacks Pasha Tatashin
2025-06-05 16:03   ` Pratyush Yadav
2025-06-08 13:49     ` Pasha Tatashin
2025-06-13 15:18       ` Pratyush Yadav
2025-06-13 20:26         ` Pasha Tatashin
2025-06-16 10:43           ` Pratyush Yadav
2025-06-16 14:57             ` Pasha Tatashin
2025-06-18 13:16               ` Pratyush Yadav
2025-05-15 18:23 ` [RFC v2 10/16] luo: luo_ioctl: add ioctl interface Pasha Tatashin
2025-05-26  8:42   ` Mike Rapoport
2025-06-08 15:08     ` Pasha Tatashin
2025-05-28 20:29   ` David Matlack
2025-06-08 16:32     ` Pasha Tatashin
2025-06-05 16:15   ` Pratyush Yadav
2025-06-08 16:35     ` Pasha Tatashin
2025-06-24  9:50   ` Christian Brauner
2025-06-24 14:27     ` Pasha Tatashin
2025-06-25  9:36       ` Christian Brauner
2025-06-25 16:12         ` David Matlack
2025-06-26 15:42           ` Pratyush Yadav
2025-06-26 16:24             ` David Matlack
2025-07-14 14:56               ` Pratyush Yadav
2025-07-17 16:17                 ` David Matlack
2025-07-23 14:51                   ` Pratyush Yadav
2025-07-06 14:33             ` Mike Rapoport
2025-07-07 12:56               ` Jason Gunthorpe
2025-06-25 16:58         ` pasha.tatashin
2025-07-06 14:24     ` Mike Rapoport
2025-07-09 21:27       ` Pratyush Yadav
2025-07-10  7:26         ` Mike Rapoport
2025-07-14 14:34           ` Jason Gunthorpe
2025-07-16  9:43             ` Greg KH
2025-05-15 18:23 ` [RFC v2 11/16] luo: luo_sysfs: add sysfs state monitoring Pasha Tatashin
2025-06-05 16:20   ` Pratyush Yadav
2025-06-08 16:36     ` Pasha Tatashin
2025-06-13 15:13       ` Pratyush Yadav
2025-05-15 18:23 ` [RFC v2 12/16] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 13/16] luo: add selftests for subsystems un/registration Pasha Tatashin
2025-05-26  8:52   ` Mike Rapoport
2025-06-08 16:47     ` Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 14/16] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
2025-05-15 18:23 ` [RFC v2 15/16] docs: add luo documentation Pasha Tatashin
2025-05-26  9:00   ` Mike Rapoport
2025-05-15 18:23 ` [RFC v2 16/16] MAINTAINERS: add liveupdate entry Pasha Tatashin
2025-05-20  7:25 ` [RFC v2 00/16] Live Update Orchestrator Mike Rapoport
2025-05-23 18:07   ` Pasha Tatashin
2025-05-26  6:32 ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).