Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v4 12/30] liveupdate: luo_ioctl: add user interface
From: Pasha Tatashin @ 2025-09-29  1:03 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Introduce the user-space interface for the Live Update Orchestrator
via ioctl commands, enabling external control over the live update
process and management of preserved resources.

The idea is that there is going to be a single userspace agent driving
the live update, therefore, only a single process can ever hold this
device opened at a time.

The following ioctl commands are introduced:

LIVEUPDATE_IOCTL_GET_STATE
Allows userspace to query the current state of the LUO state machine
(e.g., NORMAL, PREPARED, UPDATED).

LIVEUPDATE_IOCTL_SET_EVENT
Enables userspace to drive the LUO state machine by sending global
events. This includes:

LIVEUPDATE_PREPARE
To begin the state-saving process.

LIVEUPDATE_FINISH
To signal completion of restoration in the new kernel.

LIVEUPDATE_CANCEL
To abort a prepared update.

LIVEUPDATE_IOCTL_CREATE_SESSION
Provides a way for userspace to create a named session for grouping file
descriptors that need to be preserved. It returns a new file descriptor
representing the session.

LIVEUPDATE_IOCTL_RETRIEVE_SESSION
Allows the userspace agent in the new kernel to reclaim a preserved
session by its name, receiving a new file descriptor to manage the
restored resources.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/uapi/linux/liveupdate.h  | 199 ++++++++++++++++++++++++++++++
 kernel/liveupdate/luo_internal.h |  20 +++
 kernel/liveupdate/luo_ioctl.c    | 201 +++++++++++++++++++++++++++++++
 3 files changed, 420 insertions(+)

diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
index e8c0c210a790..2e38ef3094aa 100644
--- a/include/uapi/linux/liveupdate.h
+++ b/include/uapi/linux/liveupdate.h
@@ -14,6 +14,32 @@
 #include <linux/ioctl.h>
 #include <linux/types.h>
 
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl interface follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ *  - ENOTTY: The IOCTL number itself is not supported at all
+ *  - E2BIG: The IOCTL number is supported, but the provided structure has
+ *    non-zero in a part the kernel does not understand.
+ *  - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ *    understood, however a known field has a value the kernel does not
+ *    understand or support.
+ *  - EINVAL: Everything about the IOCTL was understood, but a field is not
+ *    correct.
+ *  - ENOENT: A provided token does not exist.
+ *  - ENOMEM: Out of memory.
+ *  - EOVERFLOW: Mathematics overflowed.
+ *
+ * As well as additional errnos, within specific ioctls.
+ */
+
 /**
  * enum liveupdate_state - Defines the possible states of the live update
  * orchestrator.
@@ -94,4 +120,177 @@ enum liveupdate_event {
 /* The maximum length of session name including null termination */
 #define LIVEUPDATE_SESSION_NAME_LENGTH 56
 
+/* The ioctl type, documented in ioctl-number.rst */
+#define LIVEUPDATE_IOCTL_TYPE		0xBA
+
+/* The /dev/liveupdate ioctl commands */
+enum {
+	LIVEUPDATE_CMD_BASE = 0x00,
+	LIVEUPDATE_CMD_GET_STATE = LIVEUPDATE_CMD_BASE,
+	LIVEUPDATE_CMD_SET_EVENT = 0x01,
+	LIVEUPDATE_CMD_CREATE_SESSION = 0x02,
+	LIVEUPDATE_CMD_RETRIEVE_SESSION = 0x03,
+};
+
+/**
+ * struct liveupdate_ioctl_get_state - ioctl(LIVEUPDATE_IOCTL_GET_STATE)
+ * @size:  Input; sizeof(struct liveupdate_ioctl_get_state)
+ * @state: Output; The current live update state.
+ *
+ * Query the current state of the live update orchestrator.
+ *
+ * The kernel fills the @state with the current
+ * state of the live update subsystem. Possible states are:
+ *
+ * - %LIVEUPDATE_STATE_NORMAL:   Default state; no live update operation is
+ *                               currently in progress.
+ * - %LIVEUPDATE_STATE_PREPARED: The preparation phase (triggered by
+ *                               %LIVEUPDATE_PREPARE) has completed
+ *                               successfully. The system is ready for the
+ *                               reboot transition. Note that some
+ *                               device operations (e.g., unbinding, new DMA
+ *                               mappings) might be restricted in this state.
+ * - %LIVEUPDATE_STATE_UPDATED:  The system has successfully rebooted into the
+ *                               new kernel via live update. It is now running
+ *                               the new kernel code and is awaiting the
+ *                               completion signal from user space via
+ *                               %LIVEUPDATE_FINISH after restoration tasks are
+ *                               done.
+ *
+ * See the definition of &enum liveupdate_state for more details on each state.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+struct liveupdate_ioctl_get_state {
+	__u32	size;
+	__u32	state;
+};
+
+#define LIVEUPDATE_IOCTL_GET_STATE					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_GET_STATE)
+
+/**
+ * struct liveupdate_ioctl_set_event - ioctl(LIVEUPDATE_IOCTL_SET_EVENT)
+ * @size:  Input; sizeof(struct liveupdate_ioctl_set_event)
+ * @event: Input; The live update event.
+ *
+ * Notify live update orchestrator about global event, that causes a state
+ * transition.
+ *
+ * Event, can be one of the following:
+ *
+ * - %LIVEUPDATE_PREPARE: Initiates the live update preparation phase. This
+ *                        typically triggers the saving process for items marked
+ *                        via the PRESERVE ioctls. This typically occurs
+ *                        *before* the "blackout window", while user
+ *                        applications (e.g., VMs) may still be running. Kernel
+ *                        subsystems receiving the %LIVEUPDATE_PREPARE event
+ *                        should serialize necessary state. This command does
+ *                        not transfer data.
+ * - %LIVEUPDATE_FINISH:  Signal restoration completion and triggercleanup.
+ *
+ *                        Signals that user space has completed all necessary
+ *                        restoration actions in the new kernel (after a live
+ *                        update reboot). Calling this ioctl triggers the
+ *                        cleanup phase: any resources that were successfully
+ *                        preserved but were *not* subsequently restored
+ *                        (reclaimed) via the RESTORE ioctls will have their
+ *                        preserved state discarded and associated kernel
+ *                        resources released. Involved devices may be reset. All
+ *                        desired restorations *must* be completed *before*
+ *                        this. Kernel callbacks for the %LIVEUPDATE_FINISH
+ *                        event must not fail. Successfully completing this
+ *                        phase transitions the system state from
+ *                        %LIVEUPDATE_STATE_UPDATED back to
+ *                        %LIVEUPDATE_STATE_NORMAL. This command does
+ *                        not transfer data.
+ * - %LIVEUPDATE_CANCEL:  Cancel the live update preparation phase.
+ *
+ *                        Notifies the live update subsystem to abort the
+ *                        preparation sequence potentially initiated by
+ *                        %LIVEUPDATE_PREPARE event.
+ *
+ *                        When triggered, subsystems receiving the
+ *                        %LIVEUPDATE_CANCEL event should revert any state
+ *                        changes or actions taken specifically for the aborted
+ *                        prepare phase (e.g., discard partially serialized
+ *                        state). The kernel releases resources allocated
+ *                        specifically for this *aborted preparation attempt*.
+ *
+ *                        This operation cancels the current *attempt* to
+ *                        prepare for a live update but does **not** remove
+ *                        previously validated items from the internal list
+ *                        of potentially preservable resources.
+ *
+ *                        This command does not transfer data. Kernel callbacks
+ *                        for the %LIVEUPDATE_CANCEL event must not fail.
+ *
+ * See the definition of &enum liveupdate_event for more details on each state.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+struct liveupdate_ioctl_set_event {
+	__u32	size;
+	__u32	event;
+};
+
+#define LIVEUPDATE_IOCTL_SET_EVENT					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_SET_EVENT)
+
+/**
+ * struct liveupdate_ioctl_create_session - ioctl(LIVEUPDATE_IOCTL_CREATE_SESSION)
+ * @size:	Input; sizeof(struct liveupdate_ioctl_create_session)
+ * @fd:		Output; The new file descriptor for the created session.
+ * @name:	Input; A null-terminated string for the session name, max
+ *		length %LIVEUPDATE_SESSION_NAME_LENGTH including termination
+ *		char.
+ *
+ * Creates a new live update session for managing preserved resources.
+ * This ioctl can only be called on the main /dev/liveupdate device.
+ *
+ * Return: 0 on success, negative error code on failure.
+ */
+struct liveupdate_ioctl_create_session {
+	__u32		size;
+	__s32		fd;
+	__u8		name[LIVEUPDATE_SESSION_NAME_LENGTH];
+};
+
+#define LIVEUPDATE_IOCTL_CREATE_SESSION					\
+	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_CREATE_SESSION)
+
+/**
+ * struct liveupdate_ioctl_retrieve_session - ioctl(LIVEUPDATE_IOCTL_RETRIEVE_SESSION)
+ * @size:    Input; sizeof(struct liveupdate_ioctl_retrieve_session)
+ * @fd:      Output; The new file descriptor for the retrieved session.
+ * @name:    Input; A null-terminated string identifying the session to retrieve.
+ *           The name must exactly match the name used when the session was
+ *           created in the previous kernel.
+ *
+ * Retrieves a handle (a new file descriptor) for a preserved session by its
+ * name. This is the primary mechanism for a userspace agent to regain control
+ * of its preserved resources after a live update.
+ *
+ * The userspace application provides the null-terminated `name` of a session
+ * it created before the live update. If a preserved session with a matching
+ * name is found, the kernel instantiates it and returns a new file descriptor
+ * in the `fd` field. This new session FD can then be used for all file-specific
+ * operations, such as restoring individual file descriptors with
+ * LIVEUPDATE_SESSION_RESTORE_FD.
+ *
+ * It is the responsibility of the userspace application to know the names of
+ * the sessions it needs to retrieve. If no session with the given name is
+ * found, the ioctl will fail with -ENOENT.
+ *
+ * This ioctl can only be called on the main /dev/liveupdate device when the
+ * system is in the LIVEUPDATE_STATE_UPDATED state.
+ */
+struct liveupdate_ioctl_retrieve_session {
+	__u32		size;
+	__s32		fd;
+	__u8		name[64];
+};
+
+#define LIVEUPDATE_IOCTL_RETRIEVE_SESSION \
+	_IO(LIVEUPDATE_IOCTL_TYPE, LIVEUPDATE_CMD_RETRIEVE_SESSION)
 #endif /* _UAPI_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
index 9223f71844ca..a14e0b685ccb 100644
--- a/kernel/liveupdate/luo_internal.h
+++ b/kernel/liveupdate/luo_internal.h
@@ -17,6 +17,26 @@
  */
 #define luo_restore_fail(__fmt, ...) panic(__fmt, ##__VA_ARGS__)
 
+struct luo_ucmd {
+	void __user *ubuffer;
+	u32 user_size;
+	void *cmd;
+};
+
+static inline int luo_ucmd_respond(struct luo_ucmd *ucmd,
+				   size_t kernel_cmd_size)
+{
+	/*
+	 * Copy the minimum of what the user provided and what we actually
+	 * have.
+	 */
+	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+			 min_t(size_t, ucmd->user_size, kernel_cmd_size))) {
+		return -EFAULT;
+	}
+	return 0;
+}
+
 int luo_cancel(void);
 int luo_prepare(void);
 int luo_freeze(void);
diff --git a/kernel/liveupdate/luo_ioctl.c b/kernel/liveupdate/luo_ioctl.c
index fc2afb450ad5..01ccb8a6d3f4 100644
--- a/kernel/liveupdate/luo_ioctl.c
+++ b/kernel/liveupdate/luo_ioctl.c
@@ -5,6 +5,25 @@
  * Pasha Tatashin <pasha.tatashin@soleen.com>
  */
 
+/**
+ * DOC: LUO ioctl Interface
+ *
+ * The IOCTL user-space control interface for the LUO subsystem.
+ * It registers a character device, typically found at ``/dev/liveupdate``,
+ * which allows a userspace agent to manage the LUO state machine and its
+ * associated resources, such as preservable file descriptors.
+ *
+ * To ensure that the state machine is controlled by a single entity, access
+ * to this device is exclusive: only one process is permitted to have
+ * ``/dev/liveupdate`` open at any given time. Subsequent open attempts will
+ * fail with -EBUSY until the first process closes its file descriptor.
+ * This singleton model simplifies state management by preventing conflicting
+ * commands from multiple userspace agents.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/atomic.h>
 #include <linux/errno.h>
 #include <linux/file.h>
 #include <linux/fs.h>
@@ -19,10 +38,191 @@
 
 struct luo_device_state {
 	struct miscdevice miscdev;
+	atomic_t in_use;
+};
+
+static int luo_ioctl_get_state(struct luo_ucmd *ucmd)
+{
+	struct liveupdate_ioctl_get_state *argp = ucmd->cmd;
+
+	argp->state = liveupdate_get_state();
+
+	return luo_ucmd_respond(ucmd, sizeof(*argp));
+}
+
+static int luo_ioctl_set_event(struct luo_ucmd *ucmd)
+{
+	struct liveupdate_ioctl_set_event *argp = ucmd->cmd;
+	int ret;
+
+	switch (argp->event) {
+	case LIVEUPDATE_PREPARE:
+		ret = luo_prepare();
+		break;
+	case LIVEUPDATE_FINISH:
+		ret = luo_finish();
+		break;
+	case LIVEUPDATE_CANCEL:
+		ret = luo_cancel();
+		break;
+	default:
+		ret = -EOPNOTSUPP;
+	}
+
+	return ret;
+}
+
+static int luo_ioctl_create_session(struct luo_ucmd *ucmd)
+{
+	struct liveupdate_ioctl_create_session *argp = ucmd->cmd;
+	struct file *file;
+	int ret;
+
+	argp->fd = get_unused_fd_flags(O_CLOEXEC);
+	if (argp->fd < 0)
+		return argp->fd;
+
+	ret = luo_session_create(argp->name, &file);
+	if (ret)
+		return ret;
+
+	ret = luo_ucmd_respond(ucmd, sizeof(*argp));
+	if (ret) {
+		fput(file);
+		put_unused_fd(argp->fd);
+		return ret;
+	}
+
+	fd_install(argp->fd, file);
+
+	return 0;
+}
+
+static int luo_ioctl_retrieve_session(struct luo_ucmd *ucmd)
+{
+	struct liveupdate_ioctl_retrieve_session *argp = ucmd->cmd;
+	struct file *file;
+	int ret;
+
+	argp->fd = get_unused_fd_flags(O_CLOEXEC);
+	if (argp->fd < 0)
+		return argp->fd;
+
+	ret = luo_session_retrieve(argp->name, &file);
+	if (ret < 0) {
+		put_unused_fd(argp->fd);
+
+		return ret;
+	}
+
+	ret = luo_ucmd_respond(ucmd, sizeof(*argp));
+	if (ret) {
+		fput(file);
+		put_unused_fd(argp->fd);
+		return ret;
+	}
+
+	fd_install(argp->fd, file);
+
+	return 0;
+}
+
+static int luo_open(struct inode *inodep, struct file *filep)
+{
+	struct luo_device_state *ldev = container_of(filep->private_data,
+						     struct luo_device_state,
+						     miscdev);
+
+	if (atomic_cmpxchg(&ldev->in_use, 0, 1))
+		return -EBUSY;
+
+	return 0;
+}
+
+static int luo_release(struct inode *inodep, struct file *filep)
+{
+	struct luo_device_state *ldev = container_of(filep->private_data,
+						     struct luo_device_state,
+						     miscdev);
+	atomic_set(&ldev->in_use, 0);
+
+	return 0;
+}
+
+union ucmd_buffer {
+	struct liveupdate_ioctl_create_session create;
+	struct liveupdate_ioctl_get_state state;
+	struct liveupdate_ioctl_retrieve_session retrieve;
+	struct liveupdate_ioctl_set_event event;
 };
 
+struct luo_ioctl_op {
+	unsigned int size;
+	unsigned int min_size;
+	unsigned int ioctl_num;
+	int (*execute)(struct luo_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last)                                  \
+	[_IOC_NR(_ioctl) - LIVEUPDATE_CMD_BASE] = {                            \
+		.size = sizeof(_struct) +                                      \
+			BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) <          \
+					  sizeof(_struct)),                    \
+		.min_size = offsetofend(_struct, _last),                       \
+		.ioctl_num = _ioctl,                                           \
+		.execute = _fn,                                                \
+	}
+
+static const struct luo_ioctl_op luo_ioctl_ops[] = {
+	IOCTL_OP(LIVEUPDATE_IOCTL_CREATE_SESSION, luo_ioctl_create_session,
+		 struct liveupdate_ioctl_create_session, name),
+	IOCTL_OP(LIVEUPDATE_IOCTL_GET_STATE, luo_ioctl_get_state,
+		 struct liveupdate_ioctl_get_state, state),
+	IOCTL_OP(LIVEUPDATE_IOCTL_RETRIEVE_SESSION, luo_ioctl_retrieve_session,
+		 struct liveupdate_ioctl_retrieve_session, name),
+	IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
+		 struct liveupdate_ioctl_set_event, event),
+};
+
+static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
+{
+	const struct luo_ioctl_op *op;
+	struct luo_ucmd ucmd = {};
+	union ucmd_buffer buf;
+	unsigned int nr;
+	int ret;
+
+	nr = _IOC_NR(cmd);
+	if (nr < LIVEUPDATE_CMD_BASE ||
+	    (nr - LIVEUPDATE_CMD_BASE) >= ARRAY_SIZE(luo_ioctl_ops)) {
+		return -EINVAL;
+	}
+
+	ucmd.ubuffer = (void __user *)arg;
+	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+	if (ret)
+		return ret;
+
+	op = &luo_ioctl_ops[nr - LIVEUPDATE_CMD_BASE];
+	if (op->ioctl_num != cmd)
+		return -ENOIOCTLCMD;
+	if (ucmd.user_size < op->min_size)
+		return -EINVAL;
+
+	ucmd.cmd = &buf;
+	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+				    ucmd.user_size);
+	if (ret)
+		return ret;
+
+	return op->execute(&ucmd);
+}
+
 static const struct file_operations luo_fops = {
 	.owner		= THIS_MODULE,
+	.open		= luo_open,
+	.release	= luo_release,
+	.unlocked_ioctl	= luo_ioctl,
 };
 
 static struct luo_device_state luo_dev = {
@@ -31,6 +231,7 @@ static struct luo_device_state luo_dev = {
 		.name  = "liveupdate",
 		.fops  = &luo_fops,
 	},
+	.in_use = ATOMIC_INIT(0),
 };
 
 static int __init liveupdate_init(void)
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 10/30] liveupdate: luo_subsystems: implement subsystem callbacks
From: Pasha Tatashin @ 2025-09-29  1:03 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Implement the core logic within luo_subsystems.c to handle the
invocation of registered subsystem callbacks and manage the persistence
of their state via the LUO FDT. This replaces the stub implementations
from the previous patch.

This completes the core mechanism enabling subsystems to actively
participate in the LUO state machine, execute phase-specific logic, and
persist/restore a u64 state across the live update transition
using the FDT.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/liveupdate/luo_subsystems.c | 167 ++++++++++++++++++++++++++++-
 1 file changed, 164 insertions(+), 3 deletions(-)

diff --git a/kernel/liveupdate/luo_subsystems.c b/kernel/liveupdate/luo_subsystems.c
index 69f00d5c000e..ebb7c0db08f3 100644
--- a/kernel/liveupdate/luo_subsystems.c
+++ b/kernel/liveupdate/luo_subsystems.c
@@ -101,8 +101,81 @@ void __init luo_subsystems_startup(void *fdt)
 	luo_fdt_in = fdt;
 }
 
+static void __luo_do_subsystems_cancel_calls(struct liveupdate_subsystem *boundary_subsystem)
+{
+	struct liveupdate_subsystem *subsystem;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (subsystem == boundary_subsystem)
+			break;
+
+		if (subsystem->ops->cancel) {
+			subsystem->ops->cancel(subsystem,
+					       subsystem->private_data);
+		}
+		subsystem->private_data = 0;
+	}
+}
+
+static void luo_subsystems_retrieve_data_from_fdt(void)
+{
+	struct liveupdate_subsystem *subsystem;
+	int node_offset, prop_len;
+	const void *prop;
+
+	if (!luo_fdt_in)
+		return;
+
+	node_offset = fdt_subnode_offset(luo_fdt_in, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		prop = fdt_getprop(luo_fdt_in, node_offset,
+				   subsystem->name, &prop_len);
+
+		if (!prop || prop_len != sizeof(u64)) {
+			luo_restore_fail("In FDT node '/%s' can't find property '%s': %s\n",
+					 LUO_SUBSYSTEMS_NODE_NAME,
+					 subsystem->name,
+					 fdt_strerror(node_offset));
+		}
+		memcpy(&subsystem->private_data, prop, sizeof(u64));
+	}
+}
+
+static int luo_subsystems_commit_data_to_fdt(void)
+{
+	struct liveupdate_subsystem *subsystem;
+	int ret, node_offset;
+
+	node_offset = fdt_subnode_offset(luo_fdt_out, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		ret = fdt_setprop(luo_fdt_out, node_offset, subsystem->name,
+				  &subsystem->private_data, sizeof(u64));
+		if (ret < 0) {
+			pr_err("Failed to set FDT property for subsystem '%s' %s\n",
+			       subsystem->name, fdt_strerror(ret));
+			return -ENOENT;
+		}
+	}
+
+	return 0;
+}
+
 static int luo_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
 {
+	int node_offset, prop_len;
+	const void *prop;
+
+	node_offset = fdt_subnode_offset(luo_fdt_in, 0,
+					 LUO_SUBSYSTEMS_NODE_NAME);
+	prop = fdt_getprop(luo_fdt_in, node_offset, h->name, &prop_len);
+	if (!prop || prop_len != sizeof(u64)) {
+		luo_state_read_exit();
+		return -ENOENT;
+	}
+	memcpy(data, prop, sizeof(u64));
+
 	return 0;
 }
 
@@ -121,7 +194,30 @@ static int luo_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
  */
 int luo_do_subsystems_prepare_calls(void)
 {
-	return 0;
+	struct liveupdate_subsystem *subsystem;
+	int ret;
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (!subsystem->ops->prepare)
+			continue;
+
+		ret = subsystem->ops->prepare(subsystem,
+					      &subsystem->private_data);
+		if (ret < 0) {
+			pr_err("Subsystem '%s' prepare callback failed [%d]\n",
+			       subsystem->name, ret);
+			__luo_do_subsystems_cancel_calls(subsystem);
+
+			return ret;
+		}
+	}
+
+	ret = luo_subsystems_commit_data_to_fdt();
+	if (ret)
+		__luo_do_subsystems_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -139,7 +235,30 @@ int luo_do_subsystems_prepare_calls(void)
  */
 int luo_do_subsystems_freeze_calls(void)
 {
-	return 0;
+	struct liveupdate_subsystem *subsystem;
+	int ret;
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (!subsystem->ops->freeze)
+			continue;
+
+		ret = subsystem->ops->freeze(subsystem,
+					     &subsystem->private_data);
+		if (ret < 0) {
+			pr_err("Subsystem '%s' freeze callback failed [%d]\n",
+			       subsystem->name, ret);
+			__luo_do_subsystems_cancel_calls(subsystem);
+
+			return ret;
+		}
+	}
+
+	ret = luo_subsystems_commit_data_to_fdt();
+	if (ret)
+		__luo_do_subsystems_cancel_calls(NULL);
+
+	return ret;
 }
 
 /**
@@ -150,6 +269,18 @@ int luo_do_subsystems_freeze_calls(void)
  */
 void luo_do_subsystems_finish_calls(void)
 {
+	struct liveupdate_subsystem *subsystem;
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	luo_subsystems_retrieve_data_from_fdt();
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		if (subsystem->ops->finish) {
+			subsystem->ops->finish(subsystem,
+					       subsystem->private_data);
+		}
+		subsystem->private_data = 0;
+	}
 }
 
 /**
@@ -163,6 +294,9 @@ void luo_do_subsystems_finish_calls(void)
  */
 void luo_do_subsystems_cancel_calls(void)
 {
+	guard(mutex)(&luo_subsystem_list_mutex);
+	__luo_do_subsystems_cancel_calls(NULL);
+	luo_subsystems_commit_data_to_fdt();
 }
 
 /**
@@ -285,7 +419,34 @@ int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
 	return ret;
 }
 
+/**
+ * liveupdate_get_subsystem_data - Retrieve raw private data for a subsystem
+ * from FDT.
+ * @h:      Pointer to the liveupdate_subsystem structure representing the
+ * subsystem instance. The 'name' field is used to find the property.
+ * @data:   Output pointer where the subsystem's raw private u64 data will be
+ * stored via memcpy.
+ *
+ * Reads the 8-byte data property associated with the subsystem @h->name
+ * directly from the '/subsystems' node within the globally accessible
+ * 'luo_fdt_in' blob. Returns appropriate error codes if inputs are invalid, or
+ * nodes/properties are missing or invalid.
+ *
+ * Return:  0 on success. -ENOENT on error.
+ */
 int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
 {
-	return 0;
+	int ret;
+
+	luo_state_read_enter();
+	if (WARN_ON_ONCE(!luo_fdt_in || !liveupdate_state_updated())) {
+		luo_state_read_exit();
+		return -ENOENT;
+	}
+
+	scoped_guard(mutex, &luo_subsystem_list_mutex)
+		ret = luo_get_subsystem_data(h, data);
+	luo_state_read_exit();
+
+	return ret;
 }
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 09/30] liveupdate: luo_subsystems: add subsystem registration
From: Pasha Tatashin @ 2025-09-29  1:03 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Introduce the framework for kernel subsystems (e.g., KVM, IOMMU, device
drivers) to register with LUO and participate in the live update process
via callbacks.

Subsystem Registration:
- Defines struct liveupdate_subsystem in linux/liveupdate.h,
  which subsystems use to provide their name and optional callbacks
  (prepare, freeze, cancel, finish). The callbacks accept
  a u64 *data intended for passing state/handles.
- Exports liveupdate_register_subsystem() and
  liveupdate_unregister_subsystem() API functions.
- Adds drivers/misc/liveupdate/luo_subsystems.c to manage a list
  of registered subsystems.
  Registration/unregistration is restricted to
  specific LUO states (NORMAL/UPDATED).

Callback Framework:
- The main luo_core.c state transition functions
  now delegate to new luo_do_subsystems_*_calls() functions
  defined in luo_subsystems.c.
- These new functions are intended to iterate through the registered
  subsystems and invoke their corresponding callbacks.

FDT Integration:
- Adds a /subsystems subnode within the main LUO FDT created in
  luo_core.c. This node has its own compatibility string
  (subsystems-v1).
- luo_subsystems_fdt_setup() populates this node by adding a
  property for each registered subsystem, using the subsystem's
  name.
  Currently, these properties are initialized with a placeholder
  u64 value (0).
- luo_subsystems_startup() is called from luo_core.c on boot to
  find and validate the /subsystems node in the FDT received via
  KHO.
- Adds a stub API function liveupdate_get_subsystem_data() intended
  for subsystems to retrieve their persisted u64 data from the FDT
      in the new kernel.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/liveupdate.h         |  66 +++++++
 kernel/liveupdate/Makefile         |   3 +-
 kernel/liveupdate/luo_core.c       |  19 +-
 kernel/liveupdate/luo_internal.h   |   7 +
 kernel/liveupdate/luo_subsystems.c | 291 +++++++++++++++++++++++++++++
 5 files changed, 383 insertions(+), 3 deletions(-)
 create mode 100644 kernel/liveupdate/luo_subsystems.c

diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
index 85a6828c95b0..4c378a986cfe 100644
--- a/include/linux/liveupdate.h
+++ b/include/linux/liveupdate.h
@@ -12,6 +12,52 @@
 #include <linux/list.h>
 #include <uapi/linux/liveupdate.h>
 
+struct liveupdate_subsystem;
+
+/**
+ * struct liveupdate_subsystem_ops - LUO events callback functions
+ * @prepare:      Optional. Called during LUO prepare phase. Should perform
+ *                preparatory actions and can store a u64 handle/state
+ *                via the 'data' pointer for use in later callbacks.
+ *                Return 0 on success, negative error code on failure.
+ * @freeze:       Optional. Called during LUO freeze event (before actual jump
+ *                to new kernel). Should perform final state saving actions and
+ *                can update the u64 handle/state via the 'data' pointer. Retur:
+ *                0 on success, negative error code on failure.
+ * @cancel:       Optional. Called if the live update process is canceled after
+ *                prepare (or freeze) was called. Receives the u64 data
+ *                set by prepare/freeze. Used for cleanup.
+ * @boot:         Optional. Call durng boot post live update. This callback is
+ *                done when subsystem register during live update.
+ * @finish:       Optional. Called after the live update is finished in the new
+ *                kernel.
+ *                Receives the u64 data set by prepare/freeze. Used for cleanup.
+ * @owner:        Module reference
+ */
+struct liveupdate_subsystem_ops {
+	int (*prepare)(struct liveupdate_subsystem *handle, u64 *data);
+	int (*freeze)(struct liveupdate_subsystem *handle, u64 *data);
+	void (*cancel)(struct liveupdate_subsystem *handle, u64 data);
+	void (*boot)(struct liveupdate_subsystem *handle, u64 data);
+	void (*finish)(struct liveupdate_subsystem *handle, u64 data);
+	struct module *owner;
+};
+
+/**
+ * struct liveupdate_subsystem - Represents a subsystem participating in LUO
+ * @ops:          Callback functions
+ * @name:         Unique name identifying the subsystem.
+ * @list:         List head used internally by LUO. Should not be modified by
+ *                caller after registration.
+ * @private_data: For LUO internal use, cached value of data field.
+ */
+struct liveupdate_subsystem {
+	const struct liveupdate_subsystem_ops *ops;
+	const char *name;
+	struct list_head list;
+	u64 private_data;
+};
+
 #ifdef CONFIG_LIVEUPDATE
 
 /* Return true if live update orchestrator is enabled */
@@ -33,6 +79,10 @@ bool liveupdate_state_normal(void);
 
 enum liveupdate_state liveupdate_get_state(void);
 
+int liveupdate_register_subsystem(struct liveupdate_subsystem *h);
+int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h);
+int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data);
+
 #else /* CONFIG_LIVEUPDATE */
 
 static inline int liveupdate_reboot(void)
@@ -60,5 +110,21 @@ static inline enum liveupdate_state liveupdate_get_state(void)
 	return LIVEUPDATE_STATE_NORMAL;
 }
 
+static inline int liveupdate_register_subsystem(struct liveupdate_subsystem *h)
+{
+	return 0;
+}
+
+static inline int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
+{
+	return 0;
+}
+
+static inline int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h,
+						u64 *data)
+{
+	return -ENODATA;
+}
+
 #endif /* CONFIG_LIVEUPDATE */
 #endif /* _LINUX_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
index d90cc3b4bf7b..2881bab0c6df 100644
--- a/kernel/liveupdate/Makefile
+++ b/kernel/liveupdate/Makefile
@@ -2,7 +2,8 @@
 
 luo-y :=								\
 		luo_core.o						\
-		luo_ioctl.o
+		luo_ioctl.o						\
+		luo_subsystems.o
 
 obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
 obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c
index 10796481447a..92edaeaaad3e 100644
--- a/kernel/liveupdate/luo_core.c
+++ b/kernel/liveupdate/luo_core.c
@@ -130,6 +130,10 @@ static int luo_fdt_setup(void)
 	if (ret)
 		goto exit_free;
 
+	ret = luo_subsystems_fdt_setup(fdt_out);
+	if (ret)
+		goto exit_free;
+
 	ret = kho_add_subtree(LUO_KHO_ENTRY_NAME, fdt_out);
 	if (ret)
 		goto exit_free;
@@ -153,20 +157,30 @@ static void luo_fdt_destroy(void)
 
 static int luo_do_prepare_calls(void)
 {
-	return 0;
+	int ret;
+
+	ret = luo_do_subsystems_prepare_calls();
+
+	return ret;
 }
 
 static int luo_do_freeze_calls(void)
 {
-	return 0;
+	int ret;
+
+	ret = luo_do_subsystems_freeze_calls();
+
+	return ret;
 }
 
 static void luo_do_finish_calls(void)
 {
+	luo_do_subsystems_finish_calls();
 }
 
 static void luo_do_cancel_calls(void)
 {
+	luo_do_subsystems_cancel_calls();
 }
 
 static int __luo_prepare(void)
@@ -408,6 +422,7 @@ static int __init luo_startup(void)
 	}
 
 	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
+	luo_subsystems_startup(luo_fdt_in);
 
 	return 0;
 }
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
index c98842caa4a0..c62fbbb0790c 100644
--- a/kernel/liveupdate/luo_internal.h
+++ b/kernel/liveupdate/luo_internal.h
@@ -32,4 +32,11 @@ void *luo_contig_alloc_preserve(size_t size);
 void luo_contig_free_unpreserve(void *mem, size_t size);
 void luo_contig_free_restore(void *mem, size_t size);
 
+void luo_subsystems_startup(void *fdt);
+int luo_subsystems_fdt_setup(void *fdt);
+int luo_do_subsystems_prepare_calls(void);
+int luo_do_subsystems_freeze_calls(void);
+void luo_do_subsystems_finish_calls(void);
+void luo_do_subsystems_cancel_calls(void);
+
 #endif /* _LINUX_LUO_INTERNAL_H */
diff --git a/kernel/liveupdate/luo_subsystems.c b/kernel/liveupdate/luo_subsystems.c
new file mode 100644
index 000000000000..69f00d5c000e
--- /dev/null
+++ b/kernel/liveupdate/luo_subsystems.c
@@ -0,0 +1,291 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: LUO Subsystems support
+ *
+ * Various kernel subsystems register with the Live Update Orchestrator to
+ * participate in the live update process. These subsystems are notified at
+ * different stages of the live update sequence, allowing them to serialize
+ * device state before the reboot and restore it afterwards. Examples include
+ * the device layer, interrupt controllers, KVM, IOMMU, and specific device
+ * drivers.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/libfdt.h>
+#include <linux/liveupdate.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include "luo_internal.h"
+
+#define LUO_SUBSYSTEMS_NODE_NAME	"subsystems"
+#define LUO_SUBSYSTEMS_COMPATIBLE	"subsystems-v1"
+
+static DEFINE_MUTEX(luo_subsystem_list_mutex);
+static LIST_HEAD(luo_subsystems_list);
+static void *luo_fdt_out;
+static void *luo_fdt_in;
+
+/**
+ * luo_subsystems_fdt_setup - Adds and populates the 'subsystems' node in the
+ * FDT.
+ * @fdt: Pointer to the LUO FDT blob.
+ *
+ * Add subsystems node and each subsystem to the LUO FDT blob.
+ *
+ * Returns: 0 on success, negative errno on failure.
+ */
+int luo_subsystems_fdt_setup(void *fdt)
+{
+	struct liveupdate_subsystem *subsystem;
+	const u64 zero_data = 0;
+	int ret, node_offset;
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	ret = fdt_add_subnode(fdt, 0, LUO_SUBSYSTEMS_NODE_NAME);
+	if (ret < 0)
+		goto exit_error;
+
+	node_offset = ret;
+	ret = fdt_setprop_string(fdt, node_offset, "compatible",
+				 LUO_SUBSYSTEMS_COMPATIBLE);
+	if (ret < 0)
+		goto exit_error;
+
+	list_for_each_entry(subsystem, &luo_subsystems_list, list) {
+		ret = fdt_setprop(fdt, node_offset, subsystem->name,
+				  &zero_data, sizeof(zero_data));
+		if (ret < 0)
+			goto exit_error;
+	}
+
+	luo_fdt_out = fdt;
+	return 0;
+exit_error:
+	pr_err("Failed to setup 'subsystems' node to FDT: %s\n",
+	       fdt_strerror(ret));
+	return -ENOSPC;
+}
+
+/**
+ * luo_subsystems_startup - Validates the LUO subsystems FDT node at startup.
+ * @fdt: Pointer to the LUO FDT blob passed from the previous kernel.
+ *
+ * This __init function checks the existence and validity of the '/subsystems'
+ * node in the FDT. This node is considered mandatory.
+ */
+void __init luo_subsystems_startup(void *fdt)
+{
+	int ret, node_offset;
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	node_offset = fdt_subnode_offset(fdt, 0, LUO_SUBSYSTEMS_NODE_NAME);
+	if (node_offset < 0)
+		luo_restore_fail("Failed to find /subsystems node\n");
+
+	ret = fdt_node_check_compatible(fdt, node_offset,
+					LUO_SUBSYSTEMS_COMPATIBLE);
+	if (ret) {
+		luo_restore_fail("FDT '%s' is incompatible with '%s' [%d]\n",
+				 LUO_SUBSYSTEMS_NODE_NAME,
+				 LUO_SUBSYSTEMS_COMPATIBLE, ret);
+	}
+	luo_fdt_in = fdt;
+}
+
+static int luo_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
+{
+	return 0;
+}
+
+/**
+ * luo_do_subsystems_prepare_calls - Calls prepare callbacks and updates FDT
+ * if all prepares succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'prepare' for all subsystems and stores results temporarily.
+ * If any 'prepare' fails, calls 'cancel' on previously prepared subsystems
+ * and returns the error.
+ * Phase 2: If all 'prepare' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* prepared subsystems and
+ * returns the FDT error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_subsystems_prepare_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_subsystems_freeze_calls - Calls freeze callbacks and updates FDT
+ * if all freezes succeed. Handles cancellation on failure.
+ *
+ * Phase 1: Calls 'freeze' for all subsystems and stores results temporarily.
+ * If any 'freeze' fails, calls 'cancel' on previously called subsystems
+ * and returns the error.
+ * Phase 2: If all 'freeze' calls succeeded, writes the stored data to the FDT.
+ * If any FDT write fails, calls 'cancel' on *all* subsystems and
+ * returns the FDT error.
+ *
+ * Returns: 0 on success. Negative errno on failure.
+ */
+int luo_do_subsystems_freeze_calls(void)
+{
+	return 0;
+}
+
+/**
+ * luo_do_subsystems_finish_calls- Calls finish callbacks for all subsystems.
+ *
+ * This function is called at the end of live update cycle to do the final
+ * clean-up or housekeeping of the post-live update states.
+ */
+void luo_do_subsystems_finish_calls(void)
+{
+}
+
+/**
+ * luo_do_subsystems_cancel_calls - Calls cancel callbacks for all subsystems.
+ *
+ * This function is typically called when the live update process needs to be
+ * aborted externally, for example, after the prepare phase may have run but
+ * before actual reboot. It iterates through all registered subsystems and calls
+ * the 'cancel' callback for those that implement it and likely completed
+ * prepare.
+ */
+void luo_do_subsystems_cancel_calls(void)
+{
+}
+
+/**
+ * liveupdate_register_subsystem - Register a kernel subsystem handler with LUO
+ * @h: Pointer to the liveupdate_subsystem structure allocated and populated
+ * by the calling subsystem.
+ *
+ * Registers a subsystem handler that provides callbacks for different events
+ * of the live update cycle. Registration is typically done during the
+ * subsystem's module init or core initialization.
+ *
+ * Can only be called when LUO is in the NORMAL or UPDATED states.
+ * The provided name (@h->name) must be unique among registered subsystems.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int liveupdate_register_subsystem(struct liveupdate_subsystem *h)
+{
+	struct liveupdate_subsystem *iter;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	list_for_each_entry(iter, &luo_subsystems_list, list) {
+		if (iter == h) {
+			pr_warn("Subsystem '%s' (%p) already registered.\n",
+				h->name, h);
+			ret = -EEXIST;
+			goto out_unlock;
+		}
+
+		if (!strcmp(iter->name, h->name)) {
+			pr_err("Subsystem with name '%s' already registered.\n",
+			       h->name);
+			ret = -EEXIST;
+			goto out_unlock;
+		}
+	}
+
+	if (!try_module_get(h->ops->owner)) {
+		pr_warn("Subsystem '%s' unable to get reference.\n", h->name);
+		ret = -EAGAIN;
+		goto out_unlock;
+	}
+
+	INIT_LIST_HEAD(&h->list);
+	list_add_tail(&h->list, &luo_subsystems_list);
+
+out_unlock:
+	/*
+	 * If we are booting during live update, and subsystem provided a boot
+	 * callback, do it now, since we know that subsystem has already
+	 * initialized.
+	 */
+	if (!ret && liveupdate_state_updated() && h->ops->boot) {
+		u64 data;
+
+		ret = luo_get_subsystem_data(h, &data);
+		if (!WARN_ON_ONCE(ret))
+			h->ops->boot(h, data);
+	}
+
+	luo_state_read_exit();
+
+	return ret;
+}
+
+/**
+ * liveupdate_unregister_subsystem - Unregister a kernel subsystem handler from
+ * LUO
+ * @h: Pointer to the same liveupdate_subsystem structure that was used during
+ * registration.
+ *
+ * Unregisters a previously registered subsystem handler. Typically called
+ * during module exit or subsystem teardown. LUO removes the structure from its
+ * internal list; the caller is responsible for any necessary memory cleanup
+ * of the structure itself.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ * -EINVAL if h is NULL.
+ * -ENOENT if the specified handler @h is not found in the registration list.
+ * -EBUSY if LUO is not in the NORMAL state.
+ */
+int liveupdate_unregister_subsystem(struct liveupdate_subsystem *h)
+{
+	struct liveupdate_subsystem *iter;
+	bool found = false;
+	int ret = 0;
+
+	luo_state_read_enter();
+	if (!liveupdate_state_normal() && !liveupdate_state_updated()) {
+		luo_state_read_exit();
+		return -EBUSY;
+	}
+
+	guard(mutex)(&luo_subsystem_list_mutex);
+	list_for_each_entry(iter, &luo_subsystems_list, list) {
+		if (iter == h) {
+			found = true;
+			break;
+		}
+	}
+
+	if (found) {
+		list_del_init(&h->list);
+	} else {
+		pr_warn("Subsystem handler '%s' not found for unregistration.\n",
+			h->name);
+		ret = -ENOENT;
+	}
+
+	module_put(h->ops->owner);
+	luo_state_read_exit();
+
+	return ret;
+}
+
+int liveupdate_get_subsystem_data(struct liveupdate_subsystem *h, u64 *data)
+{
+	return 0;
+}
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 08/30] liveupdate: luo_core: integrate with KHO
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Integrate the LUO with the KHO framework to enable passing LUO state
across a kexec reboot.

When LUO is transitioned to a "prepared" state, it tells KHO to
finalize, so all memory segments that were added to KHO preservation
list are getting preserved. After "Prepared" state no new segments
can be preserved. If LUO is canceled, it also tells KHO to cancel the
serialization, and therefore, later LUO can go back into the prepared
state.

This patch introduces the following changes:
- During the KHO finalization phase allocate FDT blob.
- Populate this FDT with a LUO compatibility string ("luo-v1").

LUO now depends on `CONFIG_KEXEC_HANDOVER`. The core state transition
logic (`luo_do_*_calls`) remains unimplemented in this patch.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/liveupdate/luo_core.c     | 282 ++++++++++++++++++++++++++++++-
 kernel/liveupdate/luo_internal.h |  13 ++
 2 files changed, 292 insertions(+), 3 deletions(-)

diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c
index 954d533bd8c4..10796481447a 100644
--- a/kernel/liveupdate/luo_core.c
+++ b/kernel/liveupdate/luo_core.c
@@ -47,9 +47,13 @@
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/err.h>
+#include <linux/kexec_handover.h>
 #include <linux/kobject.h>
+#include <linux/libfdt.h>
 #include <linux/liveupdate.h>
+#include <linux/mm.h>
 #include <linux/rwsem.h>
+#include <linux/sizes.h>
 #include <linux/string.h>
 #include "luo_internal.h"
 
@@ -67,6 +71,21 @@ static const char *const luo_state_str[] = {
 
 static bool luo_enabled;
 
+static void *luo_fdt_out;
+static void *luo_fdt_in;
+
+/*
+ * The LUO FDT size depends on the number of participating subsystems,
+ *
+ * The current fixed size (4K) is large enough to handle reasonable number of
+ * preserved entities. If this size ever becomes insufficient, it can either be
+ * increased, or a dynamic size calculation mechanism could be implemented in
+ * the future.
+ */
+#define LUO_FDT_SIZE		PAGE_SIZE
+#define LUO_KHO_ENTRY_NAME	"LUO"
+#define LUO_COMPATIBLE		"luo-v1"
+
 static int __init early_liveupdate_param(char *buf)
 {
 	return kstrtobool(buf, &luo_enabled);
@@ -91,6 +110,52 @@ static inline void luo_set_state(enum liveupdate_state state)
 	__luo_set_state(state);
 }
 
+/* Called during the prepare phase, to create LUO fdt tree */
+static int luo_fdt_setup(void)
+{
+	void *fdt_out;
+	int ret;
+
+	fdt_out = luo_contig_alloc_preserve(LUO_FDT_SIZE);
+	if (IS_ERR(fdt_out)) {
+		pr_err("failed to allocate/preserve FDT memory\n");
+		return PTR_ERR(fdt_out);
+	}
+
+	ret = fdt_create_empty_tree(fdt_out, LUO_FDT_SIZE);
+	if (ret)
+		goto exit_free;
+
+	ret = fdt_setprop_string(fdt_out, 0, "compatible", LUO_COMPATIBLE);
+	if (ret)
+		goto exit_free;
+
+	ret = kho_add_subtree(LUO_KHO_ENTRY_NAME, fdt_out);
+	if (ret)
+		goto exit_free;
+	luo_fdt_out = fdt_out;
+
+	return 0;
+
+exit_free:
+	luo_contig_free_unpreserve(fdt_out, LUO_FDT_SIZE);
+	pr_err("failed to prepare LUO FDT: %d\n", ret);
+
+	return ret;
+}
+
+static void luo_fdt_destroy(void)
+{
+	kho_remove_subtree(luo_fdt_out);
+	luo_contig_free_unpreserve(luo_fdt_out, LUO_FDT_SIZE);
+	luo_fdt_out = NULL;
+}
+
+static int luo_do_prepare_calls(void)
+{
+	return 0;
+}
+
 static int luo_do_freeze_calls(void)
 {
 	return 0;
@@ -100,6 +165,71 @@ static void luo_do_finish_calls(void)
 {
 }
 
+static void luo_do_cancel_calls(void)
+{
+}
+
+static int __luo_prepare(void)
+{
+	int ret;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[prepare] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_NORMAL)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_PREPARED],
+			luo_current_state_str());
+		ret = -EINVAL;
+		goto exit_unlock;
+	}
+
+	ret = luo_fdt_setup();
+	if (ret)
+		goto exit_unlock;
+
+	ret = luo_do_prepare_calls();
+	if (ret) {
+		luo_fdt_destroy();
+		goto exit_unlock;
+	}
+
+	luo_set_state(LIVEUPDATE_STATE_PREPARED);
+
+exit_unlock:
+	up_write(&luo_state_rwsem);
+
+	return ret;
+}
+
+static int __luo_cancel(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[cancel] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED) &&
+	    !is_current_luo_state(LIVEUPDATE_STATE_FROZEN)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			luo_current_state_str());
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	luo_do_cancel_calls();
+	luo_fdt_destroy();
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return 0;
+}
+
 /* Get the current state as a string */
 const char *luo_current_state_str(void)
 {
@@ -111,9 +241,28 @@ enum liveupdate_state liveupdate_get_state(void)
 	return READ_ONCE(luo_state);
 }
 
+/**
+ * luo_prepare - Initiate the live update preparation phase.
+ *
+ * This function is called to begin the live update process. It attempts to
+ * transition the luo to the ``LIVEUPDATE_STATE_PREPARED`` state.
+ *
+ * If the calls complete successfully, the orchestrator state is set
+ * to ``LIVEUPDATE_STATE_PREPARED``. If any  call fails a
+ * ``LIVEUPDATE_CANCEL`` is sent to roll back any actions.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, ``-EINVAL`` if the orchestrator is not in
+ * the normal state, or a negative error code returned by the calls.
+ */
 int luo_prepare(void)
 {
-	return 0;
+	int err = __luo_prepare();
+
+	if (err)
+		return err;
+
+	return kho_finalize();
 }
 
 /**
@@ -193,9 +342,28 @@ int luo_finish(void)
 	return 0;
 }
 
+/**
+ * luo_cancel - Cancel the ongoing live update from prepared or frozen states.
+ *
+ * This function is called to abort a live update that is currently in the
+ * ``LIVEUPDATE_STATE_PREPARED`` state.
+ *
+ * If the state is correct, it triggers the ``LIVEUPDATE_CANCEL`` notifier chain
+ * to allow subsystems to undo any actions performed during the prepare or
+ * freeze events. Finally, the orchestrator state is transitioned back to
+ * ``LIVEUPDATE_STATE_NORMAL``.
+ *
+ * @return 0 on success, or ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock.
+ */
 int luo_cancel(void)
 {
-	return 0;
+	int err =  kho_abort();
+
+	if (err)
+		return err;
+
+	return __luo_cancel();
 }
 
 void luo_state_read_enter(void)
@@ -210,7 +378,36 @@ void luo_state_read_exit(void)
 
 static int __init luo_startup(void)
 {
-	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+	phys_addr_t fdt_phys;
+	int ret;
+
+	if (!kho_is_enabled()) {
+		if (luo_enabled)
+			pr_warn("Disabling liveupdate because KHO is disabled\n");
+		luo_enabled = false;
+		return 0;
+	}
+
+	/* Retrieve LUO subtree, and verify its format. */
+	ret = kho_retrieve_subtree(LUO_KHO_ENTRY_NAME, &fdt_phys);
+	if (ret) {
+		if (ret != -ENOENT) {
+			luo_restore_fail("failed to retrieve FDT '%s' from KHO: %d\n",
+					 LUO_KHO_ENTRY_NAME, ret);
+		}
+		__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+		return 0;
+	}
+
+	luo_fdt_in = __va(fdt_phys);
+	ret = fdt_node_check_compatible(luo_fdt_in, 0, LUO_COMPATIBLE);
+	if (ret) {
+		luo_restore_fail("FDT '%s' is incompatible with '%s' [%d]\n",
+				 LUO_KHO_ENTRY_NAME, LUO_COMPATIBLE, ret);
+	}
+
+	__luo_set_state(LIVEUPDATE_STATE_UPDATED);
 
 	return 0;
 }
@@ -295,3 +492,82 @@ bool liveupdate_enabled(void)
 {
 	return luo_enabled;
 }
+
+/**
+ * luo_contig_alloc_preserve - Allocate, zero, and preserve contiguous memory.
+ * @size: The number of bytes to allocate.
+ *
+ * Allocates a physically contiguous block of zeroed pages that is large
+ * enough to hold @size bytes. The allocated memory is then registered with
+ * KHO for preservation across a kexec.
+ *
+ * Note: The actual allocated size will be rounded up to the nearest
+ * power-of-two page boundary.
+ *
+ * @return A virtual pointer to the allocated and preserved memory on success,
+ * or an ERR_PTR() encoded error on failure.
+ */
+void *luo_contig_alloc_preserve(size_t size)
+{
+	int order, ret;
+	void *mem;
+
+	if (!size)
+		return ERR_PTR(-EINVAL);
+
+	order = get_order(size);
+	if (order > MAX_PAGE_ORDER)
+		return ERR_PTR(-E2BIG);
+
+	mem = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!mem)
+		return ERR_PTR(-ENOMEM);
+
+	ret = kho_preserve_pages(virt_to_page(mem), 1 << order);
+	if (ret) {
+		free_pages((unsigned long)mem, order);
+		return ERR_PTR(ret);
+	}
+
+	return mem;
+}
+
+/**
+ * luo_contig_free_unpreserve - Unpreserve and free contiguous memory.
+ * @mem:  Pointer to the memory allocated by luo_contig_alloc_preserve().
+ * @size: The original size requested during allocation. This is used to
+ *        recalculate the correct order for freeing the pages.
+ *
+ * Unregisters the memory from KHO preservation and frees the underlying
+ * pages back to the system. This function should be called to clean up
+ * memory allocated with luo_contig_alloc_preserve().
+ */
+void luo_contig_free_unpreserve(void *mem, size_t size)
+{
+	unsigned int order;
+
+	if (!mem || !size)
+		return;
+
+	order = get_order(size);
+	if (WARN_ON_ONCE(order > MAX_PAGE_ORDER))
+		return;
+
+	WARN_ON_ONCE(kho_unpreserve_pages(virt_to_page(mem), 1 << order));
+	free_pages((unsigned long)mem, order);
+}
+
+void luo_contig_free_restore(void *mem, size_t size)
+{
+	unsigned int order;
+
+	if (!mem || !size)
+		return;
+
+	order = get_order(size);
+	if (WARN_ON_ONCE(order > MAX_PAGE_ORDER))
+		return;
+
+	WARN_ON_ONCE(!kho_restore_pages(__pa(mem), 1 << order));
+	free_pages((unsigned long)mem, order);
+}
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
index 2e0861781673..c98842caa4a0 100644
--- a/kernel/liveupdate/luo_internal.h
+++ b/kernel/liveupdate/luo_internal.h
@@ -8,6 +8,15 @@
 #ifndef _LINUX_LUO_INTERNAL_H
 #define _LINUX_LUO_INTERNAL_H
 
+/*
+ * Handles a deserialization failure: devices and memory is in unpredictable
+ * state.
+ *
+ * Continuing the boot process after a failure is dangerous because it could
+ * lead to leaks of private data.
+ */
+#define luo_restore_fail(__fmt, ...) panic(__fmt, ##__VA_ARGS__)
+
 int luo_cancel(void);
 int luo_prepare(void);
 int luo_freeze(void);
@@ -19,4 +28,8 @@ extern struct rw_semaphore luo_state_rwsem;
 
 const char *luo_current_state_str(void);
 
+void *luo_contig_alloc_preserve(size_t size);
+void luo_contig_free_unpreserve(void *mem, size_t size);
+void luo_contig_free_restore(void *mem, size_t size);
+
 #endif /* _LINUX_LUO_INTERNAL_H */
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 07/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Introduce LUO, a mechanism intended to facilitate kernel updates while
keeping designated devices operational across the transition (e.g., via
kexec). The primary use case is updating hypervisors with minimal
disruption to running virtual machines. For userspace side of hypervisor
update we have copyless migration. LUO is for updating the kernel.

This initial patch lays the groundwork for the LUO subsystem.

Further functionality, including the implementation of state transition
logic, integration with KHO, and hooks for subsystems and file
descriptors, will be added in subsequent patches.

Create a character device at /dev/liveupdate.

A new uAPI header, <uapi/linux/liveupdate.h>, will define the necessary
structures. The magic number for IOCTL is registered in
Documentation/userspace-api/ioctl/ioctl-number.rst.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 .../userspace-api/ioctl/ioctl-number.rst      |   2 +
 include/linux/liveupdate.h                    |  64 ++++
 include/uapi/linux/liveupdate.h               |  94 ++++++
 kernel/liveupdate/Kconfig                     |  27 ++
 kernel/liveupdate/Makefile                    |   6 +
 kernel/liveupdate/luo_core.c                  | 297 ++++++++++++++++++
 kernel/liveupdate/luo_internal.h              |  22 ++
 kernel/liveupdate/luo_ioctl.c                 |  54 ++++
 8 files changed, 566 insertions(+)
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 include/uapi/linux/liveupdate.h
 create mode 100644 kernel/liveupdate/luo_core.c
 create mode 100644 kernel/liveupdate/luo_internal.h
 create mode 100644 kernel/liveupdate/luo_ioctl.c

diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 7c527a01d1cf..7232b3544cec 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -385,6 +385,8 @@ Code  Seq#    Include File                                             Comments
 0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                                Marvell CN10K DPI driver
 0xB8  all    uapi/linux/mshv.h                                         Microsoft Hyper-V /dev/mshv driver
                                                                        <mailto:linux-hyperv@vger.kernel.org>
+0xBA  00-0F  uapi/linux/liveupdate.h                                   Pasha Tatashin
+                                                                       <mailto:pasha.tatashin@soleen.com>
 0xC0  00-0F  linux/usb/iowarrior.h
 0xCA  00-0F  uapi/misc/cxl.h                                           Dead since 6.15
 0xCA  10-2F  uapi/misc/ocxl.h
diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
new file mode 100644
index 000000000000..85a6828c95b0
--- /dev/null
+++ b/include/linux/liveupdate.h
@@ -0,0 +1,64 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+#ifndef _LINUX_LIVEUPDATE_H
+#define _LINUX_LIVEUPDATE_H
+
+#include <linux/bug.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <uapi/linux/liveupdate.h>
+
+#ifdef CONFIG_LIVEUPDATE
+
+/* Return true if live update orchestrator is enabled */
+bool liveupdate_enabled(void);
+
+/* Called during reboot to tell participants to complete serialization */
+int liveupdate_reboot(void);
+
+/*
+ * Return true if machine is in updated state (i.e. live update boot in
+ * progress)
+ */
+bool liveupdate_state_updated(void);
+
+/*
+ * Return true if machine is in normal state (i.e. no live update in progress).
+ */
+bool liveupdate_state_normal(void);
+
+enum liveupdate_state liveupdate_get_state(void);
+
+#else /* CONFIG_LIVEUPDATE */
+
+static inline int liveupdate_reboot(void)
+{
+	return 0;
+}
+
+static inline bool liveupdate_enabled(void)
+{
+	return false;
+}
+
+static inline bool liveupdate_state_updated(void)
+{
+	return false;
+}
+
+static inline bool liveupdate_state_normal(void)
+{
+	return true;
+}
+
+static inline enum liveupdate_state liveupdate_get_state(void)
+{
+	return LIVEUPDATE_STATE_NORMAL;
+}
+
+#endif /* CONFIG_LIVEUPDATE */
+#endif /* _LINUX_LIVEUPDATE_H */
diff --git a/include/uapi/linux/liveupdate.h b/include/uapi/linux/liveupdate.h
new file mode 100644
index 000000000000..3cb09b2c4353
--- /dev/null
+++ b/include/uapi/linux/liveupdate.h
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/*
+ * Userspace interface for /dev/liveupdate
+ * Live Update Orchestrator
+ *
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#ifndef _UAPI_LIVEUPDATE_H
+#define _UAPI_LIVEUPDATE_H
+
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+/**
+ * enum liveupdate_state - Defines the possible states of the live update
+ * orchestrator.
+ * @LIVEUPDATE_STATE_UNDEFINED:      State has not yet been initialized.
+ * @LIVEUPDATE_STATE_NORMAL:         Default state, no live update in progress.
+ * @LIVEUPDATE_STATE_PREPARED:       Live update is prepared for reboot; the
+ *                                   LIVEUPDATE_PREPARE callbacks have completed
+ *                                   successfully.
+ *                                   Devices might operate in a limited state
+ *                                   for example the participating devices might
+ *                                   not be allowed to unbind, and also the
+ *                                   setting up of new DMA mappings might be
+ *                                   disabled in this state.
+ * @LIVEUPDATE_STATE_FROZEN:         The final reboot event
+ *                                   (%LIVEUPDATE_FREEZE) has been sent, and the
+ *                                   system is performing its final state saving
+ *                                   within the "blackout window". User
+ *                                   workloads must be suspended. The actual
+ *                                   reboot (kexec) into the next kernel is
+ *                                   imminent.
+ * @LIVEUPDATE_STATE_UPDATED:        The system has rebooted into the next
+ *                                   kernel via live update the system is now
+ *                                   running the next kernel, awaiting the
+ *                                   finish event.
+ *
+ * These states track the progress and outcome of a live update operation.
+ */
+enum liveupdate_state  {
+	LIVEUPDATE_STATE_UNDEFINED = 0,
+	LIVEUPDATE_STATE_NORMAL = 1,
+	LIVEUPDATE_STATE_PREPARED = 2,
+	LIVEUPDATE_STATE_FROZEN = 3,
+	LIVEUPDATE_STATE_UPDATED = 4,
+};
+
+/**
+ * enum liveupdate_event - Events that trigger live update callbacks.
+ * @LIVEUPDATE_PREPARE: PREPARE should happen *before* the blackout window.
+ *                      Subsystems should prepare for an upcoming reboot by
+ *                      serializing their states. However, it must be considered
+ *                      that user applications, e.g. virtual machines are still
+ *                      running during this phase.
+ * @LIVEUPDATE_FREEZE:  FREEZE sent from the reboot() syscall, when the current
+ *                      kernel is on its way out. This is the final opportunity
+ *                      for subsystems to save any state that must persist
+ *                      across the reboot. Callbacks for this event should be as
+ *                      fast as possible since they are on the critical path of
+ *                      rebooting into the next kernel.
+ * @LIVEUPDATE_FINISH:  FINISH is sent in the newly booted kernel after a
+ *                      successful live update and normally *after* the blackout
+ *                      window. Subsystems should perform any final cleanup
+ *                      during this phase. This phase also provides an
+ *                      opportunity to clean up devices that were preserved but
+ *                      never explicitly reclaimed during the live update
+ *                      process. State restoration should have already occurred
+ *                      before this event. Callbacks for this event must not
+ *                      fail. The completion of this call transitions the
+ *                      machine from ``updated`` to ``normal`` state.
+ * @LIVEUPDATE_CANCEL:  CANCEL the live update and go back to normal state. This
+ *                      event is user initiated, or is done automatically when
+ *                      LIVEUPDATE_PREPARE or LIVEUPDATE_FREEZE stage fails.
+ *                      Subsystems should revert any actions taken during the
+ *                      corresponding prepare event. Callbacks for this event
+ *                      must not fail.
+ *
+ * These events represent the different stages and actions within the live
+ * update process that subsystems (like device drivers and bus drivers)
+ * need to be aware of to correctly serialize and restore their state.
+ *
+ */
+enum liveupdate_event {
+	LIVEUPDATE_PREPARE = 0,
+	LIVEUPDATE_FREEZE = 1,
+	LIVEUPDATE_FINISH = 2,
+	LIVEUPDATE_CANCEL = 3,
+};
+
+#endif /* _UAPI_LIVEUPDATE_H */
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
index eebe564b385d..f6b0bde188d9 100644
--- a/kernel/liveupdate/Kconfig
+++ b/kernel/liveupdate/Kconfig
@@ -1,7 +1,34 @@
 # SPDX-License-Identifier: GPL-2.0-only
+#
+# Copyright (c) 2025, Google LLC.
+# Pasha Tatashin <pasha.tatashin@soleen.com>
+#
+# Live Update Orchestrator
+#
 
 menu "Live Update"
 
+config LIVEUPDATE
+	bool "Live Update Orchestrator"
+	depends on KEXEC_HANDOVER
+	help
+	  Enable the Live Update Orchestrator. Live Update is a mechanism,
+	  typically based on kexec, that allows the kernel to be updated
+	  while keeping selected devices operational across the transition.
+	  These devices are intended to be reclaimed by the new kernel and
+	  re-attached to their original workload without requiring a device
+	  reset.
+
+	  Ability to handover a device from current to the next kernel depends
+	  on specific support within device drivers and related kernel
+	  subsystems.
+
+	  This feature primarily targets virtual machine hosts to quickly update
+	  the kernel hypervisor with minimal disruption to the running virtual
+	  machines.
+
+	  If unsure, say N.
+
 config KEXEC_HANDOVER
 	bool "kexec handover"
 	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
index 67c7f71b33fa..d90cc3b4bf7b 100644
--- a/kernel/liveupdate/Makefile
+++ b/kernel/liveupdate/Makefile
@@ -1,4 +1,10 @@
 # SPDX-License-Identifier: GPL-2.0
 
+luo-y :=								\
+		luo_core.o						\
+		luo_ioctl.o
+
 obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
 obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
+
+obj-$(CONFIG_LIVEUPDATE)		+= luo.o
diff --git a/kernel/liveupdate/luo_core.c b/kernel/liveupdate/luo_core.c
new file mode 100644
index 000000000000..954d533bd8c4
--- /dev/null
+++ b/kernel/liveupdate/luo_core.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+/**
+ * DOC: Live Update Orchestrator (LUO)
+ *
+ * Live Update is a specialized, kexec-based reboot process that allows a
+ * running kernel to be updated from one version to another while preserving
+ * the state of selected resources and keeping designated hardware devices
+ * operational. For these devices, DMA activity may continue throughout the
+ * kernel transition.
+ *
+ * While the primary use case driving this work is supporting live updates of
+ * the Linux kernel when it is used as a hypervisor in cloud environments, the
+ * LUO framework itself is designed to be workload-agnostic. Much like Kernel
+ * Live Patching, which applies security fixes regardless of the workload,
+ * Live Update facilitates a full kernel version upgrade for any type of system.
+ *
+ * For example, a non-hypervisor system running an in-memory cache like
+ * memcached with many gigabytes of data can use LUO. The userspace service
+ * can place its cache into a memfd, have its state preserved by LUO, and
+ * restore it immediately after the kernel kexec.
+ *
+ * Whether the system is running virtual machines, containers, a
+ * high-performance database, or networking services, LUO's primary goal is to
+ * enable a full kernel update by preserving critical userspace state and
+ * keeping essential devices operational.
+ *
+ * The core of LUO is a state machine that tracks the progress of a live update,
+ * along with a callback API that allows other kernel subsystems to participate
+ * in the process. Example subsystems that can hook into LUO include: kvm,
+ * iommu, interrupts, vfio, participating filesystems, and memory management.
+ *
+ * LUO uses Kexec Handover to transfer memory state from the current kernel to
+ * the next kernel. For more details see
+ * Documentation/core-api/kho/concepts.rst.
+ *
+ * The LUO state machine ensures that operations are performed in the correct
+ * sequence and provides a mechanism to track and recover from potential
+ * failures.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/err.h>
+#include <linux/kobject.h>
+#include <linux/liveupdate.h>
+#include <linux/rwsem.h>
+#include <linux/string.h>
+#include "luo_internal.h"
+
+DECLARE_RWSEM(luo_state_rwsem);
+
+static enum liveupdate_state luo_state = LIVEUPDATE_STATE_UNDEFINED;
+
+static const char *const luo_state_str[] = {
+	[LIVEUPDATE_STATE_UNDEFINED]	= "undefined",
+	[LIVEUPDATE_STATE_NORMAL]	= "normal",
+	[LIVEUPDATE_STATE_PREPARED]	= "prepared",
+	[LIVEUPDATE_STATE_FROZEN]	= "frozen",
+	[LIVEUPDATE_STATE_UPDATED]	= "updated",
+};
+
+static bool luo_enabled;
+
+static int __init early_liveupdate_param(char *buf)
+{
+	return kstrtobool(buf, &luo_enabled);
+}
+early_param("liveupdate", early_liveupdate_param);
+
+/* Return true if the current state is equal to the provided state */
+static inline bool is_current_luo_state(enum liveupdate_state expected_state)
+{
+	return liveupdate_get_state() == expected_state;
+}
+
+static void __luo_set_state(enum liveupdate_state state)
+{
+	WRITE_ONCE(luo_state, state);
+}
+
+static inline void luo_set_state(enum liveupdate_state state)
+{
+	pr_info("Switched from [%s] to [%s] state\n",
+		luo_current_state_str(), luo_state_str[state]);
+	__luo_set_state(state);
+}
+
+static int luo_do_freeze_calls(void)
+{
+	return 0;
+}
+
+static void luo_do_finish_calls(void)
+{
+}
+
+/* Get the current state as a string */
+const char *luo_current_state_str(void)
+{
+	return luo_state_str[liveupdate_get_state()];
+}
+
+enum liveupdate_state liveupdate_get_state(void)
+{
+	return READ_ONCE(luo_state);
+}
+
+int luo_prepare(void)
+{
+	return 0;
+}
+
+/**
+ * luo_freeze() - Initiate the final freeze notification phase for live update.
+ *
+ * Attempts to transition the live update orchestrator state from
+ * %LIVEUPDATE_STATE_PREPARED to %LIVEUPDATE_STATE_FROZEN. This function is
+ * typically called just before the actual reboot system call (e.g., kexec)
+ * is invoked, either directly by the orchestration tool or potentially from
+ * within the reboot syscall path itself.
+ *
+ * @return  0: Success. Negative error otherwise. State is reverted to
+ * %LIVEUPDATE_STATE_NORMAL in case of an error during callbacks, and everything
+ * is canceled via cancel notifcation.
+ */
+int luo_freeze(void)
+{
+	int ret;
+
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[freeze] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_FROZEN],
+			luo_current_state_str());
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	ret = luo_do_freeze_calls();
+	if (!ret)
+		luo_set_state(LIVEUPDATE_STATE_FROZEN);
+	else
+		luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return ret;
+}
+
+/**
+ * luo_finish - Finalize the live update process in the new kernel.
+ *
+ * This function is called  after a successful live update reboot into a new
+ * kernel, once the new kernel is ready to transition to the normal operational
+ * state. It signals the completion of the live update sequence to subsystems.
+ *
+ * @return 0 on success, ``-EAGAIN`` if the state change was cancelled by the
+ * user while waiting for the lock, or ``-EINVAL`` if the orchestrator is not in
+ * the updated state.
+ */
+int luo_finish(void)
+{
+	if (down_write_killable(&luo_state_rwsem)) {
+		pr_warn("[finish] event canceled by user\n");
+		return -EAGAIN;
+	}
+
+	if (!is_current_luo_state(LIVEUPDATE_STATE_UPDATED)) {
+		pr_warn("Can't switch to [%s] from [%s] state\n",
+			luo_state_str[LIVEUPDATE_STATE_NORMAL],
+			luo_current_state_str());
+		up_write(&luo_state_rwsem);
+
+		return -EINVAL;
+	}
+
+	luo_do_finish_calls();
+	luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	up_write(&luo_state_rwsem);
+
+	return 0;
+}
+
+int luo_cancel(void)
+{
+	return 0;
+}
+
+void luo_state_read_enter(void)
+{
+	down_read(&luo_state_rwsem);
+}
+
+void luo_state_read_exit(void)
+{
+	up_read(&luo_state_rwsem);
+}
+
+static int __init luo_startup(void)
+{
+	__luo_set_state(LIVEUPDATE_STATE_NORMAL);
+
+	return 0;
+}
+early_initcall(luo_startup);
+
+/* Public Functions */
+
+/**
+ * liveupdate_reboot() - Kernel reboot notifier for live update final
+ * serialization.
+ *
+ * This function is invoked directly from the reboot() syscall pathway if a
+ * reboot is initiated while the live update state is %LIVEUPDATE_STATE_PREPARED
+ * (i.e., if the user did not explicitly trigger the frozen state). It handles
+ * the implicit transition into the final frozen state.
+ *
+ * It triggers the %LIVEUPDATE_REBOOT event callbacks for participating
+ * subsystems. These callbacks must perform final state saving very quickly as
+ * they execute during the blackout period just before kexec.
+ *
+ * If any %LIVEUPDATE_FREEZE callback fails, this function triggers the
+ * %LIVEUPDATE_CANCEL event for all participants to revert their state, aborts
+ * the live update, and returns an error.
+ */
+int liveupdate_reboot(void)
+{
+	if (!is_current_luo_state(LIVEUPDATE_STATE_PREPARED))
+		return 0;
+
+	return luo_freeze();
+}
+
+/**
+ * liveupdate_state_updated - Check if the system is in the live update
+ * 'updated' state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
+ * successfully rebooted into a new kernel as part of a live update, and the
+ * preserved devices are expected to be in the process of being reclaimed.
+ *
+ * This is typically used by subsystems during early boot of the new kernel
+ * to determine if they need to attempt to restore state from a previous
+ * live update.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_updated(void)
+{
+	return is_current_luo_state(LIVEUPDATE_STATE_UPDATED);
+}
+
+/**
+ * liveupdate_state_normal - Check if the system is in the live update 'normal'
+ * state.
+ *
+ * This function checks if the live update orchestrator is in the
+ * ``LIVEUPDATE_STATE_NORMAL`` state. This state indicates that no live update
+ * is in progress. It represents the default operational state of the system.
+ *
+ * This can be used to gate actions that should only be performed when no
+ * live update activity is occurring.
+ *
+ * @return true if the system is in the ``LIVEUPDATE_STATE_NORMAL`` state,
+ * false otherwise.
+ */
+bool liveupdate_state_normal(void)
+{
+	return is_current_luo_state(LIVEUPDATE_STATE_NORMAL);
+}
+
+/**
+ * liveupdate_enabled - Check if the live update feature is enabled.
+ *
+ * This function returns the state of the live update feature flag, which
+ * can be controlled via the ``liveupdate`` kernel command-line parameter.
+ *
+ * @return true if live update is enabled, false otherwise.
+ */
+bool liveupdate_enabled(void)
+{
+	return luo_enabled;
+}
diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
new file mode 100644
index 000000000000..2e0861781673
--- /dev/null
+++ b/kernel/liveupdate/luo_internal.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#ifndef _LINUX_LUO_INTERNAL_H
+#define _LINUX_LUO_INTERNAL_H
+
+int luo_cancel(void);
+int luo_prepare(void);
+int luo_freeze(void);
+int luo_finish(void);
+
+void luo_state_read_enter(void);
+void luo_state_read_exit(void);
+extern struct rw_semaphore luo_state_rwsem;
+
+const char *luo_current_state_str(void);
+
+#endif /* _LINUX_LUO_INTERNAL_H */
diff --git a/kernel/liveupdate/luo_ioctl.c b/kernel/liveupdate/luo_ioctl.c
new file mode 100644
index 000000000000..fc2afb450ad5
--- /dev/null
+++ b/kernel/liveupdate/luo_ioctl.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Copyright (c) 2025, Google LLC.
+ * Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/liveupdate.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <uapi/linux/liveupdate.h>
+#include "luo_internal.h"
+
+struct luo_device_state {
+	struct miscdevice miscdev;
+};
+
+static const struct file_operations luo_fops = {
+	.owner		= THIS_MODULE,
+};
+
+static struct luo_device_state luo_dev = {
+	.miscdev = {
+		.minor = MISC_DYNAMIC_MINOR,
+		.name  = "liveupdate",
+		.fops  = &luo_fops,
+	},
+};
+
+static int __init liveupdate_init(void)
+{
+	if (!liveupdate_enabled())
+		return 0;
+
+	return misc_register(&luo_dev.miscdev);
+}
+module_init(liveupdate_init);
+
+static void __exit liveupdate_exit(void)
+{
+	misc_deregister(&luo_dev.miscdev);
+}
+module_exit(liveupdate_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pasha Tatashin");
+MODULE_DESCRIPTION("Live Update Orchestrator");
+MODULE_VERSION("0.1");
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 06/30] liveupdate: kho: move to kernel/liveupdate
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
core kernel related files to the same place.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
---
 Documentation/core-api/kho/concepts.rst       |  2 +-
 MAINTAINERS                                   |  2 +-
 init/Kconfig                                  |  2 ++
 kernel/Kconfig.kexec                          | 25 ----------------
 kernel/Makefile                               |  3 +-
 kernel/liveupdate/Kconfig                     | 30 +++++++++++++++++++
 kernel/liveupdate/Makefile                    |  4 +++
 kernel/{ => liveupdate}/kexec_handover.c      |  6 ++--
 .../{ => liveupdate}/kexec_handover_debug.c   |  0
 .../kexec_handover_internal.h                 |  0
 10 files changed, 42 insertions(+), 32 deletions(-)
 create mode 100644 kernel/liveupdate/Kconfig
 create mode 100644 kernel/liveupdate/Makefile
 rename kernel/{ => liveupdate}/kexec_handover.c (99%)
 rename kernel/{ => liveupdate}/kexec_handover_debug.c (100%)
 rename kernel/{ => liveupdate}/kexec_handover_internal.h (100%)

diff --git a/Documentation/core-api/kho/concepts.rst b/Documentation/core-api/kho/concepts.rst
index 36d5c05cfb30..d626d1dbd678 100644
--- a/Documentation/core-api/kho/concepts.rst
+++ b/Documentation/core-api/kho/concepts.rst
@@ -70,5 +70,5 @@ in the FDT. That state is called the KHO finalization phase.
 
 Public API
 ==========
-.. kernel-doc:: kernel/kexec_handover.c
+.. kernel-doc:: kernel/liveupdate/kexec_handover.c
    :export:
diff --git a/MAINTAINERS b/MAINTAINERS
index a6cbcc7fb396..e5c800ed4819 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13766,7 +13766,7 @@ S:	Maintained
 F:	Documentation/admin-guide/mm/kho.rst
 F:	Documentation/core-api/kho/*
 F:	include/linux/kexec_handover.h
-F:	kernel/kexec_handover*
+F:	kernel/liveupdate/kexec_handover*
 F:	tools/testing/selftests/kho/
 
 KEYS-ENCRYPTED
diff --git a/init/Kconfig b/init/Kconfig
index d295e79d3547..9952c112154f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2138,6 +2138,8 @@ config TRACEPOINTS
 
 source "kernel/Kconfig.kexec"
 
+source "kernel/liveupdate/Kconfig"
+
 endmenu		# General setup
 
 source "arch/Kconfig"
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index e68156d8c72b..15632358bcf7 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -94,31 +94,6 @@ config KEXEC_JUMP
 	  Jump between original kernel and kexeced kernel and invoke
 	  code in physical address mode via KEXEC
 
-config KEXEC_HANDOVER
-	bool "kexec handover"
-	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
-	depends on !DEFERRED_STRUCT_PAGE_INIT
-	select MEMBLOCK_KHO_SCRATCH
-	select KEXEC_FILE
-	select DEBUG_FS
-	select LIBFDT
-	select CMA
-	help
-	  Allow kexec to hand over state across kernels by generating and
-	  passing additional metadata to the target kernel. This is useful
-	  to keep data or state alive across the kexec. For this to work,
-	  both source and target kernels need to have this option enabled.
-
-config KEXEC_HANDOVER_DEBUG
-	bool "kexec handover debug interface"
-	depends on KEXEC_HANDOVER
-	depends on DEBUG_FS
-	help
-	  Allow to control kexec handover device tree via debugfs
-	  interface, i.e. finalize the state or aborting the finalization.
-	  Also, enables inspecting the KHO fdt trees with the debugfs binary
-	  blobs.
-
 config CRASH_DUMP
 	bool "kernel crash dumps"
 	default ARCH_DEFAULT_CRASH_DUMP
diff --git a/kernel/Makefile b/kernel/Makefile
index 9fe722305c9b..e83669841b8c 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-y += printk/
 obj-y += irq/
 obj-y += rcu/
 obj-y += livepatch/
+obj-y += liveupdate/
 obj-y += dma/
 obj-y += entry/
 obj-y += unwind/
@@ -82,8 +83,6 @@ obj-$(CONFIG_CRASH_DUMP_KUNIT_TEST) += crash_core_test.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
-obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
-obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/liveupdate/Kconfig b/kernel/liveupdate/Kconfig
new file mode 100644
index 000000000000..eebe564b385d
--- /dev/null
+++ b/kernel/liveupdate/Kconfig
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Live Update"
+
+config KEXEC_HANDOVER
+	bool "kexec handover"
+	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
+	depends on !DEFERRED_STRUCT_PAGE_INIT
+	select MEMBLOCK_KHO_SCRATCH
+	select KEXEC_FILE
+	select DEBUG_FS
+	select LIBFDT
+	select CMA
+	help
+	  Allow kexec to hand over state across kernels by generating and
+	  passing additional metadata to the target kernel. This is useful
+	  to keep data or state alive across the kexec. For this to work,
+	  both source and target kernels need to have this option enabled.
+
+config KEXEC_HANDOVER_DEBUG
+	bool "kexec handover debug interface"
+	depends on KEXEC_HANDOVER
+	depends on DEBUG_FS
+	help
+	  Allow to control kexec handover device tree via debugfs
+	  interface, i.e. finalize the state or aborting the finalization.
+	  Also, enables inspecting the KHO fdt trees with the debugfs binary
+	  blobs.
+
+endmenu
diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
new file mode 100644
index 000000000000..67c7f71b33fa
--- /dev/null
+++ b/kernel/liveupdate/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
+obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
diff --git a/kernel/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
similarity index 99%
rename from kernel/kexec_handover.c
rename to kernel/liveupdate/kexec_handover.c
index 61b31cfc75f2..71e98c44cf47 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/liveupdate/kexec_handover.c
@@ -24,8 +24,8 @@
  * KHO is tightly coupled with mm init and needs access to some of mm
  * internal APIs.
  */
-#include "../mm/internal.h"
-#include "kexec_internal.h"
+#include "../../mm/internal.h"
+#include "../kexec_internal.h"
 #include "kexec_handover_internal.h"
 
 #define KHO_FDT_COMPATIBLE "kho-v1"
@@ -1129,7 +1129,7 @@ static int __kho_finalize(void)
 	err |= fdt_finish_reservemap(root);
 	err |= fdt_begin_node(root, "");
 	err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE);
-	/**
+	/*
 	 * Reserve the preserved-memory-map property in the root FDT, so
 	 * that all property definitions will precede subnodes created by
 	 * KHO callers.
diff --git a/kernel/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
similarity index 100%
rename from kernel/kexec_handover_debug.c
rename to kernel/liveupdate/kexec_handover_debug.c
diff --git a/kernel/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
similarity index 100%
rename from kernel/kexec_handover_internal.h
rename to kernel/liveupdate/kexec_handover_internal.h
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 05/30] kho: don't unpreserve memory during abort
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

KHO allows clients to preserve memory regions at any point before the
KHO state is finalized. The finalization process itself involves KHO
performing its own actions, such as serializing the overall
preserved memory map.

If this finalization process is aborted, the current implementation
destroys KHO's internal memory tracking structures
(`kho_out.ser.track.orders`). This behavior effectively unpreserves
all memory from KHO's perspective, regardless of whether those
preservations were made by clients before the finalization attempt
or by KHO itself during finalization.

This premature unpreservation is incorrect. An abort of the
finalization process should only undo actions taken by KHO as part of
that specific finalization attempt. Individual memory regions
preserved by clients prior to finalization should remain preserved,
as their lifecycle is managed by the clients themselves. These
clients might still need to call kho_unpreserve_folio() or
kho_unpreserve_phys() based on their own logic, even after a KHO
finalization attempt is aborted.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 kernel/kexec_handover.c | 21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 26e035eb1314..61b31cfc75f2 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -1083,31 +1083,12 @@ EXPORT_SYMBOL_GPL(kho_restore_vmalloc);

 static int __kho_abort(void)
 {
-	int err = 0;
-	unsigned long order;
-	struct kho_mem_phys *physxa;
-
-	xa_for_each(&kho_out.track.orders, order, physxa) {
-		struct kho_mem_phys_bits *bits;
-		unsigned long phys;
-
-		xa_for_each(&physxa->phys_bits, phys, bits)
-			kfree(bits);
-
-		xa_destroy(&physxa->phys_bits);
-		kfree(physxa);
-	}
-	xa_destroy(&kho_out.track.orders);
-
 	if (kho_out.preserved_mem_map) {
 		kho_mem_ser_free(kho_out.preserved_mem_map);
 		kho_out.preserved_mem_map = NULL;
 	}

-	if (err)
-		pr_err("Failed to abort KHO finalization: %d\n", err);
-
-	return err;
+	return 0;
 }

 int kho_abort(void)
-- 
2.51.0.536.g15c5d4f767-goog

^ permalink raw reply related

* [PATCH v4 04/30] kho: add interfaces to unpreserve folios and page ranes
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Allow users of KHO to cancel the previous preservation by adding the
necessary interfaces to unpreserve folio and pages.

Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/kexec_handover.h | 12 +++++
 kernel/kexec_handover.c        | 85 ++++++++++++++++++++++++++++------
 2 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 2faf290803ce..4ba145713838 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -43,7 +43,9 @@ bool kho_is_enabled(void);
 bool is_kho_boot(void);
 
 int kho_preserve_folio(struct folio *folio);
+int kho_unpreserve_folio(struct folio *folio);
 int kho_preserve_pages(struct page *page, unsigned int nr_pages);
+int kho_unpreserve_pages(struct page *page, unsigned int nr_pages);
 int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation);
 struct folio *kho_restore_folio(phys_addr_t phys);
 struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages);
@@ -76,11 +78,21 @@ static inline int kho_preserve_folio(struct folio *folio)
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_unpreserve_folio(struct folio *folio)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int kho_preserve_pages(struct page *page, unsigned int nr_pages)
 {
 	return -EOPNOTSUPP;
 }
 
+static inline int kho_unpreserve_pages(struct page *page, unsigned int nr_pages)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int kho_preserve_vmalloc(void *ptr,
 				       struct kho_vmalloc *preservation)
 {
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index e0dc0ed565ef..26e035eb1314 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -153,26 +153,33 @@ static void *xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t sz)
 	return elm;
 }
 
-static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
-			     unsigned long end_pfn)
+static void __kho_unpreserve_order(struct kho_mem_track *track, unsigned long pfn,
+				   unsigned int order)
 {
 	struct kho_mem_phys_bits *bits;
 	struct kho_mem_phys *physxa;
+	const unsigned long pfn_high = pfn >> order;
 
-	while (pfn < end_pfn) {
-		const unsigned int order =
-			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
-		const unsigned long pfn_high = pfn >> order;
+	physxa = xa_load(&track->orders, order);
+	if (!physxa)
+		return;
+
+	bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
+	if (!bits)
+		return;
 
-		physxa = xa_load(&track->orders, order);
-		if (!physxa)
-			continue;
+	clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
+}
+
+static void __kho_unpreserve(struct kho_mem_track *track, unsigned long pfn,
+			     unsigned long end_pfn)
+{
+	unsigned int order;
 
-		bits = xa_load(&physxa->phys_bits, pfn_high / PRESERVE_BITS);
-		if (!bits)
-			continue;
+	while (pfn < end_pfn) {
+		order = min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
 
-		clear_bit(pfn_high % PRESERVE_BITS, bits->preserve);
+		__kho_unpreserve_order(track, pfn, order);
 
 		pfn += 1 << order;
 	}
@@ -734,6 +741,30 @@ int kho_preserve_folio(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(kho_preserve_folio);
 
+/**
+ * kho_unpreserve_folio - unpreserve a folio.
+ * @folio: folio to unpreserve.
+ *
+ * Instructs KHO to unpreserve a folio that was preserved by
+ * kho_preserve_folio() before. The provided @folio (pfn and order)
+ * must exactly match a previously preserved folio.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_folio(struct folio *folio)
+{
+	const unsigned long pfn = folio_pfn(folio);
+	const unsigned int order = folio_order(folio);
+	struct kho_mem_track *track = &kho_out.track;
+
+	if (kho_out.finalized)
+		return -EBUSY;
+
+	__kho_unpreserve_order(track, pfn, order);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_folio);
+
 /**
  * kho_preserve_pages - preserve contiguous pages across kexec
  * @page: first page in the list.
@@ -773,6 +804,34 @@ int kho_preserve_pages(struct page *page, unsigned int nr_pages)
 }
 EXPORT_SYMBOL_GPL(kho_preserve_pages);
 
+/**
+ * kho_unpreserve_pages - unpreserve contiguous pages.
+ * @page: first page in the list.
+ * @nr_pages: number of pages.
+ *
+ * Instructs KHO to unpreserve @nr_pages contigious  pages starting from @page.
+ * This call must exactly match a granularity at which memory was originally
+ * preserved by kho_preserve_pages, call with the same @page and
+ * @nr_pages). Unpreserving arbitrary sub-ranges of larger preserved blocks is
+ * not supported.
+ *
+ * Return: 0 on success, error code on failure
+ */
+int kho_unpreserve_pages(struct page *page, unsigned int nr_pages)
+{
+	struct kho_mem_track *track = &kho_out.track;
+	const unsigned long start_pfn = page_to_pfn(page);
+	const unsigned long end_pfn = start_pfn + nr_pages;
+
+	if (kho_out.finalized)
+		return -EBUSY;
+
+	__kho_unpreserve(track, start_pfn, end_pfn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(kho_unpreserve_pages);
+
 struct kho_vmalloc_hdr {
 	DECLARE_KHOSER_PTR(next, struct kho_vmalloc_chunk *);
 };
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 03/30] kho: drop notifiers
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

The KHO framework uses a notifier chain as the mechanism for clients to
participate in the finalization process. While this works for a single,
central state machine, it is too restrictive for kernel-internal
components like pstore/reserve_mem or IMA. These components need a
simpler, direct way to register their state for preservation (e.g.,
during their initcall) without being part of a complex,
shutdown-time notifier sequence. The notifier model forces all
participants into a single finalization flow and makes direct
preservation from an arbitrary context difficult.
This patch refactors the client participation model by removing the
notifier chain and introducing a direct API for managing FDT subtrees.

The core kho_finalize() and kho_abort() state machine remains, but
clients now register their data with KHO beforehand.

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/kexec_handover.h   |  28 +----
 kernel/kexec_handover.c          | 184 +++++++++++++++----------------
 kernel/kexec_handover_debug.c    |  17 +--
 kernel/kexec_handover_internal.h |   5 +-
 mm/memblock.c                    |  60 ++--------
 5 files changed, 118 insertions(+), 176 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 04d0108db98e..2faf290803ce 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -10,14 +10,7 @@ struct kho_scratch {
 	phys_addr_t size;
 };
 
-/* KHO Notifier index */
-enum kho_event {
-	KEXEC_KHO_FINALIZE = 0,
-	KEXEC_KHO_ABORT = 1,
-};
-
 struct folio;
-struct notifier_block;
 struct page;
 
 #define DECLARE_KHOSER_PTR(name, type) \
@@ -37,8 +30,6 @@ struct page;
 		(typeof((s).ptr))((s).phys ? phys_to_virt((s).phys) : NULL); \
 	})
 
-struct kho_serialization;
-
 struct kho_vmalloc_chunk;
 struct kho_vmalloc {
 	DECLARE_KHOSER_PTR(first, struct kho_vmalloc_chunk *);
@@ -57,12 +48,10 @@ int kho_preserve_vmalloc(void *ptr, struct kho_vmalloc *preservation);
 struct folio *kho_restore_folio(phys_addr_t phys);
 struct page *kho_restore_pages(phys_addr_t phys, unsigned int nr_pages);
 void *kho_restore_vmalloc(const struct kho_vmalloc *preservation);
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt);
+int kho_add_subtree(const char *name, void *fdt);
+void kho_remove_subtree(void *fdt);
 int kho_retrieve_subtree(const char *name, phys_addr_t *phys);
 
-int register_kho_notifier(struct notifier_block *nb);
-int unregister_kho_notifier(struct notifier_block *nb);
-
 void kho_memory_init(void);
 
 void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
@@ -114,23 +103,16 @@ static inline void *kho_restore_vmalloc(const struct kho_vmalloc *preservation)
 	return NULL;
 }
 
-static inline int kho_add_subtree(struct kho_serialization *ser,
-				  const char *name, void *fdt)
+static inline int kho_add_subtree(const char *name, void *fdt)
 {
 	return -EOPNOTSUPP;
 }
 
-static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
+static inline void kho_remove_subtree(void *fdt)
 {
-	return -EOPNOTSUPP;
 }
 
-static inline int register_kho_notifier(struct notifier_block *nb)
-{
-	return -EOPNOTSUPP;
-}
-
-static inline int unregister_kho_notifier(struct notifier_block *nb)
+static inline int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index f0f6c6b8ad83..e0dc0ed565ef 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -15,7 +15,6 @@
 #include <linux/libfdt.h>
 #include <linux/list.h>
 #include <linux/memblock.h>
-#include <linux/notifier.h>
 #include <linux/page-isolation.h>
 #include <linux/vmalloc.h>
 
@@ -99,33 +98,34 @@ struct kho_mem_track {
 
 struct khoser_mem_chunk;
 
-struct kho_serialization {
-	struct page *fdt;
-	struct kho_mem_track track;
-	/* First chunk of serialized preserved memory map */
-	struct khoser_mem_chunk *preserved_mem_map;
+struct kho_sub_fdt {
+	struct list_head l;
+	const char *name;
+	void *fdt;
 };
 
 struct kho_out {
-	struct blocking_notifier_head chain_head;
+	void *fdt;
+	bool finalized;
+	struct mutex lock; /* protects KHO FDT finalization */
 
-	struct dentry *dir;
+	struct list_head sub_fdts;
+	struct mutex fdts_lock;
 
-	struct mutex lock; /* protects KHO FDT finalization */
+	struct kho_mem_track track;
+	/* First chunk of serialized preserved memory map */
+	struct khoser_mem_chunk *preserved_mem_map;
 
-	struct kho_serialization ser;
-	bool finalized;
+	struct kho_debugfs dbg;
 };
 
 static struct kho_out kho_out = {
-	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
 	.lock = __MUTEX_INITIALIZER(kho_out.lock),
-	.ser = {
-		.fdt_list = LIST_HEAD_INIT(kho_out.ser.fdt_list),
-		.track = {
-			.orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
-		},
+	.track = {
+		.orders = XARRAY_INIT(kho_out.track.orders, 0),
 	},
+	.sub_fdts = LIST_HEAD_INIT(kho_out.sub_fdts),
+	.fdts_lock = __MUTEX_INITIALIZER(kho_out.fdts_lock),
 	.finalized = false,
 };
 
@@ -366,14 +366,14 @@ static void kho_mem_ser_free(struct khoser_mem_chunk *first_chunk)
 	}
 }
 
-static int kho_mem_serialize(struct kho_serialization *ser)
+static int kho_mem_serialize(struct kho_out *kho_out)
 {
 	struct khoser_mem_chunk *first_chunk = NULL;
 	struct khoser_mem_chunk *chunk = NULL;
 	struct kho_mem_phys *physxa;
 	unsigned long order;
 
-	xa_for_each(&ser->track.orders, order, physxa) {
+	xa_for_each(&kho_out->track.orders, order, physxa) {
 		struct kho_mem_phys_bits *bits;
 		unsigned long phys;
 
@@ -401,7 +401,7 @@ static int kho_mem_serialize(struct kho_serialization *ser)
 		}
 	}
 
-	ser->preserved_mem_map = first_chunk;
+	kho_out->preserved_mem_map = first_chunk;
 
 	return 0;
 
@@ -660,28 +660,8 @@ static void __init kho_reserve_scratch(void)
 	kho_enable = false;
 }
 
-struct kho_out {
-	struct blocking_notifier_head chain_head;
-	struct mutex lock; /* protects KHO FDT finalization */
-	struct kho_serialization ser;
-	bool finalized;
-	struct kho_debugfs dbg;
-};
-
-static struct kho_out kho_out = {
-	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
-	.lock = __MUTEX_INITIALIZER(kho_out.lock),
-	.ser = {
-		.track = {
-			.orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
-		},
-	},
-	.finalized = false,
-};
-
 /**
  * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
- * @ser: serialization control object passed by KHO notifiers.
  * @name: name of the sub tree.
  * @fdt: the sub tree blob.
  *
@@ -695,34 +675,45 @@ static struct kho_out kho_out = {
  *
  * Return: 0 on success, error code on failure
  */
-int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
+int kho_add_subtree(const char *name, void *fdt)
 {
-	int err = 0;
-	u64 phys = (u64)virt_to_phys(fdt);
-	void *root = page_to_virt(ser->fdt);
+	struct kho_sub_fdt *sub_fdt;
+	int err;
 
-	err |= fdt_begin_node(root, name);
-	err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys));
-	err |= fdt_end_node(root);
+	sub_fdt = kmalloc(sizeof(*sub_fdt), GFP_KERNEL);
+	if (!sub_fdt)
+		return -ENOMEM;
 
-	if (err)
-		return err;
+	INIT_LIST_HEAD(&sub_fdt->l);
+	sub_fdt->name = name;
+	sub_fdt->fdt = fdt;
+
+	mutex_lock(&kho_out.fdts_lock);
+	list_add_tail(&sub_fdt->l, &kho_out.sub_fdts);
+	err = kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false);
+	mutex_unlock(&kho_out.fdts_lock);
 
-	return kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false);
+	return err;
 }
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
-int register_kho_notifier(struct notifier_block *nb)
+void kho_remove_subtree(void *fdt)
 {
-	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
-}
-EXPORT_SYMBOL_GPL(register_kho_notifier);
+	struct kho_sub_fdt *sub_fdt;
+
+	mutex_lock(&kho_out.fdts_lock);
+	list_for_each_entry(sub_fdt, &kho_out.sub_fdts, l) {
+		if (sub_fdt->fdt == fdt) {
+			list_del(&sub_fdt->l);
+			kfree(sub_fdt);
+			kho_debugfs_fdt_remove(&kho_out.dbg, fdt);
+			break;
+		}
+	}
+	mutex_unlock(&kho_out.fdts_lock);
 
-int unregister_kho_notifier(struct notifier_block *nb)
-{
-	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
 }
-EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+EXPORT_SYMBOL_GPL(kho_remove_subtree);
 
 /**
  * kho_preserve_folio - preserve a folio across kexec.
@@ -737,7 +728,7 @@ int kho_preserve_folio(struct folio *folio)
 {
 	const unsigned long pfn = folio_pfn(folio);
 	const unsigned int order = folio_order(folio);
-	struct kho_mem_track *track = &kho_out.ser.track;
+	struct kho_mem_track *track = &kho_out.track;
 
 	return __kho_preserve_order(track, pfn, order);
 }
@@ -755,7 +746,7 @@ EXPORT_SYMBOL_GPL(kho_preserve_folio);
  */
 int kho_preserve_pages(struct page *page, unsigned int nr_pages)
 {
-	struct kho_mem_track *track = &kho_out.ser.track;
+	struct kho_mem_track *track = &kho_out.track;
 	const unsigned long start_pfn = page_to_pfn(page);
 	const unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long pfn = start_pfn;
@@ -851,7 +842,7 @@ static struct kho_vmalloc_chunk *new_vmalloc_chunk(struct kho_vmalloc_chunk *cur
 
 static void kho_vmalloc_unpreserve_chunk(struct kho_vmalloc_chunk *chunk)
 {
-	struct kho_mem_track *track = &kho_out.ser.track;
+	struct kho_mem_track *track = &kho_out.track;
 	unsigned long pfn = PHYS_PFN(virt_to_phys(chunk));
 
 	__kho_unpreserve(track, pfn, pfn + 1);
@@ -1033,11 +1024,11 @@ EXPORT_SYMBOL_GPL(kho_restore_vmalloc);
 
 static int __kho_abort(void)
 {
-	int err;
+	int err = 0;
 	unsigned long order;
 	struct kho_mem_phys *physxa;
 
-	xa_for_each(&kho_out.ser.track.orders, order, physxa) {
+	xa_for_each(&kho_out.track.orders, order, physxa) {
 		struct kho_mem_phys_bits *bits;
 		unsigned long phys;
 
@@ -1047,17 +1038,13 @@ static int __kho_abort(void)
 		xa_destroy(&physxa->phys_bits);
 		kfree(physxa);
 	}
-	xa_destroy(&kho_out.ser.track.orders);
+	xa_destroy(&kho_out.track.orders);
 
-	if (kho_out.ser.preserved_mem_map) {
-		kho_mem_ser_free(kho_out.ser.preserved_mem_map);
-		kho_out.ser.preserved_mem_map = NULL;
+	if (kho_out.preserved_mem_map) {
+		kho_mem_ser_free(kho_out.preserved_mem_map);
+		kho_out.preserved_mem_map = NULL;
 	}
 
-	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT,
-					   NULL);
-	err = notifier_to_errno(err);
-
 	if (err)
 		pr_err("Failed to abort KHO finalization: %d\n", err);
 
@@ -1084,7 +1071,7 @@ int kho_abort(void)
 
 	kho_out.finalized = false;
 
-	kho_debugfs_cleanup(&kho_out.dbg);
+	kho_debugfs_fdt_remove(&kho_out.dbg, kho_out.fdt);
 
 unlock:
 	mutex_unlock(&kho_out.lock);
@@ -1095,41 +1082,46 @@ static int __kho_finalize(void)
 {
 	int err = 0;
 	u64 *preserved_mem_map;
-	void *fdt = page_to_virt(kho_out.ser.fdt);
+	void *root = kho_out.fdt;
+	struct kho_sub_fdt *fdt;
 
-	err |= fdt_create(fdt, PAGE_SIZE);
-	err |= fdt_finish_reservemap(fdt);
-	err |= fdt_begin_node(fdt, "");
-	err |= fdt_property_string(fdt, "compatible", KHO_FDT_COMPATIBLE);
+	err |= fdt_create(root, PAGE_SIZE);
+	err |= fdt_finish_reservemap(root);
+	err |= fdt_begin_node(root, "");
+	err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE);
 	/**
 	 * Reserve the preserved-memory-map property in the root FDT, so
 	 * that all property definitions will precede subnodes created by
 	 * KHO callers.
 	 */
-	err |= fdt_property_placeholder(fdt, PROP_PRESERVED_MEMORY_MAP,
+	err |= fdt_property_placeholder(root, PROP_PRESERVED_MEMORY_MAP,
 					sizeof(*preserved_mem_map),
 					(void **)&preserved_mem_map);
 	if (err)
 		goto abort;
 
-	err = kho_preserve_folio(page_folio(kho_out.ser.fdt));
+	err = kho_preserve_folio(virt_to_folio(kho_out.fdt));
 	if (err)
 		goto abort;
 
-	err = blocking_notifier_call_chain(&kho_out.chain_head,
-					   KEXEC_KHO_FINALIZE, &kho_out.ser);
-	err = notifier_to_errno(err);
+	err = kho_mem_serialize(&kho_out);
 	if (err)
 		goto abort;
 
-	err = kho_mem_serialize(&kho_out.ser);
-	if (err)
-		goto abort;
+	*preserved_mem_map = (u64)virt_to_phys(kho_out.preserved_mem_map);
 
-	*preserved_mem_map = (u64)virt_to_phys(kho_out.ser.preserved_mem_map);
+	mutex_lock(&kho_out.fdts_lock);
+	list_for_each_entry(fdt, &kho_out.sub_fdts, l) {
+		phys_addr_t phys = virt_to_phys(fdt->fdt);
 
-	err |= fdt_end_node(fdt);
-	err |= fdt_finish(fdt);
+		err |= fdt_begin_node(root, fdt->name);
+		err |= fdt_property(root, PROP_SUB_FDT, &phys, sizeof(phys));
+		err |= fdt_end_node(root);
+	};
+	mutex_unlock(&kho_out.fdts_lock);
+
+	err |= fdt_end_node(root);
+	err |= fdt_finish(root);
 
 abort:
 	if (err) {
@@ -1160,7 +1152,7 @@ int kho_finalize(void)
 
 	kho_out.finalized = true;
 	ret = kho_debugfs_fdt_add(&kho_out.dbg, "fdt",
-				  page_to_virt(kho_out.ser.fdt), true);
+				  kho_out.fdt, true);
 
 unlock:
 	mutex_unlock(&kho_out.lock);
@@ -1252,15 +1244,17 @@ static __init int kho_init(void)
 {
 	int err = 0;
 	const void *fdt = kho_get_fdt();
+	struct page *fdt_page;
 
 	if (!kho_enable)
 		return 0;
 
-	kho_out.ser.fdt = alloc_page(GFP_KERNEL);
-	if (!kho_out.ser.fdt) {
+	fdt_page = alloc_page(GFP_KERNEL);
+	if (!fdt_page) {
 		err = -ENOMEM;
 		goto err_free_scratch;
 	}
+	kho_out.fdt = page_to_virt(fdt_page);
 
 	err = kho_debugfs_init();
 	if (err)
@@ -1288,8 +1282,8 @@ static __init int kho_init(void)
 	return 0;
 
 err_free_fdt:
-	put_page(kho_out.ser.fdt);
-	kho_out.ser.fdt = NULL;
+	put_page(fdt_page);
+	kho_out.fdt = NULL;
 err_free_scratch:
 	for (int i = 0; i < kho_scratch_cnt; i++) {
 		void *start = __va(kho_scratch[i].addr);
@@ -1300,7 +1294,7 @@ static __init int kho_init(void)
 	kho_enable = false;
 	return err;
 }
-late_initcall(kho_init);
+fs_initcall(kho_init);
 
 static void __init kho_release_scratch(void)
 {
@@ -1436,7 +1430,7 @@ int kho_fill_kimage(struct kimage *image)
 	if (!kho_out.finalized)
 		return 0;
 
-	image->kho.fdt = page_to_phys(kho_out.ser.fdt);
+	image->kho.fdt = virt_to_phys(kho_out.fdt);
 
 	scratch_size = sizeof(*kho_scratch) * kho_scratch_cnt;
 	scratch = (struct kexec_buf){
diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c
index b88d138a97be..af4bad225630 100644
--- a/kernel/kexec_handover_debug.c
+++ b/kernel/kexec_handover_debug.c
@@ -61,14 +61,17 @@ int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
 	return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt);
 }
 
-void kho_debugfs_cleanup(struct kho_debugfs *dbg)
+void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt)
 {
-	struct fdt_debugfs *ff, *tmp;
-
-	list_for_each_entry_safe(ff, tmp, &dbg->fdt_list, list) {
-		debugfs_remove(ff->file);
-		list_del(&ff->list);
-		kfree(ff);
+	struct fdt_debugfs *ff;
+
+	list_for_each_entry(ff, &dbg->fdt_list, list) {
+		if (ff->wrapper.data == fdt) {
+			debugfs_remove(ff->file);
+			list_del(&ff->list);
+			kfree(ff);
+			break;
+		}
 	}
 }
 
diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h
index f6f172ddcae4..229a05558b99 100644
--- a/kernel/kexec_handover_internal.h
+++ b/kernel/kexec_handover_internal.h
@@ -30,7 +30,7 @@ void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt);
 int kho_out_debugfs_init(struct kho_debugfs *dbg);
 int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
 			const void *fdt, bool root);
-void kho_debugfs_cleanup(struct kho_debugfs *dbg);
+void kho_debugfs_fdt_remove(struct kho_debugfs *dbg, void *fdt);
 #else
 static inline int kho_debugfs_init(void) { return 0; }
 static inline void kho_in_debugfs_init(struct kho_debugfs *dbg,
@@ -38,7 +38,8 @@ static inline void kho_in_debugfs_init(struct kho_debugfs *dbg,
 static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; }
 static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
 				      const void *fdt, bool root) { return 0; }
-static inline void kho_debugfs_cleanup(struct kho_debugfs *dbg) {}
+static inline void kho_debugfs_fdt_remove(struct kho_debugfs *dbg,
+					  void *fdt) { }
 #endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
 
 #endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index e23e16618e9b..c4b2d4e4c715 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2444,53 +2444,18 @@ int reserve_mem_release_by_name(const char *name)
 #define MEMBLOCK_KHO_FDT "memblock"
 #define MEMBLOCK_KHO_NODE_COMPATIBLE "memblock-v1"
 #define RESERVE_MEM_KHO_NODE_COMPATIBLE "reserve-mem-v1"
-static struct page *kho_fdt;
-
-static int reserve_mem_kho_finalize(struct kho_serialization *ser)
-{
-	int err = 0, i;
-
-	for (i = 0; i < reserved_mem_count; i++) {
-		struct reserve_mem_table *map = &reserved_mem_table[i];
-		struct page *page = phys_to_page(map->start);
-		unsigned int nr_pages = map->size >> PAGE_SHIFT;
-
-		err |= kho_preserve_pages(page, nr_pages);
-	}
-
-	err |= kho_preserve_folio(page_folio(kho_fdt));
-	err |= kho_add_subtree(ser, MEMBLOCK_KHO_FDT, page_to_virt(kho_fdt));
-
-	return notifier_from_errno(err);
-}
-
-static int reserve_mem_kho_notifier(struct notifier_block *self,
-				    unsigned long cmd, void *v)
-{
-	switch (cmd) {
-	case KEXEC_KHO_FINALIZE:
-		return reserve_mem_kho_finalize((struct kho_serialization *)v);
-	case KEXEC_KHO_ABORT:
-		return NOTIFY_DONE;
-	default:
-		return NOTIFY_BAD;
-	}
-}
-
-static struct notifier_block reserve_mem_kho_nb = {
-	.notifier_call = reserve_mem_kho_notifier,
-};
 
 static int __init prepare_kho_fdt(void)
 {
 	int err = 0, i;
+	struct page *fdt_page;
 	void *fdt;
 
-	kho_fdt = alloc_page(GFP_KERNEL);
-	if (!kho_fdt)
+	fdt_page = alloc_page(GFP_KERNEL);
+	if (!fdt_page)
 		return -ENOMEM;
 
-	fdt = page_to_virt(kho_fdt);
+	fdt = page_to_virt(fdt_page);
 
 	err |= fdt_create(fdt, PAGE_SIZE);
 	err |= fdt_finish_reservemap(fdt);
@@ -2499,7 +2464,10 @@ static int __init prepare_kho_fdt(void)
 	err |= fdt_property_string(fdt, "compatible", MEMBLOCK_KHO_NODE_COMPATIBLE);
 	for (i = 0; i < reserved_mem_count; i++) {
 		struct reserve_mem_table *map = &reserved_mem_table[i];
+		struct page *page = phys_to_page(map->start);
+		unsigned int nr_pages = map->size >> PAGE_SHIFT;
 
+		err |= kho_preserve_pages(page, nr_pages);
 		err |= fdt_begin_node(fdt, map->name);
 		err |= fdt_property_string(fdt, "compatible", RESERVE_MEM_KHO_NODE_COMPATIBLE);
 		err |= fdt_property(fdt, "start", &map->start, sizeof(map->start));
@@ -2507,13 +2475,14 @@ static int __init prepare_kho_fdt(void)
 		err |= fdt_end_node(fdt);
 	}
 	err |= fdt_end_node(fdt);
-
 	err |= fdt_finish(fdt);
 
+	err |= kho_preserve_folio(page_folio(fdt_page));
+	err |= kho_add_subtree(MEMBLOCK_KHO_FDT, fdt);
+
 	if (err) {
 		pr_err("failed to prepare memblock FDT for KHO: %d\n", err);
-		put_page(kho_fdt);
-		kho_fdt = NULL;
+		put_page(fdt_page);
 	}
 
 	return err;
@@ -2529,13 +2498,6 @@ static int __init reserve_mem_init(void)
 	err = prepare_kho_fdt();
 	if (err)
 		return err;
-
-	err = register_kho_notifier(&reserve_mem_kho_nb);
-	if (err) {
-		put_page(kho_fdt);
-		kho_fdt = NULL;
-	}
-
 	return err;
 }
 late_initcall(reserve_mem_init);
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 02/30] kho: make debugfs interface optional
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Currently, KHO is controlled via debugfs interface, but once LUO is
introduced, it can control KHO, and the debug interface becomes
optional.

Add a separate config CONFIG_KEXEC_HANDOVER_DEBUG that enables
the debugfs interface, and allows to inspect the tree.

Move all debugfs related code to a new file to keep the .c files
clear of ifdefs.

Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 MAINTAINERS                      |   3 +-
 kernel/Kconfig.kexec             |  10 ++
 kernel/Makefile                  |   1 +
 kernel/kexec_handover.c          | 255 +++++--------------------------
 kernel/kexec_handover_debug.c    | 218 ++++++++++++++++++++++++++
 kernel/kexec_handover_internal.h |  44 ++++++
 6 files changed, 311 insertions(+), 220 deletions(-)
 create mode 100644 kernel/kexec_handover_debug.c
 create mode 100644 kernel/kexec_handover_internal.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 156fa8eefa69..a6cbcc7fb396 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13759,13 +13759,14 @@ KEXEC HANDOVER (KHO)
 M:	Alexander Graf <graf@amazon.com>
 M:	Mike Rapoport <rppt@kernel.org>
 M:	Changyuan Lyu <changyuanl@google.com>
+M:	Pasha Tatashin <pasha.tatashin@soleen.com>
 L:	kexec@lists.infradead.org
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	Documentation/admin-guide/mm/kho.rst
 F:	Documentation/core-api/kho/*
 F:	include/linux/kexec_handover.h
-F:	kernel/kexec_handover.c
+F:	kernel/kexec_handover*
 F:	tools/testing/selftests/kho/
 
 KEYS-ENCRYPTED
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index 422270d64820..e68156d8c72b 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -109,6 +109,16 @@ config KEXEC_HANDOVER
 	  to keep data or state alive across the kexec. For this to work,
 	  both source and target kernels need to have this option enabled.
 
+config KEXEC_HANDOVER_DEBUG
+	bool "kexec handover debug interface"
+	depends on KEXEC_HANDOVER
+	depends on DEBUG_FS
+	help
+	  Allow to control kexec handover device tree via debugfs
+	  interface, i.e. finalize the state or aborting the finalization.
+	  Also, enables inspecting the KHO fdt trees with the debugfs binary
+	  blobs.
+
 config CRASH_DUMP
 	bool "kernel crash dumps"
 	default ARCH_DEFAULT_CRASH_DUMP
diff --git a/kernel/Makefile b/kernel/Makefile
index df3dd8291bb6..9fe722305c9b 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
 obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
+obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 0ba5a2dbae28..f0f6c6b8ad83 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -10,7 +10,6 @@
 
 #include <linux/cma.h>
 #include <linux/count_zeros.h>
-#include <linux/debugfs.h>
 #include <linux/kexec.h>
 #include <linux/kexec_handover.h>
 #include <linux/libfdt.h>
@@ -28,6 +27,7 @@
  */
 #include "../mm/internal.h"
 #include "kexec_internal.h"
+#include "kexec_handover_internal.h"
 
 #define KHO_FDT_COMPATIBLE "kho-v1"
 #define PROP_PRESERVED_MEMORY_MAP "preserved-memory-map"
@@ -101,8 +101,6 @@ struct khoser_mem_chunk;
 
 struct kho_serialization {
 	struct page *fdt;
-	struct list_head fdt_list;
-	struct dentry *sub_fdt_dir;
 	struct kho_mem_track track;
 	/* First chunk of serialized preserved memory map */
 	struct khoser_mem_chunk *preserved_mem_map;
@@ -465,8 +463,8 @@ static void __init kho_mem_deserialize(const void *fdt)
  * area for early allocations that happen before page allocator is
  * initialized.
  */
-static struct kho_scratch *kho_scratch;
-static unsigned int kho_scratch_cnt;
+struct kho_scratch *kho_scratch;
+unsigned int kho_scratch_cnt;
 
 /*
  * The scratch areas are scaled by default as percent of memory allocated from
@@ -662,36 +660,24 @@ static void __init kho_reserve_scratch(void)
 	kho_enable = false;
 }
 
-struct fdt_debugfs {
-	struct list_head list;
-	struct debugfs_blob_wrapper wrapper;
-	struct dentry *file;
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+	struct mutex lock; /* protects KHO FDT finalization */
+	struct kho_serialization ser;
+	bool finalized;
+	struct kho_debugfs dbg;
 };
 
-static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
-			       const char *name, const void *fdt)
-{
-	struct fdt_debugfs *f;
-	struct dentry *file;
-
-	f = kmalloc(sizeof(*f), GFP_KERNEL);
-	if (!f)
-		return -ENOMEM;
-
-	f->wrapper.data = (void *)fdt;
-	f->wrapper.size = fdt_totalsize(fdt);
-
-	file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
-	if (IS_ERR(file)) {
-		kfree(f);
-		return PTR_ERR(file);
-	}
-
-	f->file = file;
-	list_add(&f->list, list);
-
-	return 0;
-}
+static struct kho_out kho_out = {
+	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+	.lock = __MUTEX_INITIALIZER(kho_out.lock),
+	.ser = {
+		.track = {
+			.orders = XARRAY_INIT(kho_out.ser.track.orders, 0),
+		},
+	},
+	.finalized = false,
+};
 
 /**
  * kho_add_subtree - record the physical address of a sub FDT in KHO root tree.
@@ -704,7 +690,8 @@ static int kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
  * by KHO for the new kernel to retrieve it after kexec.
  *
  * A debugfs blob entry is also created at
- * ``/sys/kernel/debug/kho/out/sub_fdts/@name``.
+ * ``/sys/kernel/debug/kho/out/sub_fdts/@name`` when kernel is configured with
+ * CONFIG_KEXEC_HANDOVER_DEBUG
  *
  * Return: 0 on success, error code on failure
  */
@@ -721,7 +708,7 @@ int kho_add_subtree(struct kho_serialization *ser, const char *name, void *fdt)
 	if (err)
 		return err;
 
-	return kho_debugfs_fdt_add(&ser->fdt_list, ser->sub_fdt_dir, name, fdt);
+	return kho_debugfs_fdt_add(&kho_out.dbg, name, fdt, false);
 }
 EXPORT_SYMBOL_GPL(kho_add_subtree);
 
@@ -1044,29 +1031,6 @@ void *kho_restore_vmalloc(const struct kho_vmalloc *preservation)
 }
 EXPORT_SYMBOL_GPL(kho_restore_vmalloc);
 
-/* Handling for debug/kho/out */
-
-static struct dentry *debugfs_root;
-
-static int kho_out_update_debugfs_fdt(void)
-{
-	int err = 0;
-	struct fdt_debugfs *ff, *tmp;
-
-	if (kho_out.finalized) {
-		err = kho_debugfs_fdt_add(&kho_out.ser.fdt_list, kho_out.dir,
-					  "fdt", page_to_virt(kho_out.ser.fdt));
-	} else {
-		list_for_each_entry_safe(ff, tmp, &kho_out.ser.fdt_list, list) {
-			debugfs_remove(ff->file);
-			list_del(&ff->list);
-			kfree(ff);
-		}
-	}
-
-	return err;
-}
-
 static int __kho_abort(void)
 {
 	int err;
@@ -1119,7 +1083,8 @@ int kho_abort(void)
 		goto unlock;
 
 	kho_out.finalized = false;
-	ret = kho_out_update_debugfs_fdt();
+
+	kho_debugfs_cleanup(&kho_out.dbg);
 
 unlock:
 	mutex_unlock(&kho_out.lock);
@@ -1169,7 +1134,7 @@ static int __kho_finalize(void)
 abort:
 	if (err) {
 		pr_err("Failed to convert KHO state tree: %d\n", err);
-		kho_abort();
+		__kho_abort();
 	}
 
 	return err;
@@ -1194,119 +1159,32 @@ int kho_finalize(void)
 		goto unlock;
 
 	kho_out.finalized = true;
-	ret = kho_out_update_debugfs_fdt();
+	ret = kho_debugfs_fdt_add(&kho_out.dbg, "fdt",
+				  page_to_virt(kho_out.ser.fdt), true);
 
 unlock:
 	mutex_unlock(&kho_out.lock);
 	return ret;
 }
 
-static int kho_out_finalize_get(void *data, u64 *val)
+bool kho_finalized(void)
 {
-	mutex_lock(&kho_out.lock);
-	*val = kho_out.finalized;
-	mutex_unlock(&kho_out.lock);
-
-	return 0;
-}
-
-static int kho_out_finalize_set(void *data, u64 _val)
-{
-	int ret = 0;
-	bool val = !!_val;
+	bool ret;
 
 	mutex_lock(&kho_out.lock);
-
-	if (val == kho_out.finalized) {
-		if (kho_out.finalized)
-			ret = -EEXIST;
-		else
-			ret = -ENOENT;
-		goto unlock;
-	}
-
-	if (val)
-		ret = kho_finalize();
-	else
-		ret = kho_abort();
-
-	if (ret)
-		goto unlock;
-
-	kho_out.finalized = val;
-	ret = kho_out_update_debugfs_fdt();
-
-unlock:
+	ret = kho_out.finalized;
 	mutex_unlock(&kho_out.lock);
-	return ret;
-}
-
-DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
-			 kho_out_finalize_set, "%llu\n");
-
-static int scratch_phys_show(struct seq_file *m, void *v)
-{
-	for (int i = 0; i < kho_scratch_cnt; i++)
-		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
-
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(scratch_phys);
 
-static int scratch_len_show(struct seq_file *m, void *v)
-{
-	for (int i = 0; i < kho_scratch_cnt; i++)
-		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
-
-	return 0;
-}
-DEFINE_SHOW_ATTRIBUTE(scratch_len);
-
-static __init int kho_out_debugfs_init(void)
-{
-	struct dentry *dir, *f, *sub_fdt_dir;
-
-	dir = debugfs_create_dir("out", debugfs_root);
-	if (IS_ERR(dir))
-		return -ENOMEM;
-
-	sub_fdt_dir = debugfs_create_dir("sub_fdts", dir);
-	if (IS_ERR(sub_fdt_dir))
-		goto err_rmdir;
-
-	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
-				&scratch_phys_fops);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
-				&scratch_len_fops);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	f = debugfs_create_file("finalize", 0600, dir, NULL,
-				&fops_kho_out_finalize);
-	if (IS_ERR(f))
-		goto err_rmdir;
-
-	kho_out.dir = dir;
-	kho_out.ser.sub_fdt_dir = sub_fdt_dir;
-	return 0;
-
-err_rmdir:
-	debugfs_remove_recursive(dir);
-	return -ENOENT;
+	return ret;
 }
 
 struct kho_in {
-	struct dentry *dir;
 	phys_addr_t fdt_phys;
 	phys_addr_t scratch_phys;
-	struct list_head fdt_list;
+	struct kho_debugfs dbg;
 };
 
 static struct kho_in kho_in = {
-	.fdt_list = LIST_HEAD_INIT(kho_in.fdt_list),
 };
 
 static const void *kho_get_fdt(void)
@@ -1370,56 +1248,6 @@ int kho_retrieve_subtree(const char *name, phys_addr_t *phys)
 }
 EXPORT_SYMBOL_GPL(kho_retrieve_subtree);
 
-/* Handling for debugfs/kho/in */
-
-static __init int kho_in_debugfs_init(const void *fdt)
-{
-	struct dentry *sub_fdt_dir;
-	int err, child;
-
-	kho_in.dir = debugfs_create_dir("in", debugfs_root);
-	if (IS_ERR(kho_in.dir))
-		return PTR_ERR(kho_in.dir);
-
-	sub_fdt_dir = debugfs_create_dir("sub_fdts", kho_in.dir);
-	if (IS_ERR(sub_fdt_dir)) {
-		err = PTR_ERR(sub_fdt_dir);
-		goto err_rmdir;
-	}
-
-	err = kho_debugfs_fdt_add(&kho_in.fdt_list, kho_in.dir, "fdt", fdt);
-	if (err)
-		goto err_rmdir;
-
-	fdt_for_each_subnode(child, fdt, 0) {
-		int len = 0;
-		const char *name = fdt_get_name(fdt, child, NULL);
-		const u64 *fdt_phys;
-
-		fdt_phys = fdt_getprop(fdt, child, "fdt", &len);
-		if (!fdt_phys)
-			continue;
-		if (len != sizeof(*fdt_phys)) {
-			pr_warn("node `%s`'s prop `fdt` has invalid length: %d\n",
-				name, len);
-			continue;
-		}
-		err = kho_debugfs_fdt_add(&kho_in.fdt_list, sub_fdt_dir, name,
-					  phys_to_virt(*fdt_phys));
-		if (err) {
-			pr_warn("failed to add fdt `%s` to debugfs: %d\n", name,
-				err);
-			continue;
-		}
-	}
-
-	return 0;
-
-err_rmdir:
-	debugfs_remove_recursive(kho_in.dir);
-	return err;
-}
-
 static __init int kho_init(void)
 {
 	int err = 0;
@@ -1434,27 +1262,16 @@ static __init int kho_init(void)
 		goto err_free_scratch;
 	}
 
-	debugfs_root = debugfs_create_dir("kho", NULL);
-	if (IS_ERR(debugfs_root)) {
-		err = -ENOENT;
+	err = kho_debugfs_init();
+	if (err)
 		goto err_free_fdt;
-	}
 
-	err = kho_out_debugfs_init();
+	err = kho_out_debugfs_init(&kho_out.dbg);
 	if (err)
 		goto err_free_fdt;
 
 	if (fdt) {
-		err = kho_in_debugfs_init(fdt);
-		/*
-		 * Failure to create /sys/kernel/debug/kho/in does not prevent
-		 * reviving state from KHO and setting up KHO for the next
-		 * kexec.
-		 */
-		if (err)
-			pr_err("failed exposing handover FDT in debugfs: %d\n",
-			       err);
-
+		kho_in_debugfs_init(&kho_in.dbg, fdt);
 		return 0;
 	}
 
diff --git a/kernel/kexec_handover_debug.c b/kernel/kexec_handover_debug.c
new file mode 100644
index 000000000000..b88d138a97be
--- /dev/null
+++ b/kernel/kexec_handover_debug.c
@@ -0,0 +1,218 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ * Copyright (C) 2025 Google LLC, Changyuan Lyu <changyuanl@google.com>
+ * Copyright (C) 2025 Google LLC, Pasha Tatashin <pasha.tatashin@soleen.com>
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/init.h>
+#include <linux/io.h>
+#include <linux/libfdt.h>
+#include <linux/mm.h>
+#include "kexec_handover_internal.h"
+
+static struct dentry *debugfs_root;
+
+struct fdt_debugfs {
+	struct list_head list;
+	struct debugfs_blob_wrapper wrapper;
+	struct dentry *file;
+};
+
+static int __kho_debugfs_fdt_add(struct list_head *list, struct dentry *dir,
+				 const char *name, const void *fdt)
+{
+	struct fdt_debugfs *f;
+	struct dentry *file;
+
+	f = kmalloc(sizeof(*f), GFP_KERNEL);
+	if (!f)
+		return -ENOMEM;
+
+	f->wrapper.data = (void *)fdt;
+	f->wrapper.size = fdt_totalsize(fdt);
+
+	file = debugfs_create_blob(name, 0400, dir, &f->wrapper);
+	if (IS_ERR(file)) {
+		kfree(f);
+		return PTR_ERR(file);
+	}
+
+	f->file = file;
+	list_add(&f->list, list);
+
+	return 0;
+}
+
+int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
+			const void *fdt, bool root)
+{
+	struct dentry *dir;
+
+	if (root)
+		dir = dbg->dir;
+	else
+		dir = dbg->sub_fdt_dir;
+
+	return __kho_debugfs_fdt_add(&dbg->fdt_list, dir, name, fdt);
+}
+
+void kho_debugfs_cleanup(struct kho_debugfs *dbg)
+{
+	struct fdt_debugfs *ff, *tmp;
+
+	list_for_each_entry_safe(ff, tmp, &dbg->fdt_list, list) {
+		debugfs_remove(ff->file);
+		list_del(&ff->list);
+		kfree(ff);
+	}
+}
+
+static int kho_out_finalize_get(void *data, u64 *val)
+{
+	*val = kho_finalized();
+
+	return 0;
+}
+
+static int kho_out_finalize_set(void *data, u64 _val)
+{
+	bool val = !!_val;
+
+	if (val)
+		return kho_finalize();
+
+	return kho_abort();
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(kho_out_finalize_fops, kho_out_finalize_get,
+			 kho_out_finalize_set, "%llu\n");
+
+static int scratch_phys_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_phys);
+
+static int scratch_len_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_len);
+
+__init void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt)
+{
+	struct dentry *dir, *sub_fdt_dir;
+	int err, child;
+
+	INIT_LIST_HEAD(&dbg->fdt_list);
+
+	dir = debugfs_create_dir("in", debugfs_root);
+	if (IS_ERR(dir)) {
+		err = PTR_ERR(dir);
+		goto err_out;
+	}
+
+	sub_fdt_dir = debugfs_create_dir("sub_fdts", dir);
+	if (IS_ERR(sub_fdt_dir)) {
+		err = PTR_ERR(sub_fdt_dir);
+		goto err_rmdir;
+	}
+
+	err = __kho_debugfs_fdt_add(&dbg->fdt_list, dir, "fdt", fdt);
+	if (err)
+		goto err_rmdir;
+
+	fdt_for_each_subnode(child, fdt, 0) {
+		int len = 0;
+		const char *name = fdt_get_name(fdt, child, NULL);
+		const u64 *fdt_phys;
+
+		fdt_phys = fdt_getprop(fdt, child, "fdt", &len);
+		if (!fdt_phys)
+			continue;
+		if (len != sizeof(*fdt_phys)) {
+			pr_warn("node %s prop fdt has invalid length: %d\n",
+				name, len);
+			continue;
+		}
+		err = __kho_debugfs_fdt_add(&dbg->fdt_list, sub_fdt_dir, name,
+					    phys_to_virt(*fdt_phys));
+		if (err) {
+			pr_warn("failed to add fdt %s to debugfs: %d\n", name,
+				err);
+			continue;
+		}
+	}
+
+	dbg->dir = dir;
+	dbg->sub_fdt_dir = sub_fdt_dir;
+
+	return;
+err_rmdir:
+	debugfs_remove_recursive(dir);
+err_out:
+	/*
+	 * Failure to create /sys/kernel/debug/kho/in does not prevent
+	 * reviving state from KHO and setting up KHO for the next
+	 * kexec.
+	 */
+	if (err)
+		pr_err("failed exposing handover FDT in debugfs: %d\n", err);
+}
+
+__init int kho_out_debugfs_init(struct kho_debugfs *dbg)
+{
+	struct dentry *dir, *f, *sub_fdt_dir;
+
+	INIT_LIST_HEAD(&dbg->fdt_list);
+
+	dir = debugfs_create_dir("out", debugfs_root);
+	if (IS_ERR(dir))
+		return -ENOMEM;
+
+	sub_fdt_dir = debugfs_create_dir("sub_fdts", dir);
+	if (IS_ERR(sub_fdt_dir))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
+				&scratch_phys_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
+				&scratch_len_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("finalize", 0600, dir, NULL,
+				&kho_out_finalize_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	dbg->dir = dir;
+	dbg->sub_fdt_dir = sub_fdt_dir;
+	return 0;
+
+err_rmdir:
+	debugfs_remove_recursive(dir);
+	return -ENOENT;
+}
+
+__init int kho_debugfs_init(void)
+{
+	debugfs_root = debugfs_create_dir("kho", NULL);
+	if (IS_ERR(debugfs_root))
+		return -ENOENT;
+	return 0;
+}
diff --git a/kernel/kexec_handover_internal.h b/kernel/kexec_handover_internal.h
new file mode 100644
index 000000000000..f6f172ddcae4
--- /dev/null
+++ b/kernel/kexec_handover_internal.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_INTERNAL_H
+#define LINUX_KEXEC_HANDOVER_INTERNAL_H
+
+#include <linux/kexec_handover.h>
+#include <linux/list.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+#include <linux/debugfs.h>
+
+struct kho_debugfs {
+	struct dentry *dir;
+	struct dentry *sub_fdt_dir;
+	struct list_head fdt_list;
+};
+
+#else
+struct kho_debugfs {};
+#endif
+
+extern struct kho_scratch *kho_scratch;
+extern unsigned int kho_scratch_cnt;
+
+bool kho_finalized(void);
+
+#ifdef CONFIG_KEXEC_HANDOVER_DEBUG
+int kho_debugfs_init(void);
+void kho_in_debugfs_init(struct kho_debugfs *dbg, const void *fdt);
+int kho_out_debugfs_init(struct kho_debugfs *dbg);
+int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
+			const void *fdt, bool root);
+void kho_debugfs_cleanup(struct kho_debugfs *dbg);
+#else
+static inline int kho_debugfs_init(void) { return 0; }
+static inline void kho_in_debugfs_init(struct kho_debugfs *dbg,
+				       const void *fdt) { }
+static inline int kho_out_debugfs_init(struct kho_debugfs *dbg) { return 0; }
+static inline int kho_debugfs_fdt_add(struct kho_debugfs *dbg, const char *name,
+				      const void *fdt, bool root) { return 0; }
+static inline void kho_debugfs_cleanup(struct kho_debugfs *dbg) {}
+#endif /* CONFIG_KEXEC_HANDOVER_DEBUG */
+
+#endif /* LINUX_KEXEC_HANDOVER_INTERNAL_H */
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 01/30] kho: allow to drive kho from within kernel
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare
In-Reply-To: <20250929010321.3462457-1-pasha.tatashin@soleen.com>

Allow to do finalize and abort from kernel modules, so LUO could
drive the KHO sequence via its own state machine.

Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
---
 include/linux/kexec_handover.h | 15 +++++++++
 kernel/kexec_handover.c        | 56 ++++++++++++++++++++++++++++++++--
 2 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
index 25042c1d8d54..04d0108db98e 100644
--- a/include/linux/kexec_handover.h
+++ b/include/linux/kexec_handover.h
@@ -67,6 +67,10 @@ void kho_memory_init(void);
 
 void kho_populate(phys_addr_t fdt_phys, u64 fdt_len, phys_addr_t scratch_phys,
 		  u64 scratch_len);
+
+int kho_finalize(void);
+int kho_abort(void);
+
 #else
 static inline bool kho_is_enabled(void)
 {
@@ -139,6 +143,17 @@ static inline void kho_populate(phys_addr_t fdt_phys, u64 fdt_len,
 				phys_addr_t scratch_phys, u64 scratch_len)
 {
 }
+
+static inline int kho_finalize(void)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int kho_abort(void)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_KEXEC_HANDOVER */
 
 #endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index 76f0940fb485..0ba5a2dbae28 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -1067,7 +1067,7 @@ static int kho_out_update_debugfs_fdt(void)
 	return err;
 }
 
-static int kho_abort(void)
+static int __kho_abort(void)
 {
 	int err;
 	unsigned long order;
@@ -1100,7 +1100,33 @@ static int kho_abort(void)
 	return err;
 }
 
-static int kho_finalize(void)
+int kho_abort(void)
+{
+	int ret = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&kho_out.lock);
+
+	if (!kho_out.finalized) {
+		ret = -ENOENT;
+		goto unlock;
+	}
+
+	ret = __kho_abort();
+	if (ret)
+		goto unlock;
+
+	kho_out.finalized = false;
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	mutex_unlock(&kho_out.lock);
+	return ret;
+}
+
+static int __kho_finalize(void)
 {
 	int err = 0;
 	u64 *preserved_mem_map;
@@ -1149,6 +1175,32 @@ static int kho_finalize(void)
 	return err;
 }
 
+int kho_finalize(void)
+{
+	int ret = 0;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&kho_out.lock);
+
+	if (kho_out.finalized) {
+		ret = -EEXIST;
+		goto unlock;
+	}
+
+	ret = __kho_finalize();
+	if (ret)
+		goto unlock;
+
+	kho_out.finalized = true;
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	mutex_unlock(&kho_out.lock);
+	return ret;
+}
+
 static int kho_out_finalize_get(void *data, u64 *val)
 {
 	mutex_lock(&kho_out.lock);
-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply related

* [PATCH v4 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-09-29  1:02 UTC (permalink / raw)
  To: pratyush, jasonmiu, graf, changyuanl, pasha.tatashin, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	hughd, skhawaja, chrisl, steven.sistare

This series introduces the Live Update Orchestrator (LUO), a kernel
subsystem designed to facilitate live kernel updates. LUO enables
kexec-based reboots with minimal downtime, a critical capability for
cloud environments where hypervisors must be updated without disrupting
running virtual machines. By preserving the state of selected resources,
such as file descriptors and memory, LUO allows workloads to resume
seamlessly in the new kernel.

The git branch for this series can be found at:
https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v4

The patch series applies against linux-next tag: next-20250926

While this series is showed cased using memfd preservation. There are
works to preserve devices:
1. IOMMU: https://lore.kernel.org/all/20250928190624.3735830-16-skhawaja@google.com
2. PCI: https://lore.kernel.org/all/20250916-luo-pci-v2-0-c494053c3c08@kernel.org

=======================================================================
Changelog since v3:
(https://lore.kernel.org/all/20250807014442.3829950-1-pasha.tatashin@soleen.com):

- The main architectural change in this version is introduction of
  "sessions" to manage the lifecycle of preserved file descriptors.
  In v3, session management was left to a single userspace agent. This
  approach has been revised to improve robustness. Now, each session is
  represented by a file descriptor (/dev/liveupdate). The lifecycle of
  all preserved resources within a session is tied to this FD, ensuring
  automatic cleanup by the kernel if the controlling userspace agent
  crashes or exits unexpectedly.

- The first three KHO fixes from the previous series have been merged
  into Linus' tree.

- Various bug fixes and refactorings, including correcting memory
  unpreservation logic during a kho_abort() sequence.

- Addressing all comments from reviewers.

- Removing sysfs interface (/sys/kernel/liveupdate/state), the state
  can now be queried  only via ioctl() API.

=======================================================================

What is Live Update?

Live Update is a kexec-based reboot process where selected kernel
resources (memory, file descriptors, and eventually devices) are kept
operational or their state is preserved across a kernel transition. For
certain resources, DMA and interrupt activity might continue with
minimal interruption during the kernel reboot.

LUO provides a framework for coordinating live updates. It features:

State Machine
Manages the live update process through states: NORMAL, PREPARED,
FROZEN, UPDATED.

Session Management
==================
Userspace creates named sessions (driven by LUOD: Live Update
Orchestrator Daemon, see: https://tinyurl.com/luoddesign), each
represented by a file descriptor. Preserved resources are tied to a
session, and their lifecycle is managed by the session's FD, ensuring 
automatic cleanup if the controlling process exits unexpectedly.
Furthermore, sessions can be finished, prepared, and frozen
independently of the global LUO states. This granular control allows a
VMM to serialize and resume specific VMs as soon as their resources are
ready, without having to wait for all VMs to be prepared.

After a reboot, a central live update agent can retrieve a session
handle and pass it to the VMM process, which then restores its own file
descriptors. This ensures that resource allocations, such as cgroup
memory charges, are correctly accounted against the workload's cgroup
instead of the administrative agent's.

KHO Integration
===============
LUO programmatically drives KHO's finalization and abort sequences
(KHO may soon to become completely stateless, which will make KHO
interraction with LUO even simpler:
https://lore.kernel.org/all/20250917025019.1585041-1-jasonmiu@google.com)

KHO's debugfs interface is now optional, configured via
CONFIG_KEXEC_HANDOVER_DEBUG. LUO preserves its own metadata via KHO's
kho_add_subtree() and kho_preserve_phys() mechanisms.

Subsystem Participation
=======================
A callback API, liveupdate_register_subsystem(), allows kernel
subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register handlers for LUO
events (PREPARE, FREEZE, FINISH, CANCEL) and persist a u64 payload via
the LUO FDT.

File Descriptor Preservation
============================
An infrastructure (liveupdate_register_file_handler, luo_preserve_file,
luo_retrieve_file) allows specific types of file descriptors (e.g.,
memfd, vfio) to be preserved and restored within a session. Handlers for
specific file types can be registered to manage their preservation,
storing a u64 payload in the LUO FDT.

Userspace Interface
===================
ioctl (/dev/liveupdate): The primary control interface for creating and
retrieving sessions, triggering global LUO state transitions (prepare,
finish, cancel), and managing preserved file descriptors within a
session.

sysfs (/sys/kernel/liveupdate/state)
A read-only interface for monitoring the current LUO state.

Selftests
=========
Includes kernel-side hooks and an extensive userspace selftest suite to
verify core LUO functionality, including subsystem registration, state
transitions, and complex multi-kexec session lifecycles.

LUO State Machine and Events
============================
NORMAL:   Default operational state.
PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE event.
          Subsystems have saved initial state.
FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE event,
          just before kexec. Workloads must be suspended.
UPDATED:  Next kernel has booted via live update, awaiting restoration
          and LIVEUPDATE_FINISH.

Events
LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.

Mike Rapoport (Microsoft) (1):
  kho: drop notifiers

Pasha Tatashin (24):
  kho: allow to drive kho from within kernel
  kho: make debugfs interface optional
  kho: add interfaces to unpreserve folios and page ranes
  kho: don't unpreserve memory during abort
  liveupdate: kho: move to kernel/liveupdate
  liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
  liveupdate: luo_core: integrate with KHO
  liveupdate: luo_subsystems: add subsystem registration
  liveupdate: luo_subsystems: implement subsystem callbacks
  liveupdate: luo_session: Add sessions support
  liveupdate: luo_ioctl: add user interface
  liveupdate: luo_file: implement file systems callbacks
  liveupdate: luo_session: Add ioctls for file preservation and state
    management
  reboot: call liveupdate_reboot() before kexec
  kho: move kho debugfs directory to liveupdate
  liveupdate: add selftests for subsystems un/registration
  selftests/liveupdate: add subsystem/state tests
  docs: add luo documentation
  MAINTAINERS: add liveupdate entry
  selftests/liveupdate: Add multi-kexec session lifecycle test
  selftests/liveupdate: Add multi-file and unreclaimed file test
  selftests/liveupdate: Add multi-session workflow and state interaction
    test
  selftests/liveupdate: Add test for unreclaimed resource cleanup
  selftests/liveupdate: Add tests for per-session state and cancel
    cycles

Pratyush Yadav (5):
  mm: shmem: use SHMEM_F_* flags instead of VM_* flags
  mm: shmem: allow freezing inode mapping
  mm: shmem: export some functions to internal.h
  luo: allow preserving memfd
  docs: add documentation for memfd preservation via LUO

 Documentation/core-api/index.rst              |   1 +
 Documentation/core-api/kho/concepts.rst       |   2 +-
 Documentation/core-api/liveupdate.rst         |  64 ++
 Documentation/mm/index.rst                    |   1 +
 Documentation/mm/memfd_preservation.rst       | 138 +++
 Documentation/userspace-api/index.rst         |   1 +
 .../userspace-api/ioctl/ioctl-number.rst      |   2 +
 Documentation/userspace-api/liveupdate.rst    |  25 +
 MAINTAINERS                                   |  18 +-
 include/linux/kexec_handover.h                |  53 +-
 include/linux/liveupdate.h                    | 209 +++++
 include/linux/shmem_fs.h                      |  23 +
 include/uapi/linux/liveupdate.h               | 460 +++++++++
 init/Kconfig                                  |   2 +
 kernel/Kconfig.kexec                          |  15 -
 kernel/Makefile                               |   2 +-
 kernel/liveupdate/Kconfig                     |  72 ++
 kernel/liveupdate/Makefile                    |  14 +
 kernel/{ => liveupdate}/kexec_handover.c      | 507 ++++------
 kernel/liveupdate/kexec_handover_debug.c      | 222 +++++
 kernel/liveupdate/kexec_handover_internal.h   |  45 +
 kernel/liveupdate/luo_core.c                  | 588 ++++++++++++
 kernel/liveupdate/luo_file.c                  | 599 ++++++++++++
 kernel/liveupdate/luo_internal.h              | 114 +++
 kernel/liveupdate/luo_ioctl.c                 | 255 +++++
 kernel/liveupdate/luo_selftests.c             | 345 +++++++
 kernel/liveupdate/luo_selftests.h             |  84 ++
 kernel/liveupdate/luo_session.c               | 887 ++++++++++++++++++
 kernel/liveupdate/luo_subsystems.c            | 452 +++++++++
 kernel/reboot.c                               |   4 +
 mm/Makefile                                   |   1 +
 mm/internal.h                                 |   6 +
 mm/memblock.c                                 |  60 +-
 mm/memfd_luo.c                                | 523 +++++++++++
 mm/shmem.c                                    |  51 +-
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/liveupdate/.gitignore |   2 +
 tools/testing/selftests/liveupdate/Makefile   |  48 +
 tools/testing/selftests/liveupdate/config     |   6 +
 .../testing/selftests/liveupdate/do_kexec.sh  |   6 +
 .../testing/selftests/liveupdate/liveupdate.c | 404 ++++++++
 .../selftests/liveupdate/luo_multi_file.c     | 119 +++
 .../selftests/liveupdate/luo_multi_kexec.c    | 182 ++++
 .../selftests/liveupdate/luo_multi_session.c  | 155 +++
 .../selftests/liveupdate/luo_test_utils.c     | 241 +++++
 .../selftests/liveupdate/luo_test_utils.h     |  51 +
 .../selftests/liveupdate/luo_unreclaimed.c    | 107 +++
 47 files changed, 6757 insertions(+), 410 deletions(-)
 create mode 100644 Documentation/core-api/liveupdate.rst
 create mode 100644 Documentation/mm/memfd_preservation.rst
 create mode 100644 Documentation/userspace-api/liveupdate.rst
 create mode 100644 include/linux/liveupdate.h
 create mode 100644 include/uapi/linux/liveupdate.h
 create mode 100644 kernel/liveupdate/Kconfig
 create mode 100644 kernel/liveupdate/Makefile
 rename kernel/{ => liveupdate}/kexec_handover.c (80%)
 create mode 100644 kernel/liveupdate/kexec_handover_debug.c
 create mode 100644 kernel/liveupdate/kexec_handover_internal.h
 create mode 100644 kernel/liveupdate/luo_core.c
 create mode 100644 kernel/liveupdate/luo_file.c
 create mode 100644 kernel/liveupdate/luo_internal.h
 create mode 100644 kernel/liveupdate/luo_ioctl.c
 create mode 100644 kernel/liveupdate/luo_selftests.c
 create mode 100644 kernel/liveupdate/luo_selftests.h
 create mode 100644 kernel/liveupdate/luo_session.c
 create mode 100644 kernel/liveupdate/luo_subsystems.c
 create mode 100644 mm/memfd_luo.c
 create mode 100644 tools/testing/selftests/liveupdate/.gitignore
 create mode 100644 tools/testing/selftests/liveupdate/Makefile
 create mode 100644 tools/testing/selftests/liveupdate/config
 create mode 100755 tools/testing/selftests/liveupdate/do_kexec.sh
 create mode 100644 tools/testing/selftests/liveupdate/liveupdate.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_file.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_kexec.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_multi_session.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.c
 create mode 100644 tools/testing/selftests/liveupdate/luo_test_utils.h
 create mode 100644 tools/testing/selftests/liveupdate/luo_unreclaimed.c

-- 
2.51.0.536.g15c5d4f767-goog


^ permalink raw reply

* Re: [PATCH v2 0/3] ext4: Add support for mounted updates to the superblock via an ioctl
From: Theodore Ts'o @ 2025-09-26 21:47 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250916-tune2fs-v2-0-d594dc7486f0@mit.edu>


On Tue, 16 Sep 2025 23:22:46 -0400, Theodore Ts'o wrote:
> This patch series enables a future version of tune2fs to be able to
> modify certain parts of the ext4 superblock without to write to the
> block device.
> 
> The first patch fixes a potential buffer overrun caused by a
> maliciously moified superblock.  The second patch adds support for
> 32-bit uid and gid's which can have access to the reserved blocks pool.
> The last patch adds the ioctl's which will be used by tune2fs.
> 
> [...]

Applied, thanks!

[1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
      commit: 8ecb790ea8c3fc69e77bace57f14cf0d7c177bd8
[2/3] ext4: add support for 32-bit default reserved uid and gid values
      commit: 12c84dd4d308551568d85203fd6ed2685e861fda
[3/3] ext4: implemet new ioctls to set and get superblock parameters
      commit: 04a91570ac67760301e5458d65eaf1342ecca314

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply

* Re: [PATCH v5 5/8] man/man2/move_mount.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-26 12:56 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-5-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 17521 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:27AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks!  I've applied this patch, with some minor amendments (see below).
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=eb37a3066ccce4f44ab69fae559016a524e4eac>

> ---
>  man/man2/move_mount.2 | 646 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 646 insertions(+)
> 
> diff --git a/man/man2/move_mount.2 b/man/man2/move_mount.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..f954f36c43c444afb167088cc665607dfeb10676
> --- /dev/null
> +++ b/man/man2/move_mount.2
> @@ -0,0 +1,646 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH move_mount 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +move_mount \- move or attach mount object to filesystem
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.BR "#include <fcntl.h>" "          /* Definition of " AT_* " constants */"
> +.B #include <sys/mount.h>
> +.P
> +.BI "int move_mount(int " from_dirfd ", const char *" from_path ,
> +.BI "               int " to_dirfd ", const char *" to_path ,
> +.BI "               unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR move_mount ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR move_mount ()
> +moves the mount object indicated by
> +.I from_dirfd
> +and
> +.I from_path
> +to the path indicated by
> +.I to_dirfd
> +and
> +.IR to_path .
> +The mount object being moved
> +can be an existing mount point in the current mount namespace,
> +or a detached mount object created by
> +.BR fsmount (2)
> +or
> +.BR open_tree (2)
> +with
> +.BR \%OPEN_TREE_CLONE .
> +.P
> +To access the source mount object
> +or the destination mount point,
> +no permissions are required on the object itself,
> +but if either pathname is supplied,
> +execute (search) permission is required
> +on all of the directories specified in
> +.I from_path
> +or
> +.IR to_path .
> +.P
> +The calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability in order to move or attach a mount object.
> +.P
> +As with "*at()" system calls,
> +.BR move_mount ()
> +uses the
> +.I from_dirfd
> +and
> +.I to_dirfd
> +arguments
> +in conjunction with the
> +.I from_path
> +and
> +.I to_path
> +arguments to determine the source and destination objects to operate on
> +(respectively), as follows:
> +.IP \[bu] 3
> +If the pathname given in
> +.I *_path

In this case, where the non-variable part is already in italics, the
variable part is written in roman, for distinguishing it.  (See
groff_man(7).)


Have a lovely day!
Alex

> +is absolute, then
> +the corresponding
> +.I *_dirfd
> +is ignored.
> +.IP \[bu]
> +If the pathname given in
> +.I *_path
> +is relative and
> +the corresponding
> +.I *_dirfd
> +is the special value
> +.BR \%AT_FDCWD ,
> +then
> +.I *_path
> +is interpreted relative to
> +the current working directory
> +of the calling process (like
> +.BR open (2)).
> +.IP \[bu]
> +If the pathname given in
> +.I *_path
> +is relative,
> +then it is interpreted relative to
> +the directory referred to by
> +the corresponding file descriptor
> +.I *_dirfd
> +(rather than relative to
> +the current working directory
> +of the calling process,
> +as is done by
> +.BR open (2)
> +for a relative pathname).
> +In this case,
> +the corresponding
> +.I *_dirfd
> +must be a directory
> +that was opened for reading
> +.RB ( O_RDONLY )
> +or using the
> +.B O_PATH
> +flag.
> +.IP \[bu]
> +If
> +.I *_path
> +is an empty string,
> +and
> +.I flags
> +contains the appropriate
> +.BI \%MOVE_MOUNT_ * _EMPTY_PATH
> +flag,
> +then the corresponding file descriptor
> +.I *_dirfd
> +is operated on directly.
> +In this case,
> +the corresponding
> +.I *_dirfd
> +may refer to any type of file,
> +not just a directory.
> +.P
> +See
> +.BR openat (2)
> +for an explanation of why the
> +.I *_dirfd
> +arguments are useful.
> +.P
> +.I flags
> +can be used to control aspects of the path lookup
> +for both the source and destination objects,
> +as well as other properties of the mount operation.
> +A value for
> +.I flags
> +is constructed by bitwise ORing
> +zero or more of the following constants:
> +.RS
> +.TP
> +.B MOVE_MOUNT_F_EMPTY_PATH
> +If
> +.I from_path
> +is an empty string, operate on the file referred to by
> +.I from_dirfd
> +(which may have been obtained from
> +.BR open (2),
> +.BR fsmount (2),
> +or
> +.BR open_tree (2)).
> +In this case,
> +.I from_dirfd
> +may refer to any type of file,
> +not just a directory.
> +If
> +.I from_dirfd
> +is
> +.BR \%AT_FDCWD ,
> +.BR move_mount ()
> +will operate on the current working directory
> +of the calling process.
> +.IP
> +This is the most common mechanism
> +used to attach detached mount objects
> +produced by
> +.BR fsmount (2)
> +and
> +.BR open_tree (2)
> +to a mount point.
> +.TP
> +.B MOVE_MOUNT_T_EMPTY_PATH
> +As with
> +.BR \%MOVE_MOUNT_F_EMPTY_PATH ,
> +except operating on
> +.I to_dirfd
> +and
> +.IR to_path .
> +.TP
> +.B MOVE_MOUNT_F_SYMLINKS
> +If
> +.I from_path
> +references a symbolic link,
> +then dereference it.
> +The default behaviour for
> +.BR move_mount ()
> +is to
> +.I not follow
> +symbolic links.
> +.TP
> +.B MOVE_MOUNT_T_SYMLINKS
> +As with
> +.BR \%MOVE_MOUNT_F_SYMLINKS ,
> +except operating on
> +.I to_dirfd
> +and
> +.IR to_path .
> +.TP
> +.B MOVE_MOUNT_F_NO_AUTOMOUNT
> +Do not automount the terminal ("basename") component of
> +.I \%from_path
> +if it is a directory that is an automount point.
> +This allows a mount object
> +that has an automount point at its root
> +to be moved
> +and prevents unintended triggering of an automount point.
> +This flag has no effect
> +if the automount point has already been mounted over.
> +.TP
> +.B MOVE_MOUNT_T_NO_AUTOMOUNT
> +As with
> +.BR \%MOVE_MOUNT_F_NO_AUTOMOUNT ,
> +except operating on
> +.I to_dirfd
> +and
> +.IR to_path .
> +This allows an automount point to be manually mounted over.
> +.TP
> +.BR MOVE_MOUNT_SET_GROUP " (since Linux 5.15)"
> +Add the attached private-propagation mount object indicated by
> +.I to_dirfd
> +and
> +.I to_path
> +into the mount propagation "peer group"
> +of the attached non-private-propagation mount object indicated by
> +.I from_dirfd
> +and
> +.IR from_path .
> +.IP
> +Unlike other
> +.BR move_mount ()
> +operations,
> +this operation does not move or attach any mount objects.
> +Instead, it only updates the metadata
> +of attached mount objects.
> +(Also, take careful note of
> +the argument order\[em]\c
> +the mount object being modified
> +by this operation is the one specified by
> +.I to_dirfd
> +and
> +.IR to_path .)
> +.IP
> +This makes it possible to first create a mount tree
> +consisting only of private mounts
> +and then configure the desired propagation layout afterwards.
> +(See the "SHARED SUBTREES" section of
> +.BR mount_namespaces (7)
> +for more information about mount propagation and peer groups.)
> +.TP
> +.BR MOVE_MOUNT_BENEATH " (since Linux 6.5)"
> +If the path indicated by
> +.I to_dirfd
> +and
> +.I to_path
> +is an existing mount object,
> +rather than attaching or moving the mount object
> +indicated by
> +.I from_dirfd
> +and
> +.I from_path
> +on top of the mount stack,
> +attach or move it beneath the current top mount
> +on the mount stack.
> +.IP
> +After using
> +.BR \%MOVE_MOUNT_BENEATH ,
> +it is possible to
> +.BR umount (2)
> +the top mount
> +in order to reveal the mount object
> +which was attached beneath it earlier.
> +This allows for the seamless (and atomic) replacement
> +of intricate mount trees,
> +which can further be used
> +to "upgrade" a mount tree with a newer version.
> +.IP
> +This operation has several restrictions:
> +.RS
> +.IP \[bu] 3
> +Mount objects cannot be attached beneath the filesystem root,
> +including cases where
> +the filesystem root was configured by
> +.BR chroot (2)
> +or
> +.BR pivot_root (2).
> +To mount beneath the filesystem root,
> +.BR pivot_root (2)
> +must be used.
> +.IP \[bu]
> +The target path indicated by
> +.I to_dirfd
> +and
> +.I to_path
> +must not be a detached mount object,
> +such as those produced by
> +.BR open_tree (2)
> +with
> +.B \%OPEN_TREE_CLONE
> +or
> +.BR fsmount (2).
> +.IP \[bu]
> +The current top mount
> +of the target path's mount stack
> +and its parent mount
> +must be in the calling process's mount namespace.
> +.IP \[bu]
> +The caller must have sufficient privileges
> +to unmount the top mount
> +of the target path's mount stack,
> +to prove they have privileges
> +to reveal the underlying mount.
> +.IP \[bu]
> +Mount propagation events triggered by this
> +.BR move_mount ()
> +operation
> +(as described in
> +.BR mount_namespaces (7))
> +are calculated based on the parent mount
> +of the current top mount
> +of the target path's mount stack.
> +.IP \[bu]
> +The target path's mount
> +cannot be an ancestor in the mount tree of
> +the source mount object.
> +.IP \[bu]
> +The source mount object
> +must not have any overmounts,
> +otherwise it would be possible to create "shadow mounts"
> +(i.e., two mounts mounted on the same parent mount at the same mount point).
> +.IP \[bu]
> +It is not possible to move a mount
> +beneath a top mount
> +if the parent mount
> +of the current top mount
> +propagates to the top mount itself.
> +Otherwise,
> +.B \%MOVE_MOUNT_BENEATH
> +would cause the mount object
> +to be propagated
> +to the top mount
> +from the parent mount,
> +defeating the purpose of using
> +.BR \%MOVE_MOUNT_BENEATH .
> +.IP \[bu]
> +It is not possible to move a mount
> +beneath a top mount
> +if the parent mount
> +of the current top mount
> +propagates to the mount object
> +being mounted beneath.
> +Otherwise, this would cause a similar propagation issue
> +to the previous point,
> +also defeating the purpose of using
> +.BR \%MOVE_MOUNT_BENEATH .
> +.RE
> +.RE
> +.P
> +If
> +.I from_dirfd
> +is a mount object file descriptor and
> +.BR move_mount ()
> +is operating on it directly,
> +.I from_dirfd
> +will remain associated with the mount object after
> +.BR move_mount ()
> +succeeds,
> +so you may repeatedly use
> +.I from_dirfd
> +with
> +.BR move_mount (2)
> +and/or "*at()" system calls
> +as many times as necessary.
> +.SH RETURN VALUE
> +On success,
> +.BR move_mount ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied
> +for one of the directories
> +in the path prefix of one of
> +.I from_path
> +or
> +.IR to_path .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +One of
> +.I from_dirfd
> +or
> +.I to_dirfd
> +is not a valid file descriptor.
> +.TP
> +.B EFAULT
> +One of
> +.I from_path
> +or
> +.I to_path
> +is NULL
> +or a pointer to a location
> +outside the calling process's accessible address space.
> +.TP
> +.B EINVAL
> +Invalid flag specified in
> +.IR flags .
> +.TP
> +.B EINVAL
> +The path indicated by
> +.I from_dirfd
> +and
> +.I from_path
> +is not a mount object.
> +.TP
> +.B EINVAL
> +The mount object type
> +of the source mount object and target inode
> +are not compatible
> +(i.e., the source is a file but the target is a directory, or vice-versa).
> +.TP
> +.B EINVAL
> +The source mount object or target path
> +are not in the calling process's mount namespace
> +(or an anonymous mount namespace of the calling process).
> +.TP
> +.B EINVAL
> +The source mount object's parent mount
> +has shared mount propagation,
> +and thus cannot be moved
> +(as described in
> +.BR mount_namespaces (7)).
> +.TP
> +.B EINVAL
> +The source mount has
> +.B MS_UNBINDABLE
> +child mounts
> +but the target path
> +resides on a mount tree with shared mount propagation,
> +which would otherwise cause the unbindable mounts to be propagated
> +(as described in
> +.BR mount_namespaces (7)).
> +.TP
> +.B EINVAL
> +.B \%MOVE_MOUNT_BENEATH
> +was attempted,
> +but one of the listed restrictions was violated.
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered
> +when resolving one of
> +.I from_path
> +or
> +.IR to_path .
> +.TP
> +.B ENAMETOOLONG
> +One of
> +.I from_path
> +or
> +.I to_path
> +is longer than
> +.BR PATH_MAX .
> +.TP
> +.B ENOENT
> +A component of one of
> +.I from_path
> +or
> +.I to_path
> +does not exist.
> +.TP
> +.B ENOENT
> +One of
> +.I from_path
> +or
> +.I to_path
> +is an empty string,
> +but the corresponding
> +.BI MOVE_MOUNT_ * _EMPTY_PATH
> +flag is not specified in
> +.IR flags .
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of one of
> +.I from_path
> +or
> +.I to_path
> +is not a directory,
> +or one of
> +.I from_path
> +or
> +.I to_path
> +is relative
> +and the corresponding
> +.I from_dirfd
> +or
> +.I to_dirfd
> +is a file descriptor referring to a file other than a directory.
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B EPERM
> +The calling process does not have the required
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit 2db154b3ea8e14b04fee23e3fdfd5e9d17fbc6ae
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH EXAMPLES
> +.BR move_mount ()
> +can be used to move attached mounts like the following:
> +.P
> +.in +4n
> +.EX
> +move_mount(AT_FDCWD, "/a", AT_FDCWD, "/b", 0);
> +.EE
> +.in
> +.P
> +This would move the mount object mounted on
> +.I /a
> +to
> +.IR /b .
> +The above procedure is functionally equivalent to
> +the following mount operation
> +using
> +.BR mount (2):
> +.P
> +.in +4n
> +.EX
> +mount("/a", "/b", NULL, MS_MOVE, NULL);
> +.EE
> +.in
> +.P
> +.BR move_mount ()
> +can also be used in conjunction with file descriptors returned from
> +.BR open_tree (2)
> +or
> +.BR open (2):
> +.P
> +.in +4n
> +.EX
> +int fd = open_tree(AT_FDCWD, "/mnt", 0); /* open("/mnt", O_PATH); */
> +move_mount(fd, "", AT_FDCWD, "/mnt2", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt3", MOVE_MOUNT_F_EMPTY_PATH);
> +move_mount(fd, "", AT_FDCWD, "/mnt4", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +This would move the mount object mounted at
> +.I /mnt
> +to
> +.IR /mnt2 ,
> +then
> +.IR /mnt3 ,
> +and then
> +.IR /mnt4 .
> +.P
> +If the source mount object
> +indicated by
> +.I from_dirfd
> +and
> +.I from_path
> +is a detached mount object,
> +.BR move_mount ()
> +can be used to attach it to a mount point:
> +.P
> +.in +4n
> +.EX
> +int fsfd, mntfd;
> +\&
> +fsfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/sda1", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NODEV);
> +move_mount(mntfd, "", AT_FDCWD, "/home", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +This would create a new filesystem configuration context for ext4,
> +configure it,
> +create a detached mount object,
> +and then attach it to
> +.IR /home .
> +The above procedure is functionally equivalent to
> +the following mount operation
> +using
> +.BR mount (2):
> +.P
> +.in +4n
> +.EX
> +mount("/dev/sda1", "/home", "ext4", MS_NODEV, "user_xattr");
> +.EE
> +.in
> +.P
> +The same operation also works with detached bind-mounts created with
> +.BR open_tree (2)
> +with
> +.BR OPEN_TREE_CLONE :
> +.P
> +.in +4n
> +.EX
> +int mntfd = open_tree(AT_FDCWD, "/home/cyphar", OPEN_TREE_CLONE);
> +move_mount(mntfd, "", AT_FDCWD, "/root", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +This would create a new bind-mount of
> +.I /home/cyphar
> +as a detached mount object,
> +and then attach it to
> +.IR /root .
> +The above procedure is functionally equivalent to
> +the following mount operation
> +using
> +.BR mount (2):
> +.P
> +.in +4n
> +.EX
> +mount("/home/cyphar", "/root", NULL, MS_BIND, NULL);
> +.EE
> +.in
> +.SH SEE ALSO
> +.BR fsconfig (2),
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR fspick (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)
> 
> -- 
> 2.51.0
> 
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 4/8] man/man2/fsmount.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-26 12:51 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-4-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 7538 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:26AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks!  I've applied this patch.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=24243cc66e191fd917c9c13a01b7ac037ce0972e>


Cheers,
Alex

> ---
>  man/man2/fsmount.2 | 231 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 231 insertions(+)
> 
> diff --git a/man/man2/fsmount.2 b/man/man2/fsmount.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..b62850a68443bb8f6178389eb6cb1a5f9029ab30
> --- /dev/null
> +++ b/man/man2/fsmount.2
> @@ -0,0 +1,231 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH fsmount 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +fsmount \- instantiate mount object from filesystem context
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mount.h>
> +.P
> +.BI "int fsmount(int " fsfd ", unsigned int " flags \
> +", unsigned int " attr_flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR fsmount ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR fsmount ()
> +creates a new detached mount object
> +for the root of the new filesystem instance
> +referenced by the filesystem context file descriptor
> +.IR fsfd .
> +A new file descriptor
> +associated with the detached mount object
> +is then returned.
> +In order to create a mount object with
> +.BR fsmount (),
> +the calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.P
> +The filesystem context must have been created with a call to
> +.BR fsopen (2)
> +and then had a filesystem instance instantiated with a call to
> +.BR fsconfig (2)
> +with
> +.B \%FSCONFIG_CMD_CREATE
> +or
> +.B \%FSCONFIG_CMD_CREATE_EXCL
> +in order to be in the correct state
> +for this operation
> +(the "awaiting-mount" mode in kernel-developer parlance).
> +.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
> +Unlike
> +.BR open_tree (2)
> +with
> +.BR \%OPEN_TREE_CLONE ,
> +.BR fsmount ()
> +can only be called once
> +in the lifetime of a filesystem context
> +to produce a mount object.
> +.P
> +As with file descriptors returned from
> +.BR open_tree (2)
> +called with
> +.BR OPEN_TREE_CLONE ,
> +the returned file descriptor
> +can then be used with
> +.BR move_mount (2),
> +.BR mount_setattr (2),
> +or other such system calls to do further mount operations.
> +This mount object will be unmounted and destroyed
> +when the file descriptor is closed
> +if it was not otherwise attached to a mount point
> +by calling
> +.BR move_mount (2).
> +(Note that the unmount operation on
> +.BR close (2)
> +is lazy\[em]akin to calling
> +.BR umount2 (2)
> +with
> +.BR MNT_DETACH ;
> +any existing open references to files
> +from the mount object
> +will continue to work,
> +and the mount object will only be completely destroyed
> +once it ceases to be busy.)
> +The returned file descriptor
> +also acts the same as one produced by
> +.BR open (2)
> +with
> +.BR O_PATH ,
> +meaning it can also be used as a
> +.I dirfd
> +argument
> +to "*at()" system calls.
> +.P
> +.I flags
> +controls the creation of the returned file descriptor.
> +A value for
> +.I flags
> +is constructed by bitwise ORing
> +zero or more of the following constants:
> +.RS
> +.TP
> +.B FSMOUNT_CLOEXEC
> +Set the close-on-exec
> +.RB ( FD_CLOEXEC )
> +flag on the new file descriptor.
> +See the description of the
> +.B O_CLOEXEC
> +flag in
> +.BR open (2)
> +for reasons why this may be useful.
> +.RE
> +.P
> +.I attr_flags
> +specifies mount attributes
> +which will be applied to the created mount object,
> +in the form of
> +.BI \%MOUNT_ATTR_ *
> +flags.
> +The flags are interpreted as though
> +.BR mount_setattr (2)
> +was called with
> +.I attr.attr_set
> +set to the same value as
> +.IR attr_flags .
> +.BI \%MOUNT_ATTR_ *
> +flags which would require
> +specifying additional fields in
> +.BR mount_attr (2type)
> +(such as
> +.BR \%MOUNT_ATTR_IDMAP )
> +are not valid flag values for
> +.IR attr_flags .
> +.P
> +If the
> +.BR fsmount ()
> +operation is successful,
> +the filesystem context
> +associated with the file descriptor
> +.I fsfd
> +is reset
> +and placed into reconfiguration mode,
> +as if it were just returned by
> +.BR fspick (2).
> +You may continue to use
> +.BR fsconfig (2)
> +with the now-reset filesystem context,
> +including issuing the
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +command
> +to reconfigure the filesystem instance.
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EBUSY
> +The filesystem context associated with
> +.I fsfd
> +is not in the right state
> +to be used by
> +.BR fsmount ().
> +.TP
> +.B EINVAL
> +.I flags
> +had an invalid flag set.
> +.TP
> +.B EINVAL
> +.I attr_flags
> +had an invalid
> +.BI MOUNT_ATTR_ *
> +flag set.
> +.TP
> +.B EMFILE
> +The calling process has too many open files to create more.
> +.TP
> +.B ENFILE
> +The system has too many open files to create more.
> +.TP
> +.B ENOSPC
> +The "anonymous" mount namespace
> +necessary to contain the new mount object
> +could not be allocated,
> +as doing so would exceed
> +the configured per-user limit on
> +the number of mount namespaces in the current user namespace.
> +(See also
> +.BR namespaces (7).)
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B EPERM
> +The calling process does not have the required
> +.B CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit 93766fbd2696c2c4453dd8e1070977e9cd4e6b6d
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH EXAMPLES
> +.in +4n
> +.EX
> +int fsfd, mntfd, tmpfd;
> +\&
> +fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC,
> +                MOUNT_ATTR_NODEV | MOUNT_ATTR_NOEXEC);
> +\&
> +/* Create a new file without attaching the mount object */
> +tmpfd = openat(mntfd, "tmpfile", O_CREAT | O_EXCL | O_RDWR, 0600);
> +unlinkat(mntfd, "tmpfile", 0);
> +\&
> +/* Attach the mount object to "/tmp" */
> +move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.SH SEE ALSO
> +.BR fsconfig (2),
> +.BR fsopen (2),
> +.BR fspick (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)
> 
> -- 
> 2.51.0
> 
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 3/8] man/man2/fsconfig.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-26 12:48 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-3-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 22388 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:25AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

I've applied this patch, with a few amendments.  Thanks!
<https://www.alejandro-colomar.es:80/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=d0b3cd8a5399f734037e87b8f93d5b536af20d7d>
(Use port :80/)


Have a lovely day!
Alex

> ---
>  man/man2/fsconfig.2 | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 729 insertions(+)
> 
> diff --git a/man/man2/fsconfig.2 b/man/man2/fsconfig.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..a2d844a105c74f17af640d6991046dbd5fa69cf0
> --- /dev/null
> +++ b/man/man2/fsconfig.2
> @@ -0,0 +1,729 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH fsconfig 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +fsconfig \- configure new or existing filesystem context
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mount.h>
> +.P
> +.BI "int fsconfig(int " fd ", unsigned int " cmd ,
> +.BI "             const char *_Nullable " key ,
> +.BI "             const void *_Nullable " value ", int " aux );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR fsconfig ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR fsconfig ()
> +is used to supply parameters to
> +and issue commands against
> +the filesystem configuration context
> +associated with the file descriptor
> +.IR fd .
> +Filesystem configuration contexts can be created with
> +.BR fsopen (2)
> +or be instantiated from an extant filesystem instance with
> +.BR fspick (2).
> +.P
> +The
> +.I cmd
> +argument indicates the command to be issued.
> +Some commands supply parameters to the context
> +(equivalent to mount options specified with
> +.BR mount (8)),
> +while others are meta-operations on the filesystem context.
> +The list of valid
> +.I cmd
> +values are:
> +.RS
> +.TP
> +.B FSCONFIG_SET_FLAG
> +Set the flag parameter named by
> +.IR key .
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.TP
> +.B FSCONFIG_SET_STRING
> +Set the string parameter named by
> +.I key
> +to the value specified by
> +.IR value .
> +.I value
> +points to a null-terminated string,
> +and
> +.I aux
> +must be 0.
> +.TP
> +.B FSCONFIG_SET_BINARY
> +Set the blob parameter named by
> +.I key
> +to the contents of the binary blob
> +specified by
> +.IR value .
> +.I value
> +points to
> +the start of a buffer
> +that is
> +.I aux
> +bytes in length.
> +.TP
> +.B FSCONFIG_SET_FD
> +Set the file parameter named by
> +.I key
> +to the open file description
> +referenced by the file descriptor
> +.IR aux .
> +.I value
> +must be NULL.
> +.IP
> +You may also use
> +.B \%FSCONFIG_SET_STRING
> +for file parameters,
> +with
> +.I value
> +set to a null-terminated string
> +containing a base-10 representation
> +of the file descriptor number.
> +This mechanism is primarily intended for compatibility
> +with older
> +.BR mount (2)-based
> +programs,
> +and only works for parameters
> +that
> +.I only
> +accept file descriptor arguments.
> +.TP
> +.B FSCONFIG_SET_PATH
> +Set the path parameter named by
> +.I key
> +to the object at a provided path,
> +resolved in a similar manner to
> +.BR openat (2).
> +.I value
> +points to a null-terminated pathname string,
> +and
> +.I aux
> +is equivalent to the
> +.I dirfd
> +argument to
> +.BR openat (2).
> +See
> +.BR openat (2)
> +for an explanation of the need for
> +.BR \%FSCONFIG_SET_PATH .
> +.IP
> +You may also use
> +.B \%FSCONFIG_SET_STRING
> +for path parameters,
> +the behaviour of which is equivalent to
> +.B \%FSCONFIG_SET_PATH
> +with
> +.I aux
> +set to
> +.BR \%AT_FDCWD .
> +.TP
> +.B FSCONFIG_SET_PATH_EMPTY
> +As with
> +.BR \%FSCONFIG_SET_PATH ,
> +except that if
> +.I value
> +is an empty string,
> +the file descriptor specified by
> +.I aux
> +is operated on directly
> +and may be any type of file
> +(not just a directory).
> +This is equivalent to the behaviour of
> +.B \%AT_EMPTY_PATH
> +with most "*at()" system calls.
> +If
> +.I aux
> +is
> +.BR \%AT_FDCWD ,
> +the parameter will be set to
> +the current working directory
> +of the calling process.
> +.TP
> +.B FSCONFIG_CMD_CREATE
> +This command instructs the filesystem driver
> +to instantiate an instance of the filesystem in the kernel
> +with the parameters specified in the filesystem configuration context.
> +.I key
> +and
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.IP
> +This command can only be issued once
> +in the lifetime of a filesystem context.
> +If the operation succeeds,
> +the filesystem context
> +associated with file descriptor
> +.I fd
> +now references the created filesystem instance,
> +and is placed into a special "awaiting-mount" mode
> +that allows you to use
> +.BR fsmount (2)
> +to create a mount object from the filesystem instance.
> +.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
> +If the operation fails,
> +in most cases
> +the filesystem context is placed in a failed mode
> +and cannot be used for any further
> +.BR fsconfig ()
> +operations
> +(though you may still retrieve diagnostic messages
> +through the message retrieval interface,
> +as described in
> +the corresponding subsection of
> +.BR fsopen (2)).
> +.IP
> +This command can only be issued against
> +filesystem configuration contexts
> +that were created with
> +.BR fsopen (2).
> +In order to create a filesystem instance,
> +the calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.IP
> +An important thing to be aware of is that
> +the Linux kernel will
> +.I silently
> +reuse extant filesystem instances
> +depending on the filesystem type
> +and the configured parameters
> +(each filesystem driver has
> +its own policy for
> +how filesystem instances are reused).
> +This means that
> +the filesystem instance "created" by
> +.B \%FSCONFIG_CMD_CREATE
> +may, in fact, be a reference
> +to an extant filesystem instance in the kernel.
> +(For reference,
> +this behaviour also applies to
> +.BR mount (2).)
> +.IP
> +One side-effect of this behaviour is that
> +if an extant filesystem instance is reused,
> +.I all
> +parameters configured
> +for this filesystem configuration context
> +are
> +.I silently ignored
> +(with the exception of the
> +.I ro
> +and
> +.I rw
> +flag parameters;
> +if the state of the read-only flag in the
> +extant filesystem instance and the filesystem configuration context
> +do not match, this operation will return
> +.BR EBUSY ).
> +This also means that
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +commands issued against
> +the "created" filesystem instance
> +will also affect any mount objects associated with
> +the extant filesystem instance.
> +.IP
> +Programs that need to ensure
> +that they create a new filesystem instance
> +with specific parameters
> +(notably, security-related parameters
> +such as
> +.I acl
> +to enable POSIX ACLs\[em]\c
> +as described in
> +.BR acl (5))
> +should use
> +.B \%FSCONFIG_CMD_CREATE_EXCL
> +instead.
> +.TP
> +.BR FSCONFIG_CMD_CREATE_EXCL " (since Linux 6.6)"
> +.\" commit 22ed7ecdaefe0cac0c6e6295e83048af60435b13
> +.\" commit 84ab1277ce5a90a8d1f377707d662ac43cc0918a
> +As with
> +.BR \%FSCONFIG_CMD_CREATE ,
> +except that the kernel is instructed
> +to not reuse extant filesystem instances.
> +If the operation
> +would be forced to
> +reuse an extant filesystem instance,
> +this operation will return
> +.B EBUSY
> +instead.
> +.IP
> +As a result (unlike
> +.BR \%FSCONFIG_CMD_CREATE ),
> +if this operation succeeds
> +then the calling process can be sure that
> +all of the parameters successfully configured with
> +.BR fsconfig ()
> +will actually be applied
> +to the created filesystem instance.
> +.TP
> +.B FSCONFIG_CMD_RECONFIGURE
> +This command instructs the filesystem driver
> +to apply the parameters specified in the filesystem configuration context
> +to the extant filesystem instance
> +referenced by the filesystem configuration context.
> +.I key
> +and
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.IP
> +This is primarily intended for use with
> +.BR fspick (2),
> +but may also be used to modify
> +the parameters of a filesystem instance
> +after
> +.B \%FSCONFIG_CMD_CREATE
> +was used to create it
> +and a mount object was created using
> +.BR fsmount (2).
> +In order to reconfigure an extant filesystem instance,
> +the calling process must have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.IP
> +If the operation succeeds,
> +the filesystem context is reset
> +but remains in reconfiguration mode
> +and thus can be reused for subsequent
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +commands.
> +If the operation fails,
> +in most cases
> +the filesystem context is placed in a failed mode
> +and cannot be used for any further
> +.BR fsconfig ()
> +operations
> +(though you may still retrieve diagnostic messages
> +through the message retrieval interface,
> +as described in
> +the corresponding subsection of
> +.BR fsopen (2)).
> +.RE
> +.P
> +Parameters specified with
> +.BI FSCONFIG_SET_ *
> +do not take effect
> +until a corresponding
> +.B \%FSCONFIG_CMD_CREATE
> +or
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +command is issued.
> +.SH RETURN VALUE
> +On success,
> +.BR fsconfig ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +If an error occurs, the filesystem driver may provide
> +additional information about the error
> +through the message retrieval interface for filesystem configuration contexts.
> +This additional information can be retrieved at any time by calling
> +.BR read (2)
> +on the filesystem instance or filesystem configuration context
> +referenced by the file descriptor
> +.IR fd .
> +(See the "Message retrieval interface" subsection in
> +.BR fsopen (2)
> +for more details on the message format.)
> +.P
> +Even after an error occurs,
> +the filesystem configuration context is
> +.I not
> +invalidated,
> +and thus can still be used with other
> +.BR fsconfig ()
> +commands.
> +This means that users can probe support for filesystem parameters
> +on a per-parameter basis,
> +and adjust which parameters they wish to set.
> +.P
> +The error values given below result from
> +filesystem type independent errors.
> +Each filesystem type may have its own special errors
> +and its own special behavior.
> +See the Linux kernel source code for details.
> +.TP
> +.B EACCES
> +A component of a path
> +provided as a path parameter
> +was not searchable.
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EACCES
> +.B \%FSCONFIG_CMD_CREATE
> +was attempted
> +for a read-only filesystem
> +without specifying the
> +.RB ' ro '
> +flag parameter.
> +.TP
> +.B EACCES
> +A specified block device parameter
> +is located on a filesystem
> +mounted with the
> +.B \%MS_NODEV
> +option.
> +.TP
> +.B EBADF
> +The file descriptor given by
> +.I fd
> +(or possibly by
> +.IR aux ,
> +depending on the command)
> +is invalid.
> +.TP
> +.B EBUSY
> +The filesystem context associated with
> +.I fd
> +is in the wrong state
> +for the given command.
> +.TP
> +.B EBUSY
> +The filesystem instance cannot be reconfigured as read-only
> +with
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +because some programs
> +still hold files open for writing.
> +.TP
> +.B EBUSY
> +A new filesystem instance was requested with
> +.B \%FSCONFIG_CMD_CREATE_EXCL
> +but a matching superblock already existed.
> +.TP
> +.B EFAULT
> +One of the pointer arguments
> +points to a location
> +outside the calling process's accessible address space.
> +.TP
> +.B EINVAL
> +.I fd
> +does not refer to
> +a filesystem configuration context
> +or filesystem instance.
> +.TP
> +.B EINVAL
> +One of the values of
> +.IR name ,
> +.IR value ,
> +and/or
> +.I aux
> +were set to a non-zero value when
> +.I cmd
> +required that they be zero
> +(or NULL).
> +.TP
> +.B EINVAL
> +The parameter named by
> +.I name
> +cannot be set
> +using the type specified with
> +.IR cmd .
> +.TP
> +.B EINVAL
> +One of the source parameters
> +referred to
> +an invalid superblock.
> +.TP
> +.B ELOOP
> +Too many links encountered
> +during pathname resolution
> +of a path argument.
> +.TP
> +.B ENAMETOOLONG
> +A path argument was longer than
> +.BR PATH_MAX .
> +.TP
> +.B ENOENT
> +A path argument had a non-existent component.
> +.TP
> +.B ENOENT
> +A path argument is an empty string,
> +but
> +.I cmd
> +is not
> +.BR \%FSCONFIG_SET_PATH_EMPTY .
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B ENOTBLK
> +The parameter named by
> +.I name
> +must be a block device,
> +but the provided parameter value was not a block device.
> +.TP
> +.B ENOTDIR
> +A component of the path prefix
> +of a path argument
> +was not a directory.
> +.TP
> +.B EOPNOTSUPP
> +The command given by
> +.I cmd
> +is not valid.
> +.TP
> +.B ENXIO
> +The major number
> +of a block device parameter
> +is out of range.
> +.TP
> +.B EPERM
> +The command given by
> +.I cmd
> +was
> +.BR \%FSCONFIG_CMD_CREATE ,
> +.BR \%FSCONFIG_CMD_CREATE_EXCL ,
> +or
> +.BR \%FSCONFIG_CMD_RECONFIGURE ,
> +but the calling process does not have the required
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit ecdab150fddb42fe6a739335257949220033b782
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH NOTES
> +.SS Generic filesystem parameters
> +Each filesystem driver is responsible for
> +parsing most parameters specified with
> +.BR fsconfig (),
> +meaning that individual filesystems
> +may have very different behaviour
> +when encountering parameters with the same name.
> +In general,
> +you should not assume that the behaviour of
> +.BR fsconfig ()
> +when specifying a parameter to one filesystem type
> +will match the behaviour of the same parameter
> +with a different filesystem type.
> +.P
> +However,
> +the following generic parameters
> +apply to all filesystems and have unified behaviour.
> +They are set using the listed
> +.BI \%FSCONFIG_SET_ *
> +command.
> +.TP
> +\fIro\fP and \fIrw\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether the filesystem instance is read-only.
> +.TP
> +\fIdirsync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Make directory changes on this filesystem instance synchronous.
> +.TP
> +\fIsync\fP and \fIasync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether writes on this filesystem instance
> +will be made synchronous
> +(as though the
> +.B O_SYNC
> +flag to
> +.BR open (2)
> +was specified for
> +all file opens in this filesystem instance).
> +.TP
> +\fIlazytime\fP and \fInolazytime\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether to reduce on-disk updates
> +of inode timestamps on this filesystem instance
> +(as described in the
> +.B \%MS_LAZYTIME
> +section of
> +.BR mount (2)).
> +.TP
> +\fImand\fP and \fInomand\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether the filesystem instance should permit mandatory locking.
> +Since Linux 5.15,
> +.\" commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
> +mandatory locking has been deprecated
> +and setting this flag is a no-op.
> +.TP
> +\fIsource\fP (\fB\%FSCONFIG_SET_STRING\fP)
> +This parameter is equivalent to the
> +.I source
> +parameter passed to
> +.BR mount (2)
> +for the same filesystem type,
> +and is usually the pathname of a block device
> +containing the filesystem.
> +This parameter may only be set once
> +per filesystem configuration context transaction.
> +.P
> +In addition,
> +any filesystem parameters associated with
> +Linux Security Modules (LSMs)
> +are also generic with respect to the underlying filesystem.
> +See the documentation for the LSM you wish to configure for more details.
> +.SH CAVEATS
> +.SS Filesystem parameter types
> +As a result of
> +each filesystem driver being responsible for
> +parsing most parameters specified with
> +.BR fsconfig (),
> +some filesystem drivers
> +may have unintuitive behaviour
> +with regards to which
> +.BI \%FSCONFIG_SET_ *
> +commands are permitted
> +to configure a given parameter.
> +.P
> +In order for
> +filesystem parameters to be backwards compatible with
> +.BR mount (2),
> +they must be parseable as strings;
> +this almost universally means that
> +.B \%FSCONFIG_SET_STRING
> +can also be used to configure them.
> +.\" Aleksa Sarai
> +.\"   Theoretically, a filesystem could check fc->oldapi and refuse
> +.\"   FSCONFIG_SET_STRING if the operation is coming from the new API, but no
> +.\"   filesystems do this (and probably never will).
> +However, other
> +.BI \%FSCONFIG_SET_ *
> +commands need to be opted into
> +by each filesystem driver's parameter parser.
> +.P
> +One of the most user-visible instances of
> +this inconsistency is that
> +many filesystems do not support
> +configuring path parameters with
> +.B \%FSCONFIG_SET_PATH
> +(despite the name),
> +which can lead to somewhat confusing
> +.B EINVAL
> +errors.
> +(For example, the generic
> +.I source
> +parameter\[em]\c
> +which is usually a path\[em]\c
> +can only be configured
> +with
> +.BR \%FSCONFIG_SET_STRING .)
> +.P
> +When writing programs that use
> +.BR fsconfig ()
> +to configure parameters
> +with commands other than
> +.BR \%FSCONFIG_SET_STRING ,
> +users should verify
> +that the
> +.BI \%FSCONFIG_SET_ *
> +commands used to configure each parameter
> +are supported by the corresponding filesystem driver.
> +.\" Aleksa Sarai
> +.\"   While this (quite confusing) inconsistency in behaviour is true today
> +.\"   (and has been true since this was merged), this appears to mostly be an
> +.\"   unintended consequence of filesystem drivers hand-coding fsparam parsing.
> +.\"   Path parameters are the most eggregious causes of confusion.
> +.\"   Hopefully we can make this no longer the case in a future kernel.
> +.SH EXAMPLES
> +To illustrate the different kinds of flags that can be configured with
> +.BR fsconfig (),
> +here are a few examples of some different filesystems being created:
> +.P
> +.in +4n
> +.EX
> +int fsfd, mntfd;
> +\&
> +fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "inode64", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "uid", "1234", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "huge", "never", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "casefold", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC);
> +move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
> +\&
> +fsfd = fsopen("erofs", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/loop0", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE_EXCL, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOSUID);
> +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +Usually,
> +specifying the same parameter named by
> +.I key
> +multiple times with
> +.BR fsconfig ()
> +causes the parameter value to be replaced.
> +However, some filesystems may have unique behaviour:
> +.P
> +.in +4n
> +.EX
> +\&
> +int fsfd, mntfd;
> +int lowerdirfd = open("/o/ctr/lower1", O_DIRECTORY | O_CLOEXEC);
> +\&
> +fsfd = fsopen("overlay", FSOPEN_CLOEXEC);
> +/* "lowerdir+" appends to the lower dir stack each time */
> +fsconfig(fsfd, FSCONFIG_SET_FD, "lowerdir+", NULL, lowerdirfd);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower2", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower3", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower4", 0);
> +.\" fsconfig(fsfd, FSCONFIG_SET_PATH, "lowerdir+", "/o/ctr/lower5", AT_FDCWD);
> +.\" fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "lowerdir+", "", lowerdirfd);
> +.\" Aleksa Sarai: Hopefully these will also be supported in the future.
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "xino", "auto", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "nfs_export", "off", 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
> +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +And here is an example of how
> +.BR fspick (2)
> +can be used with
> +.BR fsconfig ()
> +to reconfigure the parameters
> +of an extant filesystem instance
> +attached to
> +.IR /proc :
> +.P
> +.in +4n
> +.EX
> +int fsfd = fspick(AT_FDCWD, "/proc", FSPICK_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "ptraceable", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "subset", "pid", 0);
> +fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> +.EE
> +.in
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR fspick (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)
> 
> -- 
> 2.51.0
> 
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 3/8] man/man2/fsconfig.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-25 16:52 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-09-25-azure-rubber-flair-menus-42bRw8@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]

On Fri, Sep 26, 2025 at 01:15:14AM +1000, Aleksa Sarai wrote:
> > > +.TP
> > > +.B ENOTBLK
> > > +The parameter named by
> > > +.I name
> > 
> > There's no such parameter.  (I guess you meant 'key'?)
> 
> Ah yes, I did mean "key". The same mistake was repeated for two EINVAL
> cases above as well:

Thanks!

> 
>     EINVAL One of the values of *name*, value, and/or aux were set to a
>            non-zero value when cmd required that they be zero (or NULL).
> 
>     EINVAL The parameter named by *name* cannot be set using the type
>            specified with cmd.
> 
> Do you want me to send another version or would you able to fix it when
> you apply?

I can fix it.

Cheers,
Alex

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 3/8] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-25 15:15 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <brqynohvpwo4hqdepvqks3hluq3jng6bnd7xtensee5adgtxem@3ughtcvv57si>

[-- Attachment #1: Type: text/plain, Size: 24361 bytes --]

On 2025-09-25, Alejandro Colomar <alx@kernel.org> wrote:
> Hi Aleksa,
> 
> On Thu, Sep 25, 2025 at 01:31:25AM +1000, Aleksa Sarai wrote:
> > This is loosely based on the original documentation written by David
> > Howells and later maintained by Christian Brauner, but has been
> > rewritten to be more from a user perspective (as well as fixing a few
> > critical mistakes).
> > 
> > Co-authored-by: David Howells <dhowells@redhat.com>
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > Co-authored-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man/man2/fsconfig.2 | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 729 insertions(+)
> > 
> > diff --git a/man/man2/fsconfig.2 b/man/man2/fsconfig.2
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..a2d844a105c74f17af640d6991046dbd5fa69cf0
> > --- /dev/null
> > +++ b/man/man2/fsconfig.2
> > @@ -0,0 +1,729 @@
> > +.\" Copyright, the authors of the Linux man-pages project
> > +.\"
> > +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> > +.\"
> > +.TH fsconfig 2 (date) "Linux man-pages (unreleased)"
> > +.SH NAME
> > +fsconfig \- configure new or existing filesystem context
> > +.SH LIBRARY
> > +Standard C library
> > +.RI ( libc ,\~ \-lc )
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <sys/mount.h>
> > +.P
> > +.BI "int fsconfig(int " fd ", unsigned int " cmd ,
> > +.BI "             const char *_Nullable " key ,
> > +.BI "             const void *_Nullable " value ", int " aux );
> > +.fi
> > +.SH DESCRIPTION
> > +The
> > +.BR fsconfig ()
> > +system call is part of
> > +the suite of file-descriptor-based mount facilities in Linux.
> > +.P
> > +.BR fsconfig ()
> > +is used to supply parameters to
> > +and issue commands against
> > +the filesystem configuration context
> > +associated with the file descriptor
> > +.IR fd .
> > +Filesystem configuration contexts can be created with
> > +.BR fsopen (2)
> > +or be instantiated from an extant filesystem instance with
> > +.BR fspick (2).
> > +.P
> > +The
> > +.I cmd
> > +argument indicates the command to be issued.
> > +Some commands supply parameters to the context
> > +(equivalent to mount options specified with
> > +.BR mount (8)),
> > +while others are meta-operations on the filesystem context.
> > +The list of valid
> > +.I cmd
> > +values are:
> > +.RS
> > +.TP
> > +.B FSCONFIG_SET_FLAG
> > +Set the flag parameter named by
> > +.IR key .
> > +.I value
> > +must be NULL,
> > +and
> > +.I aux
> > +must be 0.
> > +.TP
> > +.B FSCONFIG_SET_STRING
> > +Set the string parameter named by
> > +.I key
> > +to the value specified by
> > +.IR value .
> > +.I value
> > +points to a null-terminated string,
> > +and
> > +.I aux
> > +must be 0.
> > +.TP
> > +.B FSCONFIG_SET_BINARY
> > +Set the blob parameter named by
> > +.I key
> > +to the contents of the binary blob
> > +specified by
> > +.IR value .
> > +.I value
> > +points to
> > +the start of a buffer
> > +that is
> > +.I aux
> > +bytes in length.
> > +.TP
> > +.B FSCONFIG_SET_FD
> > +Set the file parameter named by
> > +.I key
> > +to the open file description
> > +referenced by the file descriptor
> > +.IR aux .
> > +.I value
> > +must be NULL.
> > +.IP
> > +You may also use
> > +.B \%FSCONFIG_SET_STRING
> > +for file parameters,
> > +with
> > +.I value
> > +set to a null-terminated string
> > +containing a base-10 representation
> > +of the file descriptor number.
> > +This mechanism is primarily intended for compatibility
> > +with older
> > +.BR mount (2)-based
> > +programs,
> > +and only works for parameters
> > +that
> > +.I only
> > +accept file descriptor arguments.
> > +.TP
> > +.B FSCONFIG_SET_PATH
> > +Set the path parameter named by
> > +.I key
> > +to the object at a provided path,
> > +resolved in a similar manner to
> > +.BR openat (2).
> > +.I value
> > +points to a null-terminated pathname string,
> > +and
> > +.I aux
> > +is equivalent to the
> > +.I dirfd
> > +argument to
> > +.BR openat (2).
> > +See
> > +.BR openat (2)
> > +for an explanation of the need for
> > +.BR \%FSCONFIG_SET_PATH .
> > +.IP
> > +You may also use
> > +.B \%FSCONFIG_SET_STRING
> > +for path parameters,
> > +the behaviour of which is equivalent to
> > +.B \%FSCONFIG_SET_PATH
> > +with
> > +.I aux
> > +set to
> > +.BR \%AT_FDCWD .
> > +.TP
> > +.B FSCONFIG_SET_PATH_EMPTY
> > +As with
> > +.BR \%FSCONFIG_SET_PATH ,
> > +except that if
> > +.I value
> > +is an empty string,
> > +the file descriptor specified by
> > +.I aux
> > +is operated on directly
> > +and may be any type of file
> > +(not just a directory).
> > +This is equivalent to the behaviour of
> > +.B \%AT_EMPTY_PATH
> > +with most "*at()" system calls.
> > +If
> > +.I aux
> > +is
> > +.BR \%AT_FDCWD ,
> > +the parameter will be set to
> > +the current working directory
> > +of the calling process.
> > +.TP
> > +.B FSCONFIG_CMD_CREATE
> > +This command instructs the filesystem driver
> > +to instantiate an instance of the filesystem in the kernel
> > +with the parameters specified in the filesystem configuration context.
> > +.I key
> > +and
> > +.I value
> > +must be NULL,
> > +and
> > +.I aux
> > +must be 0.
> > +.IP
> > +This command can only be issued once
> > +in the lifetime of a filesystem context.
> > +If the operation succeeds,
> > +the filesystem context
> > +associated with file descriptor
> > +.I fd
> > +now references the created filesystem instance,
> > +and is placed into a special "awaiting-mount" mode
> > +that allows you to use
> > +.BR fsmount (2)
> > +to create a mount object from the filesystem instance.
> > +.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
> > +If the operation fails,
> > +in most cases
> > +the filesystem context is placed in a failed mode
> > +and cannot be used for any further
> > +.BR fsconfig ()
> > +operations
> > +(though you may still retrieve diagnostic messages
> > +through the message retrieval interface,
> > +as described in
> > +the corresponding subsection of
> > +.BR fsopen (2)).
> > +.IP
> > +This command can only be issued against
> > +filesystem configuration contexts
> > +that were created with
> > +.BR fsopen (2).
> > +In order to create a filesystem instance,
> > +the calling process must have the
> > +.B \%CAP_SYS_ADMIN
> > +capability.
> > +.IP
> > +An important thing to be aware of is that
> > +the Linux kernel will
> > +.I silently
> > +reuse extant filesystem instances
> > +depending on the filesystem type
> > +and the configured parameters
> > +(each filesystem driver has
> > +its own policy for
> > +how filesystem instances are reused).
> > +This means that
> > +the filesystem instance "created" by
> > +.B \%FSCONFIG_CMD_CREATE
> > +may, in fact, be a reference
> > +to an extant filesystem instance in the kernel.
> > +(For reference,
> > +this behaviour also applies to
> > +.BR mount (2).)
> > +.IP
> > +One side-effect of this behaviour is that
> > +if an extant filesystem instance is reused,
> > +.I all
> > +parameters configured
> > +for this filesystem configuration context
> > +are
> > +.I silently ignored
> > +(with the exception of the
> > +.I ro
> > +and
> > +.I rw
> > +flag parameters;
> > +if the state of the read-only flag in the
> > +extant filesystem instance and the filesystem configuration context
> > +do not match, this operation will return
> > +.BR EBUSY ).
> > +This also means that
> > +.B \%FSCONFIG_CMD_RECONFIGURE
> > +commands issued against
> > +the "created" filesystem instance
> > +will also affect any mount objects associated with
> > +the extant filesystem instance.
> > +.IP
> > +Programs that need to ensure
> > +that they create a new filesystem instance
> > +with specific parameters
> > +(notably, security-related parameters
> > +such as
> > +.I acl
> > +to enable POSIX ACLs\[em]\c
> > +as described in
> > +.BR acl (5))
> > +should use
> > +.B \%FSCONFIG_CMD_CREATE_EXCL
> > +instead.
> > +.TP
> > +.BR FSCONFIG_CMD_CREATE_EXCL " (since Linux 6.6)"
> > +.\" commit 22ed7ecdaefe0cac0c6e6295e83048af60435b13
> > +.\" commit 84ab1277ce5a90a8d1f377707d662ac43cc0918a
> > +As with
> > +.BR \%FSCONFIG_CMD_CREATE ,
> > +except that the kernel is instructed
> > +to not reuse extant filesystem instances.
> > +If the operation
> > +would be forced to
> > +reuse an extant filesystem instance,
> > +this operation will return
> > +.B EBUSY
> > +instead.
> > +.IP
> > +As a result (unlike
> > +.BR \%FSCONFIG_CMD_CREATE ),
> > +if this operation succeeds
> > +then the calling process can be sure that
> > +all of the parameters successfully configured with
> > +.BR fsconfig ()
> > +will actually be applied
> > +to the created filesystem instance.
> > +.TP
> > +.B FSCONFIG_CMD_RECONFIGURE
> > +This command instructs the filesystem driver
> > +to apply the parameters specified in the filesystem configuration context
> > +to the extant filesystem instance
> > +referenced by the filesystem configuration context.
> > +.I key
> > +and
> > +.I value
> > +must be NULL,
> > +and
> > +.I aux
> > +must be 0.
> > +.IP
> > +This is primarily intended for use with
> > +.BR fspick (2),
> > +but may also be used to modify
> > +the parameters of a filesystem instance
> > +after
> > +.B \%FSCONFIG_CMD_CREATE
> > +was used to create it
> > +and a mount object was created using
> > +.BR fsmount (2).
> > +In order to reconfigure an extant filesystem instance,
> > +the calling process must have the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.IP
> > +If the operation succeeds,
> > +the filesystem context is reset
> > +but remains in reconfiguration mode
> > +and thus can be reused for subsequent
> > +.B \%FSCONFIG_CMD_RECONFIGURE
> > +commands.
> > +If the operation fails,
> > +in most cases
> > +the filesystem context is placed in a failed mode
> > +and cannot be used for any further
> > +.BR fsconfig ()
> > +operations
> > +(though you may still retrieve diagnostic messages
> > +through the message retrieval interface,
> > +as described in
> > +the corresponding subsection of
> > +.BR fsopen (2)).
> > +.RE
> > +.P
> > +Parameters specified with
> > +.BI FSCONFIG_SET_ *
> > +do not take effect
> > +until a corresponding
> > +.B \%FSCONFIG_CMD_CREATE
> > +or
> > +.B \%FSCONFIG_CMD_RECONFIGURE
> > +command is issued.
> > +.SH RETURN VALUE
> > +On success,
> > +.BR fsconfig ()
> > +returns 0.
> > +On error, \-1 is returned, and
> > +.I errno
> > +is set to indicate the error.
> > +.SH ERRORS
> > +If an error occurs, the filesystem driver may provide
> > +additional information about the error
> > +through the message retrieval interface for filesystem configuration contexts.
> > +This additional information can be retrieved at any time by calling
> > +.BR read (2)
> > +on the filesystem instance or filesystem configuration context
> > +referenced by the file descriptor
> > +.IR fd .
> > +(See the "Message retrieval interface" subsection in
> > +.BR fsopen (2)
> > +for more details on the message format.)
> > +.P
> > +Even after an error occurs,
> > +the filesystem configuration context is
> > +.I not
> > +invalidated,
> > +and thus can still be used with other
> > +.BR fsconfig ()
> > +commands.
> > +This means that users can probe support for filesystem parameters
> > +on a per-parameter basis,
> > +and adjust which parameters they wish to set.
> > +.P
> > +The error values given below result from
> > +filesystem type independent errors.
> > +Each filesystem type may have its own special errors
> > +and its own special behavior.
> > +See the Linux kernel source code for details.
> > +.TP
> > +.B EACCES
> > +A component of a path
> > +provided as a path parameter
> > +was not searchable.
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EACCES
> > +.B \%FSCONFIG_CMD_CREATE
> > +was attempted
> > +for a read-only filesystem
> > +without specifying the
> > +.RB ' ro '
> > +flag parameter.
> > +.TP
> > +.B EACCES
> > +A specified block device parameter
> > +is located on a filesystem
> > +mounted with the
> > +.B \%MS_NODEV
> > +option.
> > +.TP
> > +.B EBADF
> > +The file descriptor given by
> > +.I fd
> > +(or possibly by
> > +.IR aux ,
> > +depending on the command)
> > +is invalid.
> > +.TP
> > +.B EBUSY
> > +The filesystem context associated with
> > +.I fd
> > +is in the wrong state
> > +for the given command.
> > +.TP
> > +.B EBUSY
> > +The filesystem instance cannot be reconfigured as read-only
> > +with
> > +.B \%FSCONFIG_CMD_RECONFIGURE
> > +because some programs
> > +still hold files open for writing.
> > +.TP
> > +.B EBUSY
> > +A new filesystem instance was requested with
> > +.B \%FSCONFIG_CMD_CREATE_EXCL
> > +but a matching superblock already existed.
> > +.TP
> > +.B EFAULT
> > +One of the pointer arguments
> > +points to a location
> > +outside the calling process's accessible address space.
> > +.TP
> > +.B EINVAL
> > +.I fd
> > +does not refer to
> > +a filesystem configuration context
> > +or filesystem instance.
> > +.TP
> > +.B EINVAL
> > +One of the values of
> > +.IR name ,
> > +.IR value ,
> > +and/or
> > +.I aux
> > +were set to a non-zero value when
> > +.I cmd
> > +required that they be zero
> > +(or NULL).
> > +.TP
> > +.B EINVAL
> > +The parameter named by
> > +.I name
> > +cannot be set
> > +using the type specified with
> > +.IR cmd .
> > +.TP
> > +.B EINVAL
> > +One of the source parameters
> > +referred to
> > +an invalid superblock.
> > +.TP
> > +.B ELOOP
> > +Too many links encountered
> > +during pathname resolution
> > +of a path argument.
> > +.TP
> > +.B ENAMETOOLONG
> > +A path argument was longer than
> > +.BR PATH_MAX .
> > +.TP
> > +.B ENOENT
> > +A path argument had a non-existent component.
> > +.TP
> > +.B ENOENT
> > +A path argument is an empty string,
> > +but
> > +.I cmd
> > +is not
> > +.BR \%FSCONFIG_SET_PATH_EMPTY .
> > +.TP
> > +.B ENOMEM
> > +The kernel could not allocate sufficient memory to complete the operation.
> > +.TP
> > +.B ENOTBLK
> > +The parameter named by
> > +.I name
> 
> There's no such parameter.  (I guess you meant 'key'?)

Ah yes, I did mean "key". The same mistake was repeated for two EINVAL
cases above as well:

    EINVAL One of the values of *name*, value, and/or aux were set to a
           non-zero value when cmd required that they be zero (or NULL).

    EINVAL The parameter named by *name* cannot be set using the type
           specified with cmd.

Do you want me to send another version or would you able to fix it when
you apply?

Thanks.

> Cheers,
> Alex
> 
> > +must be a block device,
> > +but the provided parameter value was not a block device.
> > +.TP
> > +.B ENOTDIR
> > +A component of the path prefix
> > +of a path argument
> > +was not a directory.
> > +.TP
> > +.B EOPNOTSUPP
> > +The command given by
> > +.I cmd
> > +is not valid.
> > +.TP
> > +.B ENXIO
> > +The major number
> > +of a block device parameter
> > +is out of range.
> > +.TP
> > +.B EPERM
> > +The command given by
> > +.I cmd
> > +was
> > +.BR \%FSCONFIG_CMD_CREATE ,
> > +.BR \%FSCONFIG_CMD_CREATE_EXCL ,
> > +or
> > +.BR \%FSCONFIG_CMD_RECONFIGURE ,
> > +but the calling process does not have the required
> > +.B \%CAP_SYS_ADMIN
> > +capability.
> > +.SH STANDARDS
> > +Linux.
> > +.SH HISTORY
> > +Linux 5.2.
> > +.\" commit ecdab150fddb42fe6a739335257949220033b782
> > +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> > +glibc 2.36.
> > +.SH NOTES
> > +.SS Generic filesystem parameters
> > +Each filesystem driver is responsible for
> > +parsing most parameters specified with
> > +.BR fsconfig (),
> > +meaning that individual filesystems
> > +may have very different behaviour
> > +when encountering parameters with the same name.
> > +In general,
> > +you should not assume that the behaviour of
> > +.BR fsconfig ()
> > +when specifying a parameter to one filesystem type
> > +will match the behaviour of the same parameter
> > +with a different filesystem type.
> > +.P
> > +However,
> > +the following generic parameters
> > +apply to all filesystems and have unified behaviour.
> > +They are set using the listed
> > +.BI \%FSCONFIG_SET_ *
> > +command.
> > +.TP
> > +\fIro\fP and \fIrw\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> > +Configure whether the filesystem instance is read-only.
> > +.TP
> > +\fIdirsync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> > +Make directory changes on this filesystem instance synchronous.
> > +.TP
> > +\fIsync\fP and \fIasync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> > +Configure whether writes on this filesystem instance
> > +will be made synchronous
> > +(as though the
> > +.B O_SYNC
> > +flag to
> > +.BR open (2)
> > +was specified for
> > +all file opens in this filesystem instance).
> > +.TP
> > +\fIlazytime\fP and \fInolazytime\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> > +Configure whether to reduce on-disk updates
> > +of inode timestamps on this filesystem instance
> > +(as described in the
> > +.B \%MS_LAZYTIME
> > +section of
> > +.BR mount (2)).
> > +.TP
> > +\fImand\fP and \fInomand\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> > +Configure whether the filesystem instance should permit mandatory locking.
> > +Since Linux 5.15,
> > +.\" commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
> > +mandatory locking has been deprecated
> > +and setting this flag is a no-op.
> > +.TP
> > +\fIsource\fP (\fB\%FSCONFIG_SET_STRING\fP)
> > +This parameter is equivalent to the
> > +.I source
> > +parameter passed to
> > +.BR mount (2)
> > +for the same filesystem type,
> > +and is usually the pathname of a block device
> > +containing the filesystem.
> > +This parameter may only be set once
> > +per filesystem configuration context transaction.
> > +.P
> > +In addition,
> > +any filesystem parameters associated with
> > +Linux Security Modules (LSMs)
> > +are also generic with respect to the underlying filesystem.
> > +See the documentation for the LSM you wish to configure for more details.
> > +.SH CAVEATS
> > +.SS Filesystem parameter types
> > +As a result of
> > +each filesystem driver being responsible for
> > +parsing most parameters specified with
> > +.BR fsconfig (),
> > +some filesystem drivers
> > +may have unintuitive behaviour
> > +with regards to which
> > +.BI \%FSCONFIG_SET_ *
> > +commands are permitted
> > +to configure a given parameter.
> > +.P
> > +In order for
> > +filesystem parameters to be backwards compatible with
> > +.BR mount (2),
> > +they must be parseable as strings;
> > +this almost universally means that
> > +.B \%FSCONFIG_SET_STRING
> > +can also be used to configure them.
> > +.\" Aleksa Sarai
> > +.\"   Theoretically, a filesystem could check fc->oldapi and refuse
> > +.\"   FSCONFIG_SET_STRING if the operation is coming from the new API, but no
> > +.\"   filesystems do this (and probably never will).
> > +However, other
> > +.BI \%FSCONFIG_SET_ *
> > +commands need to be opted into
> > +by each filesystem driver's parameter parser.
> > +.P
> > +One of the most user-visible instances of
> > +this inconsistency is that
> > +many filesystems do not support
> > +configuring path parameters with
> > +.B \%FSCONFIG_SET_PATH
> > +(despite the name),
> > +which can lead to somewhat confusing
> > +.B EINVAL
> > +errors.
> > +(For example, the generic
> > +.I source
> > +parameter\[em]\c
> > +which is usually a path\[em]\c
> > +can only be configured
> > +with
> > +.BR \%FSCONFIG_SET_STRING .)
> > +.P
> > +When writing programs that use
> > +.BR fsconfig ()
> > +to configure parameters
> > +with commands other than
> > +.BR \%FSCONFIG_SET_STRING ,
> > +users should verify
> > +that the
> > +.BI \%FSCONFIG_SET_ *
> > +commands used to configure each parameter
> > +are supported by the corresponding filesystem driver.
> > +.\" Aleksa Sarai
> > +.\"   While this (quite confusing) inconsistency in behaviour is true today
> > +.\"   (and has been true since this was merged), this appears to mostly be an
> > +.\"   unintended consequence of filesystem drivers hand-coding fsparam parsing.
> > +.\"   Path parameters are the most eggregious causes of confusion.
> > +.\"   Hopefully we can make this no longer the case in a future kernel.
> > +.SH EXAMPLES
> > +To illustrate the different kinds of flags that can be configured with
> > +.BR fsconfig (),
> > +here are a few examples of some different filesystems being created:
> > +.P
> > +.in +4n
> > +.EX
> > +int fsfd, mntfd;
> > +\&
> > +fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
> > +fsconfig(fsfd, FSCONFIG_SET_FLAG, "inode64", NULL, 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "uid", "1234", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "huge", "never", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_FLAG, "casefold", NULL, 0);
> > +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC);
> > +move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
> > +\&
> > +fsfd = fsopen("erofs", FSOPEN_CLOEXEC);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/loop0", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
> > +fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> > +fsconfig(fsfd, FSCONFIG_CMD_CREATE_EXCL, NULL, NULL, 0);
> > +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOSUID);
> > +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.EE
> > +.in
> > +.P
> > +Usually,
> > +specifying the same parameter named by
> > +.I key
> > +multiple times with
> > +.BR fsconfig ()
> > +causes the parameter value to be replaced.
> > +However, some filesystems may have unique behaviour:
> > +.P
> > +.in +4n
> > +.EX
> > +\&
> > +int fsfd, mntfd;
> > +int lowerdirfd = open("/o/ctr/lower1", O_DIRECTORY | O_CLOEXEC);
> > +\&
> > +fsfd = fsopen("overlay", FSOPEN_CLOEXEC);
> > +/* "lowerdir+" appends to the lower dir stack each time */
> > +fsconfig(fsfd, FSCONFIG_SET_FD, "lowerdir+", NULL, lowerdirfd);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower2", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower3", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower4", 0);
> > +.\" fsconfig(fsfd, FSCONFIG_SET_PATH, "lowerdir+", "/o/ctr/lower5", AT_FDCWD);
> > +.\" fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "lowerdir+", "", lowerdirfd);
> > +.\" Aleksa Sarai: Hopefully these will also be supported in the future.
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "xino", "auto", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "nfs_export", "off", 0);
> > +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> > +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
> > +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.EE
> > +.in
> > +.P
> > +And here is an example of how
> > +.BR fspick (2)
> > +can be used with
> > +.BR fsconfig ()
> > +to reconfigure the parameters
> > +of an extant filesystem instance
> > +attached to
> > +.IR /proc :
> > +.P
> > +.in +4n
> > +.EX
> > +int fsfd = fspick(AT_FDCWD, "/proc", FSPICK_CLOEXEC);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "ptraceable", 0);
> > +fsconfig(fsfd, FSCONFIG_SET_STRING, "subset", "pid", 0);
> > +fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> > +.EE
> > +.in
> > +.SH SEE ALSO
> > +.BR fsmount (2),
> > +.BR fsopen (2),
> > +.BR fspick (2),
> > +.BR mount (2),
> > +.BR mount_setattr (2),
> > +.BR move_mount (2),
> > +.BR open_tree (2),
> > +.BR mount_namespaces (7)
> > 
> > -- 
> > 2.51.0
> > 
> > 
> 
> -- 
> <https://www.alejandro-colomar.es>
> Use port 80 (that is, <...:80/>).



-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* [PATCH-RFC] init: simplify initrd code (was Re: [PATCH RESEND 00/62] initrd: remove classic initrd support).
From: nschichan @ 2025-09-25 13:10 UTC (permalink / raw)
  To: nschichan
  Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
	ecurtin, email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs,
	jack, julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
	linux-arch, linux-arm-kernel, linux-block, linux-csky, linux-doc,
	linux-efi, linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
	linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
	loongarch, mcgrof, mingo, monstr, mzxreary, patches, rob,
	safinaskar, sparclinux, thomas.weissschuh, thorsten.blum,
	torvalds, tytso, viro, x86
In-Reply-To: <CAHNNwZC7gC7zaZGiSBhobSAb4m2O1BuoZ4r=SQBF-tCQyuAPvw@mail.gmail.com>

From: Nicolas Schichan <nschichan@freebox.fr>

- drop prompt_ramdisk and ramdisk_start kernel parameters
- drop compression support
- drop image autodetection, the whole /initrd.image content is now
  copied into /dev/ram0
- remove rd_load_disk() which doesn't seem to be used anywhere.

There is now no more limitation on the type of initrd filesystem that
can be loaded since the code trying to guess the initrd filesystem
size is gone (the whole /initrd.image file is used).

A few global variables in do_mounts_rd.c are now put as local
variables in rd_load_image() since they do not need to be visible
outside this function.
---

Hello,

Hopefully my email config is now better and reaches gmail users
correctly.

The patch below could probably split in a few patches, but I think
this simplify the code greatly without removing the functionality we
depend on (and this allows now to use EROFS initrd images).

Coupled with keeping the function populate_initrd_image() in
init/initramfs.c, this will keep what we need from the initrd code.

This removes support of loading bzip/gz/xz/... compressed images as
well, not sure if many user depend on this feature anymore.

No signoff because I'm only seeking comments about those changes right
now.

 init/do_mounts.h    |   2 -
 init/do_mounts_rd.c | 243 +-------------------------------------------
 2 files changed, 4 insertions(+), 241 deletions(-)

diff --git a/init/do_mounts.h b/init/do_mounts.h
index 6069ea3eb80d..c0028ee3cff6 100644
--- a/init/do_mounts.h
+++ b/init/do_mounts.h
@@ -24,12 +24,10 @@ static inline __init int create_dev(char *name, dev_t dev)
 
 #ifdef CONFIG_BLK_DEV_RAM
 
-int __init rd_load_disk(int n);
 int __init rd_load_image(char *from);
 
 #else
 
-static inline int rd_load_disk(int n) { return 0; }
 static inline int rd_load_image(char *from) { return 0; }
 
 #endif
diff --git a/init/do_mounts_rd.c b/init/do_mounts_rd.c
index ac021ae6e6fa..5a69ff43f5ee 100644
--- a/init/do_mounts_rd.c
+++ b/init/do_mounts_rd.c
@@ -14,173 +14,9 @@
 
 #include <linux/decompress/generic.h>
 
-static struct file *in_file, *out_file;
-static loff_t in_pos, out_pos;
-
-static int __init prompt_ramdisk(char *str)
-{
-	pr_warn("ignoring the deprecated prompt_ramdisk= option\n");
-	return 1;
-}
-__setup("prompt_ramdisk=", prompt_ramdisk);
-
-int __initdata rd_image_start;		/* starting block # of image */
-
-static int __init ramdisk_start_setup(char *str)
-{
-	rd_image_start = simple_strtol(str,NULL,0);
-	return 1;
-}
-__setup("ramdisk_start=", ramdisk_start_setup);
-
-static int __init crd_load(decompress_fn deco);
-
-/*
- * This routine tries to find a RAM disk image to load, and returns the
- * number of blocks to read for a non-compressed image, 0 if the image
- * is a compressed image, and -1 if an image with the right magic
- * numbers could not be found.
- *
- * We currently check for the following magic numbers:
- *	minix
- *	ext2
- *	romfs
- *	cramfs
- *	squashfs
- *	gzip
- *	bzip2
- *	lzma
- *	xz
- *	lzo
- *	lz4
- */
-static int __init
-identify_ramdisk_image(struct file *file, loff_t pos,
-		decompress_fn *decompressor)
-{
-	const int size = 512;
-	struct minix_super_block *minixsb;
-	struct romfs_super_block *romfsb;
-	struct cramfs_super *cramfsb;
-	struct squashfs_super_block *squashfsb;
-	int nblocks = -1;
-	unsigned char *buf;
-	const char *compress_name;
-	unsigned long n;
-	int start_block = rd_image_start;
-
-	buf = kmalloc(size, GFP_KERNEL);
-	if (!buf)
-		return -ENOMEM;
-
-	minixsb = (struct minix_super_block *) buf;
-	romfsb = (struct romfs_super_block *) buf;
-	cramfsb = (struct cramfs_super *) buf;
-	squashfsb = (struct squashfs_super_block *) buf;
-	memset(buf, 0xe5, size);
-
-	/*
-	 * Read block 0 to test for compressed kernel
-	 */
-	pos = start_block * BLOCK_SIZE;
-	kernel_read(file, buf, size, &pos);
-
-	*decompressor = decompress_method(buf, size, &compress_name);
-	if (compress_name) {
-		printk(KERN_NOTICE "RAMDISK: %s image found at block %d\n",
-		       compress_name, start_block);
-		if (!*decompressor)
-			printk(KERN_EMERG
-			       "RAMDISK: %s decompressor not configured!\n",
-			       compress_name);
-		nblocks = 0;
-		goto done;
-	}
-
-	/* romfs is at block zero too */
-	if (romfsb->word0 == ROMSB_WORD0 &&
-	    romfsb->word1 == ROMSB_WORD1) {
-		printk(KERN_NOTICE
-		       "RAMDISK: romfs filesystem found at block %d\n",
-		       start_block);
-		nblocks = (ntohl(romfsb->size)+BLOCK_SIZE-1)>>BLOCK_SIZE_BITS;
-		goto done;
-	}
-
-	if (cramfsb->magic == CRAMFS_MAGIC) {
-		printk(KERN_NOTICE
-		       "RAMDISK: cramfs filesystem found at block %d\n",
-		       start_block);
-		nblocks = (cramfsb->size + BLOCK_SIZE - 1) >> BLOCK_SIZE_BITS;
-		goto done;
-	}
-
-	/* squashfs is at block zero too */
-	if (le32_to_cpu(squashfsb->s_magic) == SQUASHFS_MAGIC) {
-		printk(KERN_NOTICE
-		       "RAMDISK: squashfs filesystem found at block %d\n",
-		       start_block);
-		nblocks = (le64_to_cpu(squashfsb->bytes_used) + BLOCK_SIZE - 1)
-			 >> BLOCK_SIZE_BITS;
-		goto done;
-	}
-
-	/*
-	 * Read 512 bytes further to check if cramfs is padded
-	 */
-	pos = start_block * BLOCK_SIZE + 0x200;
-	kernel_read(file, buf, size, &pos);
-
-	if (cramfsb->magic == CRAMFS_MAGIC) {
-		printk(KERN_NOTICE
-		       "RAMDISK: cramfs filesystem found at block %d\n",
-		       start_block);
-		nblocks = (cramfsb->size + BLOCK_SIZE - 1) >> BLOCK_SIZE_BITS;
-		goto done;
-	}
-
-	/*
-	 * Read block 1 to test for minix and ext2 superblock
-	 */
-	pos = (start_block + 1) * BLOCK_SIZE;
-	kernel_read(file, buf, size, &pos);
-
-	/* Try minix */
-	if (minixsb->s_magic == MINIX_SUPER_MAGIC ||
-	    minixsb->s_magic == MINIX_SUPER_MAGIC2) {
-		printk(KERN_NOTICE
-		       "RAMDISK: Minix filesystem found at block %d\n",
-		       start_block);
-		nblocks = minixsb->s_nzones << minixsb->s_log_zone_size;
-		goto done;
-	}
-
-	/* Try ext2 */
-	n = ext2_image_size(buf);
-	if (n) {
-		printk(KERN_NOTICE
-		       "RAMDISK: ext2 filesystem found at block %d\n",
-		       start_block);
-		nblocks = n;
-		goto done;
-	}
-
-	printk(KERN_NOTICE
-	       "RAMDISK: Couldn't find valid RAM disk image starting at %d.\n",
-	       start_block);
-
-done:
-	kfree(buf);
-	return nblocks;
-}
-
 static unsigned long nr_blocks(struct file *file)
 {
-	struct inode *inode = file->f_mapping->host;
-
-	if (!S_ISBLK(inode->i_mode))
-		return 0;
-	return i_size_read(inode) >> 10;
+	return i_size_read(file->f_mapping->host) >> 10;
 }
 
 int __init rd_load_image(char *from)
@@ -190,10 +26,11 @@ int __init rd_load_image(char *from)
 	int nblocks, i;
 	char *buf = NULL;
 	unsigned short rotate = 0;
-	decompress_fn decompressor = NULL;
 #if !defined(CONFIG_S390)
 	char rotator[4] = { '|' , '/' , '-' , '\\' };
 #endif
+	struct file *in_file, *out_file;
+	loff_t in_pos = 0, out_pos = 0;
 
 	out_file = filp_open("/dev/ram", O_RDWR, 0);
 	if (IS_ERR(out_file))
@@ -203,17 +40,6 @@ int __init rd_load_image(char *from)
 	if (IS_ERR(in_file))
 		goto noclose_input;
 
-	in_pos = rd_image_start * BLOCK_SIZE;
-	nblocks = identify_ramdisk_image(in_file, in_pos, &decompressor);
-	if (nblocks < 0)
-		goto done;
-
-	if (nblocks == 0) {
-		if (crd_load(decompressor) == 0)
-			goto successful_load;
-		goto done;
-	}
-
 	/*
 	 * NOTE NOTE: nblocks is not actually blocks but
 	 * the number of kibibytes of data to load into a ramdisk.
@@ -228,10 +54,7 @@ int __init rd_load_image(char *from)
 	/*
 	 * OK, time to copy in the data
 	 */
-	if (strcmp(from, "/initrd.image") == 0)
-		devblocks = nblocks;
-	else
-		devblocks = nr_blocks(in_file);
+	nblocks = devblocks = nr_blocks(in_file);
 
 	if (devblocks == 0) {
 		printk(KERN_ERR "RAMDISK: could not determine device size\n");
@@ -264,7 +87,6 @@ int __init rd_load_image(char *from)
 	}
 	pr_cont("done.\n");
 
-successful_load:
 	res = 1;
 done:
 	fput(in_file);
@@ -275,60 +97,3 @@ int __init rd_load_image(char *from)
 	init_unlink("/dev/ram");
 	return res;
 }
-
-int __init rd_load_disk(int n)
-{
-	create_dev("/dev/root", ROOT_DEV);
-	create_dev("/dev/ram", MKDEV(RAMDISK_MAJOR, n));
-	return rd_load_image("/dev/root");
-}
-
-static int exit_code;
-static int decompress_error;
-
-static long __init compr_fill(void *buf, unsigned long len)
-{
-	long r = kernel_read(in_file, buf, len, &in_pos);
-	if (r < 0)
-		printk(KERN_ERR "RAMDISK: error while reading compressed data");
-	else if (r == 0)
-		printk(KERN_ERR "RAMDISK: EOF while reading compressed data");
-	return r;
-}
-
-static long __init compr_flush(void *window, unsigned long outcnt)
-{
-	long written = kernel_write(out_file, window, outcnt, &out_pos);
-	if (written != outcnt) {
-		if (decompress_error == 0)
-			printk(KERN_ERR
-			       "RAMDISK: incomplete write (%ld != %ld)\n",
-			       written, outcnt);
-		decompress_error = 1;
-		return -1;
-	}
-	return outcnt;
-}
-
-static void __init error(char *x)
-{
-	printk(KERN_ERR "%s\n", x);
-	exit_code = 1;
-	decompress_error = 1;
-}
-
-static int __init crd_load(decompress_fn deco)
-{
-	int result;
-
-	if (!deco) {
-		pr_emerg("Invalid ramdisk decompression routine.  "
-			 "Select appropriate config option.\n");
-		panic("Could not decompress initial ramdisk image.");
-	}
-
-	result = deco(NULL, 0, compr_fill, compr_flush, NULL, NULL, error);
-	if (decompress_error)
-		result = 1;
-	return result;
-}
-- 
2.34.1


^ permalink raw reply related

* Re: [PATCH v5 3/8] man/man2/fsconfig.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-25 11:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-3-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 22219 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:25AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---
>  man/man2/fsconfig.2 | 729 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 729 insertions(+)
> 
> diff --git a/man/man2/fsconfig.2 b/man/man2/fsconfig.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..a2d844a105c74f17af640d6991046dbd5fa69cf0
> --- /dev/null
> +++ b/man/man2/fsconfig.2
> @@ -0,0 +1,729 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH fsconfig 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +fsconfig \- configure new or existing filesystem context
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mount.h>
> +.P
> +.BI "int fsconfig(int " fd ", unsigned int " cmd ,
> +.BI "             const char *_Nullable " key ,
> +.BI "             const void *_Nullable " value ", int " aux );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR fsconfig ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR fsconfig ()
> +is used to supply parameters to
> +and issue commands against
> +the filesystem configuration context
> +associated with the file descriptor
> +.IR fd .
> +Filesystem configuration contexts can be created with
> +.BR fsopen (2)
> +or be instantiated from an extant filesystem instance with
> +.BR fspick (2).
> +.P
> +The
> +.I cmd
> +argument indicates the command to be issued.
> +Some commands supply parameters to the context
> +(equivalent to mount options specified with
> +.BR mount (8)),
> +while others are meta-operations on the filesystem context.
> +The list of valid
> +.I cmd
> +values are:
> +.RS
> +.TP
> +.B FSCONFIG_SET_FLAG
> +Set the flag parameter named by
> +.IR key .
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.TP
> +.B FSCONFIG_SET_STRING
> +Set the string parameter named by
> +.I key
> +to the value specified by
> +.IR value .
> +.I value
> +points to a null-terminated string,
> +and
> +.I aux
> +must be 0.
> +.TP
> +.B FSCONFIG_SET_BINARY
> +Set the blob parameter named by
> +.I key
> +to the contents of the binary blob
> +specified by
> +.IR value .
> +.I value
> +points to
> +the start of a buffer
> +that is
> +.I aux
> +bytes in length.
> +.TP
> +.B FSCONFIG_SET_FD
> +Set the file parameter named by
> +.I key
> +to the open file description
> +referenced by the file descriptor
> +.IR aux .
> +.I value
> +must be NULL.
> +.IP
> +You may also use
> +.B \%FSCONFIG_SET_STRING
> +for file parameters,
> +with
> +.I value
> +set to a null-terminated string
> +containing a base-10 representation
> +of the file descriptor number.
> +This mechanism is primarily intended for compatibility
> +with older
> +.BR mount (2)-based
> +programs,
> +and only works for parameters
> +that
> +.I only
> +accept file descriptor arguments.
> +.TP
> +.B FSCONFIG_SET_PATH
> +Set the path parameter named by
> +.I key
> +to the object at a provided path,
> +resolved in a similar manner to
> +.BR openat (2).
> +.I value
> +points to a null-terminated pathname string,
> +and
> +.I aux
> +is equivalent to the
> +.I dirfd
> +argument to
> +.BR openat (2).
> +See
> +.BR openat (2)
> +for an explanation of the need for
> +.BR \%FSCONFIG_SET_PATH .
> +.IP
> +You may also use
> +.B \%FSCONFIG_SET_STRING
> +for path parameters,
> +the behaviour of which is equivalent to
> +.B \%FSCONFIG_SET_PATH
> +with
> +.I aux
> +set to
> +.BR \%AT_FDCWD .
> +.TP
> +.B FSCONFIG_SET_PATH_EMPTY
> +As with
> +.BR \%FSCONFIG_SET_PATH ,
> +except that if
> +.I value
> +is an empty string,
> +the file descriptor specified by
> +.I aux
> +is operated on directly
> +and may be any type of file
> +(not just a directory).
> +This is equivalent to the behaviour of
> +.B \%AT_EMPTY_PATH
> +with most "*at()" system calls.
> +If
> +.I aux
> +is
> +.BR \%AT_FDCWD ,
> +the parameter will be set to
> +the current working directory
> +of the calling process.
> +.TP
> +.B FSCONFIG_CMD_CREATE
> +This command instructs the filesystem driver
> +to instantiate an instance of the filesystem in the kernel
> +with the parameters specified in the filesystem configuration context.
> +.I key
> +and
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.IP
> +This command can only be issued once
> +in the lifetime of a filesystem context.
> +If the operation succeeds,
> +the filesystem context
> +associated with file descriptor
> +.I fd
> +now references the created filesystem instance,
> +and is placed into a special "awaiting-mount" mode
> +that allows you to use
> +.BR fsmount (2)
> +to create a mount object from the filesystem instance.
> +.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
> +If the operation fails,
> +in most cases
> +the filesystem context is placed in a failed mode
> +and cannot be used for any further
> +.BR fsconfig ()
> +operations
> +(though you may still retrieve diagnostic messages
> +through the message retrieval interface,
> +as described in
> +the corresponding subsection of
> +.BR fsopen (2)).
> +.IP
> +This command can only be issued against
> +filesystem configuration contexts
> +that were created with
> +.BR fsopen (2).
> +In order to create a filesystem instance,
> +the calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.IP
> +An important thing to be aware of is that
> +the Linux kernel will
> +.I silently
> +reuse extant filesystem instances
> +depending on the filesystem type
> +and the configured parameters
> +(each filesystem driver has
> +its own policy for
> +how filesystem instances are reused).
> +This means that
> +the filesystem instance "created" by
> +.B \%FSCONFIG_CMD_CREATE
> +may, in fact, be a reference
> +to an extant filesystem instance in the kernel.
> +(For reference,
> +this behaviour also applies to
> +.BR mount (2).)
> +.IP
> +One side-effect of this behaviour is that
> +if an extant filesystem instance is reused,
> +.I all
> +parameters configured
> +for this filesystem configuration context
> +are
> +.I silently ignored
> +(with the exception of the
> +.I ro
> +and
> +.I rw
> +flag parameters;
> +if the state of the read-only flag in the
> +extant filesystem instance and the filesystem configuration context
> +do not match, this operation will return
> +.BR EBUSY ).
> +This also means that
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +commands issued against
> +the "created" filesystem instance
> +will also affect any mount objects associated with
> +the extant filesystem instance.
> +.IP
> +Programs that need to ensure
> +that they create a new filesystem instance
> +with specific parameters
> +(notably, security-related parameters
> +such as
> +.I acl
> +to enable POSIX ACLs\[em]\c
> +as described in
> +.BR acl (5))
> +should use
> +.B \%FSCONFIG_CMD_CREATE_EXCL
> +instead.
> +.TP
> +.BR FSCONFIG_CMD_CREATE_EXCL " (since Linux 6.6)"
> +.\" commit 22ed7ecdaefe0cac0c6e6295e83048af60435b13
> +.\" commit 84ab1277ce5a90a8d1f377707d662ac43cc0918a
> +As with
> +.BR \%FSCONFIG_CMD_CREATE ,
> +except that the kernel is instructed
> +to not reuse extant filesystem instances.
> +If the operation
> +would be forced to
> +reuse an extant filesystem instance,
> +this operation will return
> +.B EBUSY
> +instead.
> +.IP
> +As a result (unlike
> +.BR \%FSCONFIG_CMD_CREATE ),
> +if this operation succeeds
> +then the calling process can be sure that
> +all of the parameters successfully configured with
> +.BR fsconfig ()
> +will actually be applied
> +to the created filesystem instance.
> +.TP
> +.B FSCONFIG_CMD_RECONFIGURE
> +This command instructs the filesystem driver
> +to apply the parameters specified in the filesystem configuration context
> +to the extant filesystem instance
> +referenced by the filesystem configuration context.
> +.I key
> +and
> +.I value
> +must be NULL,
> +and
> +.I aux
> +must be 0.
> +.IP
> +This is primarily intended for use with
> +.BR fspick (2),
> +but may also be used to modify
> +the parameters of a filesystem instance
> +after
> +.B \%FSCONFIG_CMD_CREATE
> +was used to create it
> +and a mount object was created using
> +.BR fsmount (2).
> +In order to reconfigure an extant filesystem instance,
> +the calling process must have the
> +.B CAP_SYS_ADMIN
> +capability.
> +.IP
> +If the operation succeeds,
> +the filesystem context is reset
> +but remains in reconfiguration mode
> +and thus can be reused for subsequent
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +commands.
> +If the operation fails,
> +in most cases
> +the filesystem context is placed in a failed mode
> +and cannot be used for any further
> +.BR fsconfig ()
> +operations
> +(though you may still retrieve diagnostic messages
> +through the message retrieval interface,
> +as described in
> +the corresponding subsection of
> +.BR fsopen (2)).
> +.RE
> +.P
> +Parameters specified with
> +.BI FSCONFIG_SET_ *
> +do not take effect
> +until a corresponding
> +.B \%FSCONFIG_CMD_CREATE
> +or
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +command is issued.
> +.SH RETURN VALUE
> +On success,
> +.BR fsconfig ()
> +returns 0.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +If an error occurs, the filesystem driver may provide
> +additional information about the error
> +through the message retrieval interface for filesystem configuration contexts.
> +This additional information can be retrieved at any time by calling
> +.BR read (2)
> +on the filesystem instance or filesystem configuration context
> +referenced by the file descriptor
> +.IR fd .
> +(See the "Message retrieval interface" subsection in
> +.BR fsopen (2)
> +for more details on the message format.)
> +.P
> +Even after an error occurs,
> +the filesystem configuration context is
> +.I not
> +invalidated,
> +and thus can still be used with other
> +.BR fsconfig ()
> +commands.
> +This means that users can probe support for filesystem parameters
> +on a per-parameter basis,
> +and adjust which parameters they wish to set.
> +.P
> +The error values given below result from
> +filesystem type independent errors.
> +Each filesystem type may have its own special errors
> +and its own special behavior.
> +See the Linux kernel source code for details.
> +.TP
> +.B EACCES
> +A component of a path
> +provided as a path parameter
> +was not searchable.
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EACCES
> +.B \%FSCONFIG_CMD_CREATE
> +was attempted
> +for a read-only filesystem
> +without specifying the
> +.RB ' ro '
> +flag parameter.
> +.TP
> +.B EACCES
> +A specified block device parameter
> +is located on a filesystem
> +mounted with the
> +.B \%MS_NODEV
> +option.
> +.TP
> +.B EBADF
> +The file descriptor given by
> +.I fd
> +(or possibly by
> +.IR aux ,
> +depending on the command)
> +is invalid.
> +.TP
> +.B EBUSY
> +The filesystem context associated with
> +.I fd
> +is in the wrong state
> +for the given command.
> +.TP
> +.B EBUSY
> +The filesystem instance cannot be reconfigured as read-only
> +with
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +because some programs
> +still hold files open for writing.
> +.TP
> +.B EBUSY
> +A new filesystem instance was requested with
> +.B \%FSCONFIG_CMD_CREATE_EXCL
> +but a matching superblock already existed.
> +.TP
> +.B EFAULT
> +One of the pointer arguments
> +points to a location
> +outside the calling process's accessible address space.
> +.TP
> +.B EINVAL
> +.I fd
> +does not refer to
> +a filesystem configuration context
> +or filesystem instance.
> +.TP
> +.B EINVAL
> +One of the values of
> +.IR name ,
> +.IR value ,
> +and/or
> +.I aux
> +were set to a non-zero value when
> +.I cmd
> +required that they be zero
> +(or NULL).
> +.TP
> +.B EINVAL
> +The parameter named by
> +.I name
> +cannot be set
> +using the type specified with
> +.IR cmd .
> +.TP
> +.B EINVAL
> +One of the source parameters
> +referred to
> +an invalid superblock.
> +.TP
> +.B ELOOP
> +Too many links encountered
> +during pathname resolution
> +of a path argument.
> +.TP
> +.B ENAMETOOLONG
> +A path argument was longer than
> +.BR PATH_MAX .
> +.TP
> +.B ENOENT
> +A path argument had a non-existent component.
> +.TP
> +.B ENOENT
> +A path argument is an empty string,
> +but
> +.I cmd
> +is not
> +.BR \%FSCONFIG_SET_PATH_EMPTY .
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B ENOTBLK
> +The parameter named by
> +.I name

There's no such parameter.  (I guess you meant 'key'?)


Cheers,
Alex

> +must be a block device,
> +but the provided parameter value was not a block device.
> +.TP
> +.B ENOTDIR
> +A component of the path prefix
> +of a path argument
> +was not a directory.
> +.TP
> +.B EOPNOTSUPP
> +The command given by
> +.I cmd
> +is not valid.
> +.TP
> +.B ENXIO
> +The major number
> +of a block device parameter
> +is out of range.
> +.TP
> +.B EPERM
> +The command given by
> +.I cmd
> +was
> +.BR \%FSCONFIG_CMD_CREATE ,
> +.BR \%FSCONFIG_CMD_CREATE_EXCL ,
> +or
> +.BR \%FSCONFIG_CMD_RECONFIGURE ,
> +but the calling process does not have the required
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit ecdab150fddb42fe6a739335257949220033b782
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH NOTES
> +.SS Generic filesystem parameters
> +Each filesystem driver is responsible for
> +parsing most parameters specified with
> +.BR fsconfig (),
> +meaning that individual filesystems
> +may have very different behaviour
> +when encountering parameters with the same name.
> +In general,
> +you should not assume that the behaviour of
> +.BR fsconfig ()
> +when specifying a parameter to one filesystem type
> +will match the behaviour of the same parameter
> +with a different filesystem type.
> +.P
> +However,
> +the following generic parameters
> +apply to all filesystems and have unified behaviour.
> +They are set using the listed
> +.BI \%FSCONFIG_SET_ *
> +command.
> +.TP
> +\fIro\fP and \fIrw\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether the filesystem instance is read-only.
> +.TP
> +\fIdirsync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Make directory changes on this filesystem instance synchronous.
> +.TP
> +\fIsync\fP and \fIasync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether writes on this filesystem instance
> +will be made synchronous
> +(as though the
> +.B O_SYNC
> +flag to
> +.BR open (2)
> +was specified for
> +all file opens in this filesystem instance).
> +.TP
> +\fIlazytime\fP and \fInolazytime\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether to reduce on-disk updates
> +of inode timestamps on this filesystem instance
> +(as described in the
> +.B \%MS_LAZYTIME
> +section of
> +.BR mount (2)).
> +.TP
> +\fImand\fP and \fInomand\fP (\fB\%FSCONFIG_SET_FLAG\fP)
> +Configure whether the filesystem instance should permit mandatory locking.
> +Since Linux 5.15,
> +.\" commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
> +mandatory locking has been deprecated
> +and setting this flag is a no-op.
> +.TP
> +\fIsource\fP (\fB\%FSCONFIG_SET_STRING\fP)
> +This parameter is equivalent to the
> +.I source
> +parameter passed to
> +.BR mount (2)
> +for the same filesystem type,
> +and is usually the pathname of a block device
> +containing the filesystem.
> +This parameter may only be set once
> +per filesystem configuration context transaction.
> +.P
> +In addition,
> +any filesystem parameters associated with
> +Linux Security Modules (LSMs)
> +are also generic with respect to the underlying filesystem.
> +See the documentation for the LSM you wish to configure for more details.
> +.SH CAVEATS
> +.SS Filesystem parameter types
> +As a result of
> +each filesystem driver being responsible for
> +parsing most parameters specified with
> +.BR fsconfig (),
> +some filesystem drivers
> +may have unintuitive behaviour
> +with regards to which
> +.BI \%FSCONFIG_SET_ *
> +commands are permitted
> +to configure a given parameter.
> +.P
> +In order for
> +filesystem parameters to be backwards compatible with
> +.BR mount (2),
> +they must be parseable as strings;
> +this almost universally means that
> +.B \%FSCONFIG_SET_STRING
> +can also be used to configure them.
> +.\" Aleksa Sarai
> +.\"   Theoretically, a filesystem could check fc->oldapi and refuse
> +.\"   FSCONFIG_SET_STRING if the operation is coming from the new API, but no
> +.\"   filesystems do this (and probably never will).
> +However, other
> +.BI \%FSCONFIG_SET_ *
> +commands need to be opted into
> +by each filesystem driver's parameter parser.
> +.P
> +One of the most user-visible instances of
> +this inconsistency is that
> +many filesystems do not support
> +configuring path parameters with
> +.B \%FSCONFIG_SET_PATH
> +(despite the name),
> +which can lead to somewhat confusing
> +.B EINVAL
> +errors.
> +(For example, the generic
> +.I source
> +parameter\[em]\c
> +which is usually a path\[em]\c
> +can only be configured
> +with
> +.BR \%FSCONFIG_SET_STRING .)
> +.P
> +When writing programs that use
> +.BR fsconfig ()
> +to configure parameters
> +with commands other than
> +.BR \%FSCONFIG_SET_STRING ,
> +users should verify
> +that the
> +.BI \%FSCONFIG_SET_ *
> +commands used to configure each parameter
> +are supported by the corresponding filesystem driver.
> +.\" Aleksa Sarai
> +.\"   While this (quite confusing) inconsistency in behaviour is true today
> +.\"   (and has been true since this was merged), this appears to mostly be an
> +.\"   unintended consequence of filesystem drivers hand-coding fsparam parsing.
> +.\"   Path parameters are the most eggregious causes of confusion.
> +.\"   Hopefully we can make this no longer the case in a future kernel.
> +.SH EXAMPLES
> +To illustrate the different kinds of flags that can be configured with
> +.BR fsconfig (),
> +here are a few examples of some different filesystems being created:
> +.P
> +.in +4n
> +.EX
> +int fsfd, mntfd;
> +\&
> +fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "inode64", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "uid", "1234", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "huge", "never", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "casefold", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC);
> +move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
> +\&
> +fsfd = fsopen("erofs", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/loop0", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE_EXCL, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOSUID);
> +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +Usually,
> +specifying the same parameter named by
> +.I key
> +multiple times with
> +.BR fsconfig ()
> +causes the parameter value to be replaced.
> +However, some filesystems may have unique behaviour:
> +.P
> +.in +4n
> +.EX
> +\&
> +int fsfd, mntfd;
> +int lowerdirfd = open("/o/ctr/lower1", O_DIRECTORY | O_CLOEXEC);
> +\&
> +fsfd = fsopen("overlay", FSOPEN_CLOEXEC);
> +/* "lowerdir+" appends to the lower dir stack each time */
> +fsconfig(fsfd, FSCONFIG_SET_FD, "lowerdir+", NULL, lowerdirfd);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower2", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower3", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower4", 0);
> +.\" fsconfig(fsfd, FSCONFIG_SET_PATH, "lowerdir+", "/o/ctr/lower5", AT_FDCWD);
> +.\" fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "lowerdir+", "", lowerdirfd);
> +.\" Aleksa Sarai: Hopefully these will also be supported in the future.
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "xino", "auto", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "nfs_export", "off", 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
> +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +And here is an example of how
> +.BR fspick (2)
> +can be used with
> +.BR fsconfig ()
> +to reconfigure the parameters
> +of an extant filesystem instance
> +attached to
> +.IR /proc :
> +.P
> +.in +4n
> +.EX
> +int fsfd = fspick(AT_FDCWD, "/proc", FSPICK_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "ptraceable", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "subset", "pid", 0);
> +fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> +.EE
> +.in
> +.SH SEE ALSO
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR fspick (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)
> 
> -- 
> 2.51.0
> 
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 2/8] man/man2/fspick.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-25 11:14 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-2-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 9639 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:24AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Patch applied (with minor amends; see below).  Thanks!

> ---
>  man/man2/fspick.2 | 343 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 343 insertions(+)
> 
> diff --git a/man/man2/fspick.2 b/man/man2/fspick.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..800aed81d38384be4563f2558e3cef846d7e7cee
> --- /dev/null
> +++ b/man/man2/fspick.2
> @@ -0,0 +1,343 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH fspick 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +fspick \- select filesystem for reconfiguration
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.BR "#include <fcntl.h>" "          /* Definition of " AT_* " constants */"
> +.B #include <sys/mount.h>
> +.P
> +.BI "int fspick(int " dirfd ", const char *" path ", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR fspick ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR fspick ()
> +creates a new filesystem configuration context
> +for the extant filesystem instance
> +associated with the path described by
> +.I dirfd
> +and
> +.IR path ,
> +places it into reconfiguration mode
> +(similar to
> +.BR mount (8)
> +with the
> +.I \-o\~remount

I've changed this to bold.  Command options are literal, and thus we
document them with bold (see for example ldd(1)).

> +option).
> +A new file descriptor
> +associated with the filesystem configuration context
> +is then returned.
> +The calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability in order to create a new filesystem configuration context.
> +.P
> +The resultant file descriptor can be used with
> +.BR fsconfig (2)
> +to specify the desired set of changes to
> +filesystem parameters of the filesystem instance.
> +Once the desired set of changes have been configured,
> +the changes can be effectuated by calling
> +.BR fsconfig (2)
> +with the
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +command.
> +In contrast to
> +the behaviour of
> +.B MS_REMOUNT
> +with
> +.BR mount (2),
> +.BR fspick ()
> +instantiates the filesystem configuration context
> +with a copy of
> +the extant filesystem's filesystem parameters;
> +thus,
> +subsequent
> +.B \%FSCONFIG_CMD_RECONFIGURE
> +operations
> +will only update filesystem parameters
> +explicitly modified with
> +.BR fsconfig (2).
> +.P
> +As with "*at()" system calls,
> +.BR fspick ()
> +uses the
> +.I dirfd
> +argument in conjunction with the
> +.I path
> +argument to determine the path to operate on, as follows:
> +.IP \[bu] 3
> +If the pathname given in
> +.I path
> +is absolute, then
> +.I dirfd
> +is ignored.
> +.IP \[bu]
> +If the pathname given in
> +.I path
> +is relative and
> +.I dirfd
> +is the special value
> +.BR \%AT_FDCWD ,
> +then
> +.I path
> +is interpreted relative to
> +the current working directory
> +of the calling process (like
> +.BR open (2)).
> +.IP \[bu]
> +If the pathname given in
> +.I path
> +is relative,
> +then it is interpreted relative to
> +the directory referred to by the file descriptor
> +.I dirfd
> +(rather than relative to
> +the current working directory
> +of the calling process,
> +as is done by
> +.BR open (2)
> +for a relative pathname).
> +In this case,
> +.I dirfd
> +must be a directory
> +that was opened for reading
> +.RB ( O_RDONLY )
> +or using the
> +.B O_PATH
> +flag.
> +.IP \[bu]
> +If
> +.I path
> +is an empty string,
> +and
> +.I flags
> +contains
> +.BR \%FSPICK_EMPTY_PATH ,
> +then the file descriptor
> +.I dirfd
> +is operated on directly.
> +In this case,
> +.I dirfd
> +may refer to any type of file,
> +not just a directory.
> +.P
> +See
> +.BR openat (2)
> +for an explanation of why the
> +.I dirfd
> +argument is useful.
> +.P
> +.I flags
> +can be used to control aspects of how
> +.I path
> +is resolved and
> +properties of the returned file descriptor.
> +A value for
> +.I flags
> +is constructed by bitwise ORing
> +zero or more of the following constants:
> +.RS
> +.TP
> +.B FSPICK_CLOEXEC
> +Set the close-on-exec
> +.RB ( FD_CLOEXEC )
> +flag on the new file descriptor.
> +See the description of the
> +.B O_CLOEXEC
> +flag in
> +.BR open (2)
> +for reasons why this may be useful.
> +.TP
> +.B FSPICK_EMPTY_PATH
> +If
> +.I path
> +is an empty string,
> +operate on the file referred to by
> +.I dirfd
> +(which may have been obtained from
> +.BR open (2),
> +.BR fsmount (2),
> +or
> +.BR open_tree (2)).
> +In this case,
> +.I dirfd
> +may refer to any type of file,
> +not just a directory.
> +If
> +.I dirfd
> +is
> +.BR \%AT_FDCWD ,
> +.BR fspick ()
> +will operate on the current working directory
> +of the calling process.
> +.TP
> +.B FSPICK_SYMLINK_NOFOLLOW
> +Do not follow symbolic links
> +in the terminal component of
> +.IR path .
> +If
> +.I path
> +references a symbolic link,
> +the returned filesystem context will reference
> +the filesystem that the symbolic link itself resides on.
> +.TP
> +.B FSPICK_NO_AUTOMOUNT
> +Do not automount the terminal ("basename") component of
> +.I path
> +if it is a directory that is an automount point.
> +This allows you to reconfigure an automount point,
> +rather than the location that would be mounted.
> +This flag has no effect
> +if the automount point has already been mounted over.
> +.RE
> +.P
> +As with filesystem contexts created with
> +.BR fsopen (2),
> +the file descriptor returned by
> +.BR fspick ()
> +may be queried for message strings at any time by calling
> +.BR read (2)
> +on the file descriptor.
> +(See the "Message retrieval interface" subsection in
> +.BR fsopen (2)
> +for more details on the message format.)
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Search permission is denied
> +for one of the directories
> +in the path prefix of
> +.IR path .
> +(See also
> +.BR path_resolution (7).)
> +.TP
> +.B EBADF
> +.I path
> +is relative but
> +.I dirfd
> +is neither
> +.B \%AT_FDCWD
> +nor a valid file descriptor.
> +.TP
> +.B EFAULT
> +.I path
> +is NULL
> +or a pointer to a location
> +outside the calling process's accessible address space.
> +.TP
> +.B EINVAL
> +Invalid flag specified in
> +.IR flags .
> +.TP
> +.B ELOOP
> +Too many symbolic links encountered when resolving
> +.IR path .
> +.TP
> +.B EMFILE
> +The calling process has too many open files to create more.
> +.TP
> +.B ENAMETOOLONG
> +.I path
> +is longer than
> +.BR PATH_MAX .
> +.TP
> +.B ENFILE
> +The system has too many open files to create more.
> +.TP
> +.B ENOENT
> +A component of
> +.I path
> +does not exist,
> +or is a dangling symbolic link.
> +.TP
> +.B ENOENT
> +.I path
> +is an empty string, but
> +.B \%FSPICK_EMPTY_PATH
> +is not specified in
> +.IR flags .
> +.TP
> +.B ENOTDIR
> +A component of the path prefix of
> +.I path
> +is not a directory;
> +or
> +.I path
> +is relative and
> +.I dirfd
> +is a file descriptor referring to a file other than a directory.
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B EPERM
> +The calling process does not have the required
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit cf3cba4a429be43e5527a3f78859b1bfd9ebc5fb
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH EXAMPLES
> +The following example sets the read-only flag
> +on the filesystem instance referenced by
> +the mount object attached at
> +.IR /tmp .
> +.P
> +.in +4n
> +.EX
> +int fsfd = fspick(AT_FDCWD, "/tmp", FSPICK_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
> +.EE
> +.in
> +.P
> +The above procedure is roughly equivalent to
> +the following mount operation using
> +.BR mount (2):
> +.P
> +.in +4n
> +.EX
> +mount(NULL, "/tmp", NULL, MS_REMOUNT | MS_RDONLY, NULL);
> +.EE
> +.in
> +.P
> +With the notable caveat that
> +in this example,
> +.BR mount (2)
> +will clear all other filesystem parameters
> +(such as
> +.B MS_DIRSYNC
> +or
> +.BR MS_SYNCHRONOUS );
> +.BR fsconfig (2)
> +will only modify the
> +.I ro

I've also changed this to bold, as it's something literal.


Cheers,
Alex

> +parameter.
> +.SH SEE ALSO
> +.BR fsconfig (2),
> +.BR fsmount (2),
> +.BR fsopen (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v5 1/8] man/man2/fsopen.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-25 10:13 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250925-new-mount-api-v5-1-028fb88023f2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 13334 bytes --]

Hi Aleksa,

On Thu, Sep 25, 2025 at 01:31:23AM +1000, Aleksa Sarai wrote:
> This is loosely based on the original documentation written by David
> Howells and later maintained by Christian Brauner, but has been
> rewritten to be more from a user perspective (as well as fixing a few
> critical mistakes).
> 
> Co-authored-by: David Howells <dhowells@redhat.com>
> Signed-off-by: David Howells <dhowells@redhat.com>
> Co-authored-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>

Patch applied.  Thanks!


Have a lovely day!
Alex

> ---
>  man/man2/fsopen.2 | 385 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 385 insertions(+)
> 
> diff --git a/man/man2/fsopen.2 b/man/man2/fsopen.2
> new file mode 100644
> index 0000000000000000000000000000000000000000..7fbc6c3d28e2e741cd9003c105621b4242abd487
> --- /dev/null
> +++ b/man/man2/fsopen.2
> @@ -0,0 +1,385 @@
> +.\" Copyright, the authors of the Linux man-pages project
> +.\"
> +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> +.\"
> +.TH fsopen 2 (date) "Linux man-pages (unreleased)"
> +.SH NAME
> +fsopen \- create a new filesystem context
> +.SH LIBRARY
> +Standard C library
> +.RI ( libc ,\~ \-lc )
> +.SH SYNOPSIS
> +.nf
> +.B #include <sys/mount.h>
> +.P
> +.BI "int fsopen(const char *" fsname ", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR fsopen ()
> +system call is part of
> +the suite of file-descriptor-based mount facilities in Linux.
> +.P
> +.BR fsopen ()
> +creates a blank filesystem configuration context within the kernel
> +for the filesystem named by
> +.I fsname
> +and places it into creation mode.
> +A new file descriptor
> +associated with the filesystem configuration context
> +is then returned.
> +The calling process must have the
> +.B \%CAP_SYS_ADMIN
> +capability in order to create a new filesystem configuration context.
> +.P
> +A filesystem configuration context is
> +an in-kernel representation of a pending transaction,
> +containing a set of configuration parameters that are to be applied
> +when creating a new instance of a filesystem
> +(or modifying the configuration of an existing filesystem instance,
> +such as when using
> +.BR fspick (2)).
> +.P
> +After obtaining a filesystem configuration context with
> +.BR fsopen (),
> +the general workflow for operating on the context looks like the following:
> +.IP (1) 5
> +Pass the filesystem context file descriptor to
> +.BR fsconfig (2)
> +to specify any desired filesystem parameters.
> +This may be done as many times as necessary.
> +.IP (2)
> +Pass the same filesystem context file descriptor to
> +.BR fsconfig (2)
> +with
> +.B \%FSCONFIG_CMD_CREATE
> +to create an instance of the configured filesystem.
> +.IP (3)
> +Pass the same filesystem context file descriptor to
> +.BR fsmount (2)
> +to create a new detached mount object for
> +the root of the filesystem instance,
> +which is then attached to a new file descriptor.
> +(This also places the filesystem context file descriptor into
> +reconfiguration mode,
> +similar to the mode produced by
> +.BR fspick (2).)
> +Once a mount object has been created with
> +.BR fsmount (2),
> +the filesystem context file descriptor can be safely closed.
> +.IP (4)
> +Now that a mount object has been created,
> +you may
> +.RS
> +.IP \[bu] 3
> +use the detached mount object file descriptor as a
> +.I dirfd
> +argument to "*at()" system calls;
> +and/or
> +.IP \[bu]
> +attach the mount object to a mount point
> +by passing the mount object file descriptor to
> +.BR move_mount (2).
> +This will also prevent the mount object from
> +being unmounted and destroyed when
> +the mount object file descriptor is closed.
> +.RE
> +.IP
> +The mount object file descriptor will
> +remain associated with the mount object
> +even after doing the above operations,
> +so you may repeatedly use the mount object file descriptor with
> +.BR move_mount (2)
> +and/or "*at()" system calls
> +as many times as necessary.
> +.P
> +A filesystem context will move between different modes
> +throughout its lifecycle
> +(such as the creation phase
> +when created with
> +.BR fsopen (),
> +the reconfiguration phase
> +when an existing filesystem instance is selected with
> +.BR fspick (2),
> +and the intermediate "awaiting-mount" phase
> +.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
> +between
> +.B \%FSCONFIG_CMD_CREATE
> +and
> +.BR fsmount (2)),
> +which has an impact on
> +what operations are permitted on the filesystem context.
> +.P
> +The file descriptor returned by
> +.BR fsopen ()
> +also acts as a channel for filesystem drivers to
> +provide more comprehensive diagnostic information
> +than is normally provided through the standard
> +.BR errno (3)
> +interface for system calls.
> +If an error occurs at any time during the workflow mentioned above,
> +calling
> +.BR read (2)
> +on the filesystem context file descriptor
> +will retrieve any ancillary information about the encountered errors.
> +(See the "Message retrieval interface" section
> +for more details on the message format.)
> +.P
> +.I flags
> +can be used to control aspects of
> +the creation of the filesystem configuration context file descriptor.
> +A value for
> +.I flags
> +is constructed by bitwise ORing
> +zero or more of the following constants:
> +.RS
> +.TP
> +.B FSOPEN_CLOEXEC
> +Set the close-on-exec
> +.RB ( FD_CLOEXEC )
> +flag on the new file descriptor.
> +See the description of the
> +.B O_CLOEXEC
> +flag in
> +.BR open (2)
> +for reasons why this may be useful.
> +.RE
> +.P
> +A list of filesystems supported by the running kernel
> +(and thus a list of valid values for
> +.IR fsname )
> +can be obtained from
> +.IR /proc/filesystems .
> +(See also
> +.BR proc_filesystems (5).)
> +.SS Message retrieval interface
> +When doing operations on a filesystem configuration context,
> +the filesystem driver may choose to provide
> +ancillary information to userspace
> +in the form of message strings.
> +.P
> +The filesystem context file descriptors returned by
> +.BR fsopen ()
> +and
> +.BR fspick (2)
> +may be queried for message strings at any time by calling
> +.BR read (2)
> +on the file descriptor.
> +Each call to
> +.BR read (2)
> +will return a single message,
> +prefixed to indicate its class:
> +.RS
> +.TP
> +.BI e\~ message
> +An error message was logged.
> +This is usually associated with an error being returned
> +from the corresponding system call which triggered this message.
> +.TP
> +.BI w\~ message
> +A warning message was logged.
> +.TP
> +.BI i\~ message
> +An informational message was logged.
> +.RE
> +.P
> +Messages are removed from the queue as they are read.
> +Note that the message queue has limited depth,
> +so it is possible for messages to get lost.
> +If there are no messages in the message queue,
> +.B read(2)
> +will return \-1 and
> +.I errno
> +will be set to
> +.BR \%ENODATA .
> +If the
> +.I buf
> +argument to
> +.BR read (2)
> +is not large enough to contain the entire message,
> +.BR read (2)
> +will return \-1 and
> +.I errno
> +will be set to
> +.BR \%EMSGSIZE .
> +(See BUGS.)
> +.P
> +If there are multiple filesystem contexts
> +referencing the same filesystem instance
> +(such as if you call
> +.BR fspick (2)
> +multiple times for the same mount),
> +each one gets its own independent message queue.
> +This does not apply to multiple file descriptors that are
> +tied to the same underlying open file description
> +(such as those created with
> +.BR dup (2)).
> +.P
> +Message strings will usually be prefixed by
> +the name of the filesystem or kernel subsystem
> +that logged the message,
> +though this may not always be the case.
> +See the Linux kernel source code for details.
> +.SH RETURN VALUE
> +On success, a new file descriptor is returned.
> +On error, \-1 is returned, and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EFAULT
> +.I fsname
> +is NULL
> +or a pointer to a location
> +outside the calling process's accessible address space.
> +.TP
> +.B EINVAL
> +.I flags
> +had an invalid flag set.
> +.TP
> +.B EMFILE
> +The calling process has too many open files to create more.
> +.TP
> +.B ENFILE
> +The system has too many open files to create more.
> +.TP
> +.B ENODEV
> +The filesystem named by
> +.I fsname
> +is not supported by the kernel.
> +.TP
> +.B ENOMEM
> +The kernel could not allocate sufficient memory to complete the operation.
> +.TP
> +.B EPERM
> +The calling process does not have the required
> +.B \%CAP_SYS_ADMIN
> +capability.
> +.SH STANDARDS
> +Linux.
> +.SH HISTORY
> +Linux 5.2.
> +.\" commit 24dcb3d90a1f67fe08c68a004af37df059d74005
> +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> +glibc 2.36.
> +.SH BUGS
> +.SS Message retrieval interface and \fB\%EMSGSIZE\fP
> +As described in the "Message retrieval interface" subsection above,
> +calling
> +.BR read (2)
> +with too small a buffer to contain
> +the next pending message in the message queue
> +for the filesystem configuration context
> +will cause
> +.BR read (2)
> +to return \-1 and set
> +.BR errno (3)
> +to
> +.BR \%EMSGSIZE .
> +.P
> +However,
> +this failed operation still
> +consumes the message from the message queue.
> +This effectively discards the message silently,
> +as no data is copied into the
> +.BR read (2)
> +buffer.
> +.P
> +Programs should take care to ensure that
> +their buffers are sufficiently large
> +to contain any reasonable message string,
> +in order to avoid silently losing valuable diagnostic information.
> +.\" Aleksa Sarai
> +.\"   This unfortunate behaviour has existed since this feature was merged, but
> +.\"   I have sent a patchset which will finally fix it.
> +.\"   <https://lore.kernel.org/r/20250807-fscontext-log-cleanups-v3-1-8d91d6242dc3@cyphar.com/>
> +.SH EXAMPLES
> +To illustrate the workflow for creating a new mount,
> +the following is an example of how to mount an
> +.BR ext4 (5)
> +filesystem stored on
> +.I /dev/sdb1
> +onto
> +.IR /mnt .
> +.P
> +.in +4n
> +.EX
> +int fsfd, mntfd;
> +\&
> +fsfd = fsopen("ext4", FSOPEN_CLOEXEC);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_PATH, "source", "/dev/sdb1", AT_FDCWD);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "iversion", NULL, 0)
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_RELATIME);
> +move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> +.EE
> +.in
> +.P
> +First,
> +an ext4 configuration context is created and attached to the file descriptor
> +.IR fsfd .
> +Then, a series of parameters
> +(such as the source of the filesystem)
> +are provided using
> +.BR fsconfig (2),
> +followed by the filesystem instance being created with
> +.BR \%FSCONFIG_CMD_CREATE .
> +.BR fsmount (2)
> +is then used to create a new mount object attached to the file descriptor
> +.IR mntfd ,
> +which is then attached to the intended mount point using
> +.BR move_mount (2).
> +.P
> +The above procedure is functionally equivalent to
> +the following mount operation using
> +.BR mount (2):
> +.P
> +.in +4n
> +.EX
> +mount("/dev/sdb1", "/mnt", "ext4", MS_RELATIME,
> +      "ro,noatime,acl,user_xattr,iversion");
> +.EE
> +.in
> +.P
> +And here's an example of creating a mount object
> +of an NFS server share
> +and setting a Smack security module label.
> +However, instead of attaching it to a mount point,
> +the program uses the mount object directly
> +to open a file from the NFS share.
> +.P
> +.in +4n
> +.EX
> +int fsfd, mntfd, fd;
> +\&
> +fsfd = fsopen("nfs", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "example.com/pub", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "nfsvers", "3", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "rsize", "65536", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "wsize", "65536", 0);
> +fsconfig(fsfd, FSCONFIG_SET_STRING, "smackfsdef", "foolabel", 0);
> +fsconfig(fsfd, FSCONFIG_SET_FLAG, "rdma", NULL, 0);
> +fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> +mntfd = fsmount(fsfd, 0, MOUNT_ATTR_NODEV);
> +fd = openat(mntfd, "src/linux-5.2.tar.xz", O_RDONLY);
> +.EE
> +.in
> +.P
> +Unlike the previous example,
> +this operation has no trivial equivalent with
> +.BR mount (2),
> +as it was not previously possible to create a mount object
> +that is not attached to any mount point.
> +.SH SEE ALSO
> +.BR fsconfig (2),
> +.BR fsmount (2),
> +.BR fspick (2),
> +.BR mount (2),
> +.BR mount_setattr (2),
> +.BR move_mount (2),
> +.BR open_tree (2),
> +.BR mount_namespaces (7)
> 
> -- 
> 2.51.0
> 
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Mike Rapoport @ 2025-09-25  5:26 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bDc+-R=EuGM2pU=Phq8Ui-8xsDm0ppH6yjNR0U_o4TMHg@mail.gmail.com>

On Sun, Sep 21, 2025 at 06:20:50PM -0400, Pasha Tatashin wrote:
> On Mon, Aug 18, 2025 at 9:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Aug 15, 2025 at 12:12:10PM +0300, Mike Rapoport wrote:
> > > > Which is perhaps another comment, if this __get_free_pages() is going
> > > > to be a common pattern (and I guess it will be) then the API should be
> > > > streamlined alot more:
> > > >
> > > >  void *kho_alloc_preserved_memory(gfp, size);
> > > >  void kho_free_preserved_memory(void *);
> > >
> > > This looks backwards to me. KHO should not deal with memory allocation,
> > > it's responsibility to preserve/restore memory objects it supports.
> >
> > Then maybe those are luo_ helpers
> >
> > But having users open code __get_free_pages() and convert to/from
> > struct page, phys, etc is not a great idea.
> 
> I added:
> 
> void *luo_contig_alloc_preserve(size_t size);
> void luo_contig_free_unpreserve(void *mem, size_t size);

Why not add them to KHO?
 
> Allocate contiguous, zeroed, and preserved memory.
> 
> Pasha

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Rob Landley @ 2025-09-24 19:20 UTC (permalink / raw)
  To: Alexander Patrakov, Christophe Leroy
  Cc: Askar Safin, linux-fsdevel, linux-kernel, Linus Torvalds,
	Greg Kroah-Hartman, Christian Brauner, Al Viro, Jan Kara,
	Christoph Hellwig, Jens Axboe, Andy Shevchenko, Aleksa Sarai,
	Thomas Weißschuh, Julian Stecklina, Gao Xiang, Art Nikpal,
	Andrew Morton, Eric Curtin, Alexander Graf, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <CAN_LGv3Opj9RW0atfXODy-Epn++5mt_DLEi-ewxR9Me5x46Bkg@mail.gmail.com>

On 9/24/25 11:17, Alexander Patrakov wrote:
>> Therefore is it really initrd you are removing or just some corner case
>> ? If it is really initrd, then how does QEMU still work with that
>> -initrd parameter ?
> 
> The QEMU -initrd parameter is a misnomer. It can be used to pass an
> initrd or an initramfs, and the kernel automatically figures out what
> it is.

It's not a misnomer, initrams has always been able to make use of the 
existing initrd loading mechanism to read images externally supplied by 
the bootloader. It's what grub calls it too. I documented it in the 
"External initramfs images" section of 
https://kernel.org/doc/Documentation/filesystems/ramfs-rootfs-initramfs.txt 
back in 2005. The mechanism itself is 30 years old 
(Documentation/initrd.txt was written by Werner Almsberger in linux 
1.3.73 from March 7, 1996, ala 
https://github.com/mpe/linux-fullhistory/commit/afc106342783 ).

Since initrd contents could always be in a bunch of different 
autodetected formats (and optionally compressed just like the kernel), 
initramfs just hooked in to the staircase and said "if the format is 
cpio, call this function to handle it". The patch series proposes 
removing all the other formats, but not otherwise changing the existing 
external image loader mechanism. (Personally I think removing the 
architecture-specific hacks but leaving the generic support under init/ 
would probably have made more sense as a first step.)

The bootloader hands off an initrd image, initramfs is the boot-time 
cpio extraction plumbing that's _init tagged and gets freed, and rootfs 
is the persistent mounted instance of ramfs or tmpfs that's always there 
and is analogous to the init task (PID 1) except for the mount tree. 
(And is often overmounted so it's not visible, but it's still there. And 
is NOT SPECIAL: overmounts aren't a new concept, nor is hiding them in 
things like "df".)

There's a REASON my documentation file was called 
ramfs-rootfs-initramfs.txt: the naming's always been a bit... layered. 
(And yes, I have always spelled initmpfs with only one t.)

Rob

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Alexander Patrakov @ 2025-09-24 16:17 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Askar Safin, linux-fsdevel, linux-kernel, Linus Torvalds,
	Greg Kroah-Hartman, Christian Brauner, Al Viro, Jan Kara,
	Christoph Hellwig, Jens Axboe, Andy Shevchenko, Aleksa Sarai,
	Thomas Weißschuh, Julian Stecklina, Gao Xiang, Art Nikpal,
	Andrew Morton, Eric Curtin, Alexander Graf, Rob Landley,
	Lennart Poettering, linux-arch, linux-alpha, linux-snps-arc,
	linux-arm-kernel, linux-csky, linux-hexagon, loongarch,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, x86, Ingo Molnar, linux-block, initramfs, linux-api,
	linux-doc, linux-efi, linux-ext4, Theodore Y . Ts'o,
	linux-acpi, Michal Simek, devicetree, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, patches
In-Reply-To: <ffbf1a04-047d-4787-ac1e-f5362e1ca600@csgroup.eu>

On Tue, Sep 23, 2025 at 8:22 PM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 13/09/2025 à 02:37, Askar Safin a écrit :
> > [Vous ne recevez pas souvent de courriers de safinaskar@gmail.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Intro
> > ====
> > This patchset removes classic initrd (initial RAM disk) support,
> > which was deprecated in 2020.
> > Initramfs still stays, and RAM disk itself (brd) still stays, too.
> > init/do_mounts* and init/*initramfs* are listed in VFS entry in
> > MAINTAINERS, so I think this patchset should go through VFS tree.
> > This patchset touchs every subdirectory in arch/, so I tested it
> > on 8 (!!!) archs in Qemu (see details below).
> > Warning: this patchset renames CONFIG_BLK_DEV_INITRD (!!!) to CONFIG_INITRAMFS
> > and CONFIG_RD_* to CONFIG_INITRAMFS_DECOMPRESS_* (for example,
> > CONFIG_RD_GZIP to CONFIG_INITRAMFS_DECOMPRESS_GZIP).
> > If you still use initrd, see below for workaround.
>
> Apologise if my question looks stupid, but I'm using QEMU for various
> tests, and the way QEMU is started is something like:
>
> qemu-system-ppc -kernel ./vmlinux -cpu g4 -M mac99 -initrd
> ./qemu/rootfs.cpio.gz
>
> I was therefore expecting (and fearing) it to fail with your series
> applied, but surprisingly it still works.
>
> Therefore is it really initrd you are removing or just some corner case
> ? If it is really initrd, then how does QEMU still work with that
> -initrd parameter ?

The QEMU -initrd parameter is a misnomer. It can be used to pass an
initrd or an initramfs, and the kernel automatically figures out what
it is. What you are passing is an initramfs (a gzipped cpio archive
with all the files), which is a modern and supported use case.

-- 
Alexander Patrakov

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox