From: Jason Gunthorpe <jgg@nvidia.com>
To: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: pratyush@kernel.org, jasonmiu@google.com, graf@amazon.com,
changyuanl@google.com, rppt@kernel.org, dmatlack@google.com,
rientjes@google.com, corbet@lwn.net, rdunlap@infradead.org,
ilpo.jarvinen@linux.intel.com, kanie@linux.alibaba.com,
ojeda@kernel.org, aliceryhl@google.com, masahiroy@kernel.org,
akpm@linux-foundation.org, tj@kernel.org, yoann.congal@smile.fr,
mmaurer@google.com, roman.gushchin@linux.dev,
chenridong@huawei.com, axboe@kernel.dk, mark.rutland@arm.com,
jannh@google.com, vincent.guittot@linaro.org, hannes@cmpxchg.org,
dan.j.williams@intel.com, david@redhat.com,
joel.granados@kernel.org, rostedt@goodmis.org,
anna.schumaker@oracle.com, song@kernel.org,
zhangguopeng@kylinos.cn, linux@weissschuh.net,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
linux-mm@kvack.org, gregkh@linuxfoundation.org,
tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
rafael@kernel.org, dakr@kernel.org,
bartosz.golaszewski@linaro.org, cw00.choi@samsung.com,
myungjoo.ham@samsung.com, yesanishhere@gmail.com,
Jonathan.Cameron@huawei.com, quic_zijuhu@quicinc.com,
aleksander.lobakin@intel.com, ira.weiny@intel.com,
andriy.shevchenko@linux.intel.com, leon@kernel.org,
lukas@wunner.de, bhelgaas@google.com, wagi@kernel.org,
djeffery@redhat.com, stuart.w.hayes@gmail.com, ptyadav@amazon.de,
lennart@poettering.net, brauner@kernel.org,
linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org,
saeedm@nvidia.com, ajayachandra@nvidia.com, parav@nvidia.com,
leonro@nvidia.com, witu@nvidia.com
Subject: Re: [PATCH v2 16/32] liveupdate: luo_ioctl: add ioctl interface
Date: Tue, 29 Jul 2025 13:35:36 -0300 [thread overview]
Message-ID: <20250729163536.GN36037@nvidia.com> (raw)
In-Reply-To: <20250723144649.1696299-17-pasha.tatashin@soleen.com>
On Wed, Jul 23, 2025 at 02:46:29PM +0000, Pasha Tatashin wrote:
> Introduce the user-space interface for the Live Update Orchestrator
> via ioctl commands, enabling external control over the live update
> process and management of preserved resources.
I strongly recommend copying something like fwctl (which is copying
iommufd, which is copying some other best practices). I will try to
outline the main points below.
The design of the fwctl scheme allows alot of options for ABI
compatible future extensions and I very strongly recommend that
complex ioctl style APIs be built with that in mind. I have so many
scars from trying to undo fixed ABI design :)
> +/**
> + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> + * descriptors across live update.
> + * @fd: Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> + * descriptor to be preserved.
> + * Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> + * representing the fully restored kernel resource.
> + * @flags: Unused, reserved for future expansion, must be set to 0.
> + * @token: Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> + * preserved for preserved resource.
> + * Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> + * provided to the preserve ioctl for the resource to be restored.
> + *
> + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> + * underlying kernel state preserved across a live update cycle.
> + *
> + * To preserve an FD, user space passes this struct to
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> + * kernel uses the @token field to uniquly associate the preserved FD.
> + *
> + * After the live update transition, user space passes the struct populated with
> + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> + * to find the preserved state and, on success, populates the @fd field with a
> + * new file descriptor referring to the restored resource.
> + */
> +struct liveupdate_fd {
> + int fd;
'int' should not appear in uapi structs. Fds are __s32
> + __u32 flags;
> + __aligned_u64 token;
> +};
> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define LIVEUPDATE_IOCTL_TYPE 0xBA
I have found it very helpful to organize the ioctl numbering like this:
#define IOMMUFD_TYPE (';')
enum {
IOMMUFD_CMD_BASE = 0x80,
IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
IOMMUFD_CMD_IOAS_ALLOC = 0x81,
IOMMUFD_CMD_IOAS_ALLOW_IOVAS = 0x82,
[..]
#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
The numbers should be tightly packed and non-overlapping. It becomes
difficult to manage this if the numbers are sprinkled all over the
file. The above structuring will enforce git am conflicts if things
get muddled up. Saved me a few times already in iommufd.
> +/**
> + * LIVEUPDATE_IOCTL_FD_PRESERVE - Validate and initiate preservation for a file
> + * descriptor.
> + *
> + * Argument: Pointer to &struct liveupdate_fd.
> + *
> + * User sets the @fd field identifying the file descriptor to preserve
> + * (e.g., memfd, kvm, iommufd, VFIO). The kernel validates if this FD type
> + * and its dependencies are supported for preservation. If validation passes,
> + * the kernel marks the FD internally and *initiates the process* of preparing
> + * its state for saving. The actual snapshotting of the state typically occurs
> + * during the subsequent %LIVEUPDATE_IOCTL_PREPARE execution phase, though
> + * some finalization might occur during freeze.
> + * On successful validation and initiation, the kernel uses the @token
> + * field with an opaque identifier representing the resource being preserved.
> + * This token confirms the FD is targeted for preservation and is required for
> + * the subsequent %LIVEUPDATE_IOCTL_FD_RESTORE call after the live update.
> + *
> + * Return: 0 on success (validation passed, preservation initiated), negative
> + * error code on failure (e.g., unsupported FD type, dependency issue,
> + * validation failed).
> + */
> +#define LIVEUPDATE_IOCTL_FD_PRESERVE \
> + _IOW(LIVEUPDATE_IOCTL_TYPE, 0x00, struct liveupdate_fd)
From a kdoc perspective I find it works much better to attach the kdoc
to the struct, not the ioctl:
/**
* struct iommu_destroy - ioctl(IOMMU_DESTROY)
* @size: sizeof(struct iommu_destroy)
* @id: iommufd object ID to destroy. Can be any destroyable object type.
*
* Destroy any object held within iommufd.
*/
struct iommu_destroy {
__u32 size;
__u32 id;
};
#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)
Generates this kdoc:
https://docs.kernel.org/userspace-api/iommufd.html#c.iommu_destroy
You should also make sure to link the uapi header into the kdoc build
under the "userspace API" chaper.
The structs should also be self-describing. I am fairly strongly
against using the size mechanism in the _IOW macro, it is instantly
ABI incompatible and basically impossible to deal with from userspace.
Hence why the IOMMFD version is _IO().
This means stick a size member in the first 4 bytes of every
struct. More on this later..
> +/**
> + * LIVEUPDATE_IOCTL_FD_UNPRESERVE - Remove a file descriptor from the
> + * preservation list.
> + *
> + * Argument: Pointer to __u64 token.
Every ioctl should have a struct, with the size header. If you want to
do more down the road you can not using this structure.
> +#define LIVEUPDATE_IOCTL_FD_RESTORE \
> + _IOWR(LIVEUPDATE_IOCTL_TYPE, 0x02, struct liveupdate_fd)
Strongly recommend that every ioctl have a unique struct. Sharing
structs makes future extend-ability harder.
> +/**
> + * LIVEUPDATE_IOCTL_PREPARE - Initiate preparation phase and trigger state
> + * saving.
Perhaps these just want to be a single 'set state' ioctl with an enum
input argument?
> @@ -7,4 +7,5 @@ obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
> obj-$(CONFIG_KEXEC_HANDOVER_DEBUG) += kexec_handover_debug.o
> obj-$(CONFIG_LIVEUPDATE) += luo_core.o
> obj-$(CONFIG_LIVEUPDATE) += luo_files.o
> +obj-$(CONFIG_LIVEUPDATE) += luo_ioctl.o
> obj-$(CONFIG_LIVEUPDATE) += luo_subsystems.o
I don't think luo is modular, but I think it is generally better to
write the kbuilds as though it was anyhow if it has a lot of files:
iommufd-y := \
device.o \
eventq.o \
hw_pagetable.o \
io_pagetable.o \
ioas.o \
main.o \
pages.o \
vfio_compat.o \
viommu.o
obj-$(CONFIG_IOMMUFD) += iommufd.o
Basically don't repeat obj-$(CONFIG_LIVEUPDATE), every one of those
lines creates a new module (if it was modular)
> +static int luo_open(struct inode *inodep, struct file *filep)
> +{
> + if (!capable(CAP_SYS_ADMIN))
> + return -EACCES;
IMHO file system permissions should control permission to open. No
capable check.
> + if (filep->f_flags & O_EXCL)
> + return -EINVAL;
O_EXCL doesn't really do anything for cdev, I'd drop this.
The open should have an atomic to check for single open though.
> +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> +{
> + void __user *argp = (void __user *)arg;
> + struct liveupdate_fd luo_fd;
> + enum liveupdate_state state;
> + int ret = 0;
> + u64 token;
> +
> + if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> + return -ENOTTY;
The generic parse/disptach from fwctl is a really good idea here, you
can cut and paste it, change the names. It makes it really easy to manage future extensibility:
List the ops and their structs:
static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
IOCTL_OP(FWCTL_RPC, fwctl_cmd_rpc, struct fwctl_rpc, out),
};
Index the list and copy_from_user the struct desribing the opt:
static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
unsigned long arg)
{
struct fwctl_uctx *uctx = filp->private_data;
const struct fwctl_ioctl_op *op;
struct fwctl_ucmd ucmd = {};
union fwctl_ucmd_buffer buf;
unsigned int nr;
int ret;
nr = _IOC_NR(cmd);
if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
return -ENOIOCTLCMD;
op = &fwctl_ioctl_ops[nr - FWCTL_CMD_BASE];
if (op->ioctl_num != cmd)
return -ENOIOCTLCMD;
ucmd.uctx = uctx;
ucmd.cmd = &buf;
ucmd.ubuffer = (void __user *)arg;
// This is reading/checking the standard 4 byte size header:
ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
if (ret)
return ret;
if (ucmd.user_size < op->min_size)
return -EINVAL;
ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
ucmd.user_size);
Removes a bunch of boiler plate and easy to make wrong copy_from_users
in the ioctls. Centralizes size validation, zero padding checking/etc.
> + ret = luo_register_file(luo_fd.token, luo_fd.fd);
> + if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd))) {
> + WARN_ON_ONCE(luo_unregister_file(luo_fd.token));
> + ret = -EFAULT;
Then for extensibility you'd copy back the struct:
static int ucmd_respond(struct fwctl_ucmd *ucmd, size_t cmd_len)
{
if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
min_t(size_t, ucmd->user_size, cmd_len)))
return -EFAULT;
return 0;
}
Which truncates it/etc according to some ABI extensibility rules.
> +static int __init liveupdate_init(void)
> +{
> + int err;
> +
> + if (!liveupdate_enabled())
> + return 0;
> +
> + err = misc_register(&liveupdate_miscdev);
> + if (err < 0) {
> + pr_err("Failed to register misc device '%s': %d\n",
> + liveupdate_miscdev.name, err);
Should remove most of the pr_err's, here too IMHO..
Jason
next prev parent reply other threads:[~2025-07-29 16:35 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-23 14:46 [PATCH v2 00/32] Live Update Orchestrator Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 01/32] kho: init new_physxa->phys_bits to fix lockdep Pasha Tatashin
2025-07-28 10:13 ` Mike Rapoport
2025-08-02 23:33 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 02/32] kho: mm: Don't allow deferred struct page with KHO Pasha Tatashin
2025-07-28 10:14 ` Mike Rapoport
2025-07-23 14:46 ` [PATCH v2 03/32] kho: warn if KHO is disabled due to an error Pasha Tatashin
2025-07-28 10:15 ` Mike Rapoport
2025-07-23 14:46 ` [PATCH v2 04/32] kho: allow to drive kho from within kernel Pasha Tatashin
2025-07-28 10:18 ` Mike Rapoport
2025-08-02 23:40 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 05/32] kho: make debugfs interface optional Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 06/32] kho: drop notifiers Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 07/32] kho: add interfaces to unpreserve folios and physical memory ranges Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 08/32] kho: don't unpreserve memory during abort Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 09/32] liveupdate: kho: move to kernel/liveupdate Pasha Tatashin
2025-07-29 17:14 ` Jason Gunthorpe
2025-08-02 23:46 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 10/32] liveupdate: luo_core: Live Update Orchestrator Pasha Tatashin
2025-07-29 17:28 ` Jason Gunthorpe
2025-08-04 1:11 ` Pasha Tatashin
2025-08-05 12:31 ` Jason Gunthorpe
2025-08-06 22:28 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 11/32] liveupdate: luo_core: integrate with KHO Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 12/32] liveupdate: luo_subsystems: add subsystem registration Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 13/32] liveupdate: luo_subsystems: implement subsystem callbacks Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 14/32] liveupdate: luo_files: add infrastructure for FDs Pasha Tatashin
2025-07-29 17:33 ` Jason Gunthorpe
2025-08-04 23:00 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 15/32] liveupdate: luo_files: implement file systems callbacks Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 16/32] liveupdate: luo_ioctl: add ioctl interface Pasha Tatashin
2025-07-29 16:35 ` Jason Gunthorpe [this message]
2025-08-05 18:19 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 17/32] liveupdate: luo_sysfs: add sysfs state monitoring Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 18/32] reboot: call liveupdate_reboot() before kexec Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 19/32] liveupdate: luo_files: luo_ioctl: session-based file descriptor tracking Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 20/32] kho: move kho debugfs directory to liveupdate Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 21/32] liveupdate: add selftests for subsystems un/registration Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 22/32] selftests/liveupdate: add subsystem/state tests Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 23/32] docs: add luo documentation Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 24/32] MAINTAINERS: add liveupdate entry Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 25/32] mm: shmem: use SHMEM_F_* flags instead of VM_* flags Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 26/32] mm: shmem: allow freezing inode mapping Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 27/32] mm: shmem: export some functions to internal.h Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 28/32] luo: allow preserving memfd Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 29/32] docs: add documentation for memfd preservation via LUO Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 30/32] tools: introduce libluo Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 31/32] libluo: introduce luoctl Pasha Tatashin
2025-07-29 16:14 ` Jason Gunthorpe
2025-07-29 19:53 ` Thomas Gleixner
2025-07-29 22:21 ` Jason Gunthorpe
2025-07-29 22:35 ` Steven Rostedt
2025-07-29 23:23 ` Pratyush Yadav
2025-08-05 18:24 ` Pasha Tatashin
2025-08-06 12:02 ` Pratyush Yadav
2025-08-06 20:14 ` Pasha Tatashin
2025-07-23 14:46 ` [PATCH v2 32/32] libluo: add tests Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250729163536.GN36037@nvidia.com \
--to=jgg@nvidia.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=ajayachandra@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=aleksander.lobakin@intel.com \
--cc=aliceryhl@google.com \
--cc=andriy.shevchenko@linux.intel.com \
--cc=anna.schumaker@oracle.com \
--cc=axboe@kernel.dk \
--cc=bartosz.golaszewski@linaro.org \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=brauner@kernel.org \
--cc=changyuanl@google.com \
--cc=chenridong@huawei.com \
--cc=corbet@lwn.net \
--cc=cw00.choi@samsung.com \
--cc=dakr@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=djeffery@redhat.com \
--cc=dmatlack@google.com \
--cc=graf@amazon.com \
--cc=gregkh@linuxfoundation.org \
--cc=hannes@cmpxchg.org \
--cc=hpa@zytor.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=ira.weiny@intel.com \
--cc=jannh@google.com \
--cc=jasonmiu@google.com \
--cc=joel.granados@kernel.org \
--cc=kanie@linux.alibaba.com \
--cc=lennart@poettering.net \
--cc=leon@kernel.org \
--cc=leonro@nvidia.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@weissschuh.net \
--cc=lukas@wunner.de \
--cc=mark.rutland@arm.com \
--cc=masahiroy@kernel.org \
--cc=mingo@redhat.com \
--cc=mmaurer@google.com \
--cc=myungjoo.ham@samsung.com \
--cc=ojeda@kernel.org \
--cc=parav@nvidia.com \
--cc=pasha.tatashin@soleen.com \
--cc=pratyush@kernel.org \
--cc=ptyadav@amazon.de \
--cc=quic_zijuhu@quicinc.com \
--cc=rafael@kernel.org \
--cc=rdunlap@infradead.org \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rostedt@goodmis.org \
--cc=rppt@kernel.org \
--cc=saeedm@nvidia.com \
--cc=song@kernel.org \
--cc=stuart.w.hayes@gmail.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vincent.guittot@linaro.org \
--cc=wagi@kernel.org \
--cc=witu@nvidia.com \
--cc=x86@kernel.org \
--cc=yesanishhere@gmail.com \
--cc=yoann.congal@smile.fr \
--cc=zhangguopeng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.